In [122]:
import numpy as np
import pandas as pd

In [123]:
index = pd.date_range("20200101", periods=10)
index

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

In [124]:
pd.date_range("01/02/2020", periods=10)

DatetimeIndex(['2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05',
               '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09',
               '2020-01-10', '2020-01-11'],
              dtype='datetime64[ns]', freq='D')

In [125]:
s= pd.Series(np.random.randn(5), index=["a",'b','c','d','e'])
s

a   -0.277808
b   -0.628505
c    0.675435
d   -0.358144
e    1.449935
dtype: float64

In [126]:
np.random.random(5)

array([0.48817435, 0.251305  , 0.28014273, 0.09803181, 0.37700778])

In [127]:
np.random.randn(4)

array([ 1.07698086, -2.04836859,  1.00481405,  0.15297552])

In [128]:
df = pd.DataFrame(np.random.random((4,4)), index=["a",'b','c','d'], columns=["AA",'BB','CC','DD'])
df

Unnamed: 0,AA,BB,CC,DD
a,0.775726,0.462317,0.211228,0.231928
b,0.243526,0.028878,0.108328,0.25108
c,0.348137,0.665673,0.338999,0.865507
d,0.488909,0.365217,0.600413,0.293932


In [129]:
s.head(3)

a   -0.277808
b   -0.628505
c    0.675435
dtype: float64

In [130]:
s.tail(3)

c    0.675435
d   -0.358144
e    1.449935
dtype: float64

In [131]:
df.head(3)

Unnamed: 0,AA,BB,CC,DD
a,0.775726,0.462317,0.211228,0.231928
b,0.243526,0.028878,0.108328,0.25108
c,0.348137,0.665673,0.338999,0.865507


In [132]:
df.describe()

Unnamed: 0,AA,BB,CC,DD
count,4.0,4.0,4.0,4.0
mean,0.464074,0.380521,0.314742,0.410612
std,0.230815,0.265763,0.212539,0.304369
min,0.243526,0.028878,0.108328,0.231928
25%,0.321984,0.281132,0.185503,0.246292
50%,0.418523,0.413767,0.275113,0.272506
75%,0.560613,0.513156,0.404353,0.436826
max,0.775726,0.665673,0.600413,0.865507


In [133]:
df[:2]

Unnamed: 0,AA,BB,CC,DD
a,0.775726,0.462317,0.211228,0.231928
b,0.243526,0.028878,0.108328,0.25108


In [134]:
df1 = df.copy()

In [135]:
df1.columns = [i.lower() for i in df1.columns]
df1

Unnamed: 0,aa,bb,cc,dd
a,0.775726,0.462317,0.211228,0.231928
b,0.243526,0.028878,0.108328,0.25108
c,0.348137,0.665673,0.338999,0.865507
d,0.488909,0.365217,0.600413,0.293932


In [136]:
s.array

#To get the actual data inside a Index or Series, use the .array property

<PandasArray>
[-0.2778076889814041, -0.6285051195763908,  0.6754350306064895,
 -0.3581442916759302,   1.449935363824358]
Length: 5, dtype: float64

In [137]:
s.index.array

<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas uses them is a bit beyond the scope of this introduction. See dtypes for more.

In [138]:
#If you know you need a NumPy array, use to_numpy(series) or numpy.asarray().
s.to_numpy()

array([-0.27780769, -0.62850512,  0.67543503, -0.35814429,  1.44993536])

In [139]:
np.asarray(s)

array([-0.27780769, -0.62850512,  0.67543503, -0.35814429,  1.44993536])

When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and coercing values. See dtypes for more.

to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider datetimes with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are two possibly useful representations:

An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the timezone discarded
Timezones may be preserved with dtype=object

In [140]:
ser = pd.Series(pd.date_range("2010", periods= 4, tz="CET"))
ser

0   2010-01-01 00:00:00+01:00
1   2010-01-02 00:00:00+01:00
2   2010-01-03 00:00:00+01:00
3   2010-01-04 00:00:00+01:00
dtype: datetime64[ns, CET]

In [141]:
ser.to_numpy(dtype=object)

array([Timestamp('2010-01-01 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2010-01-02 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2010-01-03 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2010-01-04 00:00:00+0100', tz='CET', freq='D')],
      dtype=object)

In [142]:
ser.to_numpy(dtype='datetime64[ns]')

array(['2009-12-31T23:00:00.000000000', '2010-01-01T23:00:00.000000000',
       '2010-01-02T23:00:00.000000000', '2010-01-03T23:00:00.000000000'],
      dtype='datetime64[ns]')

If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to.

Note When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and integers, the resulting array will be of float dtype.
In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we recommend avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:

When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy array or the extension array. Series.array will always return an ExtensionArray, and will never copy data. Series.to_numpy() will always return a NumPy array, potentially at the cost of copying / coercing values.
When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data and coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(), being a method, makes it clearer that the returned NumPy array may not be a view on the same data in the DataFrame.
Accelerated operations
pandas has support for accelerating certain types of binary numerical and boolean operations using the numexpr library and the bottleneck libraries.

These libraries are especially useful when dealing with large data sets, and provide large speedups. numexpr uses smart chunking, caching, and multiple cores. bottleneck is a set of specialized cython routines that are especially fast when dealing with arrays that have nans.

Here is a sample (using 100 column x 100,000 row DataFrames):

Operation	0.11.0 (ms)	Prior Version (ms)	Ratio to Prior
df1 > df2	13.32	125.35	0.1063
df1 * df2	21.71	36.63	0.5928
df1 + df2	22.04	36.50	0.6039
You are highly encouraged to install both libraries. See the section Recommended Dependencies for more installation info.

These are both enabled to be used by default, you can control this by setting the options:

New in version 0.20.0.

In [143]:
pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

In [144]:
df = pd.DataFrame({
       'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
       'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
       'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,1.132474,-0.229191,
b,0.049871,0.921456,0.090987
c,-0.684029,0.87043,-0.815336
d,,0.895508,-1.37657


In [145]:
row = df.iloc[1]
row

one      0.049871
two      0.921456
three    0.090987
Name: b, dtype: float64

In [146]:
column = df['two']
column

a   -0.229191
b    0.921456
c    0.870430
d    0.895508
Name: two, dtype: float64

In [147]:
df.sub(row, axis ="columns")

Unnamed: 0,one,two,three
a,1.082603,-1.150647,
b,0.0,0.0,0.0
c,-0.733901,-0.051025,-0.906323
d,,-0.025947,-1.467556


In [148]:
df.sub(row, axis=1)

Unnamed: 0,one,two,three
a,1.082603,-1.150647,
b,0.0,0.0,0.0
c,-0.733901,-0.051025,-0.906323
d,,-0.025947,-1.467556


In [149]:
df.sub(column, axis = "index")

Unnamed: 0,one,two,three
a,1.361665,0.0,
b,-0.871584,0.0,-0.830469
c,-1.55446,0.0,-1.685767
d,,0.0,-2.272078


In [150]:
df.sub(column, axis=0)

Unnamed: 0,one,two,three
a,1.361665,0.0,
b,-0.871584,0.0,-0.830469
c,-1.55446,0.0,-1.685767
d,,0.0,-2.272078


In [151]:
dfmi = df.copy()
dfmi.index = pd.MultiIndex.from_tuples([(1,"a"),(1,"b"),(1,"c"),(2,'a')],
                                       names=["first",'second'])
dfmi

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,three
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,1.132474,-0.229191,
1,b,0.049871,0.921456,0.090987
1,c,-0.684029,0.87043,-0.815336
2,a,,0.895508,-1.37657


In [152]:
dfmi.sub(column,axis=0,level='second')

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,three
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,1.361665,0.0,
1,b,-0.871584,0.0,-0.830469
1,c,-1.55446,0.0,-1.685767
2,a,,1.1247,-1.147379


Series and Index also support the divmod() builtin. This function takes the floor division and modulo operation at the same time returning a two-tuple of the same type as the left hand side. For example:

In [153]:
s = pd.Series(np.arange(10,20))
s

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

In [154]:
div, mod = divmod(s,4) #s/4
print(div)
print(mod)

0    2
1    2
2    3
3    3
4    3
5    3
6    4
7    4
8    4
9    4
dtype: int32
0    2
1    3
2    0
3    1
4    2
5    3
6    0
7    1
8    2
9    3
dtype: int32


In [155]:
idx=pd.Index(np.arange(10))
idx

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [156]:
divv, modd = divmod(idx,3)
print(divv)
print(modd)

Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')


In [157]:
#We can also do elementwise divmod():

di, mo = divmod(s,[2, 2, 3, 3, 4, 4, 5, 5, 6, 6])
print(di)
print(mo)

0    5
1    5
2    4
3    4
4    3
5    3
6    3
7    3
8    3
9    3
dtype: int32
0    0
1    1
2    0
3    1
4    2
5    3
6    1
7    2
8    0
9    1
dtype: int32


#### Working with missing value/data
In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to substitute when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can later replace NaN with some other value using fillna if you wish).

In [158]:
df

Unnamed: 0,one,two,three
a,1.132474,-0.229191,
b,0.049871,0.921456,0.090987
c,-0.684029,0.87043,-0.815336
d,,0.895508,-1.37657


In [159]:
dfm = df.copy()
dfm

Unnamed: 0,one,two,three
a,1.132474,-0.229191,
b,0.049871,0.921456,0.090987
c,-0.684029,0.87043,-0.815336
d,,0.895508,-1.37657


In [160]:
df+dfm

Unnamed: 0,one,two,three
a,2.264947,-0.458383,
b,0.099742,1.842911,0.181973
c,-1.368059,1.740861,-1.630673
d,,1.791017,-2.75314


In [161]:
df.add(dfm,fill_value=0)

Unnamed: 0,one,two,three
a,2.264947,-0.458383,
b,0.099742,1.842911,0.181973
c,-1.368059,1.740861,-1.630673
d,,1.791017,-2.75314


#### Flexible comparisons
Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogousঅনুরূপ to the binary arithmetic operations described above:

In [162]:
df.ne(dfm)

Unnamed: 0,one,two,three
a,False,False,True
b,False,False,False
c,False,False,False
d,True,False,False


In [163]:
df.le(dfm)

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


In [164]:
df.gt(dfm)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [165]:
df.lt(dfm)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [166]:
df.eq(dfm)

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


#### Boolean reductions
You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [167]:
(df>0).all()

one      False
two      False
three    False
dtype: bool

In [168]:
(df>0).any()

one      True
two      True
three    True
dtype: bool

In [169]:
df.empty

False

In [170]:
df.bool

<bound method NDFrame.bool of         one       two     three
a  1.132474 -0.229191       NaN
b  0.049871  0.921456  0.090987
c -0.684029  0.870430 -0.815336
d       NaN  0.895508 -1.376570>

In [171]:
dd=pd.DataFrame(columns=list('ABCD'))
dd

Unnamed: 0,A,B,C,D


In [172]:
dd.empty

True

In [173]:
pd.Series([True,]).bool()
#To evaluate single-element pandas objects in a boolean context, use the method bool()

True

In [174]:
df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})
df2 = pd.DataFrame({'col': [np.nan, 0, 'foo']}, index=[2, 1, 0])

In [175]:
df1.equals(df2)

False

In [176]:
df1.equals(df2.sort_index())

True

In [177]:
df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
                       'B': [np.nan, 2., 3., np.nan, 6.]})

df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.],
                       'B': [np.nan, np.nan, 3., 4., 6., 8.]})


print(df1)
print(df2)
df1.combine_first(df2)

#ProvidedDataFrame to use to fill null values.

     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0
     A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0


Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0


In [178]:
df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
                       'B': [np.nan, 2., 3., np.nan, 6.]})

df2 = pd.DataFrame({'A': [5., 2., np.nan, 3., 7.],
                       'B': [np.nan, 3., 4., 6., 8.]})


def combiner(x,y):
    return np.where(pd.isna(x),y,x)

combiner(df1,df2) #the two df will have to be same size

array([[ 1., nan],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 5.,  6.],
       [ 7.,  6.]])

#### Descriptive statistics
There exists a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the axis can be specified by name or integer:

Series: no axis argument needed
DataFrame: “index” (axis=0, default), “columns” (axis=1)
For example:

In [179]:
df

Unnamed: 0,one,two,three
a,1.132474,-0.229191,
b,0.049871,0.921456,0.090987
c,-0.684029,0.87043,-0.815336
d,,0.895508,-1.37657


In [180]:
df.mean(0)

one      0.166105
two      0.614551
three   -0.700307
dtype: float64

In [181]:
df.mean(1)

a    0.451641
b    0.354104
c   -0.209645
d   -0.240531
dtype: float64

In [182]:
df.max(0)

one      1.132474
two      0.921456
three    0.090987
dtype: float64

In [183]:
df.quantile(0)

one     -0.684029
two     -0.229191
three   -1.376570
Name: 0, dtype: float64

In [184]:
df.quantile(1)

one      1.132474
two      0.921456
three    0.090987
Name: 1, dtype: float64

In [185]:
df.sum(0)

one      0.498315
two      2.458203
three   -2.100920
dtype: float64

In [186]:
df.sum(1)

a    0.903282
b    1.062313
c   -0.628935
d   -0.481061
dtype: float64

In [187]:
df.cumsum(0)

Unnamed: 0,one,two,three
a,1.132474,-0.229191,
b,1.182345,0.692264,0.090987
c,0.498315,1.562695,-0.72435
d,,2.458203,-2.10092


In [188]:
df.cumsum(1)

Unnamed: 0,one,two,three
a,1.132474,0.903282,
b,0.049871,0.971327,1.062313
c,-0.684029,0.186401,-0.628935
d,,0.895508,-0.481061


In [189]:
df.cumprod(0)

Unnamed: 0,one,two,three
a,1.132474,-0.229191,
b,0.056478,-0.21119,0.090987
c,-0.038632,-0.183826,-0.074185
d,,-0.164618,0.10212


In [190]:
df.cumprod(1)

Unnamed: 0,one,two,three
a,1.132474,-0.259553,
b,0.049871,0.045954,0.004181
c,-0.684029,-0.5954,0.485451
d,,0.895508,-1.23273


All such methods have a skipna option signaling whether to exclude missing data (True by default):

In [191]:
df.sum(0,skipna=False)

one           NaN
two      2.458203
three         NaN
dtype: float64

In [192]:
df.std()

one      0.913813
two      0.562880
three    0.740510
dtype: float64

In [193]:
t=(df-df.mean())/df.std()
t

Unnamed: 0,one,two,three
a,1.057513,-1.498972,
b,-0.127197,0.54524,1.068579
c,-0.930316,0.45459,-0.155339
d,,0.499143,-0.913241


In [194]:
t.std()

one      1.0
two      1.0
three    1.0
dtype: float64

In [195]:
tt= df.sub(df.mean(1),axis=0).div(df.std(1),axis=0)
tt

Unnamed: 0,one,two,three
a,0.707107,-0.707107,
b,-0.618649,1.153691,-0.535042
c,-0.505916,1.151867,-0.645951
d,,0.707107,-0.707107


In [196]:
tt.std(1)

a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat different from expanding() and rolling(). For more details please see this note.

In [197]:
df.cumsum()

Unnamed: 0,one,two,three
a,1.132474,-0.229191,
b,1.182345,0.692264,0.090987
c,0.498315,1.562695,-0.72435
d,,2.458203,-2.10092


In [198]:
df.mad() #mean absolute deviation

one      0.644246
two      0.421871
three    0.527529
dtype: float64

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:

In [199]:
np.mean(df["one"])

0.16610514423593403

In [200]:
s.nunique() 
#Series.nunique() will return the number of unique non-NA values in a Series

10

In [201]:
s

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

In [202]:
s[2:4]=10
s

0    10
1    11
2    10
3    10
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course):

In [203]:
s.describe()

count    10.000000
mean     14.000000
std       3.527668
min      10.000000
25%      10.250000
50%      14.500000
75%      16.750000
max      19.000000
dtype: float64

In [204]:
df.describe()

Unnamed: 0,one,two,three
count,3.0,4.0,3.0
mean,0.166105,0.614551,-0.700307
std,0.913813,0.56288,0.74051
min,-0.684029,-0.229191,-1.37657
25%,-0.317079,0.595525,-1.095953
50%,0.049871,0.882969,-0.815336
75%,0.591172,0.901995,-0.362175
max,1.132474,0.921456,0.090987


In [205]:
frame = pd.DataFrame(np.random.randn(1000,5), columns=["A","B","C",'D','E'])
frame

Unnamed: 0,A,B,C,D,E
0,-1.086023,-0.417524,-0.220130,-1.860732,-1.094743
1,-0.729619,-0.518742,0.337549,-0.590817,-1.664892
2,1.045178,-0.053691,1.604740,-0.023918,0.635816
3,-0.350436,1.770533,1.035366,0.581002,-0.141550
4,-0.961587,-1.012551,-0.279449,-0.078929,1.072264
...,...,...,...,...,...
995,-0.874782,0.832773,-0.836291,-0.685154,0.861474
996,-0.496013,0.076264,-0.088671,0.878740,-0.224351
997,-2.207074,-1.064801,-0.954497,0.713059,-0.515545
998,0.496478,0.531871,0.586562,-0.802352,0.775669


In [206]:
frame.iloc[::3] = np.nan
frame

Unnamed: 0,A,B,C,D,E
0,,,,,
1,-0.729619,-0.518742,0.337549,-0.590817,-1.664892
2,1.045178,-0.053691,1.604740,-0.023918,0.635816
3,,,,,
4,-0.961587,-1.012551,-0.279449,-0.078929,1.072264
...,...,...,...,...,...
995,-0.874782,0.832773,-0.836291,-0.685154,0.861474
996,,,,,
997,-2.207074,-1.064801,-0.954497,0.713059,-0.515545
998,0.496478,0.531871,0.586562,-0.802352,0.775669


In [207]:
frame.describe()

Unnamed: 0,A,B,C,D,E
count,666.0,666.0,666.0,666.0,666.0
mean,-0.024685,0.098379,-0.060594,0.027102,0.041394
std,1.024178,0.986233,0.971969,1.021303,1.049761
min,-3.281101,-2.98934,-3.260154,-3.266113,-2.804449
25%,-0.759076,-0.587065,-0.721486,-0.654268,-0.717297
50%,-0.038857,0.124438,-0.050908,0.033779,0.027907
75%,0.634978,0.763905,0.602821,0.716171,0.773111
max,3.079727,2.723311,2.994951,3.031193,3.58821


In [208]:
#You can select specific percentiles to include in the output:
df.describe(percentiles=[.05,.25,.60,.95])

Unnamed: 0,one,two,three
count,3.0,4.0,3.0
mean,0.166105,0.614551,-0.700307
std,0.913813,0.56288,0.74051
min,-0.684029,-0.229191,-1.37657
5%,-0.610639,-0.064248,-1.320447
25%,-0.317079,0.595525,-1.095953
50%,0.049871,0.882969,-0.815336
60%,0.266392,0.890493,-0.634072
95%,1.024213,0.917563,0.000354
max,1.132474,0.921456,0.090987


In [209]:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
s.describe()

count     9
unique    4
top       a
freq      5
dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical columns or, if none are, only categorical columns:

In [210]:
f = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
f

Unnamed: 0,a,b
0,Yes,0
1,Yes,1
2,No,2
3,No,3


In [211]:
f.describe()

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


This behavior can be controlled by providing a list of types as include/exclude arguments. The special value all can also be used:

In [212]:
f.describe(include=['object'])

Unnamed: 0,a
count,4
unique,2
top,Yes
freq,2


In [213]:
f.describe(include=["number"])

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


In [214]:
f.describe(include='all')

Unnamed: 0,a,b
count,4,4.0
unique,2,
top,Yes,
freq,2,
mean,,1.5
std,,1.290994
min,,0.0
25%,,0.75
50%,,1.5
75%,,2.25


That feature relies on select_dtypes. Refer to there for details about accepted inputs.

#### Index of min/max values
The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

In [215]:
df.idxmin()

one      c
two      a
three    d
dtype: object

In [216]:
df.idxmax()

one      a
two      b
three    b
dtype: object

In [217]:
g = pd.Series(np.random.random(20))
g

0     0.624630
1     0.200240
2     0.617599
3     0.764376
4     0.242109
5     0.966214
6     0.606640
7     0.659463
8     0.079981
9     0.985191
10    0.514804
11    0.740630
12    0.343256
13    0.216425
14    0.965135
15    0.122677
16    0.693645
17    0.221979
18    0.623134
19    0.943684
dtype: float64

In [218]:
g.idxmax(), g.idxmin()

(9, 8)

In [219]:
d1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
d1

Unnamed: 0,A,B,C
0,0.864378,1.041644,-0.143899
1,0.446322,-1.349712,-0.025831
2,0.508761,0.718674,0.106848
3,0.6512,-0.996189,-1.024657
4,0.590447,-0.420384,0.217416


In [220]:
d1.idxmin(axis=0)

A    1
B    1
C    3
dtype: int64

In [221]:
d1.idxmin(axis=1)

0    C
1    B
2    C
3    C
4    B
dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax() return the first matching index:

In [222]:
df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
df3

Unnamed: 0,A
e,2.0
d,1.0
c,1.0
b,3.0
a,


In [223]:
df3["A"].idxmin()

#idxmin and idxmax are called argmin and argmax in NumPy.

'd'

The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays:

In [224]:
p=np.random.randint(3,10,size=20)
p

array([4, 7, 5, 8, 5, 3, 8, 4, 7, 6, 3, 9, 6, 6, 6, 6, 4, 8, 3, 8])

In [225]:
data=pd.Series(p)
data.value_counts()

6    5
8    4
4    3
3    3
7    2
5    2
9    1
dtype: int64

In [226]:
pd.value_counts(data)

6    5
8    4
4    3
3    3
7    2
5    2
9    1
dtype: int64

Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:

In [227]:
data.mode()

0    6
dtype: int32

In [228]:
type(df1)

pandas.core.frame.DataFrame

In [229]:
df1

Unnamed: 0,A,B
0,1.0,
1,,2.0
2,3.0,3.0
3,5.0,
4,,6.0


In [230]:
df1.mode()

Unnamed: 0,A,B
0,1.0,2.0
1,3.0,3.0
2,5.0,6.0


In [231]:
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
                       "B": np.random.randint(-10, 15, size=50)})
df5.mode()

Unnamed: 0,A,B
0,2,5


In [232]:
df5

Unnamed: 0,A,B
0,0,12
1,2,5
2,5,5
3,0,1
4,1,-9
5,6,-1
6,2,6
7,2,-1
8,3,14
9,6,-9


#### Discretization and quantiling
Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample quantiles) functions

In [233]:
arr = np.random.randn(20)
factor = pd.cut(arr, 4)
factor

[(-2.115, -1.008], (0.0944, 1.197], (-2.115, -1.008], (-1.008, 0.0944], (1.197, 2.3], ..., (0.0944, 1.197], (0.0944, 1.197], (0.0944, 1.197], (-1.008, 0.0944], (0.0944, 1.197]]
Length: 20
Categories (4, interval[float64]): [(-2.115, -1.008] < (-1.008, 0.0944] < (0.0944, 1.197] < (1.197, 2.3]]

In [234]:
factor = pd.cut(arr, [-5, -1, 0, 1, 5])
factor

[(-5, -1], (0, 1], (-5, -1], (0, 1], (1, 5], ..., (0, 1], (0, 1], (0, 1], (-1, 0], (0, 1]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() computes sample quantiles. For example, we could slice up some normally distributed data into equal-size quartiles like so:

In [235]:
arr = np.random.randn(30)
factor = pd.qcut(arr, [0, .25, .5, .75, 1])
factor

[(-0.348, -0.0428], (-0.0428, 0.592], (-3.36, -0.348], (0.592, 1.574], (-3.36, -0.348], ..., (-0.0428, 0.592], (-0.0428, 0.592], (-3.36, -0.348], (-0.0428, 0.592], (-3.36, -0.348]]
Length: 30
Categories (4, interval[float64]): [(-3.36, -0.348] < (-0.348, -0.0428] < (-0.0428, 0.592] < (0.592, 1.574]]

In [236]:
pd.value_counts(factor)

(0.592, 1.574]       8
(-3.36, -0.348]      8
(-0.0428, 0.592]     7
(-0.348, -0.0428]    7
dtype: int64

In [237]:
#We can also pass infinite values to define the bins:
arr = np.random.randn(20)
factor = pd.cut(arr, [-np.inf, 0, np.inf])
factor


[(0.0, inf], (0.0, inf], (-inf, 0.0], (0.0, inf], (0.0, inf], ..., (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], (0.0, inf]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

In [238]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Function application
To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or Series, row- or column-wise, or elementwise.

Tablewise Function Application: pipe()

Row or Column-wise Function Application: apply()

Aggregation API: agg() and transform()

Applying Elementwise Functions: applymap()

f, g and h are functions taking and reurning ``DataFrame``

f(g(h(df),arg1),arg2=2,arg3=3)


(df.pipe(h)
    .pipe(g, arg1=1)
    .pipe(f, arg2=2, arg3=3))
    
Pandas encourages the second style, which is known as method chaining. pipe makes it easy to use your own or another library’s functions in method chains, alongside pandas’ methods.

In [239]:
import statsmodels.formula.api as sm
bb = pd.read_csv('data/baseball.csv', index_col='id')
(bb.query('h > 0')
 .assign(ln_h=lambda df: np.log(df.h))
 .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
 .fit()
 .summary()  )

FileNotFoundError: [Errno 2] File b'data/baseball.csv' does not exist: b'data/baseball.csv'

### Sorting
Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination of both.

The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its index levels.

In [None]:
f = pd.DataFrame({
       'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
f

In [None]:
f.sort_index()

In [None]:
unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
                            columns=['three', 'two', 'one'])
unsorted_df

In [None]:
unsorted_df.sort_index()

In [None]:
unsorted_df.sort_index(ascending=False)

In [None]:
unsorted_df.sort_index(ascending=True)

In [None]:
unsorted_df.sort_index(axis=0)

In [None]:
unsorted_df["two"].sort_index(ascending=False)

#### By values
The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values() method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.sort_values() may used to specify one or more columns to use to determine the sorted order.