In [1]:
import numpy as np
import pandas as pd

In [2]:
index = pd.date_range("20200101", periods=10)
index

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

In [3]:
pd.date_range("01/02/2020", periods=10)

DatetimeIndex(['2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05',
               '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09',
               '2020-01-10', '2020-01-11'],
              dtype='datetime64[ns]', freq='D')

In [4]:
s= pd.Series(np.random.randn(5), index=["a",'b','c','d','e'])
s

a   -0.205723
b   -0.141447
c    0.699864
d    0.728211
e   -0.979283
dtype: float64

In [5]:
np.random.random(5)

array([0.08020103, 0.44620141, 0.68698383, 0.39680892, 0.67901591])

In [6]:
np.random.randn(4)

array([ 0.37783787,  0.48024375,  0.49558613, -0.36306795])

In [7]:
df = pd.DataFrame(np.random.random((4,4)), index=["a",'b','c','d'], columns=["AA",'BB','CC','DD'])
df

Unnamed: 0,AA,BB,CC,DD
a,0.920808,0.002207,0.875871,0.126957
b,0.603517,0.64401,0.418788,0.842141
c,0.599518,0.455333,0.34619,0.346122
d,0.211778,0.440516,0.049556,0.899635


In [8]:
s.head(3)

a   -0.205723
b   -0.141447
c    0.699864
dtype: float64

In [9]:
s.tail(3)

c    0.699864
d    0.728211
e   -0.979283
dtype: float64

In [10]:
df.head(3)

Unnamed: 0,AA,BB,CC,DD
a,0.920808,0.002207,0.875871,0.126957
b,0.603517,0.64401,0.418788,0.842141
c,0.599518,0.455333,0.34619,0.346122


In [11]:
df.describe()

Unnamed: 0,AA,BB,CC,DD
count,4.0,4.0,4.0,4.0
mean,0.583905,0.385516,0.422601,0.553714
std,0.290179,0.271811,0.341794,0.377742
min,0.211778,0.002207,0.049556,0.126957
25%,0.502583,0.330939,0.272032,0.291331
50%,0.601518,0.447925,0.382489,0.594131
75%,0.68284,0.502502,0.533059,0.856514
max,0.920808,0.64401,0.875871,0.899635


In [12]:
df[:2]

Unnamed: 0,AA,BB,CC,DD
a,0.920808,0.002207,0.875871,0.126957
b,0.603517,0.64401,0.418788,0.842141


In [13]:
df1 = df.copy()

In [14]:
df1.columns = [i.lower() for i in df1.columns]
df1

Unnamed: 0,aa,bb,cc,dd
a,0.920808,0.002207,0.875871,0.126957
b,0.603517,0.64401,0.418788,0.842141
c,0.599518,0.455333,0.34619,0.346122
d,0.211778,0.440516,0.049556,0.899635


In [15]:
s.array

#To get the actual data inside a Index or Series, use the .array property

<PandasArray>
[-0.2057226089205353, -0.1414474357857834,  0.6998637499165424,
  0.7282112405823299, -0.9792832745231942]
Length: 5, dtype: float64

In [16]:
s.index.array

<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas uses them is a bit beyond the scope of this introduction. See dtypes for more.

In [17]:
#If you know you need a NumPy array, use to_numpy(series) or numpy.asarray().
s.to_numpy()

array([-0.20572261, -0.14144744,  0.69986375,  0.72821124, -0.97928327])

In [18]:
np.asarray(s)

array([-0.20572261, -0.14144744,  0.69986375,  0.72821124, -0.97928327])

When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and coercing values. See dtypes for more.

to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider datetimes with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are two possibly useful representations:

An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the timezone discarded
Timezones may be preserved with dtype=object

In [19]:
ser = pd.Series(pd.date_range("2010", periods= 4, tz="CET"))
ser

0   2010-01-01 00:00:00+01:00
1   2010-01-02 00:00:00+01:00
2   2010-01-03 00:00:00+01:00
3   2010-01-04 00:00:00+01:00
dtype: datetime64[ns, CET]

In [20]:
ser.to_numpy(dtype=object)

array([Timestamp('2010-01-01 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2010-01-02 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2010-01-03 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2010-01-04 00:00:00+0100', tz='CET', freq='D')],
      dtype=object)

In [21]:
ser.to_numpy(dtype='datetime64[ns]')

array(['2009-12-31T23:00:00.000000000', '2010-01-01T23:00:00.000000000',
       '2010-01-02T23:00:00.000000000', '2010-01-03T23:00:00.000000000'],
      dtype='datetime64[ns]')

If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to.

Note When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and integers, the resulting array will be of float dtype.
In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we recommend avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:

When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy array or the extension array. Series.array will always return an ExtensionArray, and will never copy data. Series.to_numpy() will always return a NumPy array, potentially at the cost of copying / coercing values.
When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data and coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(), being a method, makes it clearer that the returned NumPy array may not be a view on the same data in the DataFrame.
Accelerated operations
pandas has support for accelerating certain types of binary numerical and boolean operations using the numexpr library and the bottleneck libraries.

These libraries are especially useful when dealing with large data sets, and provide large speedups. numexpr uses smart chunking, caching, and multiple cores. bottleneck is a set of specialized cython routines that are especially fast when dealing with arrays that have nans.

Here is a sample (using 100 column x 100,000 row DataFrames):

Operation	0.11.0 (ms)	Prior Version (ms)	Ratio to Prior
df1 > df2	13.32	125.35	0.1063
df1 * df2	21.71	36.63	0.5928
df1 + df2	22.04	36.50	0.6039
You are highly encouraged to install both libraries. See the section Recommended Dependencies for more installation info.

These are both enabled to be used by default, you can control this by setting the options:

New in version 0.20.0.

In [22]:
pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

In [23]:
df = pd.DataFrame({
       'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
       'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
       'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,-0.14716,-0.604258,
b,-1.083574,1.377826,-0.827176
c,-1.255433,0.656288,-0.980413
d,,-1.17692,1.202161


In [24]:
row = df.iloc[1]
row

one     -1.083574
two      1.377826
three   -0.827176
Name: b, dtype: float64

In [25]:
column = df['two']
column

a   -0.604258
b    1.377826
c    0.656288
d   -1.176920
Name: two, dtype: float64

In [26]:
df.sub(row, axis ="columns")

Unnamed: 0,one,two,three
a,0.936414,-1.982083,
b,0.0,0.0,0.0
c,-0.171859,-0.721537,-0.153236
d,,-2.554746,2.029338


In [27]:
df.sub(row, axis=1)

Unnamed: 0,one,two,three
a,0.936414,-1.982083,
b,0.0,0.0,0.0
c,-0.171859,-0.721537,-0.153236
d,,-2.554746,2.029338


In [28]:
df.sub(column, axis = "index")

Unnamed: 0,one,two,three
a,0.457098,0.0,
b,-2.461399,0.0,-2.205002
c,-1.911721,0.0,-1.636701
d,,0.0,2.379082


In [29]:
df.sub(column, axis=0)

Unnamed: 0,one,two,three
a,0.457098,0.0,
b,-2.461399,0.0,-2.205002
c,-1.911721,0.0,-1.636701
d,,0.0,2.379082


In [30]:
dfmi = df.copy()
dfmi.index = pd.MultiIndex.from_tuples([(1,"a"),(1,"b"),(1,"c"),(2,'a')],
                                       names=["first",'second'])
dfmi

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,three
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,-0.14716,-0.604258,
1,b,-1.083574,1.377826,-0.827176
1,c,-1.255433,0.656288,-0.980413
2,a,,-1.17692,1.202161


In [31]:
dfmi.sub(column,axis=0,level='second')

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,three
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,0.457098,0.0,
1,b,-2.461399,0.0,-2.205002
1,c,-1.911721,0.0,-1.636701
2,a,,-0.572662,1.806419


Series and Index also support the divmod() builtin. This function takes the floor division and modulo operation at the same time returning a two-tuple of the same type as the left hand side. For example:

In [32]:
s = pd.Series(np.arange(10,20))
s

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

In [33]:
div, mod = divmod(s,4) #s/4
print(div)
print(mod)

0    2
1    2
2    3
3    3
4    3
5    3
6    4
7    4
8    4
9    4
dtype: int32
0    2
1    3
2    0
3    1
4    2
5    3
6    0
7    1
8    2
9    3
dtype: int32


In [34]:
idx=pd.Index(np.arange(10))
idx

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [35]:
divv, modd = divmod(idx,3)
print(divv)
print(modd)

Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')


In [36]:
#We can also do elementwise divmod():

di, mo = divmod(s,[2, 2, 3, 3, 4, 4, 5, 5, 6, 6])
print(di)
print(mo)

0    5
1    5
2    4
3    4
4    3
5    3
6    3
7    3
8    3
9    3
dtype: int32
0    0
1    1
2    0
3    1
4    2
5    3
6    1
7    2
8    0
9    1
dtype: int32


#### Working with missing value/data
In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to substitute when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can later replace NaN with some other value using fillna if you wish).

In [37]:
df

Unnamed: 0,one,two,three
a,-0.14716,-0.604258,
b,-1.083574,1.377826,-0.827176
c,-1.255433,0.656288,-0.980413
d,,-1.17692,1.202161


In [38]:
dfm = df.copy()
dfm

Unnamed: 0,one,two,three
a,-0.14716,-0.604258,
b,-1.083574,1.377826,-0.827176
c,-1.255433,0.656288,-0.980413
d,,-1.17692,1.202161


In [39]:
df+dfm

Unnamed: 0,one,two,three
a,-0.29432,-1.208515,
b,-2.167147,2.755651,-1.654353
c,-2.510866,1.312576,-1.960825
d,,-2.35384,2.404323


In [40]:
df.add(dfm,fill_value=0)

Unnamed: 0,one,two,three
a,-0.29432,-1.208515,
b,-2.167147,2.755651,-1.654353
c,-2.510866,1.312576,-1.960825
d,,-2.35384,2.404323


#### Flexible comparisons
Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogousঅনুরূপ to the binary arithmetic operations described above:

In [41]:
df.ne(dfm)

Unnamed: 0,one,two,three
a,False,False,True
b,False,False,False
c,False,False,False
d,True,False,False


In [42]:
df.le(dfm)

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


In [43]:
df.gt(dfm)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [44]:
df.lt(dfm)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [45]:
df.eq(dfm)

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


#### Boolean reductions
You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [46]:
(df>0).all()

one      False
two      False
three    False
dtype: bool

In [47]:
(df>0).any()

one      False
two       True
three     True
dtype: bool

In [48]:
df.empty

False

In [49]:
df.bool

<bound method NDFrame.bool of         one       two     three
a -0.147160 -0.604258       NaN
b -1.083574  1.377826 -0.827176
c -1.255433  0.656288 -0.980413
d       NaN -1.176920  1.202161>

In [50]:
dd=pd.DataFrame(columns=list('ABCD'))
dd

Unnamed: 0,A,B,C,D


In [51]:
dd.empty

True

In [52]:
pd.Series([True,]).bool()
#To evaluate single-element pandas objects in a boolean context, use the method bool()

True

In [53]:
df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})
df2 = pd.DataFrame({'col': [np.nan, 0, 'foo']}, index=[2, 1, 0])

In [54]:
df1.equals(df2)

False

In [55]:
df1.equals(df2.sort_index())

True

In [56]:
df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
                       'B': [np.nan, 2., 3., np.nan, 6.]})

df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.],
                       'B': [np.nan, np.nan, 3., 4., 6., 8.]})


print(df1)
print(df2)
df1.combine_first(df2)

#ProvidedDataFrame to use to fill null values.

     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0
     A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0


Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0


In [57]:
df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
                       'B': [np.nan, 2., 3., np.nan, 6.]})

df2 = pd.DataFrame({'A': [5., 2., np.nan, 3., 7.],
                       'B': [np.nan, 3., 4., 6., 8.]})


def combiner(x,y):
    return np.where(pd.isna(x),y,x)

combiner(df1,df2) #the two df will have to be same size

array([[ 1., nan],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 5.,  6.],
       [ 7.,  6.]])

#### Descriptive statistics
There exists a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the axis can be specified by name or integer:

Series: no axis argument needed
DataFrame: “index” (axis=0, default), “columns” (axis=1)
For example:

In [58]:
df

Unnamed: 0,one,two,three
a,-0.14716,-0.604258,
b,-1.083574,1.377826,-0.827176
c,-1.255433,0.656288,-0.980413
d,,-1.17692,1.202161


In [59]:
df.mean(0)

one     -0.828722
two      0.063234
three   -0.201809
dtype: float64

In [60]:
df.mean(1)

a   -0.375709
b   -0.177642
c   -0.526519
d    0.012621
dtype: float64

In [61]:
df.max(0)

one     -0.147160
two      1.377826
three    1.202161
dtype: float64

In [62]:
df.quantile(0)

one     -1.255433
two     -1.176920
three   -0.980413
Name: 0, dtype: float64

In [63]:
df.quantile(1)

one     -0.147160
two      1.377826
three    1.202161
Name: 1, dtype: float64

In [64]:
df.sum(0)

one     -2.486166
two      0.252936
three   -0.605428
dtype: float64

In [65]:
df.sum(1)

a   -0.751417
b   -0.532925
c   -1.579557
d    0.025241
dtype: float64

In [66]:
df.cumsum(0)

Unnamed: 0,one,two,three
a,-0.14716,-0.604258,
b,-1.230733,0.773568,-0.827176
c,-2.486166,1.429856,-1.807589
d,,0.252936,-0.605428


In [67]:
df.cumsum(1)

Unnamed: 0,one,two,three
a,-0.14716,-0.751417,
b,-1.083574,0.294252,-0.532925
c,-1.255433,-0.599145,-1.579557
d,,-1.17692,0.025241


In [68]:
df.cumprod(0)

Unnamed: 0,one,two,three
a,-0.14716,-0.604258,
b,0.159459,-0.832562,-0.827176
c,-0.200189,-0.5464,0.810974
d,,0.64307,0.974922


In [69]:
df.cumprod(1)

Unnamed: 0,one,two,three
a,-0.14716,0.088922,
b,-1.083574,-1.492975,1.234954
c,-1.255433,-0.823926,0.807787
d,,-1.17692,-1.414848


All such methods have a skipna option signaling whether to exclude missing data (True by default):

In [70]:
df.sum(0,skipna=False)

one           NaN
two      0.252936
three         NaN
dtype: float64

In [71]:
df.std()

one      0.596472
two      1.163814
three    1.218286
dtype: float64

In [72]:
t=(df-df.mean())/df.std()
t

Unnamed: 0,one,two,three
a,1.142655,-0.573538,
b,-0.427265,1.129554,-0.513317
c,-0.715391,0.509578,-0.639097
d,,-1.065594,1.152415


In [73]:
t.std()

one      1.0
two      1.0
three    1.0
dtype: float64

In [74]:
tt= df.sub(df.mean(1),axis=0).div(df.std(1),axis=0)
tt

Unnamed: 0,one,two,three
a,0.707107,-0.707107,
b,-0.669493,1.149507,-0.480013
c,-0.705266,1.144435,-0.439168
d,,-0.707107,0.707107


In [75]:
tt.std(1)

a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat different from expanding() and rolling(). For more details please see this note.

In [76]:
df.cumsum()

Unnamed: 0,one,two,three
a,-0.14716,-0.604258,
b,-1.230733,0.773568,-0.827176
c,-2.486166,1.429856,-1.807589
d,,0.252936,-0.605428


In [77]:
df.mad() #mean absolute deviation

one      0.454375
two      0.953823
three    0.935980
dtype: float64

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:

In [78]:
np.mean(df["one"])

-0.8287221066065781

In [79]:
s.nunique() 
#Series.nunique() will return the number of unique non-NA values in a Series

10

In [80]:
s

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

In [81]:
s[2:4]=10
s

0    10
1    11
2    10
3    10
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course):

In [82]:
s.describe()

count    10.000000
mean     14.000000
std       3.527668
min      10.000000
25%      10.250000
50%      14.500000
75%      16.750000
max      19.000000
dtype: float64

In [83]:
df.describe()

Unnamed: 0,one,two,three
count,3.0,4.0,3.0
mean,-0.828722,0.063234,-0.201809
std,0.596472,1.163814,1.218286
min,-1.255433,-1.17692,-0.980413
25%,-1.169503,-0.747423,-0.903795
50%,-1.083574,0.026015,-0.827176
75%,-0.615367,0.836673,0.187492
max,-0.14716,1.377826,1.202161


In [84]:
frame = pd.DataFrame(np.random.randn(1000,5), columns=["A","B","C",'D','E'])
frame

Unnamed: 0,A,B,C,D,E
0,1.119111,0.570313,-2.371166,-1.568762,-0.584562
1,1.688560,0.572376,-0.456748,1.369131,-1.228487
2,1.222151,-0.085707,1.233326,1.996221,0.719593
3,-1.252786,1.457041,1.687060,0.749092,-0.316029
4,1.173119,-0.802799,0.583715,-1.664543,-0.068625
...,...,...,...,...,...
995,2.180632,-0.304672,-2.201666,0.443770,0.014395
996,0.205895,0.513372,0.382290,-0.190812,-1.516186
997,0.215898,-0.833873,-0.639641,0.156782,0.635647
998,1.751670,4.078931,1.703456,0.729440,0.926235


In [85]:
frame.iloc[::3] = np.nan
frame

Unnamed: 0,A,B,C,D,E
0,,,,,
1,1.688560,0.572376,-0.456748,1.369131,-1.228487
2,1.222151,-0.085707,1.233326,1.996221,0.719593
3,,,,,
4,1.173119,-0.802799,0.583715,-1.664543,-0.068625
...,...,...,...,...,...
995,2.180632,-0.304672,-2.201666,0.443770,0.014395
996,,,,,
997,0.215898,-0.833873,-0.639641,0.156782,0.635647
998,1.751670,4.078931,1.703456,0.729440,0.926235


In [86]:
frame.describe()

Unnamed: 0,A,B,C,D,E
count,666.0,666.0,666.0,666.0,666.0
mean,0.008088,0.037701,0.040177,0.009678,-0.013471
std,0.979957,0.963086,1.064007,1.03775,1.001306
min,-2.828268,-2.670476,-3.570007,-3.118183,-4.389366
25%,-0.678669,-0.587674,-0.701599,-0.667675,-0.731302
50%,-0.02174,0.028418,0.046913,-0.027177,0.053234
75%,0.605654,0.67499,0.729106,0.707497,0.673535
max,3.308914,4.078931,3.565297,3.935502,3.199069


In [87]:
#You can select specific percentiles to include in the output:
df.describe(percentiles=[.05,.25,.60,.95])

Unnamed: 0,one,two,three
count,3.0,4.0,3.0
mean,-0.828722,0.063234,-0.201809
std,0.596472,1.163814,1.218286
min,-1.255433,-1.17692,-0.980413
5%,-1.238247,-1.091021,-0.965089
25%,-1.169503,-0.747423,-0.903795
50%,-1.083574,0.026015,-0.827176
60%,-0.896291,0.404179,-0.421309
95%,-0.240801,1.269595,0.999228
max,-0.14716,1.377826,1.202161


In [88]:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
s.describe()

count     9
unique    4
top       a
freq      5
dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical columns or, if none are, only categorical columns:

In [89]:
f = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
f

Unnamed: 0,a,b
0,Yes,0
1,Yes,1
2,No,2
3,No,3


In [90]:
f.describe()

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


This behavior can be controlled by providing a list of types as include/exclude arguments. The special value all can also be used:

In [91]:
f.describe(include=['object'])

Unnamed: 0,a
count,4
unique,2
top,No
freq,2


In [92]:
f.describe(include=["number"])

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


In [93]:
f.describe(include='all')

Unnamed: 0,a,b
count,4,4.0
unique,2,
top,No,
freq,2,
mean,,1.5
std,,1.290994
min,,0.0
25%,,0.75
50%,,1.5
75%,,2.25


That feature relies on select_dtypes. Refer to there for details about accepted inputs.

#### Index of min/max values
The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

In [94]:
df.idxmin()

one      c
two      d
three    c
dtype: object

In [95]:
df.idxmax()

one      a
two      b
three    d
dtype: object

In [96]:
g = pd.Series(np.random.random(20))
g

0     0.100725
1     0.360648
2     0.800412
3     0.229033
4     0.444566
5     0.775535
6     0.419960
7     0.239429
8     0.776986
9     0.132523
10    0.242754
11    0.362829
12    0.831762
13    0.628190
14    0.285809
15    0.542261
16    0.822526
17    0.075662
18    0.027751
19    0.453622
dtype: float64

In [97]:
g.idxmax(), g.idxmin()

(12, 18)

In [98]:
d1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
d1

Unnamed: 0,A,B,C
0,-0.799103,0.188722,-2.395955
1,-2.129379,1.362218,-0.451214
2,0.241526,0.200685,1.031585
3,0.626288,1.980188,-2.057583
4,-0.705452,-0.602839,1.130286


In [99]:
d1.idxmin(axis=0)

A    1
B    4
C    0
dtype: int64

In [100]:
d1.idxmin(axis=1)

0    C
1    A
2    B
3    C
4    A
dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax() return the first matching index:

In [101]:
df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
df3

Unnamed: 0,A
e,2.0
d,1.0
c,1.0
b,3.0
a,


In [102]:
df3["A"].idxmin()

#idxmin and idxmax are called argmin and argmax in NumPy.

'd'

The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays:

In [103]:
p=np.random.randint(3,10,size=20)
p

array([6, 4, 9, 8, 9, 6, 7, 8, 8, 4, 3, 7, 8, 7, 8, 6, 9, 3, 7, 4])

In [104]:
data=pd.Series(p)
data.value_counts()

8    5
7    4
9    3
6    3
4    3
3    2
dtype: int64

In [105]:
pd.value_counts(data)

8    5
7    4
9    3
6    3
4    3
3    2
dtype: int64

Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:

In [106]:
data.mode()

0    8
dtype: int32

In [107]:
type(df1)

pandas.core.frame.DataFrame

In [108]:
df1

Unnamed: 0,A,B
0,1.0,
1,,2.0
2,3.0,3.0
3,5.0,
4,,6.0


In [109]:
df1.mode()

Unnamed: 0,A,B
0,1.0,2.0
1,3.0,3.0
2,5.0,6.0


In [110]:
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
                       "B": np.random.randint(-10, 15, size=50)})
df5.mode()

Unnamed: 0,A,B
0,5,-6


In [111]:
df5

Unnamed: 0,A,B
0,5,0
1,0,-6
2,6,-5
3,6,-10
4,5,0
5,2,12
6,0,-8
7,3,-2
8,3,7
9,6,14


#### Discretization and quantiling
Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample quantiles) functions

In [112]:
arr = np.random.randn(20)
factor = pd.cut(arr, 4)
factor

[(-1.058, -0.133], (-0.133, 0.791], (-1.058, -0.133], (-1.058, -0.133], (-0.133, 0.791], ..., (-1.058, -0.133], (-0.133, 0.791], (-0.133, 0.791], (-1.058, -0.133], (0.791, 1.716]]
Length: 20
Categories (4, interval[float64]): [(-1.987, -1.058] < (-1.058, -0.133] < (-0.133, 0.791] < (0.791, 1.716]]

In [113]:
factor = pd.cut(arr, [-5, -1, 0, 1, 5])
factor

[(-1, 0], (0, 1], (-1, 0], (-1, 0], (0, 1], ..., (-1, 0], (-1, 0], (0, 1], (-1, 0], (1, 5]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() computes sample quantiles. For example, we could slice up some normally distributed data into equal-size quartiles like so:

In [114]:
arr = np.random.randn(30)
factor = pd.qcut(arr, [0, .25, .5, .75, 1])
factor

[(-1.708, -1.007], (0.628, 2.457], (-0.0334, 0.628], (-0.0334, 0.628], (-1.007, -0.0334], ..., (-1.708, -1.007], (0.628, 2.457], (-0.0334, 0.628], (-1.007, -0.0334], (-0.0334, 0.628]]
Length: 30
Categories (4, interval[float64]): [(-1.708, -1.007] < (-1.007, -0.0334] < (-0.0334, 0.628] < (0.628, 2.457]]

In [115]:
pd.value_counts(factor)

(0.628, 2.457]       8
(-1.708, -1.007]     8
(-0.0334, 0.628]     7
(-1.007, -0.0334]    7
dtype: int64

In [116]:
#We can also pass infinite values to define the bins:
arr = np.random.randn(20)
factor = pd.cut(arr, [-np.inf, 0, np.inf])
factor


[(-inf, 0.0], (0.0, inf], (0.0, inf], (0.0, inf], (0.0, inf], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (-inf, 0.0]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

In [117]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Function application
To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or Series, row- or column-wise, or elementwise.

Tablewise Function Application: pipe()

Row or Column-wise Function Application: apply()

Aggregation API: agg() and transform()

Applying Elementwise Functions: applymap()

f, g and h are functions taking and reurning ``DataFrame``

f(g(h(df),arg1),arg2=2,arg3=3)


(df.pipe(h)
    .pipe(g, arg1=1)
    .pipe(f, arg2=2, arg3=3))
    
Pandas encourages the second style, which is known as method chaining. pipe makes it easy to use your own or another library’s functions in method chains, alongside pandas’ methods.

In [119]:
df

Unnamed: 0,one,two,three
a,-0.14716,-0.604258,
b,-1.083574,1.377826,-0.827176
c,-1.255433,0.656288,-0.980413
d,,-1.17692,1.202161


In [120]:
df.apply(np.mean)

one     -0.828722
two      0.063234
three   -0.201809
dtype: float64

In [121]:
df.apply(np.mean, axis=1)

a   -0.375709
b   -0.177642
c   -0.526519
d    0.012621
dtype: float64

In [126]:
df.apply(lambda x:x.max()- x.min(), axis=0)

one      1.108273
two      2.554746
three    2.182574
dtype: float64

In [127]:
df

Unnamed: 0,one,two,three
a,-0.14716,-0.604258,
b,-1.083574,1.377826,-0.827176
c,-1.255433,0.656288,-0.980413
d,,-1.17692,1.202161


In [128]:
df.reindex(index=['c','d','b'], columns =["three",'two','one'])

Unnamed: 0,three,two,one
c,-0.980413,0.656288,-1.255433
d,1.202161,-1.17692,
b,-0.827176,1.377826,-1.083574


In [129]:
df.reindex(["c",'a','d'], axis="index")

Unnamed: 0,one,two,three
c,-1.255433,0.656288,-0.980413
a,-0.14716,-0.604258,
d,,-1.17692,1.202161


In [130]:
df.reindex(['two','one','three'],axis="columns")

Unnamed: 0,two,one,three
a,-0.604258,-0.14716,
b,1.377826,-1.083574,-0.827176
c,0.656288,-1.255433,-0.980413
d,-1.17692,,1.202161


### Sorting
Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination of both.

The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its index levels.

In [None]:
f = pd.DataFrame({
       'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
f

In [None]:
f.sort_index()

In [None]:
unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
                            columns=['three', 'two', 'one'])
unsorted_df

In [None]:
unsorted_df.sort_index()

In [None]:
unsorted_df.sort_index(ascending=False)

In [None]:
unsorted_df.sort_index(ascending=True)

In [None]:
unsorted_df.sort_index(axis=0)

In [None]:
unsorted_df["two"].sort_index(ascending=False)

#### By values
The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values() method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.sort_values() may used to specify one or more columns to use to determine the sorted order.