In [1]:
import numpy as np
import pandas as pd

In [2]:
index = pd.date_range("20200101", periods=10)
index

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

In [5]:
pd.date_range("01/02/2020", periods=10)

DatetimeIndex(['2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05',
               '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09',
               '2020-01-10', '2020-01-11'],
              dtype='datetime64[ns]', freq='D')

In [6]:
s= pd.Series(np.random.randn(5), index=["a",'b','c','d','e'])
s

a    1.557325
b    1.105505
c   -0.792604
d   -1.446353
e   -0.688473
dtype: float64

In [7]:
np.random.random(5)

array([0.55382526, 0.28895585, 0.56556242, 0.28957263, 0.09300487])

In [8]:
np.random.randn(4)

array([0.47859304, 0.07671038, 1.31119782, 1.32821047])

In [9]:
df = pd.DataFrame(np.random.random((4,4)), index=["a",'b','c','d'], columns=["AA",'BB','CC','DD'])
df

Unnamed: 0,AA,BB,CC,DD
a,0.519544,0.918958,0.647774,0.586377
b,0.579397,0.85564,0.047073,0.55789
c,0.238252,0.687699,0.207992,0.698361
d,0.442447,0.54672,0.830636,0.343075


In [10]:
s.head(3)

a    1.557325
b    1.105505
c   -0.792604
dtype: float64

In [11]:
s.tail(3)

c   -0.792604
d   -1.446353
e   -0.688473
dtype: float64

In [12]:
df.head(3)

Unnamed: 0,AA,BB,CC,DD
a,0.519544,0.918958,0.647774,0.586377
b,0.579397,0.85564,0.047073,0.55789
c,0.238252,0.687699,0.207992,0.698361


In [13]:
df.describe()

Unnamed: 0,AA,BB,CC,DD
count,4.0,4.0,4.0,4.0
mean,0.44491,0.752254,0.433369,0.546426
std,0.14874,0.168217,0.366883,0.148508
min,0.238252,0.54672,0.047073,0.343075
25%,0.391398,0.652454,0.167762,0.504186
50%,0.480996,0.77167,0.427883,0.572134
75%,0.534507,0.87147,0.693489,0.614373
max,0.579397,0.918958,0.830636,0.698361


In [14]:
df[:2]

Unnamed: 0,AA,BB,CC,DD
a,0.519544,0.918958,0.647774,0.586377
b,0.579397,0.85564,0.047073,0.55789


In [15]:
df1 = df.copy()

In [16]:
df1.columns = [i.lower() for i in df1.columns]
df1

Unnamed: 0,aa,bb,cc,dd
a,0.519544,0.918958,0.647774,0.586377
b,0.579397,0.85564,0.047073,0.55789
c,0.238252,0.687699,0.207992,0.698361
d,0.442447,0.54672,0.830636,0.343075


In [20]:
s.array

#To get the actual data inside a Index or Series, use the .array property

<PandasArray>
[ 1.5573250963881682,  1.1055049188258699, -0.7926044418868525,
 -1.4463527044360456, -0.6884731346965303]
Length: 5, dtype: float64

In [21]:
s.index.array

<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas uses them is a bit beyond the scope of this introduction. See dtypes for more.

In [25]:
#If you know you need a NumPy array, use to_numpy(series) or numpy.asarray().
s.to_numpy()

array([ 1.5573251 ,  1.10550492, -0.79260444, -1.4463527 , -0.68847313])

In [26]:
np.asarray(s)

array([ 1.5573251 ,  1.10550492, -0.79260444, -1.4463527 , -0.68847313])

When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and coercing values. See dtypes for more.

to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider datetimes with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are two possibly useful representations:

An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the timezone discarded
Timezones may be preserved with dtype=object

In [27]:
ser = pd.Series(pd.date_range("2010", periods= 4, tz="CET"))
ser

0   2010-01-01 00:00:00+01:00
1   2010-01-02 00:00:00+01:00
2   2010-01-03 00:00:00+01:00
3   2010-01-04 00:00:00+01:00
dtype: datetime64[ns, CET]

In [28]:
ser.to_numpy(dtype=object)

array([Timestamp('2010-01-01 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2010-01-02 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2010-01-03 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2010-01-04 00:00:00+0100', tz='CET', freq='D')],
      dtype=object)

In [29]:
ser.to_numpy(dtype='datetime64[ns]')

array(['2009-12-31T23:00:00.000000000', '2010-01-01T23:00:00.000000000',
       '2010-01-02T23:00:00.000000000', '2010-01-03T23:00:00.000000000'],
      dtype='datetime64[ns]')

If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to.

Note When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and integers, the resulting array will be of float dtype.
In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we recommend avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:

When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy array or the extension array. Series.array will always return an ExtensionArray, and will never copy data. Series.to_numpy() will always return a NumPy array, potentially at the cost of copying / coercing values.
When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data and coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(), being a method, makes it clearer that the returned NumPy array may not be a view on the same data in the DataFrame.
Accelerated operations
pandas has support for accelerating certain types of binary numerical and boolean operations using the numexpr library and the bottleneck libraries.

These libraries are especially useful when dealing with large data sets, and provide large speedups. numexpr uses smart chunking, caching, and multiple cores. bottleneck is a set of specialized cython routines that are especially fast when dealing with arrays that have nans.

Here is a sample (using 100 column x 100,000 row DataFrames):

Operation	0.11.0 (ms)	Prior Version (ms)	Ratio to Prior
df1 > df2	13.32	125.35	0.1063
df1 * df2	21.71	36.63	0.5928
df1 + df2	22.04	36.50	0.6039
You are highly encouraged to install both libraries. See the section Recommended Dependencies for more installation info.

These are both enabled to be used by default, you can control this by setting the options:

New in version 0.20.0.

In [34]:
pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

In [35]:
df = pd.DataFrame({
       'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
       'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
       'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,0.449903,0.592746,
b,1.342945,0.894979,0.33857
c,1.654376,-2.473131,0.337498
d,,-1.377757,-0.498093


In [39]:
row = df.iloc[1]
row

one      1.342945
two      0.894979
three    0.338570
Name: b, dtype: float64

In [41]:
column = df['two']
column

a    0.592746
b    0.894979
c   -2.473131
d   -1.377757
Name: two, dtype: float64

In [42]:
df.sub(row, axis ="columns")

Unnamed: 0,one,two,three
a,-0.893042,-0.302233,
b,0.0,0.0,0.0
c,0.311431,-3.36811,-0.001072
d,,-2.272736,-0.836663


In [45]:
df.sub(row, axis=1)

Unnamed: 0,one,two,three
a,-0.893042,-0.302233,
b,0.0,0.0,0.0
c,0.311431,-3.36811,-0.001072
d,,-2.272736,-0.836663


In [46]:
df.sub(column, axis = "index")

Unnamed: 0,one,two,three
a,-0.142843,0.0,
b,0.447966,0.0,-0.556409
c,4.127507,0.0,2.810629
d,,0.0,0.879664


In [48]:
df.sub(column, axis=0)

Unnamed: 0,one,two,three
a,-0.142843,0.0,
b,0.447966,0.0,-0.556409
c,4.127507,0.0,2.810629
d,,0.0,0.879664


In [51]:
dfmi = df.copy()
dfmi.index = pd.MultiIndex.from_tuples([(1,"a"),(1,"b"),(1,"c"),(2,'a')],
                                       names=["first",'second'])
dfmi

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,three
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,0.449903,0.592746,
1,b,1.342945,0.894979,0.33857
1,c,1.654376,-2.473131,0.337498
2,a,,-1.377757,-0.498093


In [55]:
dfmi.sub(column,axis=0,level='second')

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,three
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,-0.142843,0.0,
1,b,0.447966,0.0,-0.556409
1,c,4.127507,0.0,2.810629
2,a,,-1.970502,-1.090838


Series and Index also support the divmod() builtin. This function takes the floor division and modulo operation at the same time returning a two-tuple of the same type as the left hand side. For example:

In [58]:
s = pd.Series(np.arange(10,20))
s

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

In [63]:
div, mod = divmod(s,4) #s/4
print(div)
print(mod)

0    2
1    2
2    3
3    3
4    3
5    3
6    4
7    4
8    4
9    4
dtype: int32
0    2
1    3
2    0
3    1
4    2
5    3
6    0
7    1
8    2
9    3
dtype: int32


In [64]:
idx=pd.Index(np.arange(10))
idx

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [66]:
divv, modd = divmod(idx,3)
print(divv)
print(modd)

Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')


In [67]:
#We can also do elementwise divmod():

di, mo = divmod(s,[2, 2, 3, 3, 4, 4, 5, 5, 6, 6])
print(di)
print(mo)

0    5
1    5
2    4
3    4
4    3
5    3
6    3
7    3
8    3
9    3
dtype: int32
0    0
1    1
2    0
3    1
4    2
5    3
6    1
7    2
8    0
9    1
dtype: int32


#### Working with missing value/data
In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to substitute when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can later replace NaN with some other value using fillna if you wish).

In [68]:
df

Unnamed: 0,one,two,three
a,0.449903,0.592746,
b,1.342945,0.894979,0.33857
c,1.654376,-2.473131,0.337498
d,,-1.377757,-0.498093


In [74]:
dfm = df.copy()
dfm

Unnamed: 0,one,two,three
a,0.449903,0.592746,
b,1.342945,0.894979,0.33857
c,1.654376,-2.473131,0.337498
d,,-1.377757,-0.498093


In [75]:
df+dfm

Unnamed: 0,one,two,three
a,0.899806,1.185491,
b,2.68589,1.789958,0.677139
c,3.308752,-4.946263,0.674996
d,,-2.755514,-0.996186


In [76]:
df.add(dfm,fill_value=0)

Unnamed: 0,one,two,three
a,0.899806,1.185491,
b,2.68589,1.789958,0.677139
c,3.308752,-4.946263,0.674996
d,,-2.755514,-0.996186


#### Flexible comparisons
Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogousঅনুরূপ to the binary arithmetic operations described above:

In [77]:
df.ne(dfm)

Unnamed: 0,one,two,three
a,False,False,True
b,False,False,False
c,False,False,False
d,True,False,False


In [78]:
df.le(dfm)

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


In [79]:
df.gt(dfm)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [80]:
df.lt(dfm)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [81]:
df.eq(dfm)

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


#### Boolean reductions
You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [84]:
(df>0).all()

one      False
two      False
three    False
dtype: bool

In [91]:
(df>0).any()

one      True
two      True
three    True
dtype: bool

In [103]:
df.empty

False

In [104]:
df.bool

<bound method NDFrame.bool of         one       two     three
a  0.449903  0.592746       NaN
b  1.342945  0.894979  0.338570
c  1.654376 -2.473131  0.337498
d       NaN -1.377757 -0.498093>

In [108]:
dd=pd.DataFrame(columns=list('ABCD'))
dd

Unnamed: 0,A,B,C,D


In [109]:
dd.empty

True

In [114]:
pd.Series([True,]).bool()
#To evaluate single-element pandas objects in a boolean context, use the method bool()

True

In [115]:
df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})
df2 = pd.DataFrame({'col': [np.nan, 0, 'foo']}, index=[2, 1, 0])

In [116]:
df1.equals(df2)

False

In [117]:
df1.equals(df2.sort_index())

True

In [120]:
df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
                       'B': [np.nan, 2., 3., np.nan, 6.]})

df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.],
                       'B': [np.nan, np.nan, 3., 4., 6., 8.]})


print(df1)
print(df2)
df1.combine_first(df2)

#ProvidedDataFrame to use to fill null values.

     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0
     A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0


Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0


In [124]:
df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
                       'B': [np.nan, 2., 3., np.nan, 6.]})

df2 = pd.DataFrame({'A': [5., 2., np.nan, 3., 7.],
                       'B': [np.nan, 3., 4., 6., 8.]})


def combiner(x,y):
    return np.where(pd.isna(x),y,x)

combiner(df1,df2) #the two df will have to be same size

array([[ 1., nan],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 5.,  6.],
       [ 7.,  6.]])

#### Descriptive statistics
There exists a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the axis can be specified by name or integer:

Series: no axis argument needed
DataFrame: “index” (axis=0, default), “columns” (axis=1)
For example:

In [126]:
df

Unnamed: 0,one,two,three
a,0.449903,0.592746,
b,1.342945,0.894979,0.33857
c,1.654376,-2.473131,0.337498
d,,-1.377757,-0.498093


In [127]:
df.mean(0)

one      1.149075
two     -0.590791
three    0.059325
dtype: float64

In [128]:
df.mean(1)

a    0.521324
b    0.858831
c   -0.160419
d   -0.937925
dtype: float64

In [129]:
df.max(0)

one      1.654376
two      0.894979
three    0.338570
dtype: float64

In [130]:
df.quantile(0)

one      0.449903
two     -2.473131
three   -0.498093
Name: 0, dtype: float64

In [133]:
df.quantile(1)

one      1.654376
two      0.894979
three    0.338570
Name: 1, dtype: float64

In [134]:
df.sum(0)

one      3.447224
two     -2.363164
three    0.177975
dtype: float64

In [135]:
df.sum(1)

a    1.042649
b    2.576494
c   -0.481257
d   -1.875850
dtype: float64

In [136]:
df.cumsum(0)

Unnamed: 0,one,two,three
a,0.449903,0.592746,
b,1.792848,1.487724,0.33857
c,3.447224,-0.985407,0.676068
d,,-2.363164,0.177975


In [137]:
df.cumsum(1)

Unnamed: 0,one,two,three
a,0.449903,1.042649,
b,1.342945,2.237924,2.576494
c,1.654376,-0.818755,-0.481257
d,,-1.377757,-1.87585


In [138]:
df.cumprod(0)

Unnamed: 0,one,two,three
a,0.449903,0.592746,
b,0.604195,0.530495,0.33857
c,0.999565,-1.311983,0.114267
d,,1.807594,-0.056915


In [139]:
df.cumprod(1)

Unnamed: 0,one,two,three
a,0.449903,0.266678,
b,1.342945,1.201907,0.406929
c,1.654376,-4.091489,-1.380869
d,,-1.377757,0.686251


All such methods have a skipna option signaling whether to exclude missing data (True by default):

In [140]:
df.sum(0,skipna=False)

one           NaN
two     -2.363164
three         NaN
dtype: float64

In [142]:
df.std()

one      0.625202
two      1.609430
three    0.482738
dtype: float64

In [145]:
t=(df-df.mean())/df.std()
t

Unnamed: 0,one,two,three
a,-1.118312,0.735376,
b,0.310092,0.923165,0.57846
c,0.80822,-1.16957,0.57624
d,,-0.488972,-1.1547


In [146]:
t.std()

one      1.0
two      1.0
three    1.0
dtype: float64

In [147]:
tt= df.sub(df.mean(1),axis=0).div(df.std(1),axis=0)
tt

Unnamed: 0,one,two,three
a,-0.707107,0.707107,
b,0.962142,0.071841,-1.033983
c,0.860777,-1.096945,0.236168
d,,-0.707107,0.707107


In [148]:
tt.std(1)

a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat different from expanding() and rolling(). For more details please see this note.

In [151]:
df.cumsum()

Unnamed: 0,one,two,three
a,0.449903,0.592746,
b,1.792848,1.487724,0.33857
c,3.447224,-0.985407,0.676068
d,,-2.363164,0.177975


In [153]:
df.mad() #mean absolute deviation

one      0.466114
two      1.334653
three    0.371612
dtype: float64

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:

In [154]:
np.mean(df["one"])

1.1490746273214292

In [157]:
s.nunique() 
#Series.nunique() will return the number of unique non-NA values in a Series

10

In [158]:
s

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

In [159]:
s[2:4]=10
s

0    10
1    11
2    10
3    10
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course):

In [162]:
s.describe()

count    10.000000
mean     14.000000
std       3.527668
min      10.000000
25%      10.250000
50%      14.500000
75%      16.750000
max      19.000000
dtype: float64

In [163]:
df.describe()

Unnamed: 0,one,two,three
count,3.0,4.0,3.0
mean,1.149075,-0.590791,0.059325
std,0.625202,1.60943,0.482738
min,0.449903,-2.473131,-0.498093
25%,0.896424,-1.6516,-0.080297
50%,1.342945,-0.392506,0.337498
75%,1.49866,0.668304,0.338034
max,1.654376,0.894979,0.33857


In [172]:
frame = pd.DataFrame(np.random.randn(1000,5), columns=["A","B","C",'D','E'])
frame

Unnamed: 0,A,B,C,D,E
0,1.295016,-1.473933,2.496577,-0.691634,-0.193041
1,-0.136743,0.346487,-0.820700,0.167932,1.147856
2,-1.309312,0.509353,0.957942,0.537695,0.791656
3,-0.079666,-1.051060,-1.139281,-0.609384,-0.863128
4,0.807666,-0.448098,0.839355,-0.111455,1.854615
...,...,...,...,...,...
995,-0.136743,0.127676,-0.459501,1.192484,-1.412761
996,0.015956,1.735598,-1.737024,0.230325,0.001870
997,1.283001,0.688865,-0.100347,0.314033,-0.645448
998,-0.228850,1.362646,0.069623,1.090768,0.696858


In [173]:
frame.iloc[::3] = np.nan
frame

Unnamed: 0,A,B,C,D,E
0,,,,,
1,-0.136743,0.346487,-0.820700,0.167932,1.147856
2,-1.309312,0.509353,0.957942,0.537695,0.791656
3,,,,,
4,0.807666,-0.448098,0.839355,-0.111455,1.854615
...,...,...,...,...,...
995,-0.136743,0.127676,-0.459501,1.192484,-1.412761
996,,,,,
997,1.283001,0.688865,-0.100347,0.314033,-0.645448
998,-0.228850,1.362646,0.069623,1.090768,0.696858


In [174]:
frame.describe()

Unnamed: 0,A,B,C,D,E
count,666.0,666.0,666.0,666.0,666.0
mean,-0.038969,0.025101,-0.053885,0.013629,-0.06305
std,0.999477,0.996208,1.003246,0.988198,1.001996
min,-2.90029,-3.104851,-2.880584,-2.963152,-3.050889
25%,-0.724998,-0.68996,-0.781257,-0.602064,-0.730167
50%,-0.093509,0.009281,-0.056331,0.021452,-0.092027
75%,0.591938,0.718364,0.645989,0.639454,0.626842
max,3.05811,2.878342,2.803697,2.903416,2.691081


In [178]:
#You can select specific percentiles to include in the output:
df.describe(percentiles=[.05,.25,.60,.95])

Unnamed: 0,one,two,three
count,3.0,4.0,3.0
mean,1.149075,-0.590791,0.059325
std,0.625202,1.60943,0.482738
min,0.449903,-2.473131,-0.498093
5%,0.539207,-2.308825,-0.414534
25%,0.896424,-1.6516,-0.080297
50%,1.342945,-0.392506,0.337498
60%,1.405231,0.198645,0.337712
95%,1.623233,0.849644,0.338463
max,1.654376,0.894979,0.33857


In [179]:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
s.describe()

count     9
unique    4
top       a
freq      5
dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical columns or, if none are, only categorical columns:

In [183]:
f = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
f

Unnamed: 0,a,b
0,Yes,0
1,Yes,1
2,No,2
3,No,3


In [184]:
f.describe()

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


This behavior can be controlled by providing a list of types as include/exclude arguments. The special value all can also be used:

In [185]:
f.describe(include=['object'])

Unnamed: 0,a
count,4
unique,2
top,Yes
freq,2


In [186]:
f.describe(include=["number"])

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


In [187]:
f.describe(include='all')

Unnamed: 0,a,b
count,4,4.0
unique,2,
top,Yes,
freq,2,
mean,,1.5
std,,1.290994
min,,0.0
25%,,0.75
50%,,1.5
75%,,2.25


That feature relies on select_dtypes. Refer to there for details about accepted inputs.

#### Index of min/max values
The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

In [189]:
df.idxmin()

one      a
two      c
three    d
dtype: object

In [190]:
df.idxmax()

one      c
two      b
three    b
dtype: object

In [198]:
g = pd.Series(np.random.random(20))
g

0     0.866841
1     0.210203
2     0.814791
3     0.491553
4     0.441327
5     0.244257
6     0.461070
7     0.956984
8     0.895748
9     0.868099
10    0.772524
11    0.867795
12    0.394307
13    0.710722
14    0.018234
15    0.143318
16    0.250915
17    0.827019
18    0.288053
19    0.007449
dtype: float64

In [200]:
g.idxmax(), g.idxmin()

(7, 19)

In [205]:
d1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
d1

Unnamed: 0,A,B,C
0,-0.178896,0.50828,1.342792
1,0.695323,-0.345841,0.065437
2,-0.079593,-0.253305,-1.748115
3,-1.038757,0.829311,1.41431
4,-0.541476,-0.246002,0.465728


In [206]:
d1.idxmin(axis=0)

A    3
B    1
C    2
dtype: int64

In [208]:
d1.idxmin(axis=1)

0    A
1    B
2    C
3    A
4    A
dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax() return the first matching index:

In [210]:
df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
df3

Unnamed: 0,A
e,2.0
d,1.0
c,1.0
b,3.0
a,


In [213]:
df3["A"].idxmin()

#idxmin and idxmax are called argmin and argmax in NumPy.

'd'

The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays:

In [222]:
p=np.random.randint(3,10,size=20)
p

array([4, 3, 4, 4, 5, 5, 6, 9, 5, 4, 6, 9, 7, 6, 3, 8, 3, 4, 4, 5])

In [223]:
data=pd.Series(p)
data.value_counts()

4    6
5    4
6    3
3    3
9    2
8    1
7    1
dtype: int64

In [224]:
pd.value_counts(data)

4    6
5    4
6    3
3    3
9    2
8    1
7    1
dtype: int64

Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:

In [226]:
data.mode()

0    4
dtype: int32

In [229]:
type(df1)

pandas.core.frame.DataFrame

In [231]:
df1

Unnamed: 0,A,B
0,1.0,
1,,2.0
2,3.0,3.0
3,5.0,
4,,6.0


In [230]:
df1.mode()

Unnamed: 0,A,B
0,1.0,2.0
1,3.0,3.0
2,5.0,6.0


In [232]:
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
                       "B": np.random.randint(-10, 15, size=50)})
df5.mode()

Unnamed: 0,A,B
0,1,7
1,3,14


In [233]:
df5

Unnamed: 0,A,B
0,3,11
1,2,5
2,6,0
3,3,14
4,1,9
5,6,3
6,0,-1
7,4,7
8,2,-5
9,0,14


#### Discretization and quantiling
Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample quantiles) functions

In [235]:
arr = np.random.randn(20)
factor = pd.cut(arr, 4)
factor

[(-0.493, 0.293], (-1.282, -0.493], (0.293, 1.078], (1.078, 1.864], (0.293, 1.078], ..., (-0.493, 0.293], (0.293, 1.078], (1.078, 1.864], (-1.282, -0.493], (-1.282, -0.493]]
Length: 20
Categories (4, interval[float64]): [(-1.282, -0.493] < (-0.493, 0.293] < (0.293, 1.078] < (1.078, 1.864]]

In [236]:
factor = pd.cut(arr, [-5, -1, 0, 1, 5])
factor

[(0, 1], (-1, 0], (0, 1], (1, 5], (0, 1], ..., (0, 1], (0, 1], (1, 5], (-1, 0], (-1, 0]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() computes sample quantiles. For example, we could slice up some normally distributed data into equal-size quartiles like so:

In [237]:
arr = np.random.randn(30)
factor = pd.qcut(arr, [0, .25, .5, .75, 1])
factor

[(-2.276, -0.624], (0.548, 2.227], (-2.276, -0.624], (-2.276, -0.624], (0.548, 2.227], ..., (-0.624, 0.0279], (-2.276, -0.624], (0.0279, 0.548], (-0.624, 0.0279], (0.548, 2.227]]
Length: 30
Categories (4, interval[float64]): [(-2.276, -0.624] < (-0.624, 0.0279] < (0.0279, 0.548] < (0.548, 2.227]]

In [238]:
pd.value_counts(factor)

(0.548, 2.227]      8
(-2.276, -0.624]    8
(0.0279, 0.548]     7
(-0.624, 0.0279]    7
dtype: int64

In [240]:
#We can also pass infinite values to define the bins:
arr = np.random.randn(20)
factor = pd.cut(arr, [-np.inf, 0, np.inf])
factor


[(0.0, inf], (0.0, inf], (-inf, 0.0], (0.0, inf], (0.0, inf], ..., (0.0, inf], (-inf, 0.0], (0.0, inf], (-inf, 0.0], (-inf, 0.0]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]