 # Descriptive Statistics

 `pandas` object come with built-in set of common mathematical and statical methods. Most fall in the category of reduction or summary statistics.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.DataFrame(
  np.random.standard_normal((4,2)).round(2)*8,
  index=list("abcd"),
  columns=["one", "two"]
)
df

Unnamed: 0,one,two
a,9.52,2.56
b,-20.88,0.24
c,12.48,7.68
d,3.36,-8.8


In [3]:
df.loc["a", "two"] = np.nan
df.loc["c"] = np.nan
df

Unnamed: 0,one,two
a,9.52,
b,-20.88,0.24
c,,
d,3.36,-8.8


In [4]:
df.sum()

one   -8.00
two   -8.56
dtype: float64

In [5]:
# Sum across columns:
df.sum(axis=1)

a     9.52
b   -20.64
c     0.00
d    -5.44
dtype: float64

 When the entire row contains `nan` values, the sum is 0.

In [6]:
df.sum(axis=1, skipna=False)

a      NaN
b   -20.64
c      NaN
d    -5.44
dtype: float64

 When using `skipna` param, any `nan` value in a row will result in `nan`.

 Some aggregation methods, like `mean`, require at least one non-NA value:

In [7]:
df.mean(axis=1)

a     9.52
b   -10.32
c      NaN
d    -2.72
dtype: float64

In [8]:
df.idxmax()

one    a
two    b
dtype: object

In [9]:
df.cumsum()

Unnamed: 0,one,two
a,9.52,
b,-11.36,0.24
c,,
d,-8.0,-8.56


In [10]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,-2.666667,-4.28
std,16.071109,6.392245
min,-20.88,-8.8
25%,-8.76,-6.54
50%,3.36,-4.28
75%,6.44,-2.02
max,9.52,0.24


In [11]:
pd.Series(["a", "a", "b", "c"]*4).describe()

count     16
unique     3
top        a
freq       8
dtype: object

In [12]:
df.quantile()

one    3.36
two   -4.28
Name: 0.5, dtype: float64

In [13]:
df.mad()

one    12.142222
two     4.520000
dtype: float64

In [14]:
df.std()

one    16.071109
two     6.392245
dtype: float64

 ### Correlation & Covariance

In [15]:
price = pd.read_pickle("data/yahoo_price.pkl")
vol = pd.read_pickle("data/yahoo_volume.pkl")

In [16]:
price.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,117.550003,779.960022,154.770004,57.220001
2016-10-18,117.470001,795.26001,150.720001,57.66
2016-10-19,117.120003,801.5,151.259995,57.529999
2016-10-20,117.059998,796.969971,151.520004,57.25
2016-10-21,116.599998,799.369995,149.630005,59.66


In [17]:
returns = price.pct_change()

In [18]:
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


In [19]:
returns["IBM"].corr(returns["MSFT"])

0.49976361144151144

In [20]:
returns["AAPL"].cov(returns["GOOG"])

0.00010745748920152606

 **Note**: If we use the `corrwith` method, it computes the corr of each column with a Series passed as the argument.

In [21]:
returns.corrwith(returns["IBM"])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

 We can even pass a DataFrame as the argument, it computes the corr for the matching column names:

In [22]:
returns.corrwith(vol, axis=0)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

In [23]:
rep_ser = pd.Series(["a", "c", "a", "c", "b", "b", "c", "b", "d"])
rep_ser.unique()

array(['a', 'c', 'b', 'd'], dtype=object)

In [24]:
rep_ser.value_counts()

c    3
b    3
a    2
d    1
dtype: int64

In [25]:
mask = rep_ser.isin(["a", "b"])

In [26]:
# Filtering the data:
rep_ser[mask]

0    a
2    a
4    b
5    b
7    b
dtype: object

In [27]:
uni_vals = pd.Series(["c", "a", "b"])
indices = pd.Index(uni_vals).get_indexer(rep_ser)

In [28]:
# -1 because uni_vals doesn't contain "d"
indices

array([ 1,  0,  1,  0,  2,  2,  0,  2, -1])

In [29]:
dt = pd.DataFrame({"a":[4,4,1,2,3],"b":[2,2,1,3,5],"c":[1,2,1,1,4]})
dt

Unnamed: 0,a,b,c
0,4,2,1
1,4,2,2
2,1,1,1
3,2,3,1
4,3,5,4


 Compute value counts of a single column:

In [30]:
dt["a"].value_counts().sort_index()

1    1
2    1
3    1
4    2
Name: a, dtype: int64

 Compute value counts for every column with the apply method:

In [31]:
dt.apply(pd.value_counts).fillna(0)

Unnamed: 0,a,b,c
1,1.0,1.0,3.0
2,1.0,2.0,1.0
3,1.0,1.0,0.0
4,2.0,0.0,1.0
5,0.0,1.0,0.0


 There is also a built-in `value_counts` method, but it computes the counts considering each row of the DataFrame.

In [32]:
dt_1 = pd.DataFrame({
  "a": [1,1,0,0,2],
  "b": [1,1,2,2,0],
  "c": [1,1,1,1,2]
})
dt_1

Unnamed: 0,a,b,c
0,1,1,1
1,1,1,1
2,0,2,1
3,0,2,1
4,2,0,2


In [33]:
# The index represents the unique rows as a hierarchical index:
dt_1.value_counts()

a  b  c
0  2  1    2
1  1  1    2
2  0  2    1
dtype: int64