<h1 align='center'>5.4 Summarizing and Computing Descriptive Statistics

Compared  with  the  Summarizing and Computing Descriptive Statistics  methods found  on  NumPy  arrays,  pandas methods  have  built-in  handling  for  missing  data

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])

df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [2]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [3]:
df.sum(axis=1,skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

In [4]:
df.idxmax()

one    b
two    d
dtype: object

In [5]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On non-numeric data, describe produces alternative summary statistics

In [6]:
obj = pd.Series(['a', 'b', 'b', 'c'] * 4)

In [7]:
obj.describe()

count     16
unique     3
top        b
freq       8
dtype: object

<b>Covarience and Correlation

In [9]:
import pandas_datareader.data as web 

all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})

volume = pd.DataFrame({ticker: data['Volume']for ticker, data in all_data.items()})

In [12]:
returns = price.pct_change()

In [13]:
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-05-26,-0.006774,0.028465,-0.010572,0.004679
2020-05-27,0.004357,0.031045,0.001322,0.000579
2020-05-28,0.00044,-0.008045,-0.002255,-0.000783
2020-05-29,-0.000974,0.002971,0.010198,0.008604
2020-06-01,0.012298,-8e-05,-0.002292,0.002029


The  corr  method  of  Series  computes  the  correlation  of  the  overlapping,  non-NA,aligned-by-index values in two Series.

In [14]:
returns['MSFT'].corr(returns['IBM'])

0.5969004746680456

In [15]:
returns['MSFT'].cov(returns['IBM'])

0.00016219443170635022

In [16]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.531264,0.710722,0.6422
IBM,0.531264,1.0,0.5969,0.527997
MSFT,0.710722,0.5969,1.0,0.7512
GOOG,0.6422,0.527997,0.7512,1.0


DataFrame’s  corr  and  cov  methods,  on  the  other  hand,  return  a  full  correlation  orcovariance matrix as a DataFrame

<b>Unique Values, Value Counts, and Membership

In [17]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [19]:
obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [21]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

isin  performs  a  vectorized  set  membership  check  and  can  be  useful  in  filtering  adataset down to a subset of values in a Series or column in a DataFrame

In [22]:
mask = obj.isin(['b', 'c'])

In [23]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [24]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

Related  to  isin  is  the  Index.get_indexer  method,  which  gives  you  an  index  arrayfrom an array of possibly non-distinct values into another array of distinct values

In [25]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2], dtype=int64)

In some cases, you may want to compute a histogram on multiple related columns ina DataFrame.

In [26]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})

In [31]:
result = data.apply(pd.value_counts).fillna(0)


In [32]:
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


The  row  labels  in  the  result  are  the  distinct  values  occurring  in  all  of  the  col‐umns. 

The values are the respective counts of these values in each column