Statistical methods help in the understanding and analyzing the behavior of data. We will now learn a few statistical functions, which we can apply on Pandas objects.



# Percent_change

Series, DatFrames and Panel, all have the function pct_change(). This function compares every element with its prior element and computes the change percentage.

In [2]:
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print(s.pct_change())

df = pd.DataFrame(np.random.randn(5, 2))
df.pct_change(axis=1)

0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
5   -0.200000
dtype: float64


Unnamed: 0,0,1
0,,1.969364
1,,-0.589671
2,,3.931782
3,,-1.816564
4,,-0.382268


By default, the pct_change() operates on columns; if you want to apply the same row wise, then use axis=1() argument.

# Covariance
Covariance is applied on series data. The Series object has a method cov to compute covariance between series objects. NA will be excluded automatically.

In [3]:
#Cov Series
 
import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
s1.cov(s2)

-0.268789680155105

Covariance method when applied on a DataFrame, computes cov between all the columns.

In [4]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print(frame['a'].cov(frame['b']))
print(frame.cov())

0.0595349642574335
          a         b         c         d         e
a  0.896929  0.059535  0.293494  0.146922 -0.053572
b  0.059535  0.275797  0.071105  0.058189 -0.069838
c  0.293494  0.071105  0.799903 -0.186835 -0.117847
d  0.146922  0.058189 -0.186835  0.336709  0.085880
e -0.053572 -0.069838 -0.117847  0.085880  0.691008


Note − Observe the cov between a and b column in the first statement and the same is the value returned by cov on DataFrame.

# Correlation
Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall.

In [5]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])

print(frame['a'].corr(frame['b']))
print( frame.corr())

0.5550596610077916
          a         b         c         d         e
a  1.000000  0.555060  0.138750 -0.285901 -0.080175
b  0.555060  1.000000 -0.446502 -0.562767  0.148401
c  0.138750 -0.446502  1.000000  0.328998 -0.123943
d -0.285901 -0.562767  0.328998  1.000000  0.400317
e -0.080175  0.148401 -0.123943  0.400317  1.000000


If any non-numeric column is present in the DataFrame, it is excluded automatically.

# Data Ranking
Data Ranking produces ranking for each element in the array of elements. In case of ties, assigns the mean rank.

In [6]:
import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5), index=list('abcde'))
print(s)
s['d'] = s['b'] # so there's a tie
s.rank()

a    0.913427
b    1.594444
c   -1.869449
d   -1.548400
e    1.338800
dtype: float64


a    2.0
b    4.5
c    1.0
d    4.5
e    3.0
dtype: float64

Rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.

Rank supports different tie-breaking methods, specified with the method parameter −

average − average rank of tied group

min − lowest rank in the group

max − highest rank in the group

first − ranks assigned in the order they appear in the array