<a href="https://colab.research.google.com/github/owaisahmad315/pandas/blob/main/Data_Frame_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
'''
IF YOU ARE DOING DATA SCIENCE OR STATISTICS WITH PANDAS, YOU ARE IN LUCK,
because the data frame comes with basic functionality built in.
In this section, we will examine snow totals from Alta for the past
couple years. I scraped this data off the Utah Avalanche Center website
16
,
but will use the .read_table function of pandas to create a data frame.
'''

In [5]:
import pandas as pd

data = '''year\tinches\tlocation
... 2006\t633.5\tutah
... 2007\t356\tutah
... 2008\t654\tutah
... 2009\t578\tutah
... 2010\t430\tutah
... 2011\t553\tutah
... 2012\t329.5\tutah
... 2013\t382.5\tutah
... 2014\t357.5\tutah
... 2015\t267.5\tutah'''

In [10]:
from io import StringIO
snow = pd.read_table(StringIO(data))

In [11]:
snow

Unnamed: 0,year,inches,location
0,... 2006,633.5,utah
1,... 2007,356.0,utah
2,... 2008,654.0,utah
3,... 2009,578.0,utah
4,... 2010,430.0,utah
5,... 2011,553.0,utah
6,... 2012,329.5,utah
7,... 2013,382.5,utah
8,... 2014,357.5,utah
9,... 2015,267.5,utah


In [13]:
snow.describe(include='all')

Unnamed: 0,year,inches,location
count,10,10.0,10
unique,10,,1
top,... 2006,,utah
freq,1,,10
mean,,454.15,
std,,138.357036,
min,,267.5,
25%,,356.375,
50%,,406.25,
75%,,571.75,


In [None]:
'''
The .quantile method, by default shows the 50% quantile, though the
q parameter can be specified to get different levels:
'''
snow.quantile()


In [18]:
# To just get counts of non-empty cells, use the .count method:

snow.count()

year        10
inches      10
location    10
dtype: int64

In [19]:
'''
If you have data and want to know whether any of the values in the
columns evaluate to True in a boolean context, use the .any method:
'''
snow.any()

year        True
inches      True
location    True
dtype: bool

In [20]:
'''
This method can also be applied to a row, by using the axis=1
parameter:
'''
snow.any(axis=1)

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
dtype: bool

In [21]:
'''
Likewise, there is a corresponding .all method to indicate whether all
of the values are truthy:
'''
snow.all()

year        True
inches      True
location    True
dtype: bool

In [22]:
# rank
'''
The .rank method goes through every column and assigns a number to the
rank of that cell within the column. Again, the year column isn't
particularly useful here:
'''
snow.rank()

Unnamed: 0,year,inches,location
0,1.0,9.0,5.5
1,2.0,3.0,5.5
2,3.0,10.0,5.5
3,4.0,8.0,5.5
4,5.0,6.0,5.5
5,6.0,7.0,5.5
6,7.0,2.0,5.5
7,8.0,5.0,5.5
8,9.0,4.0,5.5
9,10.0,1.0,5.5


In [23]:
snow.rank(ascending=False)

Unnamed: 0,year,inches,location
0,10.0,2.0,5.5
1,9.0,8.0,5.5
2,8.0,1.0,5.5
3,7.0,3.0,5.5
4,6.0,5.0,5.5
5,5.0,4.0,5.5
6,4.0,9.0,5.5
7,3.0,6.0,5.5
8,2.0,7.0,5.5
9,1.0,10.0,5.5


In [24]:
'''
Note that because the location columns are all the same, the rank of that
column is the average by default. To change this behavior, we can set the
method parameter to 'min', 'max', 'first', or 'dense' to get the lowest,
highest, order of appearance, or ranking by group (instead of items)
respectively ('average' is the default):
'''
snow.rank(method='min')

Unnamed: 0,year,inches,location
0,1.0,9.0,1.0
1,2.0,3.0,1.0
2,3.0,10.0,1.0
3,4.0,8.0,1.0
4,5.0,6.0,1.0
5,6.0,7.0,1.0
6,7.0,2.0,1.0
7,8.0,5.0,1.0
8,9.0,4.0,1.0
9,10.0,1.0,1.0


In [None]:
# clip
'''
Occasionally, there are outliers in the data. If this is problematic, the .clip
method trims a column (or row if axis=1) to certain values:
'''
# snow.clip(lower=400, upper=600)

In [27]:
'''
For our data, clipping fails as location is a column containing string
types. Unless your columns are semi-homogenous, you might want to run
the .clip method on the individual series or the subset of columns that
need to be clipped:
'''
snow[['inches']].clip(lower=400,upper=600)

Unnamed: 0,inches
0,600.0
1,400.0
2,600.0
3,578.0
4,430.0
5,553.0
6,400.0
7,400.0
8,400.0
9,400.0


In [29]:
# Correlation and Covariance
'''
We've already seen that the series object can perform a Pearson correlation
with another series. The data frame offers similar functionality, but it will
do a pairwise correlation with all of the numeric columns. In addition, it
will perform a Kendall or Spearman correlation, when those strings are
passed to the optional method parameter:
'''
snow.corr()

  snow.corr()


Unnamed: 0,inches
inches,1.0


In [43]:
snow.corr('spearman', numeric_only=True)

Unnamed: 0,inches
inches,1.0


In [None]:
'''
If you have two data frames that you want to correlate, you can use the
.corrwith method to compute column-wise (the default) or row-wise
(when axis=1) Pearson correlations:
'''


In [38]:
snow2= snow[['inches']] - 100
snow.corrwith(snow2, numeric_only=True)

inches    1.0
dtype: float64

In [44]:
'''
The .cov method of the data frame computes the pair-wise covariance
(non-normalized correlation):
'''
snow.cov()

  snow.cov()


Unnamed: 0,inches
inches,19142.669444
