# Basic Statistics

This may includes the Measures of Location and Measures of Spread

> This method also works on Series Type


## Measures of Location
- Min, Max, Mean

In [2]:
import pandas as pd

# Let's take the data from Sololearn
us_president = pd.read_csv('https://sololearn.com/uploads/files/president_heights_party.csv', index_col='name')

In [3]:
# Examples of Measures of Location
print("Minimum:\n", us_president.min(), "\n")
print("Maximum:\n", us_president.max(), "\n")
print("Mean:\n", us_president.mean())

Minimum:
 order              1
age               42
height           163
party     democratic
dtype: object 

Maximum:
 order       45
age         70
height     193
party     whig
dtype: object 

Mean:
 order      23.022222
age        55.000000
height    180.000000
dtype: float64


  print("Mean:\n", us_president.mean())


In [4]:
# Examples by Series and DF
print("Series:\n", us_president['age'].min(), "\n")
print("DF:\n", us_president[['age', 'height']].mean(), "\n")
print("Median:\n", us_president['age'].median())

Series:
 42 

DF:
 age        55.0
height    180.0
dtype: float64 

Median:
 55.0


## Quantiles

As its names, quantiles divides the range of data into continuous intervals with equals number of observations. It's similar like median, where the median is used to divide a range of data to be 50% of it. Meanwhile quantile, divides a range of data to be four groups, which we may obtain 25% for each groups.

In [5]:
# Let's get started by the example
# As you can see above, we have already obtained the median of the age. Now let's find the quantiles

us_president['age'].quantile([0.25, 0.5, 0.75, 1]).head()

0.25    51.0
0.50    55.0
0.75    58.0
1.00    70.0
Name: age, dtype: float64

In [6]:
# Now let's try it using DF
us_president[['age', 'height']].quantile([0.25, 0.5, 0.75, 1])

Unnamed: 0,age,height
0.25,51.0,175.0
0.5,55.0,182.0
0.75,58.0,183.0
1.0,70.0,193.0


## Measures of Spread
- Range, Variance, Standard Deviation

### Variance

In [10]:
# We may obtain the variance of the president data by using .var function like this
us_president[['age', 'height']].var().round(2)

age       43.50
height    48.68
dtype: float64

### Standard Deviation

In [12]:
# Meanwhile to obtain STDEV, we may use .std function
us_president[['age', 'height']].std().round(6)

age       6.595453
height    6.977236
dtype: float64

### Summary .describe() Function

In [13]:
# This function is used to obtain the quantiles, mean, std, count, min, and max of a DF
us_president.describe()

Unnamed: 0,order,age,height
count,45.0,45.0,45.0
mean,23.022222,55.0,180.0
std,13.136502,6.595453,6.977236
min,1.0,42.0,163.0
25%,12.0,51.0,175.0
50%,23.0,55.0,182.0
75%,34.0,58.0,183.0
max,45.0,70.0,193.0


### Categorical Variable

In [19]:
# We also may obtain the all categorical variable by doing this
us_president['party'].value_counts()

republican               19
democratic               15
democratic-republican     4
whig                      4
none                      1
federalist                1
national union            1
Name: party, dtype: int64

In [17]:
# Remember that .describe is used to summarizes the statistics of a SF and but if we used it for non numerical data, we may obtain a result like this
us_president['party'].describe()

count             45
unique             7
top       republican
freq              19
Name: party, dtype: object

# Group by and Aggregations

This one is similar like categorical or Excel by sorting by, etc.

In [20]:
us_president.groupby('party').mean()

Unnamed: 0_level_0,order,age,height
party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
democratic,26.066667,52.6,181.066667
democratic-republican,4.5,57.25,176.5
federalist,2.0,61.0,170.0
national union,17.0,56.0,178.0
none,1.0,57.0,189.0
republican,29.631579,55.263158,180.894737
whig,11.0,58.25,176.0


In [21]:
# We use this method to specify the column of group by
us_president.groupby('party')['height'].mean()

party
democratic               181.066667
democratic-republican    176.500000
federalist               170.000000
national union           178.000000
none                     189.000000
republican               180.894737
whig                     176.000000
Name: height, dtype: float64

In [24]:
import numpy as np

In [26]:
# In Aggregate, we may combine several operations into one syntax like this
# IDK why does median function is not available in .agg()
us_president.groupby('party')['age'].agg([min, np.median, max])

Unnamed: 0_level_0,min,median,max
party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
democratic,43,52.0,65
democratic-republican,57,57.0,58
federalist,61,61.0,61
national union,56,56.0,56
none,57,57.0,57
republican,42,54.0,70
whig,50,57.5,68


In [31]:
# We also can use multiple data column into dictionary like this
us_president.groupby('party').agg({'age':[min, max, np.var, np.std], 'height':[min, max, np.var, np.std]})


Unnamed: 0_level_0,age,age,age,age,height,height,height,height
Unnamed: 0_level_1,min,max,var,std,min,max,var,std
party,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
democratic,43,65,38.542857,6.208289,168,193,41.352381,6.430582
democratic-republican,57,58,0.25,0.5,163,189,137.0,11.7047
federalist,61,61,,,170,170,,
national union,56,56,,,178,178,,
none,57,57,,,189,189,,
republican,42,70,51.871345,7.202176,168,193,41.877193,6.471259
whig,50,68,82.916667,9.105859,173,183,22.666667,4.760952
