#  <font color='#eb3483'> Introduction to Statistics </font>
To become a successful Data Scientist you must know your basics. Math and Stats are the building blocks of Machine Learning algorithms. It is important to know the techniques behind various Machine Learning algorithms in order to know how and when to use them. Now the question arises, what exactly is Statistics?


There are two main categories in Statistics, namely:

1. Descriptive Statistics

Descriptive Statistics (a.k.a. summary statistics) uses the data to provide descriptions of the population, either through numerical calculations or graphs or tables. 

2. Inferential Statistics

Inferential Statistics makes inferences and predictions about a population based on a sample of data taken from the population in question. 


<img src="./media/stats.png" alt="drawing" width="500"/>


# <font color='#eb3483'> Descriptive Stats </font>

Descriptive statistics are that series of measures (numbers basically) that serve to provide general information of a dataset. They are divided into three basic types:

1. Measures of Central Tendancy
2. Measures of dispersion
3. Measures of shape






<img src="./media/descriptive.png" alt="drawing" width="800"/>



### <font color='#eb3483'> Some terms we should know: </font>

- The <font color='#eb3483'> **population** </font> is the set of sources from which data has to be collected.  
- A <font color='#eb3483'> **Sample** </font> is a subset of the Population  
- A <font color='#eb3483'>**Variable**</font> is any characteristics, number, or quantity that can be measured or counted. A variable may also be called a data item.  
- Also known as a <font color='#eb3483'>**statistical model** </font>, A statistical Parameter or population parameter is a quantity that indexes a family of probability distributions. For example, the mean, median, etc of a population.  

<table><tr>
<td> <img src="./media/population.png" alt="drawing" width="300"/> </td>
<td> <img src="./media/sample.png" alt="drawing" width="360"/> </td>
</tr></table>



In [1]:
from IPython.display import Image
import numpy as np
import pandas as pd

import scipy.stats as stats

In [2]:
weights = [100, 150, 150, 200, 250, 300, 325, 400,415, 500, 600, 1000]

###  <font color='#eb3483'> 1. Measures of Central Tendency </font>

<font color='#eb3483'> **Mean**  </font>

(aka arithmetic mean, average): The sum of all values in a sample divided by the total number of counts in the sample

In [3]:
weights_mean = np.mean(weights)
weights_mean

365.8333333333333

<font color='#eb3483'> **Median** </font>

The value that is bigger than 50% of the sample 
<img src="./media/median.png" alt="drawing" width="200"/>  
In this case the sample is even size so an average is taken of the middle two values

In [4]:
np.median(weights)

312.5

<font color='#eb3483'> **Mode** </font>

The most common value
<img src="./media/mode.png" alt="drawing" width="200"/>  


In [37]:
# use stats.mode(weights).mode to get the actual numerical mode
stats.mode(weights)

ModeResult(mode=array([150]), count=array([2]))

### <font color='#eb3483'> 2. Measures of Dispersion </font>

<font color='#eb3483'> **Range** </font>

It is the given measure of how spread apart the values in a data set are  
The difference between the maximum and the minimum value

In [38]:
np.max(weights) - np.min(weights)

900

<font color='#eb3483'> **Quartiles and Inter Quartile Range (IQR)** </font>

IQR is the measure of variability, based on dividing a data set into quartiles.
The difference between the upper quartile (Q3) and the lower quartile (Q1)

In [40]:
quantiles = stats.mstats.mquantiles(weights)
quantiles

array([172.5 , 312.5 , 461.75])

In [41]:
IQR = quantiles[-1] - quantiles[0]
IQR

289.25000000000006

<font color='#eb3483'>**Variance**  </font>

Describes how much a random variable differs from its expected value. It entails computing squares of deviations.  
Sample Variance is the average of squared differences from the mean

$$
\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$


<font color='#eb3483'> **Standard Deviation**  </font>

is the square root of the variance 
It is the measure of the dispersion of a set of data from its mean calculated by:
$$
\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}
$$


In [9]:
weights_std = np.std(weights)
weights_std

239.6771555423856

<font color='#eb3483'> **Skewness Coefficient** </font>

A measure of how skewed the data is (values stretched to the left or the right of the central values)  
<img src="./media/skewness.png" alt="drawing" width="300"/>  



The Skewness Coefficient is defined as the third moment divided by the standard deviation to the cube, its formula is: 

$$\frac{1}{N} * \frac{\sum_{n=1}^{n} (X_i-\bar{X})^{3}}{\sigma^3}$$

Don't worry if you don't know what a moment is, or find that equation a little overwhelming - the important takeaway is understanding what skew means!

We can calculate the skewness directly with `scipy.stats.skew`

In [44]:
stats.skew(weights)

1.3623858394083481

<font color='#eb3483'> **Kurtosis Coefficient** </font>

A measure of how concentrated the values are around the central values
<img src="./media/kurtosis.png" alt="drawing" width="300"/>  



The Kurtosis Coefficient is defined as the fourth moment divided by the standard deviation to the fourth power, and its formula is: 

$$\frac{1}{N} * \frac{\sum_{n=1}^{n} (Xi-\bar{X})^{4}}{\sigma^4} - 3$$

Generally we substract 3 to the value, since a perfect normal distribution has a kurtosis coefficient of 3.

We can calculate it directly with `scipy.stats.kurtosis`

In [11]:
stats.kurtosis(weights)

1.4285722765161841

# <font color='#eb3483'> Descriptive Statistics in Pandas </font>
Pandas provides a lot of awesome functionality to do everything we learned above directly on our pandas dataframes and series.

In [45]:
import pandas as pd

In [46]:
potato_varieties = ["Monalisa", "Spunta", "Kennebec"]

`np.random.choice` generates an array taking random samples from a list

In [47]:
np.random.choice(potato_varieties, 5) 

array(['Kennebec', 'Spunta', 'Kennebec', 'Monalisa', 'Kennebec'],
      dtype='<U8')

We are going to generate a random dataframe of potatoes, using `random.choice` we get random elements from a list:

In [48]:
potatos = pd.DataFrame(
        {
            "weight": weights,
            "variety": np.random.choice(potato_varieties, len(weights))    
        }
)

In [49]:
potatos.head()

Unnamed: 0,weight,variety
0,100,Kennebec
1,150,Kennebec
2,150,Kennebec
3,200,Monalisa
4,250,Monalisa


Let's check some of the statistics we computed above using pandas!

In [50]:
potatos.weight.median()

312.5

In [51]:
potatos.weight.mean()

365.8333333333333

In [52]:
potatos.weight.max()

1000

In [20]:
potatos.weight.min()

100

In [21]:
potatos.variety.mode()

0    Kennebec
dtype: object

In [22]:
potatos.weight.std()

250.33462453768604

**NOTE:** We see the standard deviation calculated by pandas is greater than the one provided by numpy (*weights_std*), why?

In [23]:
weights_std

239.6771555423856

The reason is simple, pandas calculates the standard deviation dividing by `N-1` instead of by N, this is called [Bessel's Correction](https://en.wikipedia.org/wiki/Bessel's_correction) and it is usually done to correct the error we get when we are calculating the standard deviation from a sample instead of from a whole population.

If we really want to get the `biased` standard deviation (that is, assuming our pandas dataframe is the whole population), we can do it like this:

In [24]:
potatos.weight.std(ddof=0)

239.6771555423856

We can use `quantile` to get a specific quantile. Quantile X% is just that value greater than X% of the population.

In [25]:
potatos.weight.quantile(0.9) # weight larger than 90% of the potatos

590.0

Pandas function `pandas.qcut` can convert a numerical column to a categorical column with N quartiles. If we use N=4, we will get the quartiles. This is a good way of grouping a numerical variable.

In [26]:
weights

[100, 150, 150, 200, 250, 300, 325, 400, 415, 500, 600, 1000]

In [27]:
weights_quartiles = pd.qcut(weights, 4)
potatos["quartile"] = weights_quartiles
potatos.quartile

0      (99.999, 187.5]
1      (99.999, 187.5]
2      (99.999, 187.5]
3       (187.5, 312.5]
4       (187.5, 312.5]
5       (187.5, 312.5]
6      (312.5, 436.25]
7      (312.5, 436.25]
8      (312.5, 436.25]
9     (436.25, 1000.0]
10    (436.25, 1000.0]
11    (436.25, 1000.0]
Name: quartile, dtype: category
Categories (4, interval[float64]): [(99.999, 187.5] < (187.5, 312.5] < (312.5, 436.25] < (436.25, 1000.0]]

In [28]:
potatos

Unnamed: 0,weight,variety,quartile
0,100,Kennebec,"(99.999, 187.5]"
1,150,Monalisa,"(99.999, 187.5]"
2,150,Kennebec,"(99.999, 187.5]"
3,200,Spunta,"(187.5, 312.5]"
4,250,Kennebec,"(187.5, 312.5]"
5,300,Spunta,"(187.5, 312.5]"
6,325,Monalisa,"(312.5, 436.25]"
7,400,Kennebec,"(312.5, 436.25]"
8,415,Monalisa,"(312.5, 436.25]"
9,500,Kennebec,"(436.25, 1000.0]"


We can see the 4 categories  calculated are:

```
Categories (4, interval[float64]): [(99.999, 187.5] < (187.5, 312.5] < (312.5, 436.25] < (436.25, 1000.0]]
```
with Q1=187.5 and Q3=436.25.

Please note there are different ways to calculate quartiles so there might be differences depending on the method.

In [29]:
potatos.dtypes

weight         int64
variety       object
quartile    category
dtype: object

In [30]:
potatos.variety = potatos.variety.astype("category")

In [31]:
potatos.dtypes

weight         int64
variety     category
quartile    category
dtype: object

We can use the function `.describe()` to calculate all the statistics at once.

In [53]:
potatos.describe()

Unnamed: 0,weight
count,12.0
mean,365.833333
std,250.334625
min,100.0
25%,187.5
50%,312.5
75%,436.25
max,1000.0


# <font color='#eb3483'> Correlation </font>

Another tool we have in our descriptive statistics tool box is correlation - a measure of the shape of a joint distribution (i.e. a distribution that's based on two variables). We say that two variables are correlated when the variation of both variables follow the same trend, that is, there is a strong relationship between both variables.

![](https://statistics.laerd.com/statistical-guides/img/pearson-1-small.png)

For example, let's calculate the diameter of each each potato (I spent way too much time looking these formulas, **you can safely ignore the following unless you are interested in potato density**) based on their weight, and a variable not related at all, `% starch`.

In [56]:
density_gr_cm3 = 158.5/250
volume_cm3 = lambda w: w/density_gr_cm3
k = np.pi * (4/3)
# we assume spherical potatos
radius_cm = lambda v: np.power(v/k , 1/3)
diameter_noise = lambda v: 2*radius_cm(v)*(1+0.2*np.random.random())
potatos["diameter"] = potatos.weight.apply(diameter_noise)
potatos["pct_starch"] = np.random.random(potatos.shape[0])

In [57]:
potatos.head()

Unnamed: 0,weight,variety,diameter,pct_starch
0,100,Kennebec,6.062765,0.472005
1,150,Kennebec,7.730029,0.52241
2,150,Kennebec,7.85793,0.787091
3,200,Monalisa,8.471267,0.363602
4,250,Monalisa,9.033225,0.192372


The most common way to measure correlation is called [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) (PPMC), that ranges from -1 (the variables are inversely correlated) to 1 (high correlation). Generally speaking, values above 0.5 are considered high correlations, and values above 0.3 a moderate correlation.

![](https://statistics.laerd.com/statistical-guides/img/pearson-2-small.png)

We can calculate Pearson's correlation between two arrays with `np.corrcoef`.

In [58]:
np.corrcoef(potatos.weight, potatos.diameter)

array([[1.        , 0.95747596],
       [0.95747596, 1.        ]])

The output is a symmetric matrix showing the correlation between variables (a variable is always totally correlated with itself, and thus the diagonal will always be 1).

We can calculate the correlation among all numerical variables in a pandas Dataframe with `.corr()`

In [59]:
potatos.corr()

Unnamed: 0,weight,diameter,pct_starch
weight,1.0,0.957476,0.350774
diameter,0.957476,1.0,0.264196
pct_starch,0.350774,0.264196,1.0


<br>
<br>

## <font color='#eb3483'> Remember that correlation does NOT imply causation </font>


<img src="./media/corr1.png" alt="drawing" width="800"/>
<img src="./media/corr2.png" alt="drawing" width="800"/>
