<a href="https://colab.research.google.com/github/keshavvprabhu/statistics_tutorials/blob/main/statistics101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Statistics?

Statistics is the science of collecting, organizing and analyzing data

# Standard Deviation

Standard Deviation is used to measure the ***spread*** of values in a sample. We can use the belof formula to calculate the Standard Deviation of a given sample:

\begin{align}
σ = \sqrt \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}
\end{align}

where: 

$x_i$ = each sample, $n$ = sample size, $\bar{x}$ = sample mean



A Standard Deviation cannot be good or bad because it simply tells us how spread out the values are in a sample.



### Note:

The higher the value of standard deviation, the more spread out the values are in a sample. Conversely, the lower the value for the standard deviation, the more tightly packed together the values.

Why is the Standard Deviation important?
Standard deviation is important because it tells us how spread out the values are in a given dataset.

Whenever we analyze the dataset we are interested in finding out the following metrics:

* The center of the dataset
* The spread of the values in the dataset

By knowing where the center is located and how spread out the values are, we can gain a good understanding of the distribution of values in any dataset


## Visualization
Box plots are a good way to visualize the standard distribution of a dataset

## Coefficient of Variation
One way to determine if a standard deviation is high is to compare it to the mean of the dataset

A coefficient of variation ($C_v$) is a way to measure how spread out the values are in a dataset relative to the mean. It's calculated as:

\begin{align}
    C_v = \frac{\sigma}{\bar{x}}
\end{align}

where:

$s$ = standard deviation; 

$\bar{x}$ = mean of dataset

## Variance

Variance = $\sigma ^ 2$



In [None]:
import math
import statistics

In [None]:
list_numbers = (1,10,100,1000,10000)
sample_size = len(list_numbers)
sample_mean = statistics.mean(list_numbers)
print(f"List of Numbers: {list_numbers}")
print(f"Sample Size: {sample_size}")
print(f"Mean of the Sample: {sample_mean}")

sum = 0
for number in list_numbers:
    sum +=(number - sample_mean)**2

std_dev = math.sqrt(sum/(sample_size - 1))
print(f"Calculated Standard Deviation: {std_dev}")
print(f"Actual Standard Deviation: {statistics.stdev(list_numbers)}")
print(f"Coefficient of Variance: {std_dev/sample_mean}")

List of Numbers: (1, 10, 100, 1000, 10000)
Sample Size: 5
Mean of the Sample: 2222.2
Calculated Standard Deviation: 4368.044093184042
Actual Standard Deviation: 4368.044093184042
Coefficient of Variance: 1.9656394983278025


# Mean
\begin{align}
Mean  = \frac{sum of all numbers}{count of numbers}
\end{align}

Mathematically, it can also be written as

\begin{align}
    \bar{x} = \frac{\sum_{i=1}^n x_i} {n}
\end{align}



# Median

To find the median you need to do the following:
1. Sort the data 
2. If the number of items is odd pick the middle number 
3. If the number of items is even then average the middle numbers

# Mode
Mode is simply the number that appears most in teh sample in terms of repetition


# Range
Range = largest number - smallest number


# Midrange
Midrange = (Largest Number + Smallest Number) / 2

# Mean

Let us illustrate it with an example  

In [None]:
 list_numbers =  (45, 98, 35, 96, 42, 39, 41, 41, 25, 63, 69, 75, 64, 15, 9, 8, 2, 49, 35, 68, 97, 94, 31, 13, 99 ) 
# list_numbers = (45, 98, 35, 96, 58)

In [None]:
total = 0
for i in list_numbers:
    total += i

calculated_mean = total / len(list_numbers)
print(f"Calculated Mean: {calculated_mean}")

print(f"Actual Mean: {statistics.mean(list_numbers)}")

Calculated Mean: 50.12
Actual Mean: 50.12


# Median 
Let us illustrate it with the example below

In [None]:
sorted_numbers = sorted(list_numbers)
print(f"Sorted List: {sorted_numbers}")
sample_size = len(list_numbers)
print(f"Size : {sample_size}")

if sample_size % 2 == 0:
    mid_point = sample_size//2
    median = (sorted_numbers[mid_point - 1] + sorted_numbers[mid_point])/2
    print(f"Calculated Median: {median}")
else:
    mid_point = sample_size//2
    median = sorted_numbers[mid_point]
    print(f"Calculated Median: {median}")

print(f"Actual Median: {statistics.median(sorted_numbers)}")

Sorted List: [2, 8, 9, 13, 15, 25, 31, 35, 35, 39, 41, 41, 42, 45, 49, 63, 64, 68, 69, 75, 94, 96, 97, 98, 99]
Size : 25
Calculated Median: 42
Actual Median: 42


# Mode
Let us illustrate with an example below


In [None]:
from collections import Counter
dict_output = Counter(sorted_numbers)

print(f"Dataset : {sorted_numbers}")
max_value = 0
for k,v in dict_output.items():
    if v > max_value:
        max_value = v

list_modes = list()
for k,v in dict_output.items():
    if v == max_value:
        list_modes.append(k)

if len(list_modes) == len(sorted_numbers):
    print("There are no modes")
else:    
    print(f"Calculated Mode: {list_modes}")


Dataset : [2, 8, 9, 13, 15, 25, 31, 35, 35, 39, 41, 41, 42, 45, 49, 63, 64, 68, 69, 75, 94, 96, 97, 98, 99]
Calculated Mode: [35, 41]


# Range
Let us look at the example

In [None]:
print(f"Dataset : {sorted_numbers}")

largest = max(sorted_numbers)
smallest = min(sorted_numbers)

calculated_range = largest - smallest 

print(f"Calculated Range: {calculated_range}")

Dataset : [2, 8, 9, 13, 15, 25, 31, 35, 35, 39, 41, 41, 42, 45, 49, 63, 64, 68, 69, 75, 94, 96, 97, 98, 99]
Calculated Range: 97


# Midrange

In [None]:
print(f"Dataset : {sorted_numbers}")

largest = max(sorted_numbers)
smallest = min(sorted_numbers)

calculated_midrange = (largest + smallest) / 2

print(f"Calculated Range: {calculated_midrange}")

Dataset : [2, 8, 9, 13, 15, 25, 31, 35, 35, 39, 41, 41, 42, 45, 49, 63, 64, 68, 69, 75, 94, 96, 97, 98, 99]
Calculated Range: 50.5


# What is Confusion Matrix?
Prediction\Truth | Negative | Positive
---|--- | ---
False|False Negative | False Positive
True|True Negative | True Positive


Precision and Recall are metrics which allows us to verify the quality of our classification model.


# What is Accuracy?
Mathematically, it can be written as follows:
\begin{align}
A = \frac{TP}{TP+FP+TN+FN}
\end{align}
The issue with the accuracy metric is that if the dataset is not balanced you could get skewed results that are not


# What is Precision?
Mathematically, it can be written as follows:
\begin{align}
P = \frac{TP}{TP+FP}
\end{align}


# What is Recall?
Mathematically, it can be written as follows:
\begin{align}
P = \frac{TP}{TP+FN}
\end{align}

# What is Underfitting?
Underfitting is too simplistic and the model is not able to reliably predict correctly.


# What is Overfitting?
Over fitting is considered too good to be true. The model may behave perfectly during training but when it actually in production it could not yield reliable results