# Statistics, so you can (hopefully) have fun in FML

The goal of this document is to help you get comfortable with some of the fundamental math and statistics concepts we’ve touched on in the lectures. Understanding these will make it easier to follow along with the machine learning topics we’re covering and, hopefully, give you the tools to start seeing some cool patterns and insights hidden in the data.

We'll start by covering the concepts in theory and then follow up with practical examples. This approach should help you build a strong intuition for how these ideas work in real-world data. To give you a little tease we will look at the data with statistics from the previous years for this course 🤓.



*NOTE: Everything was written from the top of my head and double checked with the study materials for the statistics from the Ing. Martina Litschmannová, Ph.D. [Study materials from statistics](https://mi21.vsb.cz/sites/mi21.vsb.cz/files/unit/interaktivni_uvod_do_statistiky.pdf)*

### For all my zoomer students

![Haha](https://github.com/lowoncuties/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/subway.gif?raw=true)

## Math and statistics terms 

### Right, right ?

<img src="https://github.com/lowoncuties/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/raw/master/images/materials_meme2.png
" alt="haha funny" style="width:400px;">


### **Basic concepts**

- **Minimum (Min)**: The smallest value in a dataset. It's simply the lowest observation in your data.
  
- **Maximum (Max)**: The largest value in a dataset. This represents the largest observed data point.

### **Mean and variance**

- **Mean**: The average value of your data. It’s the sum of all values divided by the number of values.

- **Variance**: measure of how spread out the data is around the mean. High variance means the data points are widely dispersed; low variance means they are close to the mean. This is best explained on the example, since its easier to understand it when you have a context. Variance is always in the form of the unit we $measure^2$.

*Remember: If someone is mean it doesn't necessary mean that he is mean*

### **Median** 
Middle value when your data is ordered from smallest to largest. If you have an odd number of data points, the median is the middle one. If you have an even number, the median is the average of the two middle values. 

**Interesting fact** is that the median is often a better measure of "central tendency" than the mean (average), especially in the presence of outliers. 

**Why ?**  Imagine a scenario where you’re looking at the incomes of 10 people, 9 of whom earn around $40,000, but 1 person earns $1,000,000. The mean would be heavily skewed by the millionaire, pulling the average up to a point that doesn’t represent most people's earnings. On the other hand, the median would simply be one of the $40,000 incomes, giving a much clearer picture of what the "typical" person in that group earns. This is why the median often gives a more realistic sense of central tendency when extreme values are present.

### **Quartiles and quantiles**
- **Quartiles**: from the word "Quarter" they divide your data into four equal parts after it has been sorted in ascending order
  - **Q1 (First Quartile)**: The value that marks the first 25% of the data. It’s the point below which 25% of the data lie.
  - **Q2 (Second Quartile/Median)**: The middle value, splitting the data in half. 50% of the values lie below this point.
  - **Q3 (Third Quartile)**: The value that marks the first 75% of the data, meaning 75% of the data points are below this point.

- **Quantiles**: split the data into equal-sized intervals
  - **decile**: divides the data into 10 parts. **Example** 5th decile = Median = Q2, it's basically the representation of the same "chunk" of data 
  - **percentile**: divides it into 100 parts. **Example** 80 percentile value below which 80% of the data points fall. If you think about it the 25 percentile = Q1
  
  


### **Interquartile Range (IQR)**
Difference between the third quartile (Q3) and the first quartile (Q1). It’s a measure of the middle 50% of the data

**Equation:** *IQR = Q3 - Q1*

**Why is it usefull ?** IQR tells us about the spread of the central part of the data, ignoring the extreme values. It’s useful because it is not affected by outliers as much as the range (Min - Max) is.

### **Outliers**
The simplest explanation would be "*The data you love and hate at the same time*"

Data points that are significantly different from the majority of the data. They lie outside of the expected range and can either suggest interesting insights or "mess-up" our results.

**Equation**: Any **data point < Q1 - 1.5 * IQR** OR **data point > Q3 + 1.5 * IQR**

**Example**: Imagine 10 people working a same job, 9 of whom earn around $40,000, but 1 person earns $200,000. This 1 person would probably be an outlier, this can suggest an interesting insight, and we can focus on that one person (we can maybe find out why he has that much, maybe there are some shenanigans going on, etc...everything depends on the quality and quantity of the data)


<img src="https://github.com/lowoncuties/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/raw/master/images/materials_meme3.png
" alt="haha meme" style="width:400px;">


## Standard Deviation (std)
The standard deviation is a measure of how much individual data points in a dataset deviate (or vary) from the mean. In simpler terms, it tells you how spread out the data is. 

**Interesting fact** is that the **std = $\sqrt{Variance}$**, it basically gives a "unit" to the variance. We can think about it as the value +- from the mean which we can interpret in the human form (**we cannot interpret Variance like this since its $unit^2$**). 

**Example**: We have a std of price = $3000, that means that the spread around the mean is +- $3000, if we have a better insight of the data, we can usually conclude a lot of facts based on the std, like: Market volatility, extreme weather changes, how many students studied for the test, etc. 

It helps in understanding the variability or consistency in the data.


## Practical example

In [2]:
import pandas as pd # dataframes
import numpy as np # matrices and linear algebra
import matplotlib.pyplot as plt # plotting
import seaborn as sns # another matplotlib interface - styled and easier to use