<a href="https://colab.research.google.com/github/prokope/learning-data-science/blob/main/statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Statistics


## Purpose


I want to learn the most used statistic methods and apply it to real problems

## Importing libs

In [4]:
import pandas as pd
from IPython.display import display

## Essential data types in statistics

### Numeric (Quantitative) Data


Values that represents measurable quantities.

Sub-types:
- Continuous: can take any value within a range (temperature, height)
- Discrete: countable values (number of movies, children, always integer)

ðŸ’¡ **Numerical data allows calculation and continuous visualizations**

In [None]:
df = pd.DataFrame(
    {
        'age': [18, 36, 54], # Discrete (can only be whole numbers in context)
        'height': [1.73, 1.86, 1.99] # Continuous (any value within a range)
    }
)

print(df)

   age  height
0   18    1.73
1   36    1.86
2   54    1.99


### Qualitative (Categorical) Data

Non-numerical labels that describe categories or groups.

Sub-types:
- Nominal: No natural order (car brands, hair color, countries)
- Ordinal: Has a natural order, but is not numerical (education levels, satisfaction levels)

In [None]:
df = pd.DataFrame({
    'name': ["Paul", "Richard", "John", "Lennon"], # Nominal, because names have no natural order
    'birth_country': ["Brazil", "England", "Ireland", "United States"], # Also nominal
    'education_level': ["Bachelor's", "Master's", "MBA", "PhD"] # Ordinal, because education level is BSc -> MBA/MSc -> PhD
})

print(df)

      name  birth_country education_level
0     Paul         Brazil      Bachelor's
1  Richard        England        Master's
2     John        Ireland             MBA
3   Lennon  United States             PhD


### Boolean (Binary) Data

Boolean (or binary) data is a subcategory which represents just two values. These values represents yes/no, true/false, or presense/absence conditions.

**Why it matters**
- Perfect for filters
- Used in machine learning as flags/features
- Easy to convert to 0 and 1
- Very common in real world datasets

In [None]:
df = pd.DataFrame({
    'name': ["Paul", "Alice", "Bob"],
    'age': [19, 7, 20],
    'is_adult': ["Yes", "No", "Yes"]
})
df

Unnamed: 0,name,age,is_adult
0,Paul,19,Yes
1,Alice,7,No
2,Bob,20,Yes


## Measures of Central Tendency

### Creating a fictional CSV

In [5]:
cars_price = {
    'car': ["Jetta", "Civic", "Corolla", "320i", "Compass", "Versa", "X1"],
    'brand': ["Volkswagen", "Honda", "Toyota", "BMW", "Jeep", "Nissan", "BMW"],
    'price': [220000, 122000, 140000, 262000, 148000, 123990, 280000]
}

cars_price = pd.DataFrame(cars_price)
cars_price

Unnamed: 0,car,brand,price
0,Jetta,Volkswagen,220000
1,Civic,Honda,122000
2,Corolla,Toyota,140000
3,320i,BMW,262000
4,Compass,Jeep,148000
5,Versa,Nissan,123990
6,X1,BMW,280000


### mean() â€” Mean calculation
Syntax: <code>df["column"].mean()</code>

In [None]:
cars_price["price"].mean()

np.float64(185141.42857142858)

### median() â€” Median calculation
Syntax: <code>df["column"].median()</code>

In [None]:
cars_price["price"].median()

148000.0

### mode() â€” Mode calculation
Syntax: <code>df["column"].mode()</code>

In [None]:
cars_price.brand.mode()

Unnamed: 0,brand
0,BMW


## Measures of Dispersion

### std() â€” Standard Deviation calculation

Syntax: <code>df["column"].std()</code>

Formula:
$$
\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2}
$$


In [None]:
# Using the cars DataFrame (cars_price) created before:
cars_price

Unnamed: 0,car,brand,price
0,Jetta,Volkswagen,220000
1,Civic,Honda,122000
2,Corolla,Toyota,140000
3,320i,BMW,262000
4,Compass,Jeep,148000
5,Versa,Nissan,123990
6,X1,BMW,280000


In [None]:
cars_price.price.std()

67409.29718977233

### var() â€” Variation calculation

Syntax: <code>df["column"].var()</code>

Formula:
$$
s^2 = \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$


In [None]:
cars_price["price"].var()
# Notice that the value is so much bigger than the standard deviation, it is because the square.

4544013347.619048

### range â€” How far apart are the smallest and largest values?

The range measures how spread out the data is by calculating:

$$
Range = max - min
$$

ðŸ’¡ **It's very sensitive to outliers, because it only uses max and min.**

**Pandas has no direct .range() method, so you need to calcullate like this:**

In [14]:
range = cars_price.price.max() - cars_price.price.min()
display(cars_price.price.max()) # Max value
display(cars_price.price.min()) # Min value
range

280000

122000

158000

### Interquartile Range (IQR) â€” How the middle 50% of the data is?

The IQR is the range that covers the middle 50% of the data.

It is defined as:

$$
IQR = Q3 - Q1
$$

Where:
- Q3: The value below which 75% of data fall.
- Q1: The value below which 25% of data fall.

ðŸ’¡**This one is robust against outliers, because it ignores extreme values**

In [20]:
Q1 = cars_price.price.quantile(0.25)
Q3 = cars_price.price.quantile(0.75)
IQR = Q3 - Q1
display(Q3)
display(Q1)
display(IQR)

np.float64(241000.0)

np.float64(131995.0)

np.float64(109005.0)

### mad() â€” Median Absolute Deviation


Syntax:

<code>df["column"].mad()</code>

The **Median Absolute Deviation (MAD)** is defined as:

$$
\text{MAD} = \mathrm{median}\left( \, | x_i - \mathrm{median}(x) | \, \right)
$$

In [11]:
mad = (cars_price["price"] - cars_price["price"].median()).abs().median()
display(cars_price["price"].sort_values())
display(mad)

Unnamed: 0,price
1,122000
5,123990
2,140000
4,148000
0,220000
3,262000
6,280000


26000.0