# Scikit-learn Fundamentals


In [1]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

data = pd.read_csv("../data/data.csv")
data_lable = data.copy()

le = LabelEncoder()
data_lable['Passed_LabelEncoded'] = le.fit_transform(data_lable['Passed'])
data_lable["Gender_LabelEncoded"] = le.fit_transform(data_lable["Gender"])

print("Lable Encoded data")
print(data_lable)

FileNotFoundError: [Errno 2] No such file or directory: '../data/data.csv'

---
# üìä StandardScaler

**StandardScaler** standardizes features so that they are centered around **0** with a standard deviation of **1**.

This is commonly used in scikit-learn when features have different scales.

---

## üìê Formula

$$
z = \frac{x - \mu}{\sigma}
$$

### Where:

- $x$ = Actual value  
- $\mu$ = Mean of the column  
- $\sigma$ = Standard deviation of the column  
- $z$ = Standardized value (scaled output)

---

## üß† What This Means

- Subtracting the mean ($\mu$) centers the data around 0.
- Dividing by the standard deviation ($\sigma$) scales the data.
- After scaling:
  - Mean ‚âà 0  
  - Standard Deviation ‚âà 1  

---

## üìù Example

Original Data:

$$
[10, 20, 30, 40, 50]
$$

Mean:

$$
\mu = 30
$$

Standard Deviation:

$$
\sigma \approx 14.14
$$

For $x = 10$:

$$
z = \frac{10 - 30}{14.14} \approx -1.41
$$

For $x = 30$:

$$
z = \frac{30 - 30}{14.14} = 0
$$

---


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

data = {
    "StudyHours": [1,2,3,4,5],
    "TestScore": [10,20,30,40,50]
}

df = pd.DataFrame(data)

standard_scaler = StandardScaler()
standard_scaled = standard_scaler.fit_transform(df)

print("Standard Scaler Output")
print(pd.DataFrame(standard_scaled, columns = ["StudyHours", "TestScore"]))

Standard Scaler Output
   StudyHours  TestScore
0   -1.414214  -1.414214
1   -0.707107  -0.707107
2    0.000000   0.000000
3    0.707107   0.707107
4    1.414214   1.414214


---
# üìä MinMaxScaler

**MinMaxScaler** scales features so that all values fall within a fixed range ‚Äî usually between **0 and 1**.

It preserves the shape of the original distribution but rescales the magnitude of the values.

---

## üìê Formula

$$
X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}
$$

### Where:

- $X$ = Actual value  
- $X_{min}$ = Minimum value in the column  
- $X_{max}$ = Maximum value in the column  
- $X_{scaled}$ = Scaled output value  

---

## üß† What This Means

- Subtracting $X_{min}$ shifts the lowest value to 0.
- Dividing by $(X_{max} - X_{min})$ rescales the data between 0 and 1.
- After scaling:
  - Minimum value ‚Üí 0  
  - Maximum value ‚Üí 1  
  - All other values ‚Üí Between 0 and 1  

---

## üìù Example

Original Data:

$$
[1, 2, 3, 4, 5]
$$

$$
X_{min} = 1
$$

$$
X_{max} = 5
$$

If $X = 1$:

$$
X_{scaled} = \frac{1 - 1}{5 - 1} = 0
$$

If $X = 3$:

$$
X_{scaled} = \frac{3 - 1}{5 - 1} = 0.5
$$

If $X = 5$:

$$
X_{scaled} = \frac{5 - 1}{5 - 1} = 1
$$

---


In [None]:
minmax_scaler = MinMaxScaler()
minmax_scaled = minmax_scaler.fit_transform(df)

print("MinMax Scaler Output")
print(pd.DataFrame(minmax_scaled, columns= ["StudyHours", "TestScore"]))

MinMax Scaler Output
   StudyHours  TestScore
0        0.00       0.00
1        0.25       0.25
2        0.50       0.50
3        0.75       0.75
4        1.00       1.00


In [None]:
import numpy as np

matrix = np.array([[1,2,3], [4,5,6], [7,8,9]])

In [None]:
data = pd.DataFrame(matrix)
print(data)

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9
