### Load Dataset and EDA
1. Load the dataset
2. Explore the dataset (shape, head, info)
3. Check for missing values
4. Unique values in categorical features
5. Target distribution
6. Numeric feature distribution and outliers
7. Categorical feature distribution
8. Correlation matrix for numeric features
9. Pairwise relationships (if applicable)

Notes:
Very large (>1 M rows) - consider sampling or “chunked” reading rather than loading everything into memory.

In [20]:
import pandas as pd
import numpy as np

# Load dataset
dataset = pd.read_csv('../datasets/data.csv') 
print("Dataset Shape:", dataset.shape)
print("First 5 rows of the dataset:")
print(dataset.head())

Dataset Shape: (10, 4)
First 5 rows of the dataset:
   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes


### 2. Separating Features (X) and Target (y)

In [None]:
X = dataset.iloc[:, :-1].values  # All rows, all columns except last
y = dataset.iloc[:, -1].values   # All rows, last column

### 3. Handling Missing Data
Problem: Missing values in Age and Salary columns. </br>
Solution: Use SimpleImputer to replace missing values with mean.

In [22]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])   # Apply only on Age and Salary
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### 4. Encoding Categorical Data
Country (categorical) → One-Hot Encoding 

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = ct.fit_transform(X)
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


Purchased (Yes/No) → Label Encoding

In [24]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)
print(y)  # Yes -> 1, No -> 0

[0 1 0 0 1 1 0 1 0 1]


### 5. Splitting the Dataset into Training and Test Sets

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train)
print(X_test)

[[1.0 0.0 0.0 35.0 58000.0]
 [1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]]
[[0.0 1.0 0.0 50.0 83000.0]
 [0.0 0.0 1.0 27.0 48000.0]]


### 6. Feature Scaling
Why? Age and Salary have different scales → affects models like KNN, SVM, Logistic Regression.</br>
Note: apply feature scaling after train-test split to prevent information leakage

In [26]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])  # Skip one-hot columns
X_test[:, 3:] = sc.transform(X_test[:, 3:])
print(X_train)
print(X_test)

[[1.0 0.0 0.0 -0.7529426005471072 -0.6260377781240918]
 [1.0 0.0 0.0 1.008453807952985 1.0130429500553495]
 [1.0 0.0 0.0 1.7912966561752484 1.8325833141450703]
 [0.0 1.0 0.0 -1.7314961608249362 -1.0943465576039322]
 [1.0 0.0 0.0 -0.3615211764359756 0.42765697570554906]
 [0.0 1.0 0.0 0.22561095973072184 0.05040823668012247]
 [0.0 0.0 1.0 -0.16581046438040975 -0.27480619351421154]
 [0.0 0.0 1.0 -0.013591021670525094 -1.3285009473438525]]
[[0.0 1.0 0.0 2.1827180802863797 2.3008920936249107]
 [0.0 0.0 1.0 -2.3186282969916334 -1.7968097268236927]]
