# Day 1: Data Preprocessing
Updated to current scikit-learn APIs and runnable in VS Code.

## Setup in VS Code\n\n1) Install Python 3.x and ensure it is on PATH.\n2) Create and activate a venv in the repo root:\n\n```powershell\npython -m venv .venv\n.\\.venv\\Scripts\\Activate.ps1\n```\n\n3) Install dependencies in the same environment: \n\n```powershell\npython -m pip install numpy pandas scikit-learn jupyter ipykernel\n```\n\nThen select the `.venv` kernel in VS Code and run the cells.\n

## Step 1: Importing the libraries

In [None]:
import numpy as np
import pandas as pd


## Step 2: Importing dataset

In [None]:
from pathlib import Path

# Resolve datasets/Data.csv relative to the repo root
cwd = Path.cwd()
data_path = None
for parent in [cwd] + list(cwd.parents):
    candidate = parent / "datasets" / "Data.csv"
    if candidate.exists():
        data_path = candidate
        break
if data_path is None:
    raise FileNotFoundError("Could not find datasets/Data.csv. Run the notebook from the repo root.")

dataset = pd.read_csv(data_path)

X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values
X, Y


## Step 3: Handling the missing data

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X[:, 1:3] = imputer.fit_transform(X[:, 1:3])
X


## Step 4: Encoding categorical data

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers=[('country', OneHotEncoder(), [0])], remainder='passthrough')
X = ct.fit_transform(X)
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
X, Y


## Step 5: Splitting the datasets into training set and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.2, random_state=0
)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape


## Step 6: Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler(with_mean=False)
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
X_train, X_test
