# **Scikit-Learn From Scratch**

## What is Scikit-learn?
- Scikit-learn is a one of the popular Python library used for machine learning & data mining
- It provides tools for building, training, and testing machine learning models
- Scikit learn library has a simple and consistent API for easy model implementation
- Supports both supervised (e.g., classification, regression) and unsupervised (e.g., clustering, dimensionality reduction) learning algorithms

## What is Data Mining?
- Data mining is the process of discovering patterns, trends, and useful information from large datasets using statistical, mathematical, and machine learning techniques

## Why Scikit-learn?
- It integrates well with other Python libraries like NumPy and Pandas
- Has many in-built machine learning models
- Offers tools for model evaluation, including cross-validation and performance metrics
- Provides functions for data preprocessing, such as scaling, normalization, and encoding
- Scikit-learn is open-source and widely used in academia and industry

## Scikit Learn Workflow

![workflow image](assets/scikit-learn-workflow.webp)

## What do we cover?
1. Getting the data ready
2. Choose the right estimator/algorithm/model for out problem
3. Fit the model and use it to make predictions on our data
4. Evaluating the model
5. Improve the model
6. Save and load a trained model
7. Putting all together

<br/>

---

<br/>

## Standard Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

<br/>

## 01 - Getting out data ready to be used with machine learning 

There are 3 main things we have to do:
1. Split the data into features and labels (usually `X` and `y`)
2. Filling (also called imputing) or disregarding missing values
3. Converting non-numerical values to numerical values (also called feature encoding)

In [2]:
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
# Defining features (X) & labels(y)
X = heart_disease.drop("target", axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [4]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

<br/>

In [5]:
# Split data into training & test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [6]:
X.shape

(303, 13)

> **One of the ML Principle:** never evaluate or test a model by using data that used to train a model

<br/>

### 01.1 Ensures all the data are numeric

In [7]:
car_sales = pd.read_csv('data/car-sales-extended.csv')
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [8]:
len(car_sales), car_sales.shape

(1000, (1000, 5))

In [9]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

<br/>

### 01.2 - Define features (X) and variables (y)

In [10]:
X = car_sales.drop("Price", axis=1)
X.head(3)

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4


In [11]:
y = car_sales["Price"]
y.head(3)

0    15323
1    19943
2    28343
Name: Price, dtype: int64

<br/>

### 01.2 - Split existing data into train & test sets

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((800, 4), (200, 4), (800,), (200,))

<br/>

## 2 - Choose the right model & hyper parameters

In this scenario, we try to predict a price (more accurately a number), this is a regression task. So we can choose scikit learn's RandomForestRegressor as a model

In [13]:
from sklearn.ensemble import RandomForestRegressor

# Build machine learning model & train it using split data
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'BMW'

> We got, **ValueError: could not convert string to float: 'BMW'**. That's why we should ensures all the existing data are numerical

<br/>

### 02.1 Data Preprocessing

#### Converting categorical data into numerical data - using Scikit Learn

In [16]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [17]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")

# Transforming features (X)
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [18]:
transformed_X.shape

(1000, 13)

In [19]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [20]:
pd.DataFrame(transformed_X).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0


> 12th column similar to the Odometer (KM) <br/>
> 0 - 11 columns represent values of Make, Colour, and Doors columns

In [21]:
car_sales["Doors"].value_counts()

Doors
4    856
5     79
3     65
Name: count, dtype: int64

In [22]:
car_sales["Make"].value_counts()

Make
Toyota    398
Honda     304
Nissan    198
BMW       100
Name: count, dtype: int64

In [23]:
car_sales["Colour"].value_counts()

Colour
White    407
Blue     321
Black     99
Red       94
Green     79
Name: count, dtype: int64

> There are 3 different door counts, 4 different vehicle brands, & 5 different colors. `3 + 4 + 5 = 12`. These 12 different types of data represented by above 0 - 11 columns after the One-Hot Encoding process

<br/>

#### Converting categorical data into numerical data - using Pandas

In [24]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies.head()

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False


In [25]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]]).astype(int)
dummies.head()

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0


<br/>

In [26]:
# Let's refit the model using transformed data
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model.fit(X_train, y_train)

In [27]:
# Model Accuracy
model.score(X_test, y_test)

0.3235867221569877