# **Introduction to Scikit Learn**

## What is Scikit-learn?
- Scikit-learn is a one of the popular Python library used for machine learning & data mining
- It provides tools for building, training, and testing machine learning models
- Scikit learn library has a simple and consistent API for easy model implementation
- Supports both supervised (e.g., classification, regression) and unsupervised (e.g., clustering, dimensionality reduction) learning algorithms

## What is Data Mining?
- Data mining is the process of discovering patterns, trends, and useful information from large datasets using statistical, mathematical, and machine learning techniques

## Why Scikit-learn?
- It integrates well with other Python libraries like NumPy and Pandas
- Has many in-built machine learning models
- Offers tools for model evaluation, including cross-validation and performance metrics
- Provides functions for data preprocessing, such as scaling, normalization, and encoding
- Scikit-learn is open-source and widely used in academia and industry

## Scikit Learn Workflow
![workflow image](assets/scikit-learn-workflow.webp)

<br/>

---

<br/>

In [1]:
# Import necessary basic libraries
import pandas as pd

<br/>

### Import Data

In [2]:
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
heart_disease.shape

(303, 14)

<br/>

### Defining features & variables

In [4]:
# Create X (feature matrix)
X = heart_disease.drop("target", axis=1)

# Create y (labels matrix)
y = heart_disease["target"]

<br/>

### Choose the right model & hyperparameters

This heart disease prediction is a classification problem, so let's use RandomForest model

In [5]:
from sklearn.ensemble import RandomForestClassifier

In [6]:
# Instantiate a class (class instance = clf = model)
clf = RandomForestClassifier()

# Keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

<br/>

In [7]:
# Creates data splits for train a model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [8]:
X_train.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
297,59,1,0,164,176,1,0,90,0,1.0,1,2,1
232,55,1,0,160,289,0,0,145,1,0.8,1,1,3
261,52,1,0,112,230,0,1,160,0,0.0,2,1,2


In [9]:
y_train.head(3)

297    0
232    0
261    0
Name: target, dtype: int64

<br/>

In [10]:
# Train the model
clf.fit(X_train, y_train)

In [11]:
# Make predictions
y_preds = clf.predict(X_test)
y_preds

array([0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1], dtype=int64)

<br/>

### Evaluate the model performance

In [12]:
# Evaluate the model on the test data
clf.score(X_test, y_test)

0.8524590163934426

Accuracy = 85.25%

<br/>

In [13]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [14]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.83      0.79      0.81        24
           1       0.87      0.89      0.88        37

    accuracy                           0.85        61
   macro avg       0.85      0.84      0.84        61
weighted avg       0.85      0.85      0.85        61



In [15]:
confusion_matrix(y_test, y_preds)

array([[19,  5],
       [ 4, 33]], dtype=int64)

In [16]:
accuracy_score(y_test, y_preds)

0.8524590163934426

<br/>

### Improve the model

In [17]:
# Try different amount of n_estimators
for i in range(10, 101, 5):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 86.89%

Trying model with 15 estimators...
Model accuracy on test set: 83.61%

Trying model with 20 estimators...
Model accuracy on test set: 81.97%

Trying model with 25 estimators...
Model accuracy on test set: 86.89%

Trying model with 30 estimators...
Model accuracy on test set: 90.16%

Trying model with 35 estimators...
Model accuracy on test set: 86.89%

Trying model with 40 estimators...
Model accuracy on test set: 86.89%

Trying model with 45 estimators...
Model accuracy on test set: 85.25%

Trying model with 50 estimators...
Model accuracy on test set: 86.89%

Trying model with 55 estimators...
Model accuracy on test set: 88.52%

Trying model with 60 estimators...
Model accuracy on test set: 83.61%

Trying model with 65 estimators...
Model accuracy on test set: 85.25%

Trying model with 70 estimators...
Model accuracy on test set: 83.61%

Trying model with 75 estimators...
Model accuracy on test set: 85.25%

Trying

<br/>

### Save the model & load it by using Pickle

#### What is Pickle?

- Pickle is a Python module used for serializing and deserializing Python objects, allows us to save them to a file or transfer them over a network, and later load them back into their original state
- In the context of ML, Pickle allows us to save trained scikit-learn models to a file, so we can easily reload and use them later without needing to retrain the model

In [18]:
import pickle

# Save the trained Random Forest model (clf) to a file named "random_forest_model_1.pkl" using pickle
pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

> The 'wb' mode in the open() function stands for "write binary". It opens the file in binary mode for writing, which is necessary when saving objects like models with pickle.

In [19]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.8524590163934426

> The 'rb' mode in the open() function stands for "read binary". It opens the file in binary mode for reading, which is necessary when loading objects like models with pickle.