# Masters' in Applied Artificial Intelligence
## Machine Learning Algorithms Course

Notebooks for the MLA course

by [*lufer*](mailto:lufer@ipca.pt)

---



# ML Modelling - Part VII - Improving a Model
\
**Contents**:

1.  **Model Improvement**



This notebook explores the reqyirements adn processes to improve a ML model.

# Environment preparation


**Importing necessary Libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

#import libraries for trainning
from sklearn.model_selection import train_test_split


In [None]:
import datetime
print(f"Last updated: {datetime.datetime.now()}")

**Mounting Drive**

In [None]:

from google.colab import drive

# it will ask for your google drive credentiaals
drive.mount('/content/gDrive/', force_remount=True)

# 1 - What ML Algorithms are there?

Choose a Machine Learning algortihms depend of many factors, such as the size of the datatset, the type of the data in it, the goal of the model, and others.

Sklearn offers a graphical algorith that facilicates this selection.

![picture](https://scikit-learn.org/stable/_static/ml_map.png)

# 2 - Improving a Model


RandomForest is one emsamble model that has very good performances.

**Note:**
\
The first predictions for the first model, represents the initial results! Can it be improved?

\
The improving can be done on data, model, parameters:

* From the data perspective:

> - Can the data be improved?
> - Could we collect more data (the more data, the better, usually!)

* From the model perspective

> - Is there a better model?
> - Can de current model be improved? (exploring hyperparameters)


*Hyperparameters versus Parameters*

* *Parameters* = correspond to patterns that model found in data
* *Hyperparameters* = are model settings that we can try to adjust  

\
How to manipulate Hyperparameters:
1. Manually
2. Randomly with [RandomSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
3. Exhaustively with [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)


### *Download Dataset*



In [None]:
#Importing a real world dataset preparaed for Regression

filePath='/content/gDrive/MyDrive/MIA/ColabNotebooks/Datasets/'
hd = pd.read_csv(filePath+"heart-disease.csv")
pd.set_option("display.precision", 2)
#answer: a dictionary

In [None]:
hd.head()

### Baseline Model

In [None]:
#import model
from sklearn.ensemble import RandomForestClassifier

#importing metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


**Create Data for training - *Features* - (X) and Ground Truth Values - *Labels* (y)**

In [None]:
# Create X (all the feature columns)
X = hd.drop("target", axis=1)

# Create y (the target column)
y = hd["target"]

# Check the head of the features DataFrame
X.head()

In [None]:
# Check the head and the value counts of the labels
y.head()

In [None]:
y.value_counts()

**Suppose modeling without split data (Don't do this!)!**

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

# fit the model on one set of data
rf.fit(X, y)

# evaluate the model on the second set of data
y_model = rf.predict(X)

accuracy_score(y, y_model)

**Split dataset**

In [None]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3) # by default train_test_split uses 25% of the data for the test set

X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Check the sizes of the splits
print(f"Training data: {len(X_train)} samples, {len(y_train)} labels")
print(f"Validation data: {len(X_test)} samples, {len(y_test)} labels")

**Get the Model**

In [None]:
# Since we're working on a classification problem, we'll use a RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()


**Train the Model**

In [None]:
rf.fit(X_train,y_train)

In [None]:
X_test.head()

**Predicting**

In [None]:
# Use the model to make a prediction on the test data (further evaluation)
y_preds = rf.predict(X_test)
y_preds

**Evaluate**

\
*Remember:*


Each model  has a built-in `score()` method.

- The `score()` method compares how well the model was able to learn the patterns between the features and labels.

- The `score()` method uses a standard evaluation metric to measure the model's results.
- The `score()`represents the `accuracy` of the model


In [None]:
# Evaluate the model on the training set.
train_acc = rf.score(X_train, y_train)
print(f"The model's accuracy on the training dataset is: {train_acc*100}%")
#perfect! why?

In [None]:
# Evaluate the model on the test set
test_acc = rf.score(X_test, y_test)
print(f"The model's accuracy on the testing dataset is: {test_acc*100:.2f}%")
#worst than previous prediction. Why?


Other evaluation methods;
- `classification_report(y_true, y_pred)` - Builds a text report showing various classification metrics such as `precision`, `recall` and `F1-score`.
- `confusion_matrix(y_true, y_pred)` - Create a confusion matrix to compare predictions to truth labels.
- `accuracy_score(y_true, y_pred)`- Find the accuracy score (the default metric) for a classifier.

\
**Note:** All metrics have the following in common: they compare a model's predictions (y_pred) to truth labels (y_true).

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Create a classification report
print(classification_report(y_test, y_preds))

In [None]:
# Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

In [None]:
# Compute the accuracy score (same as the score() method for classifiers)
accuracy_score(y_test, y_preds), rf.score(X_test,y_test)

**Create a function to help metrics calculation**

In [None]:
#show metrics
def evaluate_preds(y_true, y_preds):
    """
    Performs evaluation comparison on y_true labels vs. y_pred labels
    """
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {"accuracy": round(accuracy, 2),
                   "precision": round(precision, 2),
                   "recall": round(recall, 2),
                   "f1": round(f1, 2)}
    print(f"Acc: {accuracy * 100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 score: {f1:.2f}")

    return metric_dict

In [None]:
bs_metrics = evaluate_preds(y_test, y_preds)
#bs_metrics

### Improving processes

We can try changing each hyperparamater (hereafter abbreviated by parameter) value.
\
For instance, we can try change `n_estimators` or `random-state`.


**Notes**:
- All experiments with different parameters should be cross-validated.
  - **Attention:** Beware of cross-validation for time series problems (as for time series, you don't want to mix samples from the future with samples from the past).
- All experiments will be applied over three datasets: Training, Validation and Testing.

\
Has we saw above, there are three ways to manipulate Hyperparameters:

- Manually
- Randomly with RandomSearchCV
- Exhaustively with GridSearchCV


Where:

- `RandomizedSearchCV` implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

- `GridSearchCV`, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions.


[see more about in... ](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)



**Check existing Hyperparameters**

In [None]:
#craate a new instance
#rf = RandomForestClassifier()
rf.get_params()
#Hyperparameters


There are several (hyper)parameters can be explored. It is also possible to see the default values for each (hyper)parameter.
\


### **Explore Hyperparameters manually**


Hyperparameter tuning introduces a third set, a validation set.

So the process becomes:
1. Train a model on the training data.
2. (Try to) improve the model's hyperparameters on the validation set.
3. Evaluate the model on the test set.

**Creating the Validation Set**

Until now we worked with training and test datasets. We train a model on a training set and evaluate it on a test dataset.

\
For that we split data into Train and Test Data, using the `train_test_split()`function.

\
Now we repeat the process to split the Test Set in two new sets: Validation and Test set.


```
X = Train + Validation + Test
```



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Set the seed
np.random.seed(42)

# Split into X (features) & y (labels)
X = hd.drop("target", axis=1)
y = hd["target"]

# Training and test split (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

# Create validation and test split by spliting testing data in half (30% test -> 15% validation, 15% test)
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, test_size=0.5)

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Make predictions in X_valid
y_preds = clf.predict(X_valid)

# Evaluate the classifier
baseline_metrics = evaluate_preds(y_valid, y_preds)
baseline_metrics

In [None]:
# Check the sizes of the splits
print(f"Training data: {len(X_train)} samples, {len(y_train)} labels")
print(f"Validation data: {len(X_valid)} samples, {len(y_valid)} labels")
print(f"Testing data: {len(X_test)} samples, {len(y_test)} labels")

**Remember**:

On this improvement process, we use:

- X_train data for Training
- X_valid data for model testing
- X_test for model Evaluation

**Make predictions using `X_valid` data**

In [None]:
#score
a=rf.score(X_valid, y_valid)
print(f"Score: {a*100:.4f}%")

ac = accuracy_score(y_valid,y_preds)
ac

**Evaluate the model**

The first evaluation is with default parameters

In [None]:
# Evaluate the 1st classifier
baseline_metrics = evaluate_preds(y_valid, y_preds)


**Lets try to improve the model, manually**

Changing the hyperparameters `n_estimators=100` (default) to `n_estimators=200` and see if it improves on the validation set.

In [None]:
#new instance
rf2 = RandomForestClassifier(n_estimators=200)
rf2.fit(X_train,y_train)

In [None]:
#make predictions wityh different hyperparameter
y_preds2 = rf2.predict(X_valid)


In [None]:
# Evaluate the 2nd classifier
rf2_metrics = evaluate_preds(y_valid, y_preds2)

**Without cross-validation**


In [None]:
# Try different numbers of estimators (trees)... (no cross-validation)
np.random.seed(42)
for i in range(100, 200, 10):
    print(f"->Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100:.2f}%")
    print("")

**With cross-validation**

\
*Remember:*

[Cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html) is a way of making sure the results you're getting are more consistent across your training and test datasets (because it uses multiple versions of training and test sets) .

- We can achieve this by setting `cross_val_score(X, y, cv=5)`.

- `X` is the full feature set;
- `y` is the full label set;
- `cv` is the number of train and test splits

\
`cross_val_score` will automatically create from the data (in this case, 5 different splits, this is known as 5-fold cross-validation).

It is usually a better indicator of a quality model than a single split accuracy score.

In [None]:
from sklearn.model_selection import cross_val_score

# With cross-validation
np.random.seed(42)
for i in range(100, 200, 10):
    print(f"-> Trying model with [{i}] estimators...")
    model = RandomForestClassifier(n_estimators=i)
    model.fit(X_train, y_train)

    # Measure the model score on a single train/test split
    model_score = model.score(X_test, y_test)
    print(f"Model accuracy on single test set split: {model_score * 100:.2f}%")

    # Measure the mean cross-validation score across 5 different train and test splits
    cross_val_mean = np.mean(cross_val_score(model, X, y, cv=5))
    print(f"5-fold cross-validation score: {cross_val_mean * 100:.2f}%")

    print("-------------------")

After getting the best performance (score with n_estimator=120, and cv=5) we can try adjust another hyperparameter.

In [None]:
#try max_depth hyperparameter

model = RandomForestClassifier(
                                n_estimators=120,
                                max_depth=10)
model.fit(X_train, y_train)

# Measure the model score on a single train/test split
model_score = model.score(X_test, y_test)
print(f"Model accuracy on single test set split: {model_score * 100:.2f}%")

# Measure the mean cross-validation score across 5 different train and test splits
cross_val_mean = np.mean(cross_val_score(model, X, y, cv=5))
print(f"5-fold cross-validation score; n_estimator=160; max_depth=10: {cross_val_mean * 100:.2f}%")
#answer: worst or better?

And so on...continue "trying and evaluate for each hyperparameter combination"...a very hard work!

### **Exploring *RandomizedSearchCV***

\
First, it is necesssaary to create a *dictionary* of parameter distributions (collections of different values for specific hyperparamters) we'd like to analyse.

```
param_values = {"hyperparameter_name": [values_to_randomly_try]}

```

In [None]:
# Hyperparameter grid for RandomizedSearchCV
param_values = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
                       "max_depth": [None, 5, 10, 20, 30],
                       "max_features": ["sqrt", "log2", None],
                       "min_samples_split": [2, 4, 6, 8],
                       "min_samples_leaf": [1, 2, 4, 8]}

In [None]:
#check
param_values.values()

In [None]:
# Curiosity
# Count the total number of hyperparameter combinations to test
total_combinations = np.prod([len(value) for value in param_values.values()])
# 6 * 5 * 3 * 4 * 4 = 1440
#if we consider cv=5 then it will be necessary 1440*5...a lot!
print(f"Just to have an idea...there are {total_combinations} potential combinations of hyperparameters to test.")

In [None]:
# Start the timer
import time
start_time = time.time()

from sklearn.model_selection import RandomizedSearchCV, train_test_split

np.random.seed(42)

# Split into X & y
X = hd.drop("target", axis=1)
y = hd["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Set n_jobs to -1 to use all available cores on your machine (if this causes errors, try n_jobs=1)
rfc = RandomForestClassifier(n_jobs=1)  #-1 means using all processing capacity (cores)

# Setup RandomizedSearchCV
n_iter = 5 # try 5 models total
rfc = RandomizedSearchCV(estimator=rfc,
                            param_distributions=param_values,
                            n_iter=n_iter,                          # how many models to try
                            cv=5,                                   # 5-fold cross-validation
                            verbose=2)                              # print out results

# Fit the RandomizedSearchCV version of clf (does cross-validation for us, so no need to use a validation set)
rfc.fit(X_train, y_train);

# Finish the timer
end_time = time.time()
print(f"[INFO] Total time taken for {n_iter} random combinations of hyperparameters: {end_time - start_time:.2f} seconds.")

**Chech all the results and get the best combination**

In [None]:
# Find the best hyperparameters found by RandomizedSearchCV
rfc.best_params_

After this process, each time we call `predict()` on `rs_clf` (our `RandomizedSearchCV` version of our classifier), it'll use the best hyperparameters it found.

Make predictions with the best hyperparameters

In [None]:
# Make predictions with the best hyperparameters
rfc_y_preds = rfc.predict(X_test)

# Evaluate the predictions
rscv_metrics = evaluate_preds(y_test, rfc_y_preds)

### **Exploring GridSearchCV**

Since we've already tried to find some ideal hyperparameters using `RandomizedSearchCV`, we'll create another hyperparameter grid based on the `best_params_` of `rfc` with less options and then try to use `GridSearchCV` to find a more ideal set

In [None]:
# Create hyperparameter grid similar to rfc.best_params_
grid2 = {"n_estimators": [1000, 200],
              "max_depth": [30,40],
              "max_features": ["log2"],
              "min_samples_split": [2, 4, 6],
              "min_samples_leaf": [4]}
grid2.values()

In [None]:
#2*2*1*3*1
np.prod([len(value) for value in grid2.values()])
# considering cv will explore 60 models!

In [None]:
# Start the timer
import time
start_time = time.time()

from sklearn.model_selection import GridSearchCV, train_test_split

np.random.seed(42)

# Split into X & y
X = hd.drop("target", axis=1)
y = hd["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Set n_jobs to -1 to use all available machine cores (if this produces errors, try n_jobs=1)
clf = RandomForestClassifier(n_jobs=-1)

# Setup GridSearchCV
gs_clf = GridSearchCV(estimator=clf,
                      param_grid=grid2,
                      cv=5,         # 5-fold cross-validation
                      verbose=2)    # print out progress

# Fit the RandomizedSearchCV version of clf
gs_clf.fit(X_train, y_train);

# Find the running time
end_time = time.time()

*Check the best parameters*

In [None]:
# Check the best hyperparameters found with GridSearchCV
gs_clf.best_params_

*Predicting*

In [None]:
# Max predictions with the GridSearchCV classifier
gs_y_preds = gs_clf.predict(X_test)

# Evaluate the predictions
gs_metrics = evaluate_preds(y_test, gs_y_preds)
gs_metrics

Visualizing the results better

In [None]:
df = pd.DataFrame({"Baseline":bs_metrics,
                  "By Hand":baseline_metrics,
                   "RandomSearchCV":rscv_metrics,
                   "GridSearchCV":gs_metrics})

In [None]:
df

### ***Comparing performances***

In [None]:
df.plot.bar(figsize=(10, 8));

There are many other techniques to explore:

- **Collecting more data** - Based on the results our models are getting now, it seems like they're very capable of finding patterns. Collecting more data may improve a models ability to find patterns.

- Try a more advanced model - more advanced ensemble methods can be explored (XGBoost, CatBoost, etc).

### Preserve the model

The GridSearchCV model (gs_clf) looks be the better model. Lets preserve it to share or to continue working on it, later. Lets export it and save it to file.

Lets serialize the model to file using the [pickle python](https://docs.python.org/3/library/pickle.html).

In [None]:
#Preserve the model
import pickle

# Save an existing model to file
fileName = "gs_random_forest_model_1.pkl" # .pkl extension stands for "pickle"
#gs_clf was the last model
pickle.dump(gs_clf, open(fileName, "wb"))

### Download the model

In [None]:
# Load a saved model
loaded_model = pickle.load(open(fileName, "rb"))

Use the downloaded model:

In [None]:
# Make predictions and evaluate the loaded model
new_y_preds = loaded_model.predict(X_test)
new_model_metrics = evaluate_preds(y_test, new_y_preds)
new_model_metrics

In [None]:
df = pd.DataFrame({"Original GS":gs_metrics,
                  "Downloade One":new_model_metrics})
df

In [None]:
#End!