<img src="estratek_logo.jpg" style="width:543px;height:190px"/>

<img src="flavor_bean.gif" style="width:474px;height:316px"/>

# Chocolate Rating Predictor

This project compares different optimized machine learning models (supervised learning) to predict chocolate rating based in several factors
#### Acknowledgements
These ratings were compiled by Brady Brelinski, Founding Member of the Manhattan Chocolate Society. For up-to-date information, as well as additional content (including interviews with craft chocolate makers), please see his website: Flavors of Cacao

#### Inspiration
Where are the best cocoa beans grown?
Which countries produce the highest-rated bars?
What’s the relationship between cocoa solids percentage and rating?



####  Data Source:
    - URL: https://flavorsofcacao.com/chocolate_database.html

    - Number of Instances: 1795.
    - Number of Attributes: 11 + output attribute
    - Attribute information:
        - Input variables:
             1 - Company (Maker-if known): Name of the company manufacturing the bar.
             2 - Specific Bean Originor Bar Name: The specific geo-region of origin for the bar.
             3 - REF: A value linked to when the review was entered in the database. Higher = more recent.
             4 - ReviewDate: Date of publication of the review.
             5 - CocoaPercent: Cocoa percentage (darkness) of the chocolate bar being reviewed.
             6 - CompanyLocation: Manufacturer base country.
        - Output variable:
             7 - Rating: Expert rating for the bar.


#### ML Models compared:
    - XGBoost
    - Deep Learning



## I. Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
%matplotlib inline

## II. Get the Data

##### About this dataset

Chocolate is one of the most popular candies in the world. Each year, residents of the United States collectively eat more than 2.8 billions pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown.

**Flavors of Cacao Rating System:**

    5= Elite (Transcending beyond the ordinary limits)
    4= Premium (Superior flavor development, character and style)
    3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
    2= Disappointing (Passable but contains at least one significant flaw)
    1= Unpleasant (mostly unpalatable)
Each chocolate is evaluated from a combination of both objective qualities and subjective interpretation. A rating here only represents an experience with one bar from one batch. Batch numbers, vintages and review dates are included in the database when known.

The database is narrowly focused on plain dark chocolate with an aim of appreciating the flavors of the cacao when made into chocolate. The ratings do not reflect health benefits, social missions, or organic status.

Flavor is the most important component of the Flavors of Cacao ratings. Diversity, balance, intensity and purity of flavors are all considered. It is possible for a straight forward single note chocolate to rate as high as a complex flavor profile that changes throughout. Genetics, terroir, post harvest techniques, processing and storage can all be discussed when considering the flavor component.

Texture has a great impact on the overall experience and it is also possible for texture related issues to impact flavor. It is a good way to evaluate the makers vision, attention to detail and level of proficiency.

Aftermelt is the experience after the chocolate has melted. Higher quality chocolate will linger and be long lasting and enjoyable. Since the aftermelt is the last impression you get from the chocolate, it receives equal importance in the overall rating.

Overall Opinion is really where the ratings reflect a subjective opinion. Ideally it is my evaluation of whether or not the components above worked together and an opinion on the flavor development, character and style. It is also here where each chocolate can usually be summarized by the most prominent impressions that you would remember about each chocolate.

In [None]:
# Read the csv file specifying the column names
chocolates = pd.read_csv('flavors_of_cacao_orig.csv', names=['company', 'origin', 'ref', 'review', 'cocoa_percent', 'company_location', 'rating', 'bean_type', 'bean_origin'])

chocobak = chocolates
chocolates.head(20)

## III. Data Preprocessing and Exploratory Analysis

1. Check for null values
2. Encode categorical features
3. Check and handle duplicates
4. Remove or impute null values
4. Scale features (Standard Scaling)
6. Balance the data per rating?  (does it make sense using linear prediction?)
7. Describe the data
8. Visualize the data
9. Data Analysis Conclusions

##### 1. Check for null values

In [None]:
chocolates.info()

##### 2. Check and handle duplicates

In [None]:
chocolates.shape

In [None]:
duplicates = chocolates[chocolates.duplicated()]
print(duplicates.count())

In [None]:
chocolates.drop_duplicates(inplace=True)

In [None]:
chocolates.shape

##### 3. Encode categorical features

- The only categorical feature is 'type', with only two possible values: 'white' and 'red'

In [None]:
chocolates.info()

In [None]:
chocolates.head(5)

* Let's convert all the numeric columns to numeric formats

In [None]:
# Convert 'cocoa_percent' column to numeric format
chocolates['cocoa_percent'] = chocolates['cocoa_percent'].str.rstrip('%').astype('float') / 100

# Convert 'review' column to numeric format
chocolates['review'] = chocolates['review'].astype('int')

# Convert 'ref' column to numeric format since it means some order (higher = more recent)
chocolates['ref'] = chocolates['ref'].astype('int')

# Convert 'rating' column to numeric format. It could have been a classification using discreet value in 0.25 steps, 
# but this tiem we will prefer to use a linear regression model to predict a number
chocolates['rating'] = chocolates['rating'].astype('float')

chocolates.head(5)

In [None]:
# Show the categorical columns
cat_features = chocolates.select_dtypes(include=['object']).columns.to_list()
cat_features

### Working with Sparse Matrices (Disabled)

In [None]:
# This code shows how to use csr_matrix, a way to compress sparse matrices and still index them as usual.
'''

from scipy.sparse import csr_matrix

# Create a larger dense matrix
dense_matrix = np.array([
    [1, 0, 0, 0, 0],
    [0, 0, 2, 0, 0],
    [0, 3, 0, 0, 0],
    [0, 0, 0, 4, 0],
    [0, 0, 0, 0, 5]
])

# Display the dense matrix
print("Dense Matrix:")
print(dense_matrix)

# Convert the dense matrix to a sparse matrix (Compressed Sparse Row format - CSR)
sparse_matrix = csr_matrix(dense_matrix)

# Display the sparse matrix
print("\nSparse Matrix:")
print(sparse_matrix)


# Accessing non-zero elements of the sparse matrix
print("\nNon-zero elements of the sparse matrix:")
print(sparse_matrix.data)

# Calculate sparsity index for dense matrix
total_elements_dense = dense_matrix.size
non_zero_elements_dense = np.count_nonzero(dense_matrix)
sparsity_index_dense = non_zero_elements_dense / total_elements_dense

# Calculate sparsity index for sparse matrix
total_elements_sparse = sparse_matrix.size
non_zero_elements_sparse = sparse_matrix.nnz
sparsity_index_sparse = non_zero_elements_sparse / total_elements_sparse

# Display sparsity indices
print("Sparsity Index for Dense Matrix: {:.4f}".format(sparsity_index_dense))
print("Sparsity Index for Sparse Matrix: {:.4f}".format(sparsity_index_sparse))

# Example indexing for both dense and sparse matrices
row_index = 2
column_index = 1

# Indexing the dense matrix
print("\nValue in dense matrix at ({}, {}): {}".format(row_index, column_index, dense_matrix[row_index, column_index]))

# Indexing the sparse matrix
print("Value in sparse matrix at ({}, {}): {}".format(row_index, column_index, sparse_matrix[row_index, column_index]))


'''

In [None]:
# ************ CAMBIAR ESTO UNA VEZ SE VAYAN A HOT-ENCODE los features categóricos
# encoded_chocolates = pd.get_dummies(chocolates,columns=cat_features,drop_first=True)
encoded_chocolates = chocolates    # TEMPORAL
encoded_chocolates.info()



# sparse_enconded_chocolates = csr_matrix(encoded_chocolates)

# print(sparse_enconded_chocolates.data)



##### 4. Remove or impute null values
- First we check for NaN or null values.
- We will use SimpleImputer estimator to impute all the mising values at once

In [None]:
# Replace blank values with NaN
chocolates.replace(r'^\s*$', np.nan, regex=True, inplace=True)
encoded_chocolates.replace(r'^\s*$', np.nan, regex=True, inplace=True)


In [None]:
# List all the rows with null values
chocolates[chocolates.isnull().any(axis=1)]

In [None]:
#  Count the null values
null_values = chocolates.isna().sum().sum()
null_values, len(chocolates)

In [None]:
# Imputing null categorical values on columns bean_type and bean_origin
encoded_chocolates['bean_type'].fillna('Unknown', inplace=True)
encoded_chocolates['bean_origin'].fillna('Unknown', inplace=True)


In [None]:
#  Check the null values count again
null_values = chocolates.isna().sum().sum()
null_values

In [None]:
encoded_chocolates

* COMMENT: No hay más valores nulos

In [None]:
# Here it is the imputer block code, just in case it was needed. Not this time.
'''
from sklearn.impute import SimpleImputer


imputer = SimpleImputer(strategy='mean')
wines_imputed_matrix = imputer.fit_transform(encoded_wines)
imputed_wines = pd.DataFrame(wines_imputed_matrix, columns=encoded_wines.columns)
imputed_wines
'''

imputed_chocolates = encoded_chocolates
imputed_chocolates

# --- HASTA AQUI ----

##### 5. Scale features (Stadard Scaler)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
cat_feat_cols = ['company', 'origin', 'company_location','bean_type', 'bean_origin','rating']
scaled_features = scaler.fit_transform(imputed_chocolates.drop(cat_feat_cols, axis=1))

scaled_chocolate_columns = imputed_chocolates.columns.drop(cat_feat_cols)
scaled_chocolates = pd.DataFrame(scaled_features, columns=scaled_chocolate_columns)

##### 6. Describe the Data
1. wines
2. encoded_wines
3. imputed_wines
4. scaled_wines (normalized)
5. X, y

In [None]:
chocolates.describe()

In [None]:
scaled_chocolates.describe()

In [None]:
# Cantidad vinos por nivel de calidad

print(chocolates.rating.value_counts())

In [None]:
# Cantidad de vinos por tipo
pd.set_option('display.max_rows', None)
print(chocolates.bean_origin.value_counts())

#### 6.5. VIF (Variance Inflation Factor)



VIF determnes the stength of  the correlation between the independent variables
VIF score of an independent variable represents how well the variable is explained by other independent variables

- VIF starts at 1 and has no upper limit
- VIF = 1, no correlation between the independent variable and the other variables
- VIF  exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

* Removing columns with high VIF scores can help to reduce multicollinearity and improve the performance of the model.


In [None]:
#'''

from statsmodels.stats.outliers_influence import variance_inflation_factor


vif = pd.DataFrame()
vif['vif'] = [variance_inflation_factor(scaled_features, i) for i in range(scaled_features.shape[1])]
vif['Feaatures'] = scaled_wines_columns

# Checking the values...
vif
# '''

* No columns have a vif > 5, so we leave it as it is

#### 6.6 Apply Over Sampling Technique

This technique is used to modify the unequal data classes to create balanced datasets.
When the quantity of data is insufficient, the oversampling method tries to balance by incrementing the size of rare samples.

Over sampling techniques for classification problems
    1. Random Oversampling
    2. Synthetic Minority Oversampling Technique (SMOTE)
    3. Adaptive Synthetic Sampling (ADASYN)

We will use SMOTE here.

##### SMOTE (Synthetic Minority Oversampling Technique)

SMOTE works by utilizin K-nearest neigbors algorithm to create synthetic data.
In this technique, the minority class is over-sampled by producing synthetic examples rather than by over-sampling with replacement and for each minority class observation.

In [None]:
from imblearn.over_sampling import SMOTE

oversample = SMOTE(k_neighbors=4)
# transform the dataset

X = scaled_wines
y = imputed_wines['quality']

X_os, y_os = oversample.fit_resample(X, y)


In [None]:
X_os.shape, y_os.shape

In [None]:
y.value_counts()

In [None]:
y_os.value_counts()

In [None]:
# Observe the data has been balanced

sns.countplot(x=y_os)
sns.countplot(x=y, fill=False)
plt.grid()

#### 6.6 Apply Under Sampling Technique (*** NOT CHOSEN ***)

Unlike oversampling, this technique balances the imbalanced dataset by reducing the size of the class. It is tipically used when there is a lot of data (big datasets).

Undersampling techniques for classification problems
    1. Random undersampling
    2. Near Miss Undersampling
    3. Tomek Link Undersampling

The possible advantage is the reduction in run-time by decreasing the amount of traiining dataset, and also that it helps to solve memory problems.
    
We will use Near Miss here.


In [None]:
from imblearn.under_sampling import NearMiss

nmiss = NearMiss()

# transform the dataset

X = scaled_wines
y = imputed_wines['quality']

X_us, y_us = nmiss.fit_resample(X, y)


In [None]:
X_us.shape, y_us.shape

In [None]:
y.value_counts()

In [None]:
y_os.value_counts()

In [None]:
# Observe the data has been balanced

sns.countplot(x=y_us)
# sns.countplot(x=y, fill=False)
plt.grid()

In [None]:
# Selection will be to use Oversampling.  Too few examples for the undersampled dataset

X = X_os
y = y_os

#### 7. Visualize the Data

In [None]:
sns.countplot(wines['type'])
plt.grid()
plt.show()

In [None]:
sns.countplot(data=wines, x=wines['quality'], hue='type', palette='PuRd')
plt.grid()
plt.show()

In [None]:
numerical_columns = wines.select_dtypes(include=['number'])
corr_matrix = numerical_columns.corr()

In [None]:
sns.pairplot(wines, hue='type', palette='PuRd')

In [None]:
plt.figure(figsize=(16,6))
sns.heatmap(corr_matrix, cmap='PuOr', annot=True)

### 8. Red vs. White Wine Data Analysis

In [None]:
red_corr = red_wines.corr()
plt.figure(figsize=(16,6))
sns.heatmap(red_corr, cmap='PuOr', annot=True)

* The most negatively correlated variable with quality is Volatile Acidity and Sulfur Dioxide.
* The most positively correlated variable with quality is Alcohol

In [None]:
white_corr = white_wines.corr()
plt.figure(figsize=(16,6))
sns.heatmap(red_corr, cmap='PuOr', annot=True)

* Same results as with red wines regarding correlation with quality

In [None]:
wines.hist(figsize=(10, 10), bins=60, color='darkred')
plt.tight_layout()
plt.show()

In [None]:

# Convert the filtered DataFrames to NumPy arrays before passing to histplot
red_alcohol = np.array(wines[wines['type'] == 'red']['alcohol'])
red_alcohol

In [None]:
# Convert the filtered DataFrames to NumPy arrays before passing to histplot
white_alcohol = np.array(wines[wines['type'] == 'white']['alcohol'])

white_alcohol

In [None]:

# Create the histograms
# sns.histplot(data=red_alcohol, alpha=0.4, bins='auto', kde=True, color='red')
# sns.histplot(data=white_alcohol, alpha=0.4, bins='auto', kde=True, color='gray')
# plt.show()


In [None]:
# Create the lmplot
sns.lmplot(
    x='free sulfur dioxide',
    y='total sulfur dioxide',
    # x='alcohol',
    # y='residual sugar',
    data=wines,
    hue='type',
    palette={'white': 'gray', 'red': 'darkred'},
    height=8
)

# Use more informative axis labels than are provided by default
# sns.set_axis_labels("Alcohol level (%)", "Residual Sugar")

# Add a title to the plot
plt.title("Alcohol vs. Residual Sugar in Wine")

# Show the plot
plt.show()

##### CONCLUSIONS

1. Data is unbalanced with the number of samples per quality values
2. WE need to apply data sampling mathods for Imbalanced datasets
3. Methods:
    Oversampling:
    3.1. Random Oversampling
    3.2. Synthetic Minority Oversampling Technique (SMOTE)
    3.3. Adaptive Synthetic Sampling (ADASYN)
    Undersampling:
    3.4. Random under sampling
    3.5. Near Miss Under Sampling
    3.6. Tomek Links Under Sampling
4. We corrected data imbalance using oversampling (SMOTE) technique.


## IV. Model training class definitions

In [None]:
import datetime

class ScoreLogger:
    def __init__(self):
        self.df = pd.DataFrame(columns=['Model', 'Score'])

    def add(self, epic, model, score):
        # Get the now date time
        now_ts = datetime.datetime.now()
        new_row = pd.DataFrame({'Timestamp': [now_ts], 'Epic': [epic], 'Model': [model], 'Score': [score]})
        self.df = pd.concat([self.df, new_row], ignore_index=True)

    def print(self):
        if (len(self.df) == 0):
            print('Nothing to show here.')
        else:
            self.df = self.df.sort_values(by=['Score'], ascending=False)
            print(self.df.to_string())
            print('\n')
            print('Timestamp:',  self.df['Timestamp'].iloc[0])
            print('Best epic:',  self.df['Epic'].iloc[0])
            print('Best model:', self.df['Model'].iloc[0])
            print('Best score:', self.df['Score'].iloc[0])
    def clear(self):
        self.df = pd.DataFrame(columns=['Model', 'Score'])

logger = ScoreLogger()
# logger.clear()

#### Getting all the training source data: X, y, X_train, X_test, y_train, y_test

In [None]:
from sklearn.model_selection import train_test_split

# Uncomment in case you DON'T want to use oversampling or undersampling
# X = scaled_wines
# y = imputed_wines['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=101)


In [None]:
X_train.info()

In [None]:
y_train.describe()


## V. Model Evaluation

### 1. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=200, random_state=101)

log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))

In [None]:
# Check all LogisticRegression hyperparameters

# Get the default parameters
default_parameters = LogisticRegression().get_params()


# Print the default parameters
print('Parameter             Value')
print('-'*30)
for parameter, value in default_parameters.items():
    print(f"{parameter:20}: {value}")

In [None]:
# Define the hyperparameter grid
hyperparameters = {
    'C': [0.001, 0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['lbfgs', 'sag', 'saga'],
    'tol': [0.0001, 0.00001],
    'max_iter': [1000, 2000, 5000],
}

In [None]:
# C A U T I O N !!!     7min process ahead

# Run the model with all the parameter combinations
grid_search = GridSearchCV(log_reg, hyperparameters, cv=5, verbose=3)
# grid_search.fit(X_train, y_train)

In [None]:
# BEST FOUND LOOKING AT THE GRIDSEARCH RUN
# C=10, max_iter=2000, penalty=l2, solver=sag, tol=1e-05;, score=0.500 total time=   1.7s
# C=1, max_iter=1000, penalty=l1, solver=saga, tol=0.0001;, score=0.497 total time=   1.3s
# C=10, max_iter=5000, penalty=l2, solver=sag, tol=0.0001;, score=0.451 total time=   2.8s
# C=10, max_iter=5000, penalty=l2, solver=sag, tol=0.0001;, score=0.497 total time=   2.6s
# Get the best hyperparameters
# 
# best_hyperparameters = grid_search.best_params_
# print(best_hyperparameters)

# Get the best hyperparameters
# best_hyperparameters = grid_search.best_params_
# print(best_hyperparameters)
# print(grid_search.best_score_)

# Train the model with the best hyperparameters
best_hyperparameters = {'C': 0.15, 'max_iter': 5000, 'penalty': 'l1', 'solver': 'saga', 'tol': 0.000001}   # Found manually
log_reg.set_params(**best_hyperparameters)

In [None]:
log_reg.fit(X_train, y_train)

# Evaluate the accuracy of the model on the test set
y_pred = log_reg.predict(X_test)
# accuracy = np.mean(y_pred == y_test)

# print('Accuracy:', accuracy)

print ('Score', log_reg.score(X_test, y_test))


In [None]:
print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))

In [None]:
print('Logistic Regression done!')
logger.add(this_epic, 'LogisticRegression', log_reg.score(X_test, y_test))

### 2. KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))
# log_reg.score

#### Choosing a K Value

Let's go ahead and use the elbow method to pick a good K Value:

In [None]:
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

* We choose K = 30

In [None]:
# Check all LogisticRegression hyperparameters

# Get the default parameters
default_parameters = KNeighborsClassifier().get_params()


# Print the default parameters
print('Parameter             Value')
print('-'*30)
for parameter, value in default_parameters.items():
    print(f"{parameter:20}: {value}")
    

In [None]:
# Define the hyperparameter grid
hyperparameters = {
    # 'n_neighbors': [3, 5, 7, 9, 11],
    'n_neighbors': [30],                # Chosen as the best K according to the Elbow chart
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size': [30, 50, 70, 90, 110],
    'p': [1, 2, 3]
}

In [None]:
# C A U T I O N !!!     5min process ahead

# Run the model with all the parameter combinations
grid_search = GridSearchCV(knn, hyperparameters, cv=5, verbose=3)
grid_search.fit(X_train, y_train)



* Train the model with the best hyper parameters

In [None]:
# Get the best hyperparameters
best_hyperparameters = grid_search.best_params_
print(best_hyperparameters)

# Get the best hyperparameters
best_hyperparameters = grid_search.best_params_
print(best_hyperparameters)
print(grid_search.best_score_)

# Train the model with the best hyperparameters
knn.set_params(**best_hyperparameters)


In [None]:
knn.fit(X_train, y_train)

# Evaluate the accuracy of the model on the test set
y_pred = knn.predict(X_test)
# accuracy = np.mean(y_pred == y_test)

# print('Accuracy:', accuracy)

print ('Score', knn.score(X_test, y_test))

In [None]:
print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))

In [None]:
print('K Nearest Neighbors done!')
logger.add(this_epic, 'KNN', knn.score(X_test, y_test))

### 3. Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))
# log_reg.score

In [None]:
# Check all LogisticRegression hyperparameters

# Get the default parameters
default_parameters = DecisionTreeClassifier().get_params()


# Print the default parameters
print('Parameter             Value')
print('-'*30)
for parameter, value in default_parameters.items():
    print(f"{parameter:20}: {value}")
    

In [None]:
# Define the hyperparameter grid
hyperparameter_grid = {
    'max_depth': [3, 5, 7, 9, 11],
    'min_samples_split': [2, 5, 10, 20, 50],
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
}

In [None]:
# C A U T I O N !!!     1min process ahead

# Run the model with all the parameter combinations
grid_search = GridSearchCV(dtree, hyperparameter_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train)



* Train the model with the best hyper parameters

In [None]:
# Get the best hyperparameters
best_hyperparameters = grid_search.best_params_
print(best_hyperparameters)

# Get the best hyperparameters
best_hyperparameters = grid_search.best_params_
print(best_hyperparameters)
print(grid_search.best_score_)

# Train the model with the best hyperparameters
dtree.set_params(**best_hyperparameters)


In [None]:
dtree.fit(X_train, y_train)

# Evaluate the accuracy of the model on the test set
y_pred = dtree.predict(X_test)
# accuracy = np.mean(y_pred == y_test)

# print('Accuracy:', accuracy)

print ('Score', dtree.score(X_test, y_test))

In [None]:
print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))

In [None]:
print('Decision Tree done!')
logger.add(this_epic, 'DecisionTree', dtree.score(X_test, y_test))

### 4. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rforest = RandomForestClassifier()
rforest.fit(X_train, y_train)
y_pred = rforest.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))
# log_reg.score

In [None]:
# Check all LogisticRegression hyperparameters

# Get the default parameters
default_parameters = RandomForestClassifier().get_params()


# Print the default parameters
print('Parameter             Value')
print('-'*30)
for parameter, value in default_parameters.items():
    print(f"{parameter:20}: {value}")
    

In [None]:
# Define the hyperparameter grid
hyperparameter_grid = {
    'n_estimators': [100, 200, 500, 1000],
    'n_estimators': [1000],
    'max_depth': [3, 5, 7, 9, 11],
    'min_samples_split': [2, 5, 10, 20, 50],
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False],
}

In [None]:
# C A U T I O N !!!     45 min process ahead

# Run the model with all the parameter combinations
grid_search = GridSearchCV(rforest, hyperparameter_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train)



* Train the model with the best hyper parameters

In [None]:
# Get the best hyperparameters
best_hyperparameters = grid_search.best_params_
print(best_hyperparameters)

# Get the best hyperparameters
best_hyperparameters = grid_search.best_params_
print(best_hyperparameters)
print(grid_search.best_score_)

# Train the model with the best hyperparameters
best_hyperparameters = {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 11, 'min_samples_split': 2, 'n_estimators': 1000}    # REMOVE!!!!
rforest.set_params(**best_hyperparameters)


In [None]:
rforest.fit(X_train, y_train)

# Evaluate the accuracy of the model on the test set
y_pred = rforest.predict(X_test)
# accuracy = np.mean(y_pred == y_test)

# print('Accuracy:', accuracy)

print ('Score', rforest.score(X_test, y_test))


In [None]:
print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))

In [None]:
# Make predictions on the training set using the RandomForestClassifier object
y_pred_train = rforest.predict(X_train)

# Calculate the OOB error
oob_error = np.mean(y_pred_train != y_train)

# Print the OOB error
print('OOB error:', oob_error)

In [None]:
# Here we check bias vs. variance calculating the mean error between trainiing set and test set
trainset_error = oob_error
testset_error = np.mean(y_test != y_pred)
print('Train set error', trainset_error)
print('Test set error', testset_error)

In [None]:
print('Random Forest done!')
logger.add(this_epic, 'Random Forest', rforest.score(X_test, y_test))

### 5. Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))
# log_reg.score

In [None]:
# Check all SVM hyperparameters

# Get the default parameters
default_parameters = SVC().get_params()


# Print the default parameters
print('Parameter             Value')
print('-'*30)
for parameter, value in default_parameters.items():
    print(f"{parameter:20}: {value}")
    

In [None]:
# Define the hyperparameter grid
hyperparameter_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['auto', 'scale'],
}

In [None]:
# C A U T I O N !!!     6 min process ahead

# Run the model with all the parameter combinations
grid_search = GridSearchCV(svm, hyperparameter_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train)



* Train the model with the best hyper parameters

In [None]:
# Get the best hyperparameters
best_hyperparameters = grid_search.best_params_
print(best_hyperparameters)

# Get the best hyperparameters
best_hyperparameters = grid_search.best_params_
print(best_hyperparameters)
print(grid_search.best_score_)

# Train the model with the best hyperparameters   (CHANGE THE ESTIMATOR)
svm.set_params(**best_hyperparameters)


In [None]:
svm.fit(X_train, y_train)

# Evaluate the accuracy of the model on the test set
y_pred = svm.predict(X_test)
# accuracy = np.mean(y_pred == y_test)

# print('Accuracy:', accuracy)

print ('Score',svm.score(X_test, y_test))


In [None]:
print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))

In [None]:
# Make predictions on the training set using the RandomForestClassifier object
y_pred_train = rforest.predict(X_train)

# Calculate the OOB error
oob_error = np.mean(y_pred_train != y_train)

# Print the OOB error
print('OOB error:', oob_error)

In [None]:
# Here we check bias vs. variance calculating the mean error between trainiing set and test set
trainset_error = oob_error
testset_error = np.mean(y_test != y_pred)
print('Train set error', trainset_error)
print('Test set error', testset_error)

In [None]:
print('Support Vector Machine done!')
logger.add(this_epic, 'SVM', svm.score(X_test, y_test))

### 6. XGBoost

In [None]:
import xgboost as xgb
from xgboost import XGBClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
le = LabelEncoder()

# Encode the class labels
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.fit_transform(y_test)

xgb_model = XGBClassifier(objective='multiclass:softmax', learning_rate = 0.1,
              max_depth = 1, n_estimators = 330)

xgb_model.fit(X_train, y_train_encoded)
y_pred = xgb_model.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print('Confusion Matrix:') 

print(confusion_matrix(y_test, y_pred))
print('Classification Report:') 
print(classification_report(y_test, y_pred, zero_division=0))


In [None]:
# Check all XGBoost hyperparameters

# Get the default parameters
default_parameters = XGBClassifier().get_params()


# Print the default parameters
print('Parameter             Value')
print('-'*30)
for parameter, value in default_parameters.items():
    print(f"{parameter:20}: {value}")
    

In [None]:
# Define the hyperparameter grid
hyperparameter_grid = {
    'n_estimators': [200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
}

In [None]:

random_search = RandomizedSearchCV(estimator=xgb_model, param_distributions=hyperparameter_grid, cv=5, verbose=3)
random_search.fit(X_train, y_train_encoded)


In [None]:
# C A U T I O N !!!     6 min process ahead

# Run the model with all the parameter combinations
grid_search = GridSearchCV(xgb_model, hyperparameter_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train_encoded)



* Train the model with the best hyper parameters

In [None]:
# Get the best hyperparameters
best_hyperparameters = grid_search.best_params_
print(best_hyperparameters)

# Get the best hyperparameters
best_random_hyperparameters = random_search.best_params_
best_grid_hyperparameters    = grid_search.best_params_
print('RandomizedSearch')
print(random_search.best_score_)
print(best_random_hyperparameters)
print('GridSearch')
print(best_grid_hyperparameters)
print(grid_search.best_score_)

# Train the model with the best hyperparameters   (CHANGE THE ESTIMATOR)
xgb_model.set_params(**best_grid_hyperparameters)


In [None]:
xgb_model.fit(X_train, y_train_encoded)

# Evaluate the accuracy of the model on the test set
y_pred = xgb_model.predict(X_test)
# accuracy = np.mean(y_pred == y_test)

# print('Accuracy:', accuracy)

print ('Score',xgb_model.score(X_test, y_test_encoded))

In [None]:
print('Confusion Matrix:') 

print(confusion_matrix(y_test_encoded, y_pred))
sns.heatmap(confusion_matrix(y_test_encoded, y_pred))
print('Classification Report:') 
print(classification_report(y_test_encoded, y_pred, zero_division=0))

In [None]:
# Make predictions on the training set using the RandomForestClassifier object
y_pred_train = xgb_model.predict(X_train)

# Calculate the OOB error
oob_error = np.mean(y_pred_train != y_train_encoded)

# Print the OOB error
print('OOB error:', oob_error)

In [None]:
# Here we check bias vs. variance calculating the mean error between trainiing set and test set
trainset_error = oob_error
testset_error = np.mean(y_test_encoded != y_pred)
print('Train set error', trainset_error)
print('Test set error', testset_error)

In [None]:
# Make predictions on the test data
y_pred_encoded = xgb_model.predict(X_test)

# Decode the class labels
y_pred = le.inverse_transform(y_pred_encoded)

In [None]:
print('XGBoost done!')
logger.add(this_epic, 'XGBoost', xgb_model.score(X_test, y_test_encoded))

In [None]:
logger.df

### 7. Final Tuning for the winner model = XGBoost

In [None]:
# Create a LabelEncoder object
le = LabelEncoder()

# Encode the class labels
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.fit_transform(y_test)

xgb_model = xgb.XGBClassifier(early_stopping_rounds=10)


xgb_model.fit(X_train, y_train_encoded, eval_set=[(X_test, y_test_encoded)])
y_pred = xgb_model.predict(X_test)



In [None]:
# Check all XGBoost hyperparameters

# Get the default parameters
default_parameters = XGBClassifier().get_params()


# Print the default parameters
print('Parameter             Value')
print('-'*30)
for parameter, value in default_parameters.items():
    print(f"{parameter:20}: {value}")
    

In [None]:
# Define the hyperparameter grid
hyperparameter_grid = {
    'n_estimators': [300],
    'max_depth': [7],
    'min_child_weight': [1, 2],
    'early_stopping_rounds': [10, 20],
    'learning_rate': [0.3],
}

# GridSearch
# {'learning_rate': 0.3, 'max_depth': 7, 'n_estimators': 300}
# 0.7797805560613957

In [None]:
# C A U T I O N !!!     6 min process ahead

# Run the model with all the parameter combinations
grid_search = GridSearchCV(xgb_model, hyperparameter_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train_encoded, eval_set=[(X_test, y_test_encoded)])



* Train the model with the best hyper parameters

In [None]:

# Get the best hyperparameters
best_grid_hyperparameters    = grid_search.best_params_

print('GridSearch')
print(best_grid_hyperparameters)
print(grid_search.best_score_)

# Train the model with the best hyperparameters   (CHANGE THE ESTIMATOR)
xgb_model.set_params(**best_grid_hyperparameters)


In [None]:
from sklearn.model_selection import validation_curve
from sklearn.model_selection import StratifiedKFold
from scipy.sparse import vstack

# reproducibility
seed = 101
np.random.seed(seed)

y_encoded = pd.concat([pd.DataFrame(y_train_encoded), pd.DataFrame(y_test_encoded)])
print('y_train_encoded: ', len(y_train_encoded))
print('y_test_encoded: ', len(y_train_encoded))
print('y_encoded: ', len(y_train_encoded))
y_encoded.describe()

In [None]:
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)


In [None]:
# Define the hyperparameter grid
default_params = {
    # 'objective': 'binary:logistic',
    'max_depth': 7,
    'min_child_weight': 2,
    'learning_rate': 0.3,
}

n_estimators_range = np.linspace(1, 200, 10).astype('int')

train_scores, test_scores = validation_curve(
    XGBClassifier(**default_params),
    X, y_encoded,
    param_name = 'n_estimators',
    param_range = n_estimators_range,
    cv=cv,
    scoring='accuracy', 
    verbose=3
)


In [None]:
# Show the validation curve plot

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

fig = plt.figure(figsize=(10, 6), dpi=100)

plt.title("Validation Curve with XGBoost (eta = 0.3)")
plt.xlabel("number of trees")
plt.ylabel("Accuracy")
plt.ylim(0.0, 1.1)

plt.plot(n_estimators_range,
             train_scores_mean,
             label="Training score",
             color="r")

plt.plot(n_estimators_range,
             test_scores_mean,
             label="Cross-validation score",
             color="g")

plt.fill_between(n_estimators_range,
                 train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std,
                 alpha=0.2, color="r")

plt.fill_between(n_estimators_range,
                 test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std,
                 alpha=0.2, color="g")

plt.axhline(y=1, color='k', ls='dashed')

plt.legend(loc="best")
plt.show()

i = np.argmax(test_scores_mean)
print("Best cross-validation result ({0:.2f}) obtained for {1} trees".format(test_scores_mean[i], n_estimators_range[i]))

#### Manually check if the prediction rates are true

In [None]:

y_pred_train_df = pd.DataFrame(y_pred_train, columns=['y_pred_train'])
out_train_df = pd.concat([X_train, y_train, y_pred_train_df], ignore_index=True, sort=False, axis=1)
out_train_df.columns = X_train.columns.to_list() + ['y_train'] + ['y_pred_train']
out_train_df
y_pred_df = pd.DataFrame(y_pred, columns=['y_pred'])
type(X_test), type(y_test), type(y_pred_df)

out_test_df = pd.concat([X_test, pd.Series(y_test), y_pred_df], ignore_index=True, sort=False, axis=1)
out_test_df.columns = X_test.columns.to_list() + ['y_test'] + ['y_pred']
out_test_df


In [None]:

y_pred_df = pd.DataFrame(y_pred, columns=['y_pred'])
out_test_df = pd.concat([X_test, pd.Series(y_test), y_pred_df], ignore_index=True, sort=False, axis=1)
out_test_df.columns = X_test.columns.to_list() + ['y_test'] + ['y_pred']
out_test_df

In [None]:
out_train_df.to_excel('out_train.xlsx', sheet_name='Training', index=False)
out_test_df.to_excel('out_test.xlsx', sheet_name='Test', index=False)

In [None]:
# Saving the model

xgb_model.save_model('xgb_model.json')