# Model Making

### Description

This notebook file must be run after taking the output from datasetPreparation.ipynb

Training of the model will be done ONLY. Manipulation of the dataset must NOT be done here. Only importing of dataset is allowed. 

Train Test Splits will be done on datasetPrepation.ipynb

This notebook will output the top performing ZERO-SHOT models. No hyperparameter tuning will be done in this notebook.

Technique used will be the following:

In [139]:
import pandas as pd
import numpy as np
import time

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import lightgbm as lgb
from xgboost import XGBClassifier

import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.metrics import classification_report, accuracy_score

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

import joblib
import os

### Importing of dataset: train, validation, and test sets

In [83]:
X_train = pd.read_csv('data/cleaned/X_train.csv')
X_test = pd.read_csv('data/cleaned/X_test.csv')
y_train = pd.read_csv('data/cleaned/y_train.csv')
y_test = pd.read_csv('data/cleaned/y_test.csv')

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

zero_shot_performances = []

X_train shape: (39818, 6)
X_test shape: (17065, 6)
y_train shape: (39818, 1)
y_test shape: (17065, 1)


In [84]:
# Reshape y_train and y_test to be 1D arrays
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

###  **UNCOMMENT TO IMPORT ORIGINAL, CLEANED, AND UNDERSAMPLED**

In [85]:
# original_X_train = pd.read_csv('data/original/X_train.csv')
# cleaned_X_train = pd.read_csv('data/cleaned/X_train.csv')
# undersampled_X_train = pd.read_csv('data/undersampled/X_train.csv')

# original_X_test = pd.read_csv('data/original/X_test.csv')
# cleaned_X_test = pd.read_csv('data/cleaned/X_test.csv')
# undersampled_X_test = pd.read_csv('data/undersampled/X_test.csv')

# original_y_train = pd.read_csv('data/original/y_train.csv')
# cleaned_y_train = pd.read_csv('data/cleaned/y_train.csv')
# undersampled_y_train = pd.read_csv('data/undersampled/y_train.csv')

# original_y_test = pd.read_csv('data/original/y_test.csv')
# cleaned_y_test = pd.read_csv('data/cleaned/y_test.csv')
# undersampled_y_test = pd.read_csv('data/undersampled/y_test.csv')

### Functions

In [86]:
def get_performance_report(model, X_test, y_test):
    y_pred = model.predict(X_test)
    
    return classification_report(y_test, y_pred)

In [87]:
def load_model(filename, X_test, y_test):
    if os.path.exists(filename):
        print("Model Found: Loading...")
        model = joblib.load(filename)
        performance = get_performance_report(model, X_test, y_test)
        print(performance)

        return model, performance, True
    
    else:
        print(f"Model {filename} not found")

        return None, None, False

In [88]:
def print_classification_report(report):
    # Define the headers
    headers = ["precision", "recall", "f1-score", "support"]
    
    # Print the header
    print(f"{'':>15} {'precision':>10} {'recall':>10} {'f1-score':>10} {'support':>10}")
    
    # Print the per-class metrics
    for class_name, metrics in report.items():
        if class_name in ['accuracy', 'macro avg', 'weighted avg']:
            continue  # Skip overall metrics for now
        print(f"{class_name:>15} {metrics['precision']:>10.2f} {metrics['recall']:>10.2f} {metrics['f1-score']:>10.2f} {metrics['support']:>10}")
    
    # Print the overall accuracy
    print(f"\n{'accuracy':>15} {' ':>10} {' ':>10} {report['accuracy']:>10.2f} {sum([report[class_name]['support'] for class_name in report if class_name not in ['accuracy', 'macro avg', 'weighted avg']]):>10}")

    # Print the macro and weighted averages
    for avg in ['macro avg', 'weighted avg']:
        metrics = report[avg]
        print(f"{avg:>15} {metrics['precision']:>10.2f} {metrics['recall']:>10.2f} {metrics['f1-score']:>10.2f} {metrics['support']:>10}")


### Training of model

The following models will be used for training on a stars classification dataset:

1. **Logistic Regression**: A linear model for binary classification, extended here for multi-class classification. It models the probability of each class using the logistic function.

2. **Support Vector Machines (SVM)**: A classification model that finds the hyperplane best separating the data into classes, effective for high-dimensional data.

3. **Gaussian Naive Bayes**: A probabilistic classifier based on Bayes' theorem with the assumption of feature independence. Suitable for high-dimensional datasets.

4. **K-Nearest Neighbors (KNN)**: An instance-based learning algorithm that classifies a sample based on the majority class among its k-nearest neighbors, effective for non-linear decision boundaries.

5. **Decision Trees**: A model that splits the data into subsets based on feature values, providing a clear visual representation of decision-making, suitable for both classification and regression.

6. **Extreme Gradient Boost (XGBoost)**: An efficient implementation of gradient-boosted decision trees designed for speed and performance, widely used for structured/tabular data.

7. **Light Gradient Boost Machine (LightGBM)**: A highly efficient gradient boosting framework optimized for speed and memory usage, making it suitable for large datasets and high-dimensional data.

8. **Random Forest:** An ensemble method that uses multiple decision trees to improve classification accuracy and control overfitting. It builds numerous decision trees during training and outputs the mode of the classes for classification tasks.

9. **AdaBoost:** An ensemble method that combines multiple weak classifiers to form a strong classifier. It adjusts the weights of incorrectly classified instances so that subsequent classifiers focus more on difficult cases, improving overall accuracy.

10. **Feedforward Neural Network (FNN)**: A simple type of artificial neural network where information moves in one direction, from input to output, suitable for basic and complex classification tasks.


**train_zero_shot** gets the model and trains it one time only.

Returns: classification report of the model

In [89]:
def train_zero_shot(model, X_train, X_test, y_train, y_test):
    # Fit model once
    model.fit(X_train, y_train)

    # Predict X_test
    y_pred = model.predict(X_test)

    # Get classification report
    class_report = classification_report(y_test, y_pred, output_dict=True)

    return class_report
    

In [90]:
def train_with_tuning(model, param_grid, X_train, X_test, y_train, y_test):
    # Perform hyperparameter tuning with GridSearchCV
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=5, verbose=3, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    # Get the best model
    best_model = grid_search.best_estimator_
    
    # Train the best model on the training set
    best_model.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = best_model.predict(X_test)
    
    # Calculate the accuracy of the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model Accuracy: {accuracy:.2f}")
    
    # Print the classification report
    class_report = classification_report(y_test, y_pred, output_dict=True)
    print("Classification Report:")
    print(class_report)
    
    return class_report, best_model 

##### Logistic Regression

In [91]:
model = LogisticRegression()

report = train_zero_shot(model, X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

print_classification_report(report)

                 precision     recall   f1-score    support
              0       0.93       0.95       0.94       5688
              1       0.95       0.93       0.94       5688
              2       0.99       1.00       1.00       5689

       accuracy                             0.96      17065
      macro avg       0.96       0.96       0.96      17065
   weighted avg       0.96       0.96       0.96      17065


##### Support Vector Machines

In [92]:
report = train_zero_shot(SVC(), X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

print_classification_report(report)

                 precision     recall   f1-score    support
              0       0.94       0.97       0.95       5688
              1       0.98       0.93       0.95       5688
              2       0.99       1.00       1.00       5689

       accuracy                             0.97      17065
      macro avg       0.97       0.97       0.97      17065
   weighted avg       0.97       0.97       0.97      17065


##### Naive Bayes

In [93]:
model = GaussianNB()

report = train_zero_shot(model, X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

print_classification_report(report)

                 precision     recall   f1-score    support
              0       0.87       0.76       0.81       5688
              1       0.80       0.89       0.84       5688
              2       0.99       0.99       0.99       5689

       accuracy                             0.88      17065
      macro avg       0.89       0.88       0.88      17065
   weighted avg       0.89       0.88       0.88      17065


In [95]:
# # Define parameters
# param_grid = {
#     'var_smoothing': np.logspace(0, -9, num=100)
# }

# filename = "models/gnb_model.joblib"

# gnb_model, gnb_performance, isLoaded = load_model(filename, X_test, y_test)

# # Train Model
# if not(isLoaded):
#     print("Training GNB model")
#     gnb_model, gnb_performance = train_with_tuning(param_grid, X_train, X_test, y_train, y_test)

#     # Export model
#     joblib.dump(gnb_model, filename)

# # Put model in array
# models.append(gnb_model)
# model_performances.append(gnb_performance)

##### K-Nearest Neighbors

In [96]:
model = KNeighborsClassifier()

report = train_zero_shot(model, X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

print_classification_report(report)

                 precision     recall   f1-score    support
              0       0.94       0.97       0.95       5688
              1       0.97       0.94       0.96       5688
              2       0.99       1.00       0.99       5689

       accuracy                             0.97      17065
      macro avg       0.97       0.97       0.97      17065
   weighted avg       0.97       0.97       0.97      17065


In [98]:
# param_grid = {
#     'n_neighbors': [3, 5, 7, 9, 11],
#     'weights': ['uniform', 'distance'],
#     'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
#     'p': [1, 2]
# }

# filename = "models/knn_model.joblib"

# knn_model, knn_performance, isLoaded = load_model(filename, X_test, y_test)

# # Train Model
# if not(isLoaded):
#     print("Training KNN model")
#     knn_model, knn_performance = k_nearest_neighbors(param_grid, X_train, X_test, y_train, y_test)

#     # Export model
#     joblib.dump(knn_model, filename)

# # Put model in array
# models.append(knn_model)
# model_performances.append(knn_performance)

##### Decision Trees

In [99]:
model = DecisionTreeClassifier()

report = train_zero_shot(model, X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

print_classification_report(report)

                 precision     recall   f1-score    support
              0       0.94       0.94       0.94       5688
              1       0.94       0.94       0.94       5688
              2       1.00       1.00       1.00       5689

       accuracy                             0.96      17065
      macro avg       0.96       0.96       0.96      17065
   weighted avg       0.96       0.96       0.96      17065


In [101]:
# # Define the parameter grid for hyperparameter tuning
# param_grid = {
#     'criterion': ['gini', 'entropy'],
#     'splitter': ['best', 'random'],
#     'max_depth': [None, 10, 20, 30, 40, 50],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [1, 2, 4],
#     'max_features': [None, 'sqrt', 'log2']
# }

# filename = "models/dt_model.joblib"

# dt_model, dt_performance, isLoaded = load_model(filename, X_test, y_test)

# # Train Model
# if not(isLoaded):
#     print("Training DT model")
#     dt_model, dt_performance = decision_tree(param_grid, X_train, X_test, y_train, y_test)

#     # Export model
#     joblib.dump(dt_model, filename)

# # Put model in array
# models.append(dt_model)
# model_performances.append(dt_performance)

##### XGBoost

In [102]:
model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

report = train_zero_shot(model, X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

print_classification_report(report)

                 precision     recall   f1-score    support
              0       0.95       0.97       0.96       5688
              1       0.97       0.95       0.96       5688
              2       1.00       1.00       1.00       5689

       accuracy                             0.97      17065
      macro avg       0.97       0.97       0.97      17065
   weighted avg       0.97       0.97       0.97      17065


In [103]:
# XGB Model param_grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 0.1, 0.2],
    'reg_alpha': [0, 0.01, 0.1],
    'reg_lambda': [1, 1.5, 2]
}

##### LightGBM

In [104]:
model = lgb.LGBMClassifier(random_state=42)

report = train_zero_shot(model, X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

print_classification_report(report)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000492 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1530
[LightGBM] [Info] Number of data points in the train set: 39818, number of used features: 6
[LightGBM] [Info] Start training from score -1.098587
[LightGBM] [Info] Start training from score -1.098587
[LightGBM] [Info] Start training from score -1.098663
                 precision     recall   f1-score    support
              0       0.95       0.97       0.96       5688
              1       0.97       0.95       0.96       5688
              2       0.99       1.00       1.00       5689

       accuracy                             0.97      17065
      macro avg       0.97       0.97       0.97      17065
   weighted avg       0.97       0.97       0.97      17065


In [105]:
# LGBM Model param_grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [31, 50, 100],
    'max_depth': [-1, 10, 20, 30],
    'min_data_in_leaf': [20, 50, 100],
    'feature_fraction': [0.6, 0.8, 1.0],
    'bagging_fraction': [0.6, 0.8, 1.0],
    'bagging_freq': [0, 5, 10],
    'lambda_l1': [0, 0.01, 0.1],
    'lambda_l2': [0, 0.01, 0.1]
}

##### Random Forest

In [106]:
model = RandomForestClassifier()

report = train_zero_shot(model, X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

print_classification_report(report)

                 precision     recall   f1-score    support
              0       0.95       0.97       0.96       5688
              1       0.98       0.95       0.96       5688
              2       1.00       1.00       1.00       5689

       accuracy                             0.97      17065
      macro avg       0.97       0.97       0.97      17065
   weighted avg       0.97       0.97       0.97      17065


##### AdaBoost

In [107]:
model = AdaBoostClassifier()

report = train_zero_shot(model, X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

print_classification_report(report)

                 precision     recall   f1-score    support
              0       0.46       0.81       0.59       5688
              1       0.33       0.09       0.14       5688
              2       0.99       0.97       0.98       5689

       accuracy                             0.62      17065
      macro avg       0.59       0.62       0.57      17065
   weighted avg       0.59       0.62       0.57      17065


##### Neural Network

In [111]:
def train_neural_network(X_train, X_test, y_train, y_test):
    # Determine the number of classes
    num_classes = 3
    
    # Convert labels to one-hot encoding
    y_train_one_hot = tf.keras.utils.to_categorical(y_train, num_classes)
    
    # Define the neural network model
    model = tf.keras.models.Sequential()
    model.add(layers.InputLayer(input_shape=(X_train.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(32, activation='relu'))
    model.add(layers.Dense(num_classes, activation='softmax'))

    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    # Train the model
    model.fit(X_train, y_train_one_hot, epochs=1, batch_size=32, validation_split=0.1)

    # Make predictions on the test set
    y_pred_probs = model.predict(X_test)
    y_pred = y_pred_probs.argmax(axis=1)

    # Calculate the accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model Accuracy: {accuracy:.2f}")

    # Get the classification report
    report = classification_report(y_test, y_pred, output_dict=True)
    return report

In [112]:
report = train_neural_network(X_train, X_test, y_train, y_test)

zero_shot_performances.append(report)

Model Accuracy: 0.96


### Compare all zero shot performances

In [131]:
model_titles = [
    'Logistic Regression',
    'Support Vector Machines',
    'Gaussian Naive Bayes',
    'K-Nearest Neighbors',
    'Decision Trees',
    'Extreme Gradient Boost',
    'Light Gradient Boost Machine',
    'Random Forest',
    'AdaBoost',
    'Feedforward Neural Network'
]

for title, report in zip(model_titles, zero_shot_performances):
    print(title)
    print_classification_report(report)
    print('-----------------------------------------------------------')

Logistic Regression
                 precision     recall   f1-score    support
              0       0.93       0.95       0.94       5688
              1       0.95       0.93       0.94       5688
              2       0.99       1.00       1.00       5689

       accuracy                             0.96      17065
      macro avg       0.96       0.96       0.96      17065
   weighted avg       0.96       0.96       0.96      17065
-----------------------------------------------------------
Support Vector Machines
                 precision     recall   f1-score    support
              0       0.94       0.97       0.95       5688
              1       0.98       0.93       0.95       5688
              2       0.99       1.00       1.00       5689

       accuracy                             0.97      17065
      macro avg       0.97       0.97       0.97      17065
   weighted avg       0.97       0.97       0.97      17065
------------------------------------------------------

In [138]:
model_accuracies = pd.DataFrame(columns=('Model', 'Accuracy'))

for title, report in zip(model_titles, zero_shot_performances):
    model_accuracies = pd.concat([model_accuracies, pd.DataFrame({'Model': [title], 'Accuracy': [report['accuracy']]})], ignore_index=True)


model_accuracies.sort_values('Accuracy', ascending=False)

Unnamed: 0,Model,Accuracy
7,Random Forest,0.97363
5,Extreme Gradient Boost,0.973513
6,Light Gradient Boost Machine,0.972986
3,K-Nearest Neighbors,0.968649
1,Support Vector Machines,0.967184
9,Feedforward Neural Network,0.963551
4,Decision Trees,0.959742
0,Logistic Regression,0.959156
2,Gaussian Naive Bayes,0.883211
8,AdaBoost,0.622678


### Hyperparameter Tuning

We now get the top five models and tune each of them.

In [143]:
top_models_zero_shot = [RandomForestClassifier(), XGBClassifier(), lgb.LGBMClassifier(), KNeighborsClassifier(), SVC()]
top_model_titles = [
    'Random Forest',
    'Extreme Gradient Boost',
    'Light Gradient Boost Machine',
    'K-Nearest Neighbors',
    'Support Vector Machines',
]

hyper_parameters = [
    {
        'n_estimators': [100, 200, 300],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    },
    {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 6, 9],
        'learning_rate': [0.01, 0.1, 0.2],
        'subsample': [0.6, 0.8, 1.0]
    },
    {
        'num_leaves': [31, 50, 100],
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [100, 200, 300]
    },
    {
        'n_neighbors': [3, 5, 7, 9],
        'weights': ['uniform', 'distance'],
        'metric': ['euclidean', 'manhattan']
    },
    {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf', 'poly'],
        'gamma': ['scale', 'auto']
    }
]

models_after_hyperparameter_tuning = []

tuned_model_data = pd.DataFrame(columns=['Model', 'Accuracy', "Best Hyperparameters", "Time To Train"])

reports_after_tuning = []

In [145]:
for title, model, param_grid in zip(top_model_titles, top_models_zero_shot, hyper_parameters):
    print(f"Training {title}")

    # Get start time
    start_time = time.time()

    # Start hyperparameter tuning
    report, best_model = train_with_tuning(model, param_grid, X_train, X_test, y_train, y_test)

    # Get end time
    end_time = time.time()

    # Get elapsed time
    elapsed_time = end_time - start_time
    print(f"Total time taken to train {title}: {elapsed_time:.2f} seconds")

    # Get best hyperparameters of the model
    hyperparameters = model.get_params()
    print(f"Best Hyperparameters: {hyperparameters}")

    # Append report data
    reports_after_tuning.append(report)
    print_classification_report(report)

    # Append Dataframe data
    tuned_model_data = pd.concat([tuned_model_data, pd.DataFrame({'Model': [title], 
                                                                  'Accuracy': [report['accuracy']], 
                                                                  "Best Hyperparameters": [hyperparameters], 
                                                                  "Time To Train": [elapsed_time]
                                                                  })], ignore_index=True)

    # Save models just in case
    models_after_hyperparameter_tuning.append(best_model)
    
    print("\n----------\n")

Training Random Forest
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Model Accuracy: 0.97
Classification Report:
{'0': {'precision': 0.9506003430531732, 'recall': 0.9743319268635724, 'f1-score': 0.9623198471956937, 'support': 5688}, '1': {'precision': 0.9761388286334056, 'recall': 0.9493670886075949, 'f1-score': 0.9625668449197861, 'support': 5688}, '2': {'precision': 0.9975451516745573, 'recall': 1.0, 'f1-score': 0.9987710674157303, 'support': 5689}, 'accuracy': 0.97456782888954, 'macro avg': {'precision': 0.9747614411203788, 'recall': 0.9745663384903892, 'f1-score': 0.9745525865104034, 'support': 17065}, 'weighted avg': {'precision': 0.9747627762338013, 'recall': 0.97456782888954, 'f1-score': 0.9745540057006117, 'support': 17065}}
Total time taken to train Random Forest: 697.62 seconds
Best Hyperparameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples':

In [146]:
tuned_model_data

Unnamed: 0,Model,Accuracy,Best Hyperparameters,Time To Train
0,Random Forest,0.974568,"{'bootstrap': True, 'ccp_alpha': 0.0, 'class_w...",697.622665
1,Extreme Gradient Boost,0.974509,"{'objective': 'binary:logistic', 'base_score':...",117.401529
2,Light Gradient Boost Machine,0.973044,"{'boosting_type': 'gbdt', 'class_weight': None...",60.905026
3,K-Nearest Neighbors,0.968473,"{'algorithm': 'auto', 'leaf_size': 30, 'metric...",3.198234
4,Support Vector Machines,0.968473,"{'C': 1.0, 'break_ties': False, 'cache_size': ...",76.192002
