# Classification

## Introduction

Electricity pricing is a critical aspect of energy markets, influencing both consumers and suppliers. In this project, we analyze electricity price trends in **British Columbia (BC)** and **Alberta (AB)**, Canada, using historical electricity measurements. Our objective is to develop a predictive model that estimates whether the electricity price in British Columbia will **increase (UP)** or **decrease (DOWN)**.

### Problem Statement

Given a dataset containing electricity-related metrics, we aim to predict the **bc_price_evo** variable, which indicates whether the electricity price in BC is increasing or decreasing. The dataset includes:

- **Date and Time of measurement**
- **Electricity Price and Demand** in British Columbia and Alberta
- **Electricity Transfer** between the two regions

The ultimate goal is to build an accurate machine learning model that can effectively classify price movement in BC.


### Evaluation Metric

Our model will be evaluated using the **Accuracy Score**, which measures the percentage of correct predictions.


```math
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total predictions}}
```

A higher accuracy score indicates a better-performing model.

### Submission Format

The submission should be a CSV file containing the predicted **bc_price_evo** values for each test ID. The format should be:

```csv
id,bc_price_evo 
28855,UP 
28856,UP 
28857,DOWN ...
```


This project will explore various machine learning techniques to improve prediction accuracy and gain insights into electricity price fluctuations.


### Libraries

For this project, we will use some libraries that can help us facilitate the process

In [None]:
import pandas as pd                 # pandas for the data structure manipulation 
import numpy as np                  # numpy for numerical operations
import matplotlib.pyplot as plt     # matplotlib for plotting
#import sklearn as sklearn          # sklearn for machine learning and evaluation (required module will be imported later in each partie)

## Data Processing

This document outlines a step-by-step approach to handling data using Python. The process includes data loading, inspecting, handling missing values, and data normalization.

In [None]:
data_dir = '../data/classification/'
output_dir = '../output/classification/submission/'

### 1. Loading Data  
The first step is to import the necessary libraries and load the dataset into a Pandas DataFrame.

In [None]:
df_train_raw = pd.read_csv(data_dir + 'train.csv', index_col=0)
df_train_raw.head()

In [None]:
df_test_raw = pd.read_csv(data_dir + 'test.csv', index_col=0)
df_test_raw.head()

### 2. Data Inspection  

By looking at the trainning and testing dataset, we will see that there are 7 columns, which will provide information for the predictions. Those are:
- `id` - Unique identifier used by Kaggle

- `date` - Date at which the measurement was made, between the 15th of May 2015 and the 13th of December 2017 (normalized between 0 and 1)
- `hour` - Hour of measurement as a half hour period of time over 24 hours (values originally between 0 and 47, here normalized between 0 and 1)
- `bc_price` - Electricity price in British Columbia (normalized between 0 and 1)
- `bc_demand` - Electricity demand in British Columbia (normalized between 0 and 1)
- `ab_price` - Electricity price in Alberta (normalized between 0 and 1)
- `ab_demand` - Electricity demand in Alberta (normalized between 0 and 1)
- `transfer` - Electricity transfer scheduled between British Columbia and Alberta (normalized between 0 and 1)
- `bc_price_evo` - Is the price in British Columbia going UP or DOWN compared to the last 24 hours? This is the target variable (i.e., it is only given during training)

Before processin,  it's essential to check the structure and properties of the data

In [None]:
df_train_raw.info()
df_train_raw.describe()

In [None]:
df_test_raw.info()
df_train_raw.describe()

Excellent, it seems that our data is really clean, there is no missing values and complex information are already normalized. So it could be ready for the processing. 

### 3. Data preparing

For the training process, we will need to seperate the data and the predictions (the labels in the `bc_price_evo` column).

In [None]:
df_train = df_train_raw.drop(columns=['bc_price_evo'])
df_train_labels = df_train_raw.loc[:, 'bc_price_evo'].copy()

df_train.head()
df_train_labels.head()

For the testing dataset, it's already okay, and we can use the raw data for the prediction. But we want to make things more concurrent, so we will define the save submission function, which will save our predictions in a csv file for the submission.

In [None]:
df_test = df_test_raw.copy()

In [None]:
def save_submission( df_test_labels, name_model  ):
    test = df_test.copy()
    test['bc_price_evo'] = df_test_labels
    test.to_csv(f'../data/classification/submission/{name_model}.csv', columns=['bc_price_evo'])

## Classification Models

### 1. Logistic Regression

Logistic Regression is a statistical method used for binary classification tasks. Unlike linear regression, which predicts continuous values, logistic regression predicts the probability that a given input belongs to a particular class. It applies the **sigmoid function** to transform linear predictions into probability values ranging from 0 to 1.  

**How It Works**:
1. Computes a weighted sum of input features.
2. Applies the **sigmoid function** to map the result to a probability.
3. Uses a decision threshold (typically 0.5) to classify data points into one of two categories.

Logistic Regression is fully integrated by Scikit Learn library in the module linear_model package, and we can use it as our model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

LogisticRegression of ScikitLearn supported a lot of hyperparameters, like `max_iter`, `C` (Inverse of regularization strength), `random_state`, ... And chosing the right hyperparameters for the model is really complicated. To solve this, we decide to do the eastimation of accuracy: we will seperate the training dataset into sub_training dataset and sub_validation dataset, we will use this for the training and validation of our model, and get the accuracy (of course, this is not the actual accuracy, it is only the estimate of the accuracy which will help us somehow decide the better hyperparameters. ).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train, df_train_labels, test_size=0.20, random_state=23)
clf = LogisticRegression(max_iter=10000, random_state=0, C=15, class_weight='balanced')
clf.fit(X_train, y_train)

acc = accuracy_score(y_test, clf.predict(X_test)) * 100
print(f"Logistic Regression model accuracy: {acc:.2f}%")

The preceding scripts show how we estimate the accuracy of the model with the hyparameters `max_iter=10000, random_state=0, C=15, class_weight='balanced'`. Doing this with another case, we will have the estimated accuracy tables"

| Solver (d=lbfgs) | C (d=1.0) | Penalty (d=l2) | Class Weight (d=None) | Fit Intercept (d=True) | L1 Ratio (d=None) | Accuracy (%) |
|--------|----|---------|--------------|--------------|----------|--------------|
| default | default | default | default | default | default | 74.01 |
| liblinear | default | default | default | default | default | 73.94 |
| default | 0.01 | default | default | default | default | 62.97 |
| default | 0.1  | default | default | default | default | 66.77 |
| default | 10   | default | default | default | default | 75.05 |
| default | 100  | default | default | default | default | 74.96 |
| default | 15   | default | default | default | default | 75.12 |
| liblinear | 15 | l1 | default | default | default | 75.00 |
| saga | 15 | elasticnet | default | default | 0.5 | 75.08 |
| default | 15 | default | balanced | default | default | 75.13 |
| default | 15 | default | balanced | False | default | 71.65 |


By looking at the table below, the highest estimated accuracy is `max_iter=10000, random_state=0, C=15, class_weight='balanced'` with accuracy = 75.12%. So we can use this for our model

In [None]:
clf = LogisticRegression(max_iter=10000, random_state=0, C=15, class_weight='balanced')
clf.fit(df_train, df_train_labels)
test_labels = clf.predict(df_test)

In [None]:
#concat test_labels to df_test index
save_submission(test_labels, 'LogisticRegression')

Our predictions is saved in csv file, and we now can send it to Kaggle for the submission. The result is 0.7284, not too high but still a good value.

### 2. Decision Tree

A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. It splits the dataset into subsets based on feature values, creating a tree-like structure where each internal node represents a decision based on a feature, branches correspond to decision outcomes, and leaf nodes represent the final prediction. The model aims to find the most informative splits to maximize classification accuracy.

Decision Trees are easy to interpret and can handle both numerical and categorical data. However, they are prone to overfitting, especially when the tree depth is large. Techniques such as pruning, limiting tree depth, and setting minimum samples for splits can help mitigate this issue.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


The dataset was split into training (70%) and testing (30%) subsets. Several hyperparameter configurations were tested to determine the best-performing model. The primary parameters tested include:

- Criterion: gini, entropy, log_loss

- Max Depth: None, 20

- Min Samples Split: 2, 5, 10, 20

- Min Samples Leaf: 1, 5

The Decision Tree Classifier was trained and tested under each configuration, and accuracy was used as the evaluation metric.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train, df_train_labels, test_size=0.3, random_state=21)
dtree = DecisionTreeClassifier(random_state=1, max_depth=20)
dtree.fit(X_train, y_train)

acc = accuracy_score(y_test, dtree.predict(X_test)) * 100
print(f"Decision Tree model accuracy: {acc:.2f}%")

| Criterion (d=gini) | Max Depth (d=None) | Min Samples Split (d=2) | Min Samples Leaf (d=1) | Accuracy (%) |
|---------|---------|--------------|--------------|--------------|
| default | default | default | default | 84.63 |
| default | default | 5 | default | 84.61 |
| default | default | 10 | default | 84.27 |
| default | default | 20 | 5 | 84.09 |
| default | 20 | default | default | 85.06 |
| default | 20 | 5 | default | 84.67 |
| default | 20 | 10 | 5 | 84.38 |
| default | 20 | 20 | 5 | 84.15 |
| entropy | default | default | default | 84.73 |
| entropy | default | 5 | default | 84.60 |
| entropy | default | 10 | 5 | 84.13 |
| entropy | 20 | default | default | 84.64 |
| entropy | 20 | 5 | default | 84.45 |
| entropy | 20 | 10 | 5 | 84.05 |
| log_loss | default | default | default | 84.73 |
| log_loss | default | 5 | default | 84.60 |
| log_loss | default | 10 | 5 | 84.13 |
| log_loss | 20 | default | default | 84.64 |
| log_loss | 20 | 5 | default | 84.45 |
| log_loss | 20 | 10 | 5 | 84.05 |


The best-performing configuration was a Decision Tree with:

- Criterion: gini

- Max Depth: 20

- Min Samples Split: 2

- Min Samples Leaf: 1

This configuration achieved an accuracy of 85.06% on the test dataset. The model was then retrained on the full training set and used to generate predictions on the test dataset, yielding a Kaggle competition score of 0.8648.


In [None]:
dtree = DecisionTreeClassifier(random_state=1, max_depth=20)
dtree.fit(df_train, df_train_labels)
dtree_test_labels = dtree.predict(df_test)

In [None]:
save_submission(dtree_test_labels, 'DecisionTree')

The Decision Tree model achieved strong predictive performance, with the best configuration yielding an accuracy of 85.06% on the test dataset and a Kaggle score of 0.8648. The results indicate that limiting the tree depth to 20 helps prevent overfitting while maintaining strong predictive power.

Using the gini criterion produced the best results, though entropy and log_loss performed comparably. The choice of a lower min_samples_split value helped the model make finer splits, but increasing it slightly could improve generalization.

Despite the high accuracy, Decision Trees have some limitations, including sensitivity to noisy data and overfitting when the depth is too large. 

### 3. Random Forest

Random Forest is an ensemble learning method that constructs multiple decision trees and combines their outputs to make predictions. It helps improve accuracy and reduce overfitting compared to a single decision tree. The final classification is determined by majority voting among all decision trees in the forest


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, classification_report

A Random Forest Classifier was used to predict bc_price_evo. The dataset was split into training (80%) and validation (20%) subsets. The model was trained using various hyperparameter settings, including:

- Number of Estimators: 50, 100

- Max Depth: None, 20

- Min Samples Split: 2, 5, 10, 20

- Min Samples Leaf: 1, 5

The model was trained and validated under each configuration, and accuracy was used as the evaluation metric.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df_train, df_train_labels, test_size=0.2, random_state=28)

rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(X_train, y_train)

acc = accuracy_score(y_test, rf_model.predict(X_test)) * 100
print(f"Decision Tree model accuracy: {acc:.2f}%")

| N Estimators (d=100) | Max Depth (d=None) | Min Samples Split (d=2) | Min Samples Leaf (d=1) | Accuracy (%) |
|---------|---------|--------------|--------------|--------------|
| 50 | default | default | default | 97.46 |
| 50 | default | default | 5 | 91.53 |
| 50 | default | 5 | default | 96.85 |
| 50 | default | 5 | 5 | 91.53 |
| 50 | default | 10 | default | 94.63 |
| 50 | default | 10 | 5 | 91.53 |
| 50 | default | 20 | default | 91.37 |
| 50 | 20 | default | default | 97.15 |
| 50 | 20 | default | 5 | 91.32 |
| 50 | 20 | 5 | default | 96.53 |
| 50 | 20 | 5 | 5 | 91.32 |
| 50 | 20 | 10 | default | 94.31 |
| 50 | 20 | 10 | 5 | 91.32 |
| 50 | 20 | 20 | default | 91.02 |
| default | default | default | default | 97.47 |
| default | default | default | 5 | 91.67 |
| default | default | 5 | default | 96.95 |
| default | default | 5 | 5 | 91.67 |
| default | default | 10 | default | 94.98 |
| default | default | 10 | 5 | 91.67 |
| default | default | 20 | default | 91.63 |
| default | 20 | default | default | 97.35 |
| default | 20 | default | 5 | 91.64 |
| default | 20 | 5 | default | 96.66 |
| default | 20 | 5 | 5 | 91.64 |
| default | 20 | 10 | default | 94.49 |
| default | 20 | 10 | 5 | 91.64 |
| default | 20 | 20 | default | 90.97 |

The best hyperparameters we get according to the table are the default parameters, which has the accuracy of 97.47%

In [None]:
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(df_train, df_train_labels)
rf_test_labels = rf_model.predict(df_test)

In [None]:
save_submission(rf_test_labels, 'RandomForest')

The Random Forest model demonstrated strong performance with an accuracy of 97.47% on the test set and a Kaggle score of 0.8816. Compared to a single Decision Tree model, Random Forest reduced overfitting and improved generalization by leveraging ensemble learning.

The Random Forest model achieved excellent accuracy compared to the Decision Tree model, demonstrating the power of ensemble learning.While the model performed well on the test set, an accuracy of 97.47% suggests possible overfitting. Further analysis using cross-validation and feature importance analysis would be beneficial. The number of estimators and tree depth significantly affected the model’s performance. Increasing the number of estimators generally improved accuracy.

### 4. Support Vector Machine (SVM)

Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification and regression tasks.  
It works by finding the optimal hyperplane that best separates data points into different classes.  

Key concepts of SVM include:  
- **Margin Maximization**: SVM aims to maximize the margin between the closest data points (support vectors) and the decision boundary.  
- **Kernel Trick**: SVM can handle non-linearly separable data by using kernel functions (e.g., linear, polynomial, RBF) to map data into a higher-dimensional space.  
- **Regularization (C Parameter)**: Controls the trade-off between achieving a low error rate and maintaining a large margin.  

SVM is fully integrated by Scikit Learn library in the module svm package with the class SVC, and we can use it as our model.

In [None]:
from sklearn.svm import SVC
from matplotlib.pylab import RandomState

Like the process of estimation accuracy we have used in LogisticRegression, we will do the same thing to choose the hyperparameters we want.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train, df_train_labels, test_size=0.20, random_state=RandomState())
svm = SVC(kernel="rbf", gamma=20, C=100.0, class_weight='balanced', max_iter=100000)
svm.fit(X_train, y_train)
acc = accuracy_score(y_test, svm.predict(X_test)) * 100
print(f"Logistic Regression model accuracy: {acc:.2f}%")

And the estimated accuracy tables we get is:

| Kernel (d='rbf') | Gamma (d='scale') | C (d=1.0)     | Degree (d=3) | Coef0 (d=0.0) | Class Weight (d=None) | Max Iter (d=-1) | Accuracy (%) |
|---------|--------|--------|--------|-------|--------------|-----------|--------------|
| rbf     | 0.5    | 1.0    | default | default | default | default | 76.33 |
| rbf     | 0.5    | 10.0   | default | default | default | default | 77.39 |
| rbf     | 0.5    | 100.0  | default | default | default | default | 78.39 |
| rbf     | default| 100.0  | default | default | default | default | 79.83 |
| linear  | default| 100.0  | default | default | default | default | 74.75 |
| sigmoid | default| 100.0  | default | default | default | default | 43.20 |
| poly    | default| 100.0  | default | default | default | default | 76.42 |
| rbf     | scale  | 100.0  | default | default | default | default | 79.83 |
| rbf     | auto   | 100.0  | default | default | default | default | 76.78 |
| rbf     | 0.01   | 100.0  | default | default | default | default | 75.60 |
| rbf     | 0.1    | 100.0  | default | default | default | default | 76.76 |
| rbf     | 1      | 100.0  | default | default | default | default | 78.66 |
| rbf     | 5      | 100.0  | default | default | default | default | 80.70 |
| rbf     | 10     | 100.0  | default | default | default | default | 81.37 |
| rbf     | 20     | 100.0  | default | default | default | default | 82.05 |
| rbf     | 20     | 100.0  | default | default | balanced | 100000    | 82.12 |
| rbf     | 0.05   | 10.0   | 3      | default | default | default | 75.05 |
| rbf     | 0.05   | 10.0   | 3      | 1      | balanced | 5000      | 75.05 |


And the best hyperparameters we get according to the table is: `kernel="rbf", gamma=20, C=100.0, class_weight='balanced', max_iter=-1`

In [None]:
svm = SVC(kernel="rbf", gamma=20, C=100.0, class_weight='balanced', max_iter=-1) #82.12%
svm.fit(df_train, df_train_labels)
test_labels = svm.predict(df_test)

In [None]:
save_submission(test_labels, 'SVM')

Our predictions is saved in csv file, and we now can send it to Kaggle for the submission. The result is 0.8176, not too high but still a good value.

### 5. Naive Bayes


Naïve Bayes is a probabilistic classification algorithm based on Bayes' theorem. It assumes that features are conditionally independent given the class label. There are different variants of Naïve Bayes, and since our features are continuous, we use Gaussian Naïve Bayes, which assumes that features follow a normal distribution.

Since our features are continuous numerical values, the best choice is GaussianNB because:

- It assumes features are normally distributed (which is common for numerical data).

- Other Naïve Bayes variants (like MultinomialNB, ComplementNB, and BernoulliNB) are designed for discrete/categorical data and are not suitable for continuous numerical data.

Attention : Naïve Bayes models in scikit-Learn require numerical labels instead of string labels (`'UP'` / `'DOWN'`). Therefore, we need to transform `'UP'` -> `1` and `'DOWN'` -> `0`

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

In [None]:
# UP → 1, DOWN → 0
label_encoder = LabelEncoder()
df_train_labels_encode = label_encoder.fit_transform(df_train_labels)

The dataset was split into training (70%) and testing (30%) subsets. Since Naïve Bayes requires numerical labels, we encoded UP as 1 and DOWN as 0. The key parameter adjusted was var_smoothing, which helps handle numerical stability in probability calculations.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train, df_train_labels_encode, test_size=0.3, random_state=21)

gnb = GaussianNB(var_smoothing=1e-3)
gnb.fit(X_train, y_train)

acc = accuracy_score(y_test, gnb.predict(X_test)) * 100
print(f"Gaussian Naive Bayes model accuracy: {acc:.2f}%")

| Var Smoothing (d=1e-9) | Accuracy (%) |
|---------|---------|
| default | 68.65 |
| 1e-10 | 68.65 |
| 1e-08 | 68.65 |
| 1e-07 | 68.65 |
| 1e-06 | 68.67 |
| 1e-05 | 68.71 |
| 0.0001 | 69.13 |
| 0.001 | 70.00 |
| 0.01 | 65.80 |
| 0.1 | 62.42 |

And the best hyperparameter we get according to the table is: `var_smoothing=1e-3`

In [None]:
gnb = GaussianNB(var_smoothing=1e-3)
gnb.fit(df_train, df_train_labels_encode)
gnb_test_labels_encode = gnb.predict(df_test)

gnb_test_labels = label_encoder.inverse_transform(gnb_test_labels_encode)

In [None]:
save_submission(gnb_test_labels, 'GaussianNB')

The Naive Bayes model performed worse than the Decision Tree (85.06%) and Random Forest (97.47%). Naïve Bayes assumes all features are independent, which is rarely true in real-world data. This assumption likely contributed to the lower accuracy.

The Kaggle score of 0.6945 indicates that Naïve Bayes is not the best model for this task but serves as a baseline classifier.

Naive Bayes model's predictions are saved in csv file, the result on Kaggle is 0.6945, which is pretty low. However, it's normal for Naïve Bayes (NB) to perform worse than other models, e.g, Random Forest, kNN, or SVM for many reasons, for example:
- Naïve Bayes assumes features are independent, which is rarely true in real-world data. 
- NB calculates probabilities independently for each feature, which can be inaccurate if features are correlated.

### 6. K-Nearest Neighbors (KNN)


k-Nearest Neighbors (kNN) is a non-parametric, instance-based learning algorithm used for classification tasks. It classifies a given data point by looking at the majority class of its closest k neighbors. The algorithm is sensitive to distance metrics and weighting schemes, making hyperparameter tuning crucial for performance optimization.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [None]:
# UP → 1, DOWN → 0
label_encoder = LabelEncoder()
df_train_labels_encode = label_encoder.fit_transform(df_train_labels)

A kNN classifier was used to predict bc_price_evo. The dataset was split into training (80%) and validation (20%) subsets. The model was trained using various hyperparameter settings, including:

- Number of Neighbors: 1, 3, 5, 7, 9, 11

- Distance Metrics: minkowski, euclidean, manhattan, chebyshev

- Weighting Methods: uniform, distance

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df_train, df_train_labels_encode, test_size=0.2, random_state=28)

knn = KNeighborsClassifier(n_neighbors=9, weights="distance", metric="manhattan")
knn.fit(X_train, y_train)

acc = accuracy_score(y_test, knn.predict(X_test)) * 100
print(f"kNN model accuracy: {acc:.2f}%")

| N Neighbors (d=5) | Weights (d='uniform') | Metric (d='minkowski') | Accuracy (%) |
|---------|----------|-----------|--------------|
| 1 | uniform | minkowski | 94.64 |
| 1 | uniform | euclidean | 94.64 |
| 1 | uniform | manhattan | 94.85 |
| 1 | uniform | chebyshev | 94.50 |
| 1 | distance | minkowski | 94.64 |
| 1 | distance | euclidean | 94.64 |
| 1 | distance | manhattan | 94.85 |
| 1 | distance | chebyshev | 94.50 |
| 3 | distance | minkowski | 95.23 |
| 3 | distance | euclidean | 95.23 |
| 3 | distance | manhattan | 95.43 |
| 3 | distance | chebyshev | 95.07 |
| 5 | distance | minkowski | 95.39 |
| 5 | distance | euclidean | 95.39 |
| 5 | distance | manhattan | 95.59 |
| 5 | distance | chebyshev | 95.15 |
| 7 | distance | minkowski | 95.59 |
| 7 | distance | euclidean | 95.59 |
| 7 | distance | manhattan | 95.68 |
| 7 | distance | chebyshev | 95.29 |
| 9 | distance | minkowski | 95.52 |
| 9 | distance | euclidean | 95.52 |
| 9 | distance | manhattan | 95.95 |
| 9 | distance | chebyshev | 95.58 |
| 11 | distance | minkowski | 95.53 |
| 11 | distance | euclidean | 95.53 |
| 11 | distance | manhattan | 95.84 |
| 11 | distance | chebyshev | 95.75 |

The best-performing configuration was a kNN model with:

- Number of Neighbors: 9

- Weighting Method: distance

- Distance Metric: manhattan

This configuration achieved an accuracy of 95.95% on the test dataset. The model was then retrained on the full training set and used to generate predictions on the test dataset, yielding a Kaggle competition score of 0.7870.

In [None]:
knn = KNeighborsClassifier(n_neighbors=9, weights="distance", metric="manhattan")
knn.fit(df_train, df_train_labels_encode)
knn_test_labels_encode = knn.predict(df_test)

knn_test_labels = label_encoder.inverse_transform(knn_test_labels_encode)

In [None]:
save_submission(knn_test_labels, 'knn_model')

The kNN model achieved a strong accuracy of 95.95% on the test set, indicating that the model performed well on this classification task.
Despite the high accuracy on the test set, the Kaggle competition score of 0.7870 suggests that the model may struggle with generalization compared to other models like Random Forest. Finally, kNN is a computationally expensive algorithm, especially for large datasets, as predictions require distance calculations for every test instance.

## Conclusion

In this project, multiple machine learning models were evaluated for predicting the evolution of electricity prices in British Columbia (BC). The models varied in complexity, assumptions, and performance. Below is a summary of their results:

| Model | Accuracy (%) | Kaggle Score |
| --- | --- | --- |
| Logistic Regression | 75.12 | 0.7284 |
| Support Vector Machine (SVM) | 82.12 | 0.8176 |
| Decision Tree | 85.06 | 0.8648 |
| Random Forest | **97.47** | **0.8816** |
| Naïve Bayes | 70.00 | 0.6945 |
| k-Nearest Neighbors (kNN) | 95.95 | 0.7870 |

Key Takeaways :
- The Random Forest classifier achieved the highest accuracy (97.47%) and the best Kaggle competition score (0.8816). Its ensemble approach provided strong generalization and outperformed other models.
- kNN performed well on the test dataset (95.95%) but had a lower Kaggle score (0.7870), indicating potential generalization issues. The Decision Tree model also showed solid accuracy (85.06%) but was outperformed by Random Forest.
- SVM demonstrated a good balance between accuracy (82.12%) and generalization (Kaggle score of 0.8176), making it a reliable choice for structured data.
- Naïve Bayes had the lowest accuracy (70.00%) and Kaggle score (0.6945). This was expected due to its assumption of feature independence, which is rarely true in real-world datasets.
- Logistic Regression and Naïve Bayes were the fastest to train but lacked predictive power. Random Forest and kNN, while highly accurate, were more computationally expensive.