# Classification

## Introduction

Electricity pricing is a critical aspect of energy markets, influencing both consumers and suppliers. In this project, we analyze electricity price trends in **British Columbia (BC)** and **Alberta (AB)**, Canada, using historical electricity measurements. Our objective is to develop a predictive model that estimates whether the electricity price in British Columbia will **increase (UP)** or **decrease (DOWN)**.

### Problem Statement

Given a dataset containing electricity-related metrics, we aim to predict the **bc_price_evo** variable, which indicates whether the electricity price in BC is increasing or decreasing. The dataset includes:

- **Date and Time of measurement**
- **Electricity Price and Demand** in British Columbia and Alberta
- **Electricity Transfer** between the two regions

The ultimate goal is to build an accurate machine learning model that can effectively classify price movement in BC.


### Evaluation Metric

Our model will be evaluated using the **Accuracy Score**, which measures the percentage of correct predictions.


```math
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total predictions}}
```

A higher accuracy score indicates a better-performing model.

### Submission Format

The submission should be a CSV file containing the predicted **bc_price_evo** values for each test ID. The format should be:

```csv
id,bc_price_evo 
28855,UP 
28856,UP 
28857,DOWN ...
```


This project will explore various machine learning techniques to improve prediction accuracy and gain insights into electricity price fluctuations.


### Libraries

For this project, we will use some libraries that can help us facilitate the process

In [3]:
import pandas as pd                 # pandas for the data structure manipulation 
import numpy as np                  # numpy for numerical operations
import matplotlib.pyplot as plt     # matplotlib for plotting
#import sklearn as sklearn          # sklearn for machine learning and evaluation (required module will be imported later in each partie)

## Data Processing

This document outlines a step-by-step approach to handling data using Python. The process includes data loading, inspecting, handling missing values, and data normalization.

In [None]:
data_dir = '../data/classification/'
output_dir = '../output/classification/submission/'

### 1. Loading Data  
The first step is to import the necessary libraries and load the dataset into a Pandas DataFrame.

In [10]:
df_train_raw = pd.read_csv(data_dir + 'train.csv', index_col=0)
df_train_raw.head()

Unnamed: 0_level_0,date,hour,bc_price,bc_demand,ab_price,ab_demand,transfer,bc_price_evo
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.45206,1.0,0.074096,0.578846,0.005029,0.494821,0.489912,UP
1,0.455555,0.574468,0.033025,0.349003,0.001554,0.264889,0.829386,DOWN
2,0.027521,0.617021,0.098325,0.533918,0.003467,0.422915,0.414912,UP
3,0.455732,0.93617,0.041822,0.588515,0.00286,0.448731,0.525,UP
4,4.4e-05,0.255319,0.051489,0.30244,0.003467,0.422915,0.414912,UP


In [11]:
df_test_raw = pd.read_csv(data_dir + 'test.csv', index_col=0)
df_test_raw.head()

Unnamed: 0_level_0,date,hour,bc_price,bc_demand,ab_price,ab_demand,transfer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
28855,0.02699,0.638298,0.090128,0.295894,0.003467,0.422915,0.414912
28856,0.009203,0.021277,0.055632,0.386195,0.003467,0.422915,0.414912
28857,0.429494,0.0,0.051849,0.377417,0.003467,0.422915,0.414912
28858,0.885978,0.106383,0.034856,0.198007,0.00225,0.29855,0.740789
28859,0.469493,0.723404,0.042122,0.453734,0.002823,0.436044,0.480702


### 2. Data Inspection  

By looking at the trainning and testing dataset, we will see that there are 7 columns, which will provide information for the predictions. Those are:
- `id` - Unique identifier used by Kaggle

- `date` - Date at which the measurement was made, between the 15th of May 2015 and the 13th of December 2017 (normalized between 0 and 1)
- `hour` - Hour of measurement as a half hour period of time over 24 hours (values originally between 0 and 47, here normalized between 0 and 1)
- `bc_price` - Electricity price in British Columbia (normalized between 0 and 1)
- `bc_demand` - Electricity demand in British Columbia (normalized between 0 and 1)
- `ab_price` - Electricity price in Alberta (normalized between 0 and 1)
- `ab_demand` - Electricity demand in Alberta (normalized between 0 and 1)
- `transfer` - Electricity transfer scheduled between British Columbia and Alberta (normalized between 0 and 1)
- `bc_price_evo` - Is the price in British Columbia going UP or DOWN compared to the last 24 hours? This is the target variable (i.e., it is only given during training)

Before processin,  it's essential to check the structure and properties of the data

In [20]:
df_train_raw.info()
df_train_raw.describe()

<class 'pandas.core.frame.DataFrame'>
Index: 28855 entries, 0 to 28854
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          28855 non-null  float64
 1   hour          28855 non-null  float64
 2   bc_price      28855 non-null  float64
 3   bc_demand     28855 non-null  float64
 4   ab_price      28855 non-null  float64
 5   ab_demand     28855 non-null  float64
 6   transfer      28855 non-null  float64
 7   bc_price_evo  28855 non-null  object 
dtypes: float64(7), object(1)
memory usage: 2.0+ MB


Unnamed: 0,date,hour,bc_price,bc_demand,ab_price,ab_demand,transfer
count,28855.0,28855.0,28855.0,28855.0,28855.0,28855.0,28855.0
mean,0.497199,0.50662,0.060432,0.434533,0.003634,0.427201,0.497219
std,0.340534,0.290504,0.043706,0.162271,0.012532,0.120954,0.152048
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.031813,0.255319,0.036268,0.320813,0.002358,0.381926,0.414912
50%,0.456219,0.510638,0.051219,0.452098,0.003467,0.422915,0.414912
75%,0.880536,0.765957,0.076018,0.542696,0.003467,0.476437,0.599123
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [21]:
df_test_raw.info()
df_train_raw.describe()

<class 'pandas.core.frame.DataFrame'>
Index: 9619 entries, 28855 to 38473
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       9619 non-null   float64
 1   hour       9619 non-null   float64
 2   bc_price   9619 non-null   float64
 3   bc_demand  9619 non-null   float64
 4   ab_price   9619 non-null   float64
 5   ab_demand  9619 non-null   float64
 6   transfer   9619 non-null   float64
dtypes: float64(7)
memory usage: 601.2 KB


Unnamed: 0,date,hour,bc_price,bc_demand,ab_price,ab_demand,transfer
count,28855.0,28855.0,28855.0,28855.0,28855.0,28855.0,28855.0
mean,0.497199,0.50662,0.060432,0.434533,0.003634,0.427201,0.497219
std,0.340534,0.290504,0.043706,0.162271,0.012532,0.120954,0.152048
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.031813,0.255319,0.036268,0.320813,0.002358,0.381926,0.414912
50%,0.456219,0.510638,0.051219,0.452098,0.003467,0.422915,0.414912
75%,0.880536,0.765957,0.076018,0.542696,0.003467,0.476437,0.599123
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Excellent, it seems that our data is really clean, there is no missing values and complex information are already normalized. So it could be ready for the processing. 

### 3. Data preparing

For the training process, we will need to seperate the data and the predictions (the labels in the `bc_price_evo` column).

In [None]:
df_train = df_train_raw.drop(columns=['bc_price_evo'])
df_train_labels = df_train_raw.loc[:, 'bc_price_evo'].copy()

df_train.head()
df_train_labels.head()

Unnamed: 0_level_0,date,hour,bc_price,bc_demand,ab_price,ab_demand,transfer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.452060,1.000000,0.074096,0.578846,0.005029,0.494821,0.489912
1,0.455555,0.574468,0.033025,0.349003,0.001554,0.264889,0.829386
2,0.027521,0.617021,0.098325,0.533918,0.003467,0.422915,0.414912
3,0.455732,0.936170,0.041822,0.588515,0.002860,0.448731,0.525000
4,0.000044,0.255319,0.051489,0.302440,0.003467,0.422915,0.414912
...,...,...,...,...,...,...,...
28850,0.026503,1.000000,0.082232,0.427551,0.003467,0.422915,0.414912
28851,0.451927,0.574468,0.033626,0.564564,0.002198,0.624806,0.553947
28852,0.907482,0.893617,0.055872,0.329664,0.003695,0.316416,0.602193
28853,0.915800,0.936170,0.044884,0.355549,0.003072,0.241326,0.420614


id
0          UP
1        DOWN
2          UP
3          UP
4          UP
         ... 
28850      UP
28851    DOWN
28852    DOWN
28853    DOWN
28854    DOWN
Name: bc_price_evo, Length: 28855, dtype: object

In [13]:
df_test = pd.read_csv('../data/classification/test.csv', index_col=0)

In [14]:
def save_submission( df_test_labels, name_model  ):
    test = df_test.copy()
    test['bc_price_evo'] = df_test_labels
    test.to_csv(f'../data/classification/submission/{name_model}.csv', columns=['bc_price_evo'])

## Logistic Regression (Son)

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df_train, df_train_labels, test_size=0.20, random_state=23)

# clf = LogisticRegression(max_iter=10000, random_state=0) # 74.01%
# clf = LogisticRegression(max_iter=10000, random_state=0, solver='liblinear', dual=True) # 73.94%

# clf = LogisticRegression(max_iter=10000, random_state=0, C=0.01) # 62.97%
# clf = LogisticRegression(max_iter=10000, random_state=0, C=0.1) # 66.77%
# clf = LogisticRegression(max_iter=10000, random_state=0, C=10) # 75.05%
# clf = LogisticRegression(max_iter=10000, random_state=0, C=100) # 74.96%
# clf = LogisticRegression(max_iter=10000, random_state=0, C=15) # 75.12%

# clf = LogisticRegression(max_iter=10000, random_state=0, C=15, penalty='l1', solver='liblinear') # 75%
# clf = LogisticRegression(max_iter=10000, random_state=0, C=15, penalty='elasticnet', solver='saga', l1_ratio=0.5) # 75.08%
# clf = LogisticRegression(max_iter=10000, random_state=0, C=15, class_weight='balanced') # 75.13%
# clf = LogisticRegression(max_iter=10000, random_state=0, C=15, class_weight='balanced', fit_intercept=False) # 71.65%

clf = LogisticRegression(max_iter=10000, random_state=0, C=15, class_weight='balanced') # 75.13%
clf.fit(X_train, y_train)

acc = accuracy_score(y_test, clf.predict(X_test)) * 100
print(f"Logistic Regression model accuracy: {acc:.2f}%")

Logistic Regression model accuracy: 75.13%


In [9]:
clf = LogisticRegression(max_iter=10000, random_state=0, C=15, class_weight='balanced')
clf.fit(df_train, df_train_labels)

test_labels = clf.predict(df_test)

In [10]:
#concat test_labels to df_test index
save_submission(test_labels, 'LogisticRegression')

## Decision Tree (Tu)

In [10]:
save_submission(test_labels, 'DecisionTree')

## Random Forest (Tho)


## Support Vector Machine (SVM) (Son)


In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from matplotlib.pylab import RandomState

In [None]:


X_train, X_test, y_train, y_test = train_test_split(df_train, df_train_labels, test_size=0.20, random_state=RandomState())

# svm = SVC(kernel="rbf", gamma=0.5, C=1.0) #76.33%
# svm = SVC(kernel="rbf", gamma=0.5, C=10.0) #77.39%
# svm = SVC(kernel="rbf", gamma=0.5, C=100.0) #78.39%

# svm = SVC(kernel="rbf", C=100.0) #79.83%
# svm = SVC(kernel="linear", C=100.0) #74.75%
# svm = SVC(kernel="sigmoid", C=100.0) #43.20%
# svm = SVC(kernel="poly", C=100.0) #76.42%

# svm = SVC(kernel="rbf", gamma='scale', C=100.0) #79.83%
# svm = SVC(kernel="rbf", gamma='auto', C=100.0) #76.78%
# svm = SVC(kernel="rbf", gamma=0.01, C=100.0) #75.60%
# svm = SVC(kernel="rbf", gamma=0.1, C=100.0) #76.76%
# svm = SVC(kernel="rbf", gamma=1, C=100.0) #78.66%
# svm = SVC(kernel="rbf", gamma=5, C=100.0) #80.70%
# svm = SVC(kernel="rbf", gamma=10, C=100.0) #81.37%
# svm = SVC(kernel="rbf", gamma=20, C=100.0) #82.05%

svm = SVC(kernel="rbf", gamma=20, C=100.0, class_weight='balanced', max_iter=100000) #82.12%
# svm = SVC(kernel="rbf", gamma=0.05, C=10.0, degree=3) #75.05%
#svm = SVC(kernel="rbf", gamma=0.05, C=10.0, degree=3, coef0=1, class_weight='balanced', max_iter=5000) #75.05%

svm.fit(X_train, y_train)

acc = accuracy_score(y_test, svm.predict(X_test)) * 100
print(f"Logistic Regression model accuracy: {acc:.2f}%")

Logistic Regression model accuracy: 77.51%


In [59]:
svm = SVC(kernel="rbf", gamma=20, C=100.0, class_weight='balanced', max_iter=100000) #82.12%
svm.fit(df_train, df_train_labels)

test_labels = svm.predict(df_test)



In [60]:
save_submission(test_labels, 'SVM')

## Naive Bayes (Tu)


## K-Nearest Neighbors (KNN) (Tho)