# Introduction 

In this work, we’ll build a binary classification model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.

The dataset can be accessed from the UCI Machine Learning Repository via this [link](https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+)

In [68]:
# Import data manipulation libraries
import pandas as pd
import numpy as np

In [69]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [70]:
# Read the uploaded dataset into a dataframe
data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Datasets/uci_data.csv')

In [71]:
# Overview the dataframe
data

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.959060,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.781760,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.277210,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.669600,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.797110,0.455450,0.656947,0.820923,0.049860,unstable
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,2.930406,9.487627,2.376523,6.187797,3.343416,-0.658054,-1.449106,-1.236256,0.601709,0.779642,0.813512,0.608385,0.023892,unstable
9996,3.392299,1.274827,2.954947,6.894759,4.349512,-1.663661,-0.952437,-1.733414,0.502079,0.567242,0.285880,0.366120,-0.025803,stable
9997,2.364034,2.842030,8.776391,1.008906,4.299976,-1.380719,-0.943884,-1.975373,0.487838,0.986505,0.149286,0.145984,-0.031810,stable
9998,9.631511,3.994398,2.757071,7.821347,2.514755,-0.966330,-0.649915,-0.898510,0.365246,0.587558,0.889118,0.818391,0.037789,unstable


Looking at the dataframe above, we can observe some relationship between `stab` and `stabf` columns. 

Whenever `stab` <= **0**, the corresponding value in `stabf` will be **"stable"**; if otherwise, `stabf` will be **"unstable"**. 

Because of this relationship, we are going to drop the `stab` column in the dataframe. This will essentially make `stabf` the sole dependent variable.

In [72]:
data.drop('stab', axis=1, inplace=True)

In [73]:
# Checking for null values
print(data.isna().sum())

tau1     0
tau2     0
tau3     0
tau4     0
p1       0
p2       0
p3       0
p4       0
g1       0
g2       0
g3       0
g4       0
stabf    0
dtype: int64


Let's take a look at the distribution of the values in the `stabf` column.

In [74]:
data['stabf'].value_counts()

unstable    6380
stable      3620
Name: stabf, dtype: int64

What we have is a **binary classification problem** with just two classess.

Now let's split the data into training and testing sets.

In [75]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split the dataframe
x_train, x_test, y_train, y_test = train_test_split(data.drop(['stabf'], axis=1), data['stabf'], test_size=0.2, random_state=1)


We are going to use *standard scaler* to transform the train set, `x_train`, and the test set, `x_test`.

In [76]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler
 
# Instantiate into a variable
scaler = StandardScaler()
 
# Normalize x_train
norm_x_train = scaler.fit_transform(x_train)
# Transform to dataframe
norm_x_train = pd.DataFrame(norm_x_train, columns=x_train.columns)
 
# Normalize x_test
norm_x_test = scaler.fit_transform(x_test)
# Transform to dataframe
norm_x_test = pd.DataFrame(norm_x_test, columns=x_test.columns)

# Training Classifiers and Measuring Perfornances.
 
Now we are moving on to training different models on our data. The training will be in this order:
 
- Select a classifier
  - Train the classifier
  - Make predictions with the classifier
  - Measure the performance of the classifier
 
Firstly, let us import the libraries we are going to use for measuring performance.

In [77]:
# Import essential libraries from sklearn.metrics
from sklearn.metrics import (precision_score, accuracy_score, 
                             recall_score, f1_score, 
                             confusion_matrix)

# Question
What is the accuracy on the test set using the random forest classifier? In 4 decimal places.

### Random Forest
 
- Training and Prediction

In [78]:
# Import RandomForestClassifier from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier
 
# Instantiate RandomForestClassifier into a variable
rf = RandomForestClassifier(random_state=1)
 
# Fit the model
rf.fit(x_train, y_train)
 
# Make predictions for test set
y_test_predictions = rf.predict(x_test)

- Performance Measurement

In [79]:
 
metric_scores = [precision_score, recall_score, f1_score, accuracy_score]
for score in metric_scores:
    if score != accuracy_score:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions, pos_label='stable'), 4))
    else:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions), 4))

Precision Score: 0.9191
Recall Score: 0.8778
F1 Score: 0.898
Accuracy Score: 0.929


 # Answer
Accuracy on the test set is `0.929`

- Confusion Matrix

In [80]:
# Print confusion matrix setting the label parameter as ['unstable', 'stable']
cnf_mat = confusion_matrix(y_test, y_test_predictions, labels=['unstable', 'stable'])
print (cnf_mat)

[[1233   55]
 [  87  625]]


### Extra Trees Classifier
- Training and Predictions

In [81]:
# Import ExtraTreesClassifier from sklearn.ensemble
from sklearn.ensemble import ExtraTreesClassifier
 
# Instantiate it into a variable
et_clf = ExtraTreesClassifier(random_state=1)
 
# Fit the model
et_clf.fit(x_train, y_train)
 
# Make predictions on the test set
y_test_predictions = et_clf.predict(x_test)

- Performance Measurement

In [82]:
 
metric_scores = [precision_score, recall_score, f1_score, accuracy_score]
for score in metric_scores:
    if score != accuracy_score:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions, pos_label='stable'), 4))
    else:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions), 4))

Precision Score: 0.941
Recall Score: 0.8511
F1 Score: 0.8938
Accuracy Score: 0.928


\begin{equation}
Accuracy Score = 0.928
\label{equation 1}
\end{equation}

 - Confusion Matrix

In [83]:
 # Print confusion matrix setting the label parameter as ['unstable', 'stable']
cnf_mat = confusion_matrix(y_test, y_test_predictions, labels=['unstable', 'stable'])
print (cnf_mat)

[[1250   38]
 [ 106  606]]


# Question 
What is the accuracy on the test set using the xgboost classifier? In 4 decimal places.

### Extreme Gradient Boosting
- Training and Predictions

In [84]:
# Import xgboost
import xgboost as xgb
 
# Instantiate into a variable
xgb_clf = xgb.XGBClassifier(random_state=1)
 
# Fit the model
xgb_clf.fit(x_train, y_train)
 
# Make predictions
y_test_predictions = xgb_clf.predict(x_test)

 - Performance Measurement

In [85]:
 
metric_scores = [precision_score, recall_score, f1_score, accuracy_score]
for score in metric_scores:
    if score != accuracy_score:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions, pos_label='stable'), 4))
    else:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions), 4))

Precision Score: 0.9206
Recall Score: 0.8469
F1 Score: 0.8822
Accuracy Score: 0.9195


# Answer 
Accuracy on the test set using Extreme Gradient Boosting is `0.9195`

 - Confusion Matrix

In [86]:
# Print confusion matrix setting the label parameter as ['unstable', 'stable']
cnf_mat = confusion_matrix(y_test, y_test_predictions, labels=['unstable', 'stable'])
print (cnf_mat)

[[1236   52]
 [ 109  603]]


 # Question 
What is the accuracy on the test set using the LGBM classifier? In 4 decimal places.

 ### Light Gradient Boosting
- Training and Predictions

In [87]:
 # Import lightgbm
import lightgbm as lgb
 
# Instantiate into a variable
lgb_clf = lgb.LGBMClassifier(random_state=1)
 
# Fit the model
lgb_clf.fit(x_train, y_train)
 
# Make predictions
y_test_predictions = lgb_clf.predict(x_test)

- Performance Measurement

In [88]:
 
metric_scores = [precision_score, recall_score, f1_score, accuracy_score]
for score in metric_scores:
    if score != accuracy_score:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions, pos_label='stable'), 4))
    else:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions), 4))

Precision Score: 0.9285
Recall Score: 0.8933
F1 Score: 0.9105
Accuracy Score: 0.9375


# Answer 
Accuracy on test set using LGBM is `0.9375`

- Confusion Matrix

In [89]:
 # Print confusion matrix setting the label parameter as ['unstable', 'stable']
cnf_mat = confusion_matrix(y_test, y_test_predictions, labels=['unstable', 'stable'])
print (cnf_mat)

[[1239   49]
 [  76  636]]


 # Improving Classifiers with RandomizedSearchCV

We are going to improve the `ExtraTreesClassifier` by running a `RandomizedSearchCV` on it.

Let's begin by specifying the hyperparameters needed for this operation.

In [90]:
# Specify hyperparameter grid needed
n_estimators = [50, 100, 300, 500, 1000]
min_samples_split = [2, 3, 5, 7, 9]
min_samples_leaf = [1, 2, 4, 6, 8]
max_features = ['auto', 'sqrt', 'log2', None]
 
hyperparameter_grid = {
    'n_estimators':n_estimators, 'min_samples_split':min_samples_split, 
'min_samples_leaf':min_samples_leaf, 'max_features':max_features
}

- Training and Predictions

# Question
To improve the Extra Trees Classifier, you will use the following parameters (number of estimators, minimum number of samples, minimum number of samples for leaf node and the number of features to consider when looking for the best split) for the hyperparameter grid needed to run a Randomized Cross Validation Search (RandomizedSearchCV).

**What are the best hyperparameters from the randomized search CV?**

In [91]:
 
# Import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
 
# Instantiate into a variable
rscv_clf = RandomizedSearchCV(et_clf, cv=5, param_distributions=hyperparameter_grid, n_iter=10, scoring='accuracy', n_jobs=-1, verbose=1, random_state=1)
 
# Fit the model
rscv_clf.fit(x_train, y_train)
 
# Make predictions on test set
y_test_predictions = rscv_clf.predict(x_test)
 
rscv_clf.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.5min finished


{'max_features': None,
 'min_samples_leaf': 8,
 'min_samples_split': 2,
 'n_estimators': 1000}

# Answer
 
The best hyperparameters are:
 
>{'max_features': None,
 'min_samples_leaf': 8,
 'min_samples_split': 2,
 'n_estimators': 1000}

- Performance Measurement

In [92]:
 
metric_scores = [precision_score, recall_score, f1_score, accuracy_score]
for score in metric_scores:
    if score != accuracy_score:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions, pos_label='stable'), 4))
    else:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions), 4))

Precision Score: 0.9211
Recall Score: 0.8694
F1 Score: 0.8945
Accuracy Score: 0.927


In [93]:
 # Print confusion matrix setting the label parameter as ['unstable', 'stable']
cnf_mat = confusion_matrix(y_test, y_test_predictions, labels=['unstable', 'stable'])
print (cnf_mat)

[[1235   53]
 [  93  619]]


# Question 
Train a new ExtraTreesClassifier Model with the new Hyperparameters from the RandomizedSearchCV (with random_state = 1). Is the accuracy of the new optimal model higher or lower than the initial ExtraTreesClassifier model with no hyperparameter tuning?

 ## Training ExtraTreesClassifier with hyperparameters from RandomizedSearchCV.

In [94]:
 # Import ExtraTreesClassifier from sklearn.ensemble
from sklearn.ensemble import ExtraTreesClassifier
 
# Instantiate it into a variable
# Use hyperparameters derived from RandomizedSearchCV
et_clf = ExtraTreesClassifier(n_estimators=1000, min_samples_split=2, min_samples_leaf=8, max_features=None, random_state=1)
 
# Fit the model
et_clf.fit(x_train, y_train)
 
# Make predictions on the test set
y_test_predictions = et_clf.predict(x_test)

In [95]:
 
metric_scores = [precision_score, recall_score, f1_score, accuracy_score]
for score in metric_scores:
    if score != accuracy_score:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions, pos_label='stable'), 4))
    else:
        print(score.__name__.replace('_', ' ').title()+':', round(score(y_test, y_test_predictions), 4))

Precision Score: 0.9211
Recall Score: 0.8694
F1 Score: 0.8945
Accuracy Score: 0.927


# Answer 
The accuracy score in the new ExtraTreesClassifier with the optimal hyperparameters is `0.927`. This is slightly lower than the accuracy score of `0.928` derived earlier in equation **(\ref{equation 1})** from the ExtraTreesClassifier

# Question
Find the feature importance using the optimal ExtraTreesClassifier model. Which features are the most and least important respectively?

In [96]:
feature_importance = et_clf.feature_importances_
 
print (feature_importance)

[0.13723979 0.14050787 0.1346805  0.13541662 0.00368361 0.0053368
 0.00542927 0.0049625  0.10256224 0.1075776  0.11306257 0.10954062]


In [97]:
feature_importance_series = pd.Series(feature_importance, index=data.drop('stabf', axis=1).columns)
 
print (feature_importance_series)

tau1    0.137240
tau2    0.140508
tau3    0.134680
tau4    0.135417
p1      0.003684
p2      0.005337
p3      0.005429
p4      0.004963
g1      0.102562
g2      0.107578
g3      0.113063
g4      0.109541
dtype: float64


We can further sort the values to have a logical order of the importance of each feature.

In [98]:
print (feature_importance_series.sort_values())

p1      0.003684
p4      0.004963
p2      0.005337
p3      0.005429
g1      0.102562
g2      0.107578
g4      0.109541
g3      0.113063
tau3    0.134680
tau4    0.135417
tau1    0.137240
tau2    0.140508
dtype: float64


# Answer
From the output of the cell above, the most important feature is `tau2` and the least important feature is `p1`.