# Classification -- BBC Articles

- Text Vectorization (from scikit-learn):
  - TF-IDF Vectorizer
- Classifiers (from scikit-learn):
  - Random Forest
  - Logistic Regression
  - K-Nearest Neighbors
  - Simple Decision Tree
  - Gaussian Naive Bayes

## Import Libraries and Set Settings

In [1]:
!pip install wordcloud



In [2]:
!pip install plotly



In [3]:
!pip install nltk



In [4]:
import os                              # Python default package
import numpy as np
import pandas as pd
import ipywidgets as widgets

from sqlalchemy import create_engine   # conda install -c anaconda sqlalchemy
from wordcloud import WordCloud        # conda install -c conda-forge wordcloud
from ipywidgets import interact, fixed

# For visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Azure ML Specific
from azureml.core import Workspace, Dataset

In [5]:
sns.set_theme(style="whitegrid")
pd.options.display.max_rows = 3000

## Import Dataset

In this notebook, we will be testing on `BBCArticles`

In [6]:
# Specific Azure ML for importing Datasets
subscription_id = '546d9c91-7fcf-4547-836c-10b640e06628'
resource_group = 'NSSCapstoneProject'
workspace_name = 'BBCArticles'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='bbc-with-category-target')
bbc = dataset.to_pandas_dataframe()

display(bbc.shape)
display(bbc.head())

(2225, 6)

Unnamed: 0,category,titles,contents,content_lengths,processed_contents,category_target
0,business,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarner...,2516,quarterly profits us media giant timewarner ju...,0
1,business,Dollar gains on Greenspan speech,The dollar has hit its highest level against t...,2213,dollar hit highest level euro almost three mon...,0
2,business,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuko...,1512,owners embattled russian oil giant yukos ask b...,0
3,business,High fuel prices hit BA's profits,British Airways has blamed high fuel prices fo...,2368,british airways blamed high fuel prices 40 dro...,0
4,business,Share boost for feud-hit Reliance,The board of Indian conglomerate Reliance has ...,849,board indian conglomerate reliance agreed shar...,0


### How are these labels mapped?

In [7]:
# Each pair of (category, category_target) records
records = bbc[["category", "category_target"]].to_records(index=False)

# Checking the mapping of the new labels
for pair in np.unique(np.array(records)):
    print(pair)

('business', 0)
('entertainment', 1)
('politics', 2)
('sport', 3)
('tech', 4)


## Classification -- Ground Work

### Split data to Training And Testing

In [8]:
# Import libraries
from sklearn.model_selection import train_test_split

In [9]:
# Features and Target
X = bbc["processed_contents"]
y = bbc["category_target"]

In [10]:
# Perform train-test-split: 20% Test-Size
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=777,
    stratify=y # Make sure to have the target column evenly distributed
)

In [11]:
# Check result of splitting
print("X:", X.shape)
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)

X: (2225,)
X_train: (1780,)
X_test: (445,)


**NOTE**
- **Currently, all the X's are in text format**
- **We need to convert them into vectorial format instead**
- **This will be done using TF-IDF Vectorizer**

### Using TF-IDF for Vectorizing into Word Embedding

In [12]:
# Import libraries
from sklearn.feature_extraction.text import TfidfVectorizer

- Convert a collection of raw documents to a matrix of TF-IDF features
- `TfidfVectorizer` is equivalent to `CountVectorizer` followed by `TfidfTransformer`

```python
class sklearn.feature_extraction.text.TfidfVectorizer(
    *,
    input='content', 
    encoding='utf-8', 
    decode_error='strict', 
    strip_accents=None, 
    lowercase=True, 
    preprocessor=None, 
    tokenizer=None, 
    analyzer='word', 
    stop_words=None, 
    token_pattern='(?u)\b\w\w+\b', 
    ngram_range=(1, 1), 
    max_df=1.0, 
    min_df=1, 
    max_features=None, 
    vocabulary=None, 
    binary=False, 
    dtype=<class 'numpy.float64'>, 
    norm='l2', 
    use_idf=True, 
    smooth_idf=True, 
    sublinear_tf=False
)
```

Define the hyperparameters of the TF-IDF Vectorizer

In [13]:
# Hyper-parameters for TF-IDF Vectorizer
ngram_range = (1, 2)   # We are doing up to bigrams
min_df = 10            # When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature
max_df = 1.0           # When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
# max_features = 300
norm = "l2"
sublinear_tf = True

In [14]:
# Create a TF-IDF Vectorizer instance
tfidf = TfidfVectorizer(
    encoding="utf-8",
    ngram_range=ngram_range,
    stop_words=None, # We already cleaned the text earlier in pre-processing
    lowercase=False, # We already cleaned the text earlier in pre-processing
    max_df=max_df,
    min_df=min_df,
    norm=norm,
    sublinear_tf=sublinear_tf
)

# Store the vectorized features and labels of the training and testing data
# We can pass these to various ML algorithms later for comparing classification performance
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train

# Only transform on the test data
features_test = tfidf.transform(X_test).toarray()
labels_test = y_test

In [15]:
# Checking where we are
print("features_train:", features_train.shape)
print("labels_train:", labels_train.shape)
print("features_test:", features_test.shape)
print("labels_test:", labels_test.shape)

features_train: (1780, 5765)
labels_train: (1780,)
features_test: (445, 5765)
labels_test: (445,)


In [16]:
# Checking what they look like
print("--- Vectorized Features for Training:")
print(features_train)
print("--- Target Labels:")
print(labels_train)

--- Vectorized Features for Training:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
--- Target Labels:
709     1
1374    3
2016    4
704     1
260     0
1709    3
2162    4
793     1
56      0
1342    3
2079    4
2013    4
826     1
1857    4
139     1
1727    3
329     0
1323    2
686     1
1793    3
643     1
1992    4
1903    4
844     1
1914    4
781     1
1890    4
2189    4
368     0
642     1
1231    3
1980    4
1911    4
2042    4
2173    4
1505    3
836     1
488     0
787     1
791     1
1366    3
1215    2
233     0
1414    3
142     0
834     1
1332    2
1735    3
1961    4
215     0
1532    3
1476    3
2208    4
1378    3
68      0
1781    3
145     0
2095    4
1831    4
255     0
1481    3
2207    4
185     0
1240    2
2018    4
83      0
2144    4
1477    3
1712    3
1027    2
850     1
1872    4
113     0
1399    3
1670    3
1492    3
405     0
1962    4
1189   

**We now have our sparse matrices ready to being used for Models**

## Classification: Using Random Forest Model (RF) -- Out of the Box

In [17]:
# Import libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [18]:
# Initialize the model
model_rf = RandomForestClassifier(random_state=777)

In [19]:
# Fit the model
model_rf.fit(features_train, labels_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=777,
                       verbose=0, warm_start=False)

In [20]:
# Predict on the test data
labels_pred = model_rf.predict(features_test)
display(labels_pred)

array([1, 2, 0, 3, 3, 0, 2, 3, 3, 1, 0, 2, 2, 4, 2, 3, 2, 1, 1, 3, 3, 3,
       1, 0, 3, 1, 0, 2, 1, 0, 3, 0, 1, 4, 0, 0, 0, 0, 3, 1, 2, 3, 4, 0,
       1, 2, 0, 1, 0, 1, 1, 2, 3, 1, 3, 3, 0, 4, 0, 4, 3, 3, 2, 3, 1, 3,
       3, 1, 4, 1, 4, 0, 4, 4, 0, 3, 1, 2, 3, 0, 3, 2, 4, 4, 2, 3, 3, 0,
       0, 2, 4, 2, 3, 0, 4, 4, 3, 3, 1, 0, 3, 3, 3, 2, 4, 0, 0, 2, 2, 3,
       4, 3, 0, 1, 2, 4, 4, 0, 3, 3, 3, 0, 2, 0, 3, 1, 3, 3, 1, 0, 0, 4,
       2, 4, 3, 0, 2, 0, 3, 2, 1, 2, 2, 1, 3, 0, 3, 3, 1, 3, 1, 0, 3, 2,
       3, 2, 1, 3, 2, 2, 4, 3, 3, 0, 2, 0, 1, 1, 3, 4, 0, 2, 4, 3, 2, 0,
       1, 0, 4, 4, 3, 1, 2, 1, 4, 0, 4, 0, 0, 2, 0, 0, 0, 1, 4, 0, 3, 4,
       3, 0, 1, 1, 0, 0, 4, 0, 3, 2, 0, 1, 1, 4, 4, 0, 0, 3, 0, 3, 0, 2,
       1, 0, 0, 1, 3, 2, 4, 1, 4, 4, 3, 0, 3, 3, 3, 3, 1, 2, 3, 0, 3, 2,
       1, 1, 2, 0, 1, 4, 0, 3, 1, 2, 1, 1, 2, 2, 1, 3, 3, 3, 3, 2, 0, 2,
       2, 3, 1, 3, 0, 3, 3, 3, 4, 1, 1, 0, 2, 0, 4, 0, 3, 0, 0, 4, 3, 4,
       0, 0, 3, 2, 0, 0, 2, 1, 1, 2, 2, 2, 1, 4, 0,

### Random Forest Performance Metrics

In [21]:
# Accuracy
accuracy = accuracy_score(labels_test, labels_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9483146067415731


In [22]:
# Classification Report
report = classification_report(labels_test, labels_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.92      1.00      0.96       102
           1       0.99      0.94      0.96        77
           2       0.96      0.95      0.96        84
           3       0.91      0.99      0.95       102
           4       1.00      0.84      0.91        80

    accuracy                           0.95       445
   macro avg       0.96      0.94      0.95       445
weighted avg       0.95      0.95      0.95       445



## Classification: Using Logistic Regression Model (LR) -- Out of the Box

In [23]:
# Import libraries
from sklearn.linear_model import LogisticRegression

In [24]:
# Initialize the model
model_lr = LogisticRegression()

In [25]:
# Fit the model
model_lr.fit(features_train, labels_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [26]:
# Predict on the test data
labels_pred = model_lr.predict(features_test)
display(labels_pred)

array([1, 2, 0, 3, 3, 0, 2, 3, 3, 1, 0, 2, 2, 4, 2, 3, 2, 1, 1, 3, 3, 3,
       1, 0, 3, 1, 0, 2, 1, 0, 3, 0, 1, 4, 0, 0, 0, 0, 3, 1, 2, 3, 4, 0,
       1, 2, 0, 1, 0, 1, 1, 2, 3, 1, 3, 3, 0, 4, 0, 4, 3, 3, 2, 3, 1, 3,
       3, 1, 4, 1, 4, 0, 4, 4, 0, 3, 1, 2, 3, 0, 3, 2, 4, 4, 2, 3, 3, 0,
       0, 2, 4, 2, 3, 0, 4, 4, 3, 3, 1, 0, 3, 3, 3, 2, 4, 0, 0, 2, 2, 3,
       4, 3, 0, 1, 2, 4, 4, 0, 3, 3, 3, 0, 2, 0, 3, 1, 3, 3, 1, 2, 0, 4,
       2, 4, 3, 0, 2, 0, 3, 2, 1, 2, 2, 2, 3, 0, 3, 3, 1, 4, 1, 0, 3, 2,
       3, 2, 1, 3, 2, 2, 4, 1, 3, 0, 2, 0, 1, 1, 3, 4, 0, 2, 4, 3, 2, 0,
       1, 0, 4, 4, 3, 1, 2, 1, 4, 0, 4, 0, 0, 2, 0, 0, 0, 1, 4, 0, 3, 4,
       3, 0, 1, 1, 0, 1, 4, 0, 3, 2, 0, 1, 1, 4, 4, 0, 0, 3, 0, 3, 0, 2,
       1, 0, 0, 1, 3, 2, 4, 1, 2, 4, 4, 0, 3, 3, 3, 3, 1, 2, 3, 0, 3, 2,
       1, 1, 2, 0, 1, 4, 0, 3, 1, 2, 1, 1, 2, 2, 1, 3, 3, 3, 3, 2, 0, 2,
       2, 2, 1, 3, 0, 4, 4, 3, 4, 1, 1, 0, 2, 0, 4, 0, 3, 0, 0, 4, 3, 4,
       0, 0, 3, 2, 1, 0, 2, 1, 1, 2, 2, 2, 1, 4, 0,

### Logistic Regression Performance Metrics

In [27]:
# Accuracy
accuracy = accuracy_score(labels_test, labels_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9730337078651685


In [28]:
# Classification Report
report = classification_report(labels_test, labels_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.99      0.98       102
           1       1.00      0.97      0.99        77
           2       0.95      1.00      0.98        84
           3       0.96      1.00      0.98       102
           4       1.00      0.89      0.94        80

    accuracy                           0.97       445
   macro avg       0.98      0.97      0.97       445
weighted avg       0.97      0.97      0.97       445



## Classification: Using K-Nearest Neighbors Model (KNN) -- Out of the Box

In [29]:
# Import libraries
from sklearn.neighbors import KNeighborsClassifier

In [30]:
# Initialize the model
model_knn = KNeighborsClassifier()

In [31]:
# Fit the model
model_knn.fit(features_train, labels_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [32]:
# Predict on the test data
labels_pred = model_knn.predict(features_test)
display(labels_pred)

array([1, 2, 0, 3, 3, 0, 2, 3, 3, 1, 0, 2, 2, 4, 2, 3, 2, 1, 1, 3, 3, 3,
       1, 4, 3, 1, 0, 2, 1, 0, 3, 4, 1, 4, 0, 0, 0, 0, 3, 1, 2, 3, 4, 0,
       1, 2, 0, 1, 0, 1, 1, 2, 3, 1, 3, 3, 0, 4, 0, 4, 3, 3, 2, 3, 1, 3,
       3, 1, 4, 1, 4, 0, 4, 1, 0, 3, 1, 2, 3, 0, 3, 2, 4, 4, 2, 3, 3, 0,
       0, 2, 4, 2, 3, 0, 4, 4, 3, 3, 1, 0, 3, 3, 3, 2, 4, 0, 0, 2, 2, 3,
       4, 3, 0, 1, 2, 4, 4, 0, 3, 3, 3, 0, 2, 0, 3, 1, 3, 3, 1, 2, 0, 4,
       2, 4, 3, 0, 2, 0, 3, 2, 1, 2, 2, 1, 3, 0, 3, 3, 1, 4, 1, 0, 3, 2,
       3, 2, 1, 3, 2, 2, 4, 1, 3, 0, 2, 0, 1, 1, 3, 4, 0, 2, 4, 3, 2, 0,
       1, 0, 4, 4, 3, 1, 2, 1, 4, 0, 4, 0, 0, 2, 0, 0, 0, 1, 4, 0, 3, 4,
       3, 0, 1, 1, 0, 1, 4, 4, 3, 2, 0, 1, 1, 4, 4, 0, 0, 3, 0, 3, 0, 2,
       1, 0, 0, 1, 3, 2, 4, 1, 2, 4, 4, 0, 3, 3, 3, 3, 1, 2, 3, 0, 3, 2,
       1, 1, 2, 0, 1, 4, 0, 3, 1, 2, 1, 1, 2, 2, 1, 3, 3, 3, 3, 2, 0, 4,
       2, 2, 1, 3, 2, 4, 4, 3, 4, 1, 1, 0, 2, 0, 4, 0, 3, 0, 0, 4, 3, 4,
       0, 0, 3, 2, 0, 0, 2, 1, 1, 2, 2, 2, 1, 4, 0,

### K-Nearest Neighbors Performance Metrics

In [33]:
# Accuracy
accuracy = accuracy_score(labels_test, labels_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9662921348314607


In [34]:
# Classification Report
report = classification_report(labels_test, labels_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.95      0.96       102
           1       0.99      0.99      0.99        77
           2       0.97      0.99      0.98        84
           3       0.97      1.00      0.99       102
           4       0.95      0.90      0.92        80

    accuracy                           0.97       445
   macro avg       0.97      0.97      0.97       445
weighted avg       0.97      0.97      0.97       445



## Classification: Using Simple Decision Tree Model (dt) -- Out of the Box

In [35]:
# Import libraries
from sklearn.tree import DecisionTreeClassifier

In [36]:
# Initialize the model
model_dt = DecisionTreeClassifier()

In [37]:
# Fit the model
model_dt.fit(features_train, labels_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [38]:
# Predict on the test data
labels_pred = model_dt.predict(features_test)
display(labels_pred)

array([2, 4, 0, 3, 3, 0, 2, 3, 3, 1, 3, 2, 2, 4, 2, 3, 0, 1, 4, 3, 3, 3,
       4, 4, 3, 4, 0, 2, 1, 0, 3, 0, 1, 4, 0, 0, 0, 0, 3, 1, 2, 3, 4, 0,
       1, 0, 4, 1, 0, 0, 4, 2, 3, 3, 3, 2, 4, 4, 0, 4, 3, 3, 0, 2, 1, 3,
       3, 1, 4, 1, 4, 4, 4, 4, 0, 3, 0, 2, 3, 2, 3, 2, 4, 4, 2, 3, 3, 0,
       0, 2, 4, 2, 4, 0, 4, 3, 3, 3, 1, 0, 3, 3, 3, 2, 4, 0, 0, 2, 2, 3,
       4, 3, 0, 0, 2, 4, 4, 0, 3, 3, 3, 2, 2, 0, 3, 1, 3, 3, 0, 3, 4, 4,
       2, 0, 3, 0, 1, 0, 3, 4, 1, 2, 2, 1, 3, 0, 3, 3, 1, 4, 1, 0, 3, 2,
       3, 2, 1, 3, 2, 2, 2, 0, 3, 0, 2, 0, 1, 1, 3, 4, 4, 2, 0, 3, 2, 0,
       1, 0, 4, 0, 3, 1, 2, 1, 2, 0, 4, 0, 2, 2, 0, 2, 0, 1, 4, 0, 3, 4,
       3, 0, 1, 1, 0, 4, 4, 2, 3, 2, 0, 1, 1, 4, 4, 0, 2, 3, 0, 3, 3, 2,
       1, 0, 0, 1, 3, 2, 4, 1, 2, 4, 4, 2, 3, 3, 3, 3, 1, 2, 2, 0, 3, 0,
       1, 1, 0, 0, 1, 4, 2, 3, 1, 2, 1, 1, 2, 2, 1, 3, 3, 3, 3, 2, 0, 1,
       2, 3, 3, 3, 3, 4, 4, 3, 4, 1, 1, 0, 2, 1, 4, 0, 3, 0, 0, 4, 3, 4,
       0, 0, 3, 2, 2, 4, 2, 1, 1, 2, 2, 2, 1, 4, 0,

### Simple Decision Tree Performance Metrics

In [39]:
# Accuracy
accuracy = accuracy_score(labels_test, labels_pred)
print("Accuracy:", accuracy)

Accuracy: 0.797752808988764


In [40]:
# Classification Report
report = classification_report(labels_test, labels_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.75      0.75       102
           1       0.90      0.74      0.81        77
           2       0.71      0.79      0.75        84
           3       0.88      0.90      0.89       102
           4       0.79      0.79      0.79        80

    accuracy                           0.80       445
   macro avg       0.80      0.79      0.80       445
weighted avg       0.80      0.80      0.80       445



## Classification: Using Gaussian Naive Bayes Model (gnb) -- Out of the Box

In [41]:
# Import libraries
from sklearn.naive_bayes import GaussianNB

In [42]:
# Initialize the model
model_gnb = GaussianNB()

In [43]:
# Fit the model
model_gnb.fit(features_train, labels_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [44]:
# Predict on the test data
labels_pred = model_gnb.predict(features_test)
display(labels_pred)

array([1, 2, 0, 3, 3, 0, 2, 3, 3, 1, 0, 2, 2, 4, 2, 3, 0, 2, 1, 3, 3, 3,
       1, 0, 3, 1, 0, 0, 1, 0, 3, 0, 1, 4, 0, 0, 0, 0, 3, 1, 2, 3, 4, 0,
       4, 0, 0, 1, 0, 1, 1, 2, 3, 1, 3, 3, 0, 4, 0, 4, 4, 3, 2, 3, 1, 3,
       3, 4, 4, 1, 4, 0, 4, 4, 0, 3, 1, 2, 3, 0, 3, 2, 4, 4, 2, 3, 3, 0,
       0, 2, 4, 2, 1, 0, 4, 4, 3, 3, 1, 0, 3, 3, 3, 2, 4, 0, 4, 2, 2, 3,
       4, 3, 0, 1, 2, 4, 4, 0, 3, 3, 3, 0, 2, 0, 3, 1, 3, 3, 1, 2, 0, 4,
       2, 4, 3, 2, 2, 0, 3, 2, 1, 2, 2, 1, 3, 0, 3, 3, 1, 4, 2, 0, 3, 2,
       3, 2, 1, 3, 2, 2, 4, 1, 3, 0, 2, 0, 2, 1, 3, 4, 4, 2, 4, 3, 2, 0,
       1, 0, 4, 4, 3, 1, 2, 4, 4, 0, 4, 0, 0, 2, 0, 0, 0, 4, 4, 0, 3, 4,
       3, 0, 1, 1, 0, 1, 4, 0, 3, 2, 0, 1, 2, 4, 4, 0, 0, 3, 0, 3, 0, 2,
       1, 0, 2, 1, 3, 2, 4, 1, 2, 4, 4, 0, 3, 3, 3, 3, 1, 2, 3, 0, 3, 2,
       1, 1, 2, 0, 1, 4, 0, 3, 1, 2, 1, 1, 2, 2, 1, 3, 3, 3, 3, 2, 0, 2,
       2, 0, 1, 3, 0, 4, 4, 3, 4, 1, 1, 0, 2, 0, 2, 0, 3, 0, 4, 4, 3, 4,
       0, 0, 3, 2, 4, 0, 2, 1, 1, 2, 2, 2, 1, 4, 0,

### Gaussian Naive Bayes Performance Metrics

In [45]:
# Accuracy
accuracy = accuracy_score(labels_test, labels_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9213483146067416


In [46]:
# Classification Report
report = classification_report(labels_test, labels_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.91      0.92       102
           1       0.98      0.83      0.90        77
           2       0.86      0.93      0.89        84
           3       1.00      0.99      1.00       102
           4       0.85      0.93      0.89        80

    accuracy                           0.92       445
   macro avg       0.92      0.92      0.92       445
weighted avg       0.93      0.92      0.92       445



## Classification With Hyper-Parameters Tuning

### Using Random Forest

Previously, our best score with Random Forest (Out-of-the-box) was 0.9483

#### Using GridSearch

In [47]:
# Import Libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

In [48]:
# Define the parameters for the grid
n_estimators = [100, 300, 500, 800, 1200]
max_depth = [5, 10, 15, 20, 25, 30]
min_samples_split = [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
min_samples_leaf = [1, 5, 10, 15]

# Finalize the params for GridSearchCV
hyper_params = dict(
    n_estimators=n_estimators,
    max_depth=max_depth,
    min_samples_split=min_samples_split,
    min_samples_leaf=min_samples_leaf
)

In [49]:
# Initialize the model with GridSearchCV: 3-folds CV
model_rf = RandomForestClassifier(random_state=777)
grid_cv_rf = GridSearchCV(
    model_rf, 
    hyper_params, 
    cv=3, # 3-folds-CV
    verbose=4,
    n_jobs=-1 # Use all available cores
)

In [50]:
# Fit on the training set to get the best model
best_rf = grid_cv_rf.fit(features_train, labels_train)

Fitting 3 folds for each of 1320 candidates, totalling 3960 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   17.5s

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 213 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 384 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 605 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 874 tasks      | elapsed: 13.6min
[Parallel(n_jobs=-1)]: Done 1193 tasks      | elapsed: 19.5min
[Parallel(n_jobs=-1)]: Done 1560 tasks      | elapsed: 27.4min
[Parallel(n_jobs=-1)]: Done 1977 tasks      | elapsed: 36.1min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 47.5min
[Parallel(n_jobs=-1)]: Done 2957 tasks      | elapsed: 61.7min
[Parallel(n_jobs=-1)]: Done 3520 tasks      | elapsed: 75.5min
[Parallel(n_jobs=-1)]: Done 3960 out of 3

In [51]:
best_rf.best_params_

{'max_depth': 30,
 'min_samples_leaf': 1,
 'min_samples_split': 20,
 'n_estimators': 300}

### Using Random Forest With Best Params From Grid Search

In [53]:
# Initialize the model
model_rf_best = RandomForestClassifier(
    random_state=777,
    max_depth=30,
    min_samples_leaf=1,
    min_samples_split=20,
    n_estimators=300
)

In [54]:
# Fit the model
model_rf_best.fit(features_train, labels_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=30, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=20,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=777,
                       verbose=0, warm_start=False)

In [55]:
# Predict on the test data
labels_pred = model_rf_best.predict(features_test)
display(labels_pred)

array([1, 2, 0, 3, 3, 0, 2, 3, 3, 1, 0, 2, 2, 4, 2, 3, 2, 1, 1, 3, 3, 3,
       1, 0, 3, 1, 0, 2, 1, 0, 3, 0, 1, 4, 0, 0, 0, 0, 3, 1, 2, 3, 4, 0,
       1, 2, 0, 1, 0, 1, 1, 2, 3, 1, 3, 3, 0, 4, 0, 4, 3, 3, 2, 3, 1, 3,
       3, 1, 4, 1, 4, 4, 4, 4, 0, 3, 1, 2, 3, 0, 3, 2, 4, 4, 2, 3, 3, 0,
       0, 2, 4, 2, 3, 0, 4, 4, 3, 3, 1, 0, 3, 3, 3, 2, 4, 0, 0, 2, 2, 3,
       4, 3, 0, 1, 2, 4, 4, 0, 3, 3, 3, 0, 2, 0, 3, 1, 3, 3, 1, 0, 0, 4,
       2, 4, 3, 0, 2, 0, 3, 2, 1, 2, 2, 1, 3, 0, 3, 3, 1, 4, 1, 0, 3, 2,
       3, 2, 1, 3, 2, 2, 4, 3, 3, 0, 2, 0, 1, 1, 3, 4, 0, 2, 4, 3, 2, 0,
       1, 0, 4, 4, 3, 1, 2, 1, 4, 0, 4, 0, 0, 2, 0, 0, 0, 1, 4, 0, 3, 4,
       3, 0, 1, 1, 0, 0, 4, 2, 3, 2, 0, 1, 1, 4, 4, 0, 0, 3, 0, 3, 0, 2,
       1, 0, 0, 1, 3, 2, 4, 1, 4, 4, 3, 0, 3, 3, 3, 3, 1, 2, 3, 0, 3, 2,
       1, 1, 2, 0, 1, 4, 0, 3, 1, 2, 1, 1, 2, 2, 1, 3, 3, 3, 3, 2, 0, 4,
       2, 3, 1, 3, 0, 3, 4, 3, 4, 1, 1, 0, 2, 0, 4, 0, 3, 0, 0, 4, 3, 4,
       0, 0, 3, 2, 1, 0, 2, 1, 1, 2, 2, 2, 1, 4, 0,

### Random Forest With Best Params From Grid Search Performance Metrics

In [56]:
# Accuracy
accuracy = accuracy_score(labels_test, labels_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9550561797752809


In [57]:
# Classification Report
report = classification_report(labels_test, labels_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.98      0.95       102
           1       1.00      0.94      0.97        77
           2       0.96      0.95      0.96        84
           3       0.93      1.00      0.96       102
           4       0.99      0.89      0.93        80

    accuracy                           0.96       445
   macro avg       0.96      0.95      0.95       445
weighted avg       0.96      0.96      0.95       445



### Using Logistic Regression

Previously, our best score with Logistic Regression (Out-of-the-box) was 0.9730

#### Using GridSearch

In [72]:
# Import Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

In [73]:
# Define the parameters for the grid
param_grid = {
    "C": [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
    "penalty": ["l1", "l2"]
}

In [74]:
# Initialize the model with GridSearchCV: 5-folds CV
model_lr = LogisticRegression()
grid_cv_lr = GridSearchCV(
    model_lr, 
    param_grid, 
    cv=5, # 5-folds-CV
    verbose=4,
    n_jobs=-1 # Use all available cores
)

In [75]:
# Fit on the training set to get the best model
best_lr = grid_cv_lr.fit(features_train, labels_train)

Fitting 5 folds for each of 28 candidates, totalling 140 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 126 tasks      | elapsed:   19.6s
[Parallel(n_jobs=-1)]: Done 140 out of 140 | elapsed:   23.9s finished


In [76]:
best_lr.best_params_

{'C': 0.7, 'penalty': 'l2'}

### Using Logistic Regression With Best Params From GridSearch

In [77]:
# Initialize the model
model_lr_best = LogisticRegression(C=0.7, penalty="l2")

In [78]:
# Fit the model
model_lr_best.fit(features_train, labels_train)

LogisticRegression(C=0.7, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [79]:
# Predict on the test data
labels_pred = model_lr_best.predict(features_test)
display(labels_pred)

array([1, 2, 0, 3, 3, 0, 2, 3, 3, 1, 0, 2, 2, 4, 2, 3, 2, 1, 1, 3, 3, 3,
       1, 0, 3, 1, 0, 2, 1, 0, 3, 0, 1, 4, 0, 0, 0, 0, 3, 1, 2, 3, 4, 0,
       1, 2, 0, 1, 0, 1, 1, 2, 3, 1, 3, 3, 0, 4, 0, 4, 3, 3, 2, 3, 1, 3,
       3, 1, 4, 1, 4, 0, 4, 4, 0, 3, 1, 2, 3, 0, 3, 2, 4, 4, 2, 3, 3, 0,
       0, 2, 4, 2, 3, 0, 4, 4, 3, 3, 1, 0, 3, 3, 3, 2, 4, 0, 0, 2, 2, 3,
       4, 3, 0, 1, 2, 4, 4, 0, 3, 3, 3, 0, 2, 0, 3, 1, 3, 3, 1, 2, 0, 4,
       2, 4, 3, 0, 2, 0, 3, 2, 1, 2, 2, 2, 3, 0, 3, 3, 1, 4, 1, 0, 3, 2,
       3, 2, 1, 3, 2, 2, 4, 1, 3, 0, 2, 0, 1, 1, 3, 4, 0, 2, 4, 3, 2, 0,
       1, 0, 4, 4, 3, 1, 2, 1, 4, 0, 4, 0, 0, 2, 0, 0, 0, 1, 4, 0, 3, 4,
       3, 0, 1, 1, 0, 1, 4, 0, 3, 2, 0, 1, 1, 4, 4, 0, 0, 3, 0, 3, 0, 2,
       1, 0, 0, 1, 3, 2, 4, 1, 2, 4, 4, 0, 3, 3, 3, 3, 1, 2, 3, 0, 3, 2,
       1, 1, 2, 0, 1, 4, 0, 3, 1, 2, 1, 1, 2, 2, 1, 3, 3, 3, 3, 2, 0, 2,
       2, 2, 1, 3, 0, 4, 4, 3, 4, 1, 1, 0, 2, 0, 4, 0, 3, 0, 0, 4, 3, 4,
       0, 0, 3, 2, 1, 0, 2, 1, 1, 2, 2, 2, 1, 4, 0,

### Logistic Regression With Best Params From Grid Search Performance Metrics

In [80]:
# Accuracy
accuracy = accuracy_score(labels_test, labels_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9730337078651685


In [81]:
# Classification Report
report = classification_report(labels_test, labels_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.99      0.98       102
           1       1.00      0.97      0.99        77
           2       0.95      1.00      0.98        84
           3       0.96      1.00      0.98       102
           4       1.00      0.89      0.94        80

    accuracy                           0.97       445
   macro avg       0.98      0.97      0.97       445
weighted avg       0.97      0.97      0.97       445



### Using K-Nearest Neighbors

Previously, our best score with K-Nearest Neighbors (Out-of-the-box) was 0.9663

#### Using GridSearch

In [82]:
# Import libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

In [92]:
# Define the parameters for the grid
param_grid = {
    "n_neighbors": [2, 3, 4, 5, 6, 7],
    "p": [1, 2, 3, 4]
}

In [93]:
# Initialize the model with GridSearchCV: 3-folds CV
model_knn = KNeighborsClassifier()
grid_cv_knn = GridSearchCV(
    model_knn, 
    param_grid, 
    cv=3, # 3-folds-CV
    verbose=4,
    n_jobs=-1 # Use all available cores
)

In [94]:
# Fit on the training set to get the best model
best_knn = grid_cv_knn.fit(features_train, labels_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed: 15.9min finished


In [95]:
best_knn.best_params_

{'n_neighbors': 7, 'p': 2}

### Using K-Nearest Neighbors With Best Params From GridSearch

In [96]:
# Initialize the model
model_knn_best = KNeighborsClassifier(n_neighbors=7, p=2)

In [97]:
# Fit the model
model_knn_best.fit(features_train, labels_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                     weights='uniform')

In [98]:
# Predict on the test data
labels_pred = model_knn_best.predict(features_test)
display(labels_pred)

array([1, 2, 0, 3, 3, 0, 2, 3, 3, 1, 0, 2, 2, 4, 2, 3, 2, 1, 1, 3, 3, 3,
       1, 0, 3, 1, 0, 2, 1, 0, 3, 4, 1, 4, 0, 0, 0, 0, 3, 1, 2, 3, 4, 0,
       1, 2, 0, 1, 0, 1, 1, 2, 3, 1, 3, 3, 0, 4, 0, 4, 3, 3, 2, 3, 1, 3,
       3, 1, 4, 1, 4, 0, 4, 1, 0, 3, 1, 2, 3, 0, 3, 2, 4, 4, 2, 3, 3, 0,
       0, 2, 4, 2, 3, 0, 4, 4, 3, 3, 1, 0, 3, 3, 3, 2, 4, 0, 0, 2, 2, 3,
       4, 3, 0, 1, 2, 4, 4, 0, 3, 3, 3, 0, 2, 0, 3, 1, 3, 3, 1, 2, 0, 4,
       2, 4, 3, 4, 2, 0, 3, 2, 1, 2, 2, 1, 3, 0, 3, 3, 1, 4, 1, 0, 3, 2,
       3, 2, 1, 3, 2, 2, 4, 1, 3, 0, 2, 0, 1, 1, 3, 4, 0, 2, 4, 3, 2, 0,
       1, 0, 4, 4, 3, 1, 2, 1, 4, 0, 4, 0, 0, 2, 0, 0, 0, 1, 4, 0, 3, 4,
       3, 0, 1, 1, 0, 4, 4, 0, 3, 2, 0, 1, 1, 4, 4, 0, 0, 3, 0, 3, 0, 2,
       1, 0, 0, 1, 3, 2, 4, 1, 2, 4, 4, 0, 3, 3, 3, 3, 1, 2, 3, 0, 3, 2,
       1, 1, 2, 0, 1, 4, 0, 3, 1, 2, 1, 1, 2, 2, 1, 3, 3, 3, 3, 2, 0, 4,
       2, 2, 1, 3, 0, 4, 4, 3, 4, 1, 1, 0, 2, 0, 4, 0, 3, 0, 0, 4, 3, 4,
       0, 0, 3, 2, 1, 0, 2, 1, 1, 2, 2, 2, 1, 4, 0,

### K-Nearest Neighbors With Best Params From Grid Search Performance Metrics

In [99]:
# Accuracy
accuracy = accuracy_score(labels_test, labels_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9730337078651685


In [100]:
# Classification Report
report = classification_report(labels_test, labels_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.97      0.98       102
           1       0.99      0.97      0.98        77
           2       0.98      0.99      0.98        84
           3       0.98      1.00      0.99       102
           4       0.94      0.93      0.93        80

    accuracy                           0.97       445
   macro avg       0.97      0.97      0.97       445
weighted avg       0.97      0.97      0.97       445



### Results

Current Accuracy Scores on `test` data:

- Random Forest -- No Tuning: 0.9483
- Logistic Regression: 0.9730
- K-Nearest Neighbors: 0.9663
- Simple Decision Tree: 0.7978
- Gaussian Naive Bayes: 0.9213
- Random Forest -- GridSearchCV: 0.9551
- Logistic Regression -- GridSearchCV: 0.9730
- K-Nearest-Neighbors -- GridsearchCV: 0.9730