# **ICS5110 Notebook**

View the web page for this project [here](https://mkenely.com/ics5110).

- [Feature Reference](https://mkenely.com/ics5110/features)
- [Feature Distributions](https://mkenely.com/ics5110/distributions)
- [Correlation Matrix](https://mkenely.com/ics5110/correlation_matrix)
- [Feature vs G3 Scatter Plots](https://mkenely.com/ics5110/scatter_plots)


### **Imports**

In [1]:
import os
import sys

import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import pickle

from gradio_implementations import pca_gradio

  from .autonotebook import tqdm as notebook_tqdm


### **Data**

In [2]:
portugese_df = pd.read_csv('./data/Portuguese.csv')

le = LabelEncoder()
encoding_mappings = {}

for column in portugese_df.columns:
    if portugese_df[column].dtype == 'object':
        portugese_df[column] = le.fit_transform(portugese_df[column])
        encoding_mappings[column] = {index: label for index, label in enumerate(le.classes_)}

X = portugese_df.drop('G3', axis=1)
X = X.drop('G1', axis=1)
X = X.drop('G2', axis=1)

y = portugese_df['G3']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### **Models**

#### **PCA**

**Imports**

In [3]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression

**Standardisation**
1. Subtract the mean from each variable
2. Divide by the standard deviation

In [4]:
X_s = X.copy()

for column in X.columns:
    X_s[column] = (X[column] - X[column].mean()) / X[column].std()

**Find number of components required to achieve two levels of explained variance: 95% and 90%**

In [5]:
accepted_v1 = 0.95
accepted_v2 = 0.90

pca = PCA()
pca.fit(X_s)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components_1 = np.argmax(cumulative_variance >= accepted_v1) + 1
n_components_2 = np.argmax(cumulative_variance >= accepted_v2) + 1

In [6]:
print(f'{"Original features:":<1}{len(X.columns):>15}')
print(f'{"PCA components for variance 1: ":<1}{n_components_1:>2}')
print(f'{"PCA components for variance 2: ":<1}{n_components_2:>2}')

Original features:             30
PCA components for variance 1: 27
PCA components for variance 2: 24


**Fit normal model as well as two models with PCA at the two variance levels**

In [7]:
pca_1 = PCA(n_components=n_components_1)
pca_2 = PCA(n_components=n_components_2)

X_train_pca_1 = pca_1.fit_transform(X_train)
X_test_pca_1 = pca_1.transform(X_test)
X_train_pca_2 = pca_2.fit_transform(X_train)
X_test_pca_2 = pca_2.transform(X_test)

normal_model = LinearRegression()
pca_model_1 = LinearRegression()
pca_model_2 = LinearRegression()

normal_model.fit(X_train, y_train)
pca_model_1.fit(X_train_pca_1, y_train)
pca_model_2.fit(X_train_pca_2, y_train)

**Results**

In [8]:
normal_model_size = sys.getsizeof(pickle.dumps(normal_model))
pca_model_1_size = sys.getsizeof(pickle.dumps(pca_model_1))
pca_model_2_size = sys.getsizeof(pickle.dumps(pca_model_2))

normal_model_size_kb = normal_model_size / 1024
pca_model_1_size_kb = pca_model_1_size / 1024
pca_model_2_size_kb = pca_model_2_size / 1024

In [9]:
results_df = pd.DataFrame({
    'model': ['Normal', 'PCA 95%', 'PCA 90%'],
    'accuracy': [normal_model.score(X_test, y_test), pca_model_1.score(X_test_pca_1, y_test), pca_model_2.score(X_test_pca_2, y_test)],
    'model_size (KB)': [normal_model_size_kb, pca_model_1_size_kb, pca_model_2_size_kb],
    'relative_accuracy': [1, pca_model_1.score(X_test_pca_1, y_test) / normal_model.score(X_test, y_test), pca_model_2.score(X_test_pca_2, y_test) / normal_model.score(X_test, y_test)]
})

results_df['accuracy'] = results_df['accuracy'].apply(lambda x: round(x, 3))
results_df['model_size (KB)'] = results_df['model_size (KB)'].apply(lambda x: round(x, 3))
results_df['relative_accuracy'] = results_df['relative_accuracy'].apply(lambda x: round(x, 3))

results_df.set_index('model', inplace=True)

In [10]:
results_df.head()

Unnamed: 0_level_0,accuracy,model_size (KB),relative_accuracy
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Normal,0.231,1.253,1.0
PCA 95%,0.224,0.849,0.971
PCA 90%,0.199,0.802,0.863


### **Gradio**

#### **PCA**

In [11]:
pca_gradio.make_gradio(
    [normal_model, pca_model_1, pca_model_2],
    [pca_1, pca_2],
    [normal_model_size_kb, pca_model_1_size_kb, pca_model_2_size_kb],
)

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Keyboard interruption in main thread... closing server.
