# First Home Assignment
Made by: <br>
- Diogo Araújo, fc60997 - 2 H <br>
- Joel Oliveira, fc59442 - 3 H <br>
- João Braz, f60419 - 2 H <br>

For this first home assignment, we have been tasked with builing a few supervised learning models. The data we will be using was taken from the "UCI Supercoductivity Data" dataset, which is represented via the "train.csv" files and the "unique_m.csv" files. <br>

The dependent variable (y) for the will be the last column of our data, which is the "critical_temp" column. <br>

## Import Libraries and Data
In this initial section, we are only going to be importing the libraries we will need for the coming objectives as well as obtaining the data from the ".csv" files. <br>

In [59]:
# Import all the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from time import time
from scipy.sparse import dok_matrix
from scipy.sparse.linalg import svds
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, max_error, ConfusionMatrixDisplay, confusion_matrix, matthews_corrcoef


np.seterr(divide='ignore', invalid='ignore');

# Get the data from the .csv files
df_train = pd.read_csv('train.csv') #PCA -> dense
df_unique = pd.read_csv('unique_m.csv') #SVD -> sparse

# Merge both of the datasets
df = df_train.merge(df_unique, left_index=True, right_index=True)

## Objective 1
In this objective, of the first home assignment, we aim to create the dimensionality reduction model four our data. For this, we will be utilizing PCA (Principal Component Analysis) and then analyzing its results. <br>

We will be utilizing the StandardScaler for scaling and PCA for reducing our data. However, this change will only happen during the model fitting in order to make it easier to compare our models, as such in this section the data will continue to not have been transformed in order to not alter this process. As such, in this section we will only be creating our Scaling and Dimensionality Reduction models. <br>

In [60]:
# Check the dataframe for missing values (doesnt appear to have any)
print("Dataframe Missing Values:", df.isna().sum().sum())

# Get the X and Y from the dataframe
x = df.drop(columns=["critical_temp_x", "critical_temp_y", "material"])
y = df.critical_temp_x

# Split the data into training and testing sets (Full data and Dimensionality Reduction data)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=50)

# Check if merge of both dataframes is sparse or dense (on training set)
print(f"Percentage of 0's in the X matrix = {(x_train==0).sum().sum() / np.prod(x_train.shape):.3f}")

# Create the dimensionality reduction model (will be using PCA as df is dense, also n_components is set to 0.90 to get variance atleast above 90%)
pca = PCA(n_components=0.90, svd_solver="full")

# Fit the model with the training data
pca.fit(x_train)

# Check model for number of components to reach explainability
print(f"Number of components to reach 90% explainability with scaled dataset = {len(pca.explained_variance_)}")


Dataframe Missing Values: 0
Percentage of 0's in the X matrix = 0.495
Number of components to reach 90% explainability with scaled dataset = 2


### Discuss Results (1)
We start this exercise by checking the model for any missing values. As we found none, we instead passed onto splitting the data and then scaling it. We only scaled the training data as we were not going to change the testing data. <br>

After this, as can be seen from the results obtained above, we checked for the percentage of values equal to 0 in our merged dataframe. From these, we determined that our data will be considered a dense matrix (as most of its values are not equal to 0) and, as such, we decided to use the PCA dimensionality recuction model as it is more effective in dense matrixes than SVD. From this, we then scale our data (using the "with_median" hyperparameter equal to false in order to not mess with the values equal to 0) and perform the dimensionality reduction (using PCA) on it. For the model, we only accepted that which allowed us to have a variance above 90%. <br>

We then saw how many components we reached in order to reach explainability, which equalled to 66. This is a good decrease in the number of features, as originally there were 170. <br>

## Objective 2
In this objective, we have to create the regression and classification models. <br>

### 2.1)
For this part, we are making the regression model. We will be using Decision Trees (DT), with their sklearn implementation, as PCA components are decorrolated between themselves. As such, with no linear relation between components we don't expect good performance from the Linear Regression (LR) model. <br>

The complexity in this exercise will come from training the Decision Tree model. <br>

In [61]:
# Create (and test the time) for the Regression Models (for full and reduced data)
t = time()
dtr = Pipeline([
    ("scaler", StandardScaler()),
    ("regressor", DecisionTreeRegressor())
]).fit(x_train, y_train)
print(f"Train time without PCA = {time() - t:.3f}")

t = time()
dtr_pca = Pipeline([
    ("scaler", StandardScaler()),
    ("dim", PCA(n_components=0.9, svd_solver="full")),
    ("regressor", DecisionTreeRegressor())
]).fit(x_train, y_train)
print(f"Train time with PCA = {time() - t:.3f}")

# Print out the number of elements in each data (check reduction)
print("Full dataset number of elements = ", np.prod(x_train.shape), 
      "\nReduced dataset number of elements = ", np.prod(PCA(n_components=0.9, svd_solver="full").fit_transform(
            StandardScaler().fit_transform(x_train)
        ).shape
    ),
)

# Get predictions for model
t = time()
preds = dtr.predict(x_test)
print(f"Prediction time without PCA = {time() - t:.3f}")

t=time()
preds_pca = dtr_pca.predict(x_test)
print(f"Prediction time with PCA = {time() - t:.3f}", end="\n"+"-"*120+"\n")

# Compute the Evaluation values (RMSE, max and pearson for full and dim. reduced data)
rmse = mean_squared_error(y_test, preds, squared=False)
max_err = max_error(y_test, preds)
pearson_r = np.corrcoef(y_test, preds)[0,1]

rmse_pca = mean_squared_error(y_test, preds_pca, squared=False) 
max_err_pca = max_error(y_test, preds_pca)
pearson_r_pca = np.corrcoef(y_test, preds_pca)[0,1]

# Print out the evaluation values for the data (comparison)
print(f"Full Dataset RMSE = {rmse:.3f}", "\t"*5,  f"| Reduced Dataset RMSE = {rmse_pca:.3f}")
print(f"Full Dataset Max. Error = {max_err:.3f}", "\t"*4, f"| Reduced Dataset Max. Error = {max_err_pca:.3f}")
print(f"Full Dataset Pearson Corr. = {pearson_r:.3f}", "\t"*4, f"| Reduced Dataset Pearson Corr = {pearson_r_pca:.3f}")

Train time without PCA = 3.163
Train time with PCA = 3.456
Full dataset number of elements =  3195712 
Reduced dataset number of elements =  1262976
Prediction time without PCA = 0.020
Prediction time with PCA = 0.023
------------------------------------------------------------------------------------------------------------------------
Full Dataset RMSE = 11.774 					 | Reduced Dataset RMSE = 12.106
Full Dataset Max. Error = 114.400 				 | Reduced Dataset Max. Error = 87.400
Full Dataset Pearson Corr. = 0.941 				 | Reduced Dataset Pearson Corr = 0.938


### Discuss Results (2.1)
As can be seen from the previous results, we can compare / evaluate the various results while using the full dataset and the dimensionality reduction dataset. <br>

From the initial lines, we can see the training time between the two models is quite similar, so the complexity wont come from this part. The same can be said for the predicion time. Although this would increase with alot more elements, and as such the difference between these two would increase, it is not something to take into account for our case. <br>

The total size between the data is also vastly changed, from the larger 3195712 elements of the full data compared to the 1262976 elements of the smaller, reduced data.<br>

The regression difference is also minimal between the full or reduced datasets. While the RMSE stayed nearly the same, the maximum error obtained in the reduced dataset is reduced. <br>

With these results, it is much better to use the reduced dataset as it they are extremely similar but it is smaller. <br>

### 2.2)
For this part, we are making the classification model. From the exercise, we were instructed to use both NaiveBayes (NB) and DecisionTrees (DT) for the full and reduced datasets. <br>

We will also have to create a few additional classes for the dependent variable. These will be VeryLow (0.0, 1.0), Low (0.0, 1.0), Medium (0.0, 1.0), High (0.0, 1.0) and VeryHigh (0.0, 1.0). We will then create two models, one with the full dataset (direct variables) and the other with the projection of the full data in a smaller data space (dimensionality reduction). These results will then be compared and discussed. <br>

Our first objective, for this exercise, was to compare the models on the full dataset. After that, we would compare both of the models using the reduced dataset. <br>

In [62]:
# Create a function to add classes
def to_class(x: float) -> str:
    if 0 <= x < 1.0:
        return "VeryLow"
    elif 1 <= x < 5.0:
        return "Low"
    elif 5 <= x < 20.0:
        return "Medium"
    elif 20 <= x < 100.0:
        return "High"
    elif x >= 100:
        return "VeryHigh"
    return np.nan

# Add classes to our dependent variable (apply function)
y_train_class = y_train.apply(to_class)
y_test_class = y_test.apply(to_class)

## FULL DATA (not dim. reduced)
print("-----------FULL DATASET------------")

# Create (and compare training time) for both models 
t = time()
dtc = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", DecisionTreeClassifier())
]).fit(x_train, y_train_class)
print(f"Decision Tree train time (no PCA) = {time() - t:.3f}")

t = time()
nb = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", GaussianNB())
]).fit(x_train, y_train_class)
print(f"Naive Bayes train time (no PCA) = {time() - t:.3f}")

# Get predictions from models (and check time)
t = time()
dtc_preds = dtc.predict(x_test)
print(f"Decision Tree predict time (no PCA) = {time() - t:.3f}")

t=time()
nb_preds = nb.predict(x_test)
print(f"Naive Bayes predict time (no PCA) = {time() - t:.3f}")

# Compute model evaluations
dtc_mfcc = matthews_corrcoef(y_test_class,dtc_preds)
nb_mfcc= matthews_corrcoef(y_test_class, nb_preds)

# Print out model evaluations (comparison)
print(f"Decision Tree MFCC = {dtc_mfcc:.3f}", "\t"*5,  f"| Naive Bayes MFCC = {nb_mfcc:.3f}")

## REDUCED DATA (PCA reduced)
print("-----------REDUCED DATASET------------")

# Create (and compare training time) for both models (also fit)
t = time()
dtc = Pipeline([
    ("scaler", StandardScaler()),
    ("dim", PCA(n_components=0.9, svd_solver="full")),
    ("classifier", DecisionTreeClassifier())
]).fit(x_train, y_train_class)
print(f"Decision Tree train time (with PCA) = {time() - t:.3f}")

t = time()
nb = Pipeline([
    ("scaler", StandardScaler()),
    ("dim", PCA(n_components=0.9, svd_solver="full")),
    ("classifier", GaussianNB())
]).fit(x_train, y_train_class)
print(f"Naive Bayes train time (with PCA) = {time() - t:.3f}")

# Get predictions from models (and check time)
t = time()
dtc_preds = dtc.predict(x_test)
print(f"Decision Tree predict time (with PCA) = {time() - t:.3f}")

t = time()
nb_preds = nb.predict(x_test)
print(f"Naive Bayes predict time (with PCA) = {time() - t:.3f}")

# Compute model evaluations
dtc_mfcc = matthews_corrcoef(y_test_class, dtc_preds)
nb_mfcc= matthews_corrcoef(y_test_class, nb_preds)

# Print out model evaluations (comparison)
print(f"Decision Tree MFCC = {dtc_mfcc:.3f}", "\t"*5,  f"| Naive Bayes MFCC = {nb_mfcc:.3f}")


-----------FULL DATASET------------
Decision Tree train time (no PCA) = 5.075
Naive Bayes train time (no PCA) = 0.176
Decision Tree predict time (no PCA) = 0.010
Naive Bayes predict time (no PCA) = 0.019
Decision Tree MFCC = 0.803 					 | Naive Bayes MFCC = 0.325
-----------REDUCED DATASET------------
Decision Tree train time (with PCA) = 4.501
Naive Bayes train time (with PCA) = 0.809
Decision Tree predict time (with PCA) = 0.013
Naive Bayes predict time (with PCA) = 0.033
Decision Tree MFCC = 0.783 					 | Naive Bayes MFCC = 0.195


### Discuss Results (2.2)
Starting from the full set, we can check the first line. From this, we can tell how big of the difference in the training time between our models is. Given the small size of the data, this is an extremely noticeable difference as it will only increase more with size. <br>

However, from the next lines we can tell that the DT model is actually faster in making the predictions than the NB model. This is due to it only parsing the tree, meaning a lower complexity, which is faster than calculating probabilities / likelihoods. The DT model also obtained a better MCFF score. <br>

Moving on to the reduced set (which is after the evaluation line) we can see that the time they took to train was nearly the same, although it had a bigger impact in the NB model which took a significant more time. <br>

Een with the independence of the variables, the NB model did not have a good performance. This is most likely due to the DT model having universal function approximators that performed better, as they can adapt more to every kind of relation with the data. <br>

Comparing both of the decision trees, as with the regression, we can see similar results in these on both datasets. Therefore, using the reduced dataset is advisable for the same reasons given previously. <br>