In [1]:
# Paul-Jason Mello
# Professor Shim
# CMPE 257
# April 21st, 2022

# Dimensionality reduction and PCA

In [2]:
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

## 1. Discuss curse of dimensionality

In [3]:
# The curse of dimensionality dictates that as we increase dimensions, data is bound to become spraser. This means
# that even when we can increase the number of features, it will not always improve accuracy. Moreover, the number of
# training samples needed for higher dimensions increases exponentialy though the eq K^D.

## 2. Discuss any 3 dimensionality reduction techniques

In [4]:
# Forward Selection
# 
# Through forward selection we are able to select only the most relevant data to use. This can help significantly reduce
# the dimensions by only using highly correlated columns to a target variable. In forward selection we add to the model
# one at a time. We test accuracy one by one until the model becomes weaker. Then the best combination of columns is 
# selected as the models parameters.

# Backward Selection
# 
# We apply the same process as we see in forward selection; However, in this version we remove from the model one at a time. 

# ANOVA
# 
# ANOVA is a method of statistical analysis which looks to display if there is any statistically significant variations
# in the data between multiple variables. Here the reduction comes from removing insiginificant columns.

## 3. Explain PCA in your own words

In [5]:
# PCA
# 
# Principle Component Analysis is a statistical method which looks to reduce dimensionality. It takes data and attempts to
# orthogonally change it in a way that allows for the understanding of correlated variables. This also allows for better
# generalization in machine learning models. 

## 4. Perform classification on wine dataset using any model of your choice using all the features

In [6]:
# Using the red wine data downloaded from the uci website found below
# https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

wine = pd.read_csv("winequality-red.csv", sep =";")
wine.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5
6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5


In [7]:
print(wine.nunique())

fixed acidity            96
volatile acidity        143
citric acid              80
residual sugar           91
chlorides               153
free sulfur dioxide      60
total sulfur dioxide    144
density                 436
pH                       89
sulphates                96
alcohol                  65
quality                   6
dtype: int64


In [8]:
# Perfectly clean data

print(wine.isnull().sum())

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


In [9]:
wineTarget = pd.DataFrame()
wineTarget["quality"] = wine["quality"]
wine = wine.drop("quality", axis = 1)

In [10]:
x_train, x_test, y_train, y_test = train_test_split(wine, wineTarget)

In [11]:
model = RandomForestClassifier()
model.fit(x_train, y_train)

  model.fit(x_train, y_train)


RandomForestClassifier()

In [12]:
accuracy = model.score(x_test, y_test)
print("Test Accuracy: " + str(accuracy*100) + "%")

Test Accuracy: 71.0%


In [13]:
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[  0   0   3   0   0   0]
 [  0   0   6   2   0   0]
 [  0   0 149  29   0   0]
 [  0   0  37 111   4   0]
 [  0   0   3  26  24   1]
 [  0   0   0   3   2   0]]


In [14]:
# Quality rated from 1-9. However, data only contains 6 unique types, this is why the confusion matrix only contains six.

## 5. Perform classification on wine dataset using PCA and taking all top components such that 90% variance is explained by them

In [15]:
pca = PCA(n_components = 1)
pcaWine = pca.fit_transform(wine)

In [16]:
pcadf = pd.DataFrame(data = pcaWine, columns = ["PCA1"])

In [17]:
data = pd.concat([pcadf, wineTarget["quality"]], axis = 1)

In [18]:
# Only one variable is needed to explain ~95% of the variance

print(str(pca.explained_variance_ratio_))

[0.94657698]


In [19]:
x_train, x_test, y_train, y_test = train_test_split(pcaWine,  wineTarget["quality"])

In [20]:
model = RandomForestClassifier()
model.fit(x_train, y_train)

RandomForestClassifier()

In [21]:
# With only one variable, using PCA, we were able to accurately predict the quality of a wine 52% of the time.

accuracy = model.score(x_test, y_test)
print("Test Accuracy: " + str(accuracy*100) + "%")

Test Accuracy: 51.74999999999999%


In [22]:
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[  0   0   0   0   0   0]
 [  2   2   7   3   1   1]
 [  0   1 102  44  13   0]
 [  0   7  60  86  21   0]
 [  0   1  14  14  17   0]
 [  0   1   2   1   0   0]]


## 6. Discuss the difference in results vs difference in number of features used for classification.

In [23]:
# In the first implementation we found that with 12 unique variables, not including the target variable, we were able to 
# acheive 71% accuracy in our data. In the PCA driven model we were able to accurately predict 52% using only a single 
# principal component. This demonstrates that most of the models accuracy can be rerepresented in a smaller dimension. 
# PCA is useful because it is far quicker at calculating these metrics thanks to the reduced dimensions and when the 
# data frame is large, it may be necessary to reduce it using PCA to retain information.