# **Principal Component Analysis for Dataset Optimization with Dimension Reduction**
Principal Component Analysis is used to find patterns in data and then express them as principal components, which emphasize the similarities and differences that are discovered. These principal components can then be examined to see what portion of information each one contributes to the whole. This will allow for selection and removal of those principal components contributing very little. 

This is very useful for machine learning purposes on large datasets with many attributes, otherwise known as the dimensions of a dataset. This is because after elimination, the remaining principal components can be used as the new attributes; representative, without significant loss, of the information provided by the original dataset. This compression of data provides a reduction in the overall dimensionality and a new dataset which will be more manageable for use with machine learning algorithms.

# **Importing Packages and Loading Wine Datasets**

White Wine and Red Wine datasets were imported from UCI's Machine Learning Repository, and also joined together to form an additional dataset of all wine. Column names were changed to remove spaces so coding with attributes is easier later on.

In [0]:
#Imports
#Most packages used, and the rest were experimented with
import numpy as np
import sympy as sp
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale, normalize
from sklearn.preprocessing import StandardScaler
%matplotlib inline

In [0]:
#Load White Wine Dataset into Pandas DataFrame from UCI Database
urlW = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
ww = pd.read_csv(urlW, sep = ';' , header = 'infer')

#Load Red Wine Dataset into Pandas DataFrame from UCI Database
urlR = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
rw = pd.read_csv(urlR, sep = ';' , header = 'infer')

#Remove Spaces from Column Names
ww.columns = ['fixedAcidity', 'volatileAcidity', 'citricAcid', 'residualSugar', 'chlorides','freeSulfurDioxide', 'totalSulfurDioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
rw.columns = ['fixedAcidity', 'volatileAcidity', 'citricAcid', 'residualSugar', 'chlorides','freeSulfurDioxide', 'totalSulfurDioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']

# Merged Dataframes of Red and White Wine for a combined list of wine
frames = [rw,ww]
aw = pd.concat(frames, keys=['Red Wine', 'White Wine'])

# **Exploratory Data Analysis**

The data used for our project is from two datasets concering the characteristics of wine and the possible effect they have on the overall quality of wine. One dataset is about white wine, the second red wine, and the third dataset is a combined dataset of red and white wine. The large difference in the number of instances between the red and white wine datasets could possibly be significant. The fact the white and red wines will have different values for their attributes may be significant. These two factors were significant enough that the decision was made to explore them seperately, along with them combined in one list.

Initial observations drawn from the data shows 4898 instances of white wine and 1599 instances of red wine, for a total of 6497 samples.

There are 12 attributes, 11 of which are predictor variables and 1 target variable concering quality. It is important to explore the data further to search for errors, outliers, or other significant observations.

In [0]:
#View Number of Instances and Attributes for Each Set
rw.shape, ww.shape,  aw.shape

In [0]:
#Overview of what the data looks like
aw

### **Data Summary**
Summary statistics show the datasets contain no missing values. The quality columns in both red and white datasets show mostly similiar values. However, many of the values for the predictor variables are further apart when comparing the two datasets. This means white wine will have almost three times the influence over the values in our combined dataset.

In [0]:
#Summarize Red Wine Data
rw.describe()

In [0]:
#Histograms of Red Wine Attributes Side by Side
#Create single figure for all histograms and the array that holds all histograms
#Achieved through research and a combination of multiple ideas found from researching on google and stackexchange
#Matplotlib package used
fig, axarr = plt.subplots(3,4,figsize=(20,7))
fig.suptitle('Histograms of Red Wine Attributes', fontsize=16)
fig.tight_layout
fig.subplots_adjust(top=0.95, hspace=0.3, wspace=0.4)

#Create histograms in the array for each Red Wine attribute
rw.quality.hist(ax=axarr[0,0])
axarr[0,0].set_xlabel("Quality")
axarr[0,0].set_ylabel("Counts")
rw.alcohol.hist(ax=axarr[0,1])
axarr[0,1].set_xlabel("Alcohol")
rw.density.hist(ax=axarr[0,2])
axarr[0,2].set_xlabel("Density")
rw.residualSugar.hist(ax=axarr[0,3])
axarr[0,3].set_xlabel("Residual Sugar")
rw.pH.hist(ax=axarr[1,0])
axarr[1,0].set_xlabel("ph")
axarr[1,0].set_ylabel("Counts")
rw.sulphates.hist(ax=axarr[1,1])
axarr[1,1].set_xlabel("Sulphates")
rw.freeSulfurDioxide.hist(ax=axarr[1,2])
axarr[1,2].set_xlabel("Free Sulfur Dioxide")
rw.totalSulfurDioxide.hist(ax=axarr[1,3])
axarr[1,3].set_xlabel("Total Sulfur Dioxide")
rw.chlorides.hist(ax=axarr[2,0])
axarr[2,0].set_xlabel("Chlorides")
axarr[2,0].set_ylabel("Counts")
rw.fixedAcidity.hist(ax=axarr[2,1])
axarr[2,1].set_xlabel("Fixed Acidity")
rw.volatileAcidity.hist(ax=axarr[2,2])
axarr[2,2].set_xlabel("Volatile Acidity")
rw.citricAcid.hist(ax=axarr[2,3])
axarr[2,3].set_xlabel("Citric Acid")

In [0]:
#Summarize White Wine Data
ww.describe()

In [0]:
#Histograms of White Wine Attributes Side by Side
#Create figure and the array that holds all histograms
fig, axarr = plt.subplots(3,4,figsize=(20,7))
fig.suptitle('Histograms of White Wine Attributes', fontsize=16)
fig.tight_layout
fig.subplots_adjust(top=0.95, hspace=0.3, wspace=0.4)

#Create histograms in the array for each Wed Wine attribute
ww.quality.hist(ax=axarr[0,0])
axarr[0,0].set_xlabel("Quality")
axarr[0,0].set_ylabel("Counts")
ww.alcohol.hist(ax=axarr[0,1])
axarr[0,1].set_xlabel("Alcohol")
ww.density.hist(ax=axarr[0,2])
axarr[0,2].set_xlabel("Density")
ww.residualSugar.hist(ax=axarr[0,3])
axarr[0,3].set_xlabel("Residual Sugar")
ww.pH.hist(ax=axarr[1,0])
axarr[1,0].set_xlabel("ph")
axarr[1,0].set_ylabel("Counts")
ww.sulphates.hist(ax=axarr[1,1])
axarr[1,1].set_xlabel("Sulphates")
ww.freeSulfurDioxide.hist(ax=axarr[1,2])
axarr[1,2].set_xlabel("Free Sulfur Dioxide")
ww.totalSulfurDioxide.hist(ax=axarr[1,3])
axarr[1,3].set_xlabel("Total Sulfur Dioxide")
ww.chlorides.hist(ax=axarr[2,0])
axarr[2,0].set_xlabel("Chlorides")
axarr[2,0].set_ylabel("Counts")
ww.fixedAcidity.hist(ax=axarr[2,1])
axarr[2,1].set_xlabel("Fixed Acidity")
ww.volatileAcidity.hist(ax=axarr[2,2])
axarr[2,2].set_xlabel("Volatile Acidity")
ww.citricAcid.hist(ax=axarr[2,3])
axarr[2,3].set_xlabel("Citric Acid")

In [0]:
#Summarize All Wine Data
aw.describe()

In [0]:
#Side-by-side Histograms Red and White Wine Attributes
#Create figure and the array that holds all histograms
fig, axarr = plt.subplots(3,4,figsize=(20,7))
fig.suptitle('Histogram of Red and White Wine Attributes', fontsize=16)
fig.tight_layout
fig.subplots_adjust(top=0.95, hspace=0.3, wspace=0.4)

#Create histograms in the array for each attribute in All Wine
aw.quality.hist(ax=axarr[0,0])
axarr[0,0].set_xlabel("Quality")
axarr[0,0].set_ylabel("Counts")
aw.alcohol.hist(ax=axarr[0,1])
axarr[0,1].set_xlabel("Alcohol")
aw.density.hist(ax=axarr[0,2])
axarr[0,2].set_xlabel("Density")
aw.residualSugar.hist(ax=axarr[0,3])
axarr[0,3].set_xlabel("Residual Sugar")
aw.pH.hist(ax=axarr[1,0])
axarr[1,0].set_xlabel("ph")
axarr[1,0].set_ylabel("Counts")
aw.sulphates.hist(ax=axarr[1,1])
axarr[1,1].set_xlabel("Sulphates")
aw.freeSulfurDioxide.hist(ax=axarr[1,2])
axarr[1,2].set_xlabel("Free Sulfur Dioxide")
aw.totalSulfurDioxide.hist(ax=axarr[1,3])
axarr[1,3].set_xlabel("Total Sulfur Dioxide")
aw.chlorides.hist(ax=axarr[2,0])
axarr[2,0].set_xlabel("Chlorides")
axarr[2,0].set_ylabel("Counts")
aw.fixedAcidity.hist(ax=axarr[2,1])
axarr[2,1].set_xlabel("Fixed Acidity")
aw.volatileAcidity.hist(ax=axarr[2,2])
axarr[2,2].set_xlabel("Volatile Acidity")
aw.citricAcid.hist(ax=axarr[2,3])
axarr[2,3].set_xlabel("Citric Acid")

### **Detect Outliers** 
After reviewing the histograms for all three datasets, we can see that a lot of the attributes are skewed right. This suggests there may be outliers and, after examining the boxplots below for the attributes of the combined dataset of Red and White Wines, there appears to be many outliers outside of the Inter Quartile Range (middle 50% of data values). We can not assume any of these outliers are errors at this time due to the amount of them.

However, at least two of these instances can be said to be rather significant, and future data models may want to be explored with out these entries; or at least an explanation of why the wine creation procces, whether natural or unnnatural, resulted those very high numbers. 

The values for quality, the target variable, are also all centered around the middle of the rankings on a scale of 1 to 10. The exact criteria for achieving these values for criteria should be examined. It is also important to note that not all data is scaled the same way and there are notable differences in the means between all datasets.

In [0]:
#Side-by-side Boxplots of Red and White Wine Attributes
#Create figure and the array that holds all histograms
fig, axarr = plt.subplots(3,4,figsize=(30,7))
fig.suptitle('Boxplots of Red and White Wine Red and White Wine Attributes', fontsize=16)
fig.tight_layout
fig.subplots_adjust(top=0.95, hspace=0.3, wspace=0.4)

#Create boxplots in the array for each attribute in All Wine
#Seaborn Package used for boxplots
sns.boxplot(aw.quality, ax=axarr[0,0])
axarr[0,0].set_xlabel("Quality")
axarr[0,0].set_ylabel("Counts")
sns.boxplot(aw.alcohol,ax=axarr[0,1])
axarr[0,1].set_xlabel("Alcohol")
sns.boxplot(aw.density,ax=axarr[0,2])
axarr[0,2].set_xlabel("Density")
sns.boxplot(aw.residualSugar,ax=axarr[0,3])
axarr[0,3].set_xlabel("Residual Sugar")
sns.boxplot(aw.pH,ax=axarr[1,0])
axarr[1,0].set_xlabel("ph")
axarr[1,0].set_ylabel("Counts")
sns.boxplot(aw.sulphates,ax=axarr[1,1])
axarr[1,1].set_xlabel("Sulphates")
sns.boxplot(aw.freeSulfurDioxide,ax=axarr[1,2])
axarr[1,2].set_xlabel("Free Sulfur Dioxide")
sns.boxplot(aw.totalSulfurDioxide,ax=axarr[1,3])
axarr[1,3].set_xlabel("Total Sulfur Dioxide")
sns.boxplot(aw.chlorides,ax=axarr[2,0])
axarr[2,0].set_xlabel("Chlorides")
axarr[2,0].set_ylabel("Counts")
sns.boxplot(aw.fixedAcidity,ax=axarr[2,1])
axarr[2,1].set_xlabel("Fixed Acidity")
sns.boxplot(aw.volatileAcidity,ax=axarr[2,2])
axarr[2,2].set_xlabel("Volatile Acidity")
sns.boxplot(aw.citricAcid,ax=axarr[2,3])
axarr[2,3].set_xlabel("Citric Acid")

## **Correlation**

The overall correlation between most of our variables seems to be low. Our target variable seems to be most correlated with alcohol around .44. The strongest correlation we have is negative between density and alcohol; meaning as the amount of one increases, the other decreases. It should also be noted the correlations are a lot lower on average in the white wine dataset than the red wine dataset. Whether this is a result of the type of wine or the number of instances cannot be inferred.

In [0]:
#Correlation Visualization for Red Wine Features
#Code retrieved from kaggle source listed at the bottom of this page
#Calculate Correlation of attributes for Red Wine
corrRW = rw.corr()

#Create heatmap to display correlation values
#Matplotlib.pyplot and Seaborn Packages used
plt.figure(figsize=(10,10))
sns.heatmap(corrRW, vmax=1, square=True,annot=True,cmap='cubehelix')
plt.title('Correlation between different Red Wine features')

In [0]:
#Correlation Visualization for White Wine Features
#Calculate Correlation of attributes for White Wine
corrWW = ww.corr()

#Create heatmap to display correlation values
plt.figure(figsize=(10,10))
sns.heatmap(corrWW, vmax=1, square=True,annot=True,cmap='cubehelix')
plt.title('Correlation between different White Wine feaures')

In [0]:
#Correlation Visualization for Red and White Wine Features
#Calculate Correlation of attributes for Combined Wine Data
corrAW = aw.corr()

#Create heatmap to display correlation values
plt.figure(figsize=(10,10))
sns.heatmap(corrAW, vmax=1, square=True,annot=True,cmap='cubehelix')
plt.title('Correlation between different features of Red and White Wine Combined')

## **Seperate Out Features from the Target Variable Quality**

We want to explore the principal components relating to only the features and don't want the numbers for quality affecting the overall explained variance.

In [0]:
#Seperate Features from Target Variable
#Identify Features
features = ['fixedAcidity', 'volatileAcidity', 'citricAcid', 'residualSugar', 'chlorides','freeSulfurDioxide', 'totalSulfurDioxide', 'density', 'pH', 'sulphates', 'alcohol']

#Seperate attributes from quality for Red Wine
rwf = rw.loc[:, features].values
rwq = rw.loc[:,['quality']].values

#Seperate attributes from quality for White Wine
wwf = ww.loc[:, features].values
wwq = ww.loc[:,['quality']].values

#Seperate attributes from quality for All Wine
awf = aw.loc[:, features].values
awq = aw.loc[:,['quality']].values

#Ensure the data was preserved
rwf.shape, wwf.shape, awf.shape

## **Principle Component Analysis: No Standardization**
Without standardizing our data, after running the code below, we can see that the first principal component is accounting for around 90% of the explained variance.  This suggests the weights are not spread evenly between our features and the larger valued attributes are influencing the the first principle component the most

In [0]:
#Perform PCA analysis with all untransformed datasets and barcharts for the resulting principal components 
#Create the figure and the array that holds all barplots
fig, axarr = plt.subplots(1,3,figsize=(20,5))
fig.suptitle('Percentage of Explained Variance Per Principle Component', fontsize=16)
fig.tight_layout()
fig.subplots_adjust(top=0.9, wspace=0.1)

#PCA transformation for Red Wine Features
#sklearn.decomposition Package for PCA used
#Process learned from sources listed at the end of this file
pcaRW = PCA()
pcaRW.fit(rwf)
pca_RW = pcaRW.transform(rwf)

#PCA transformation for White Wine Features
pcaWW = PCA()
pcaWW.fit(wwf)
pca_WW = pcaWW.transform(wwf)

#PCA transformation for All Wine Features
pcaAW = PCA()
pcaAW.fit(awf)
pca_AW = pcaAW.transform(awf)

#Calculate and round the explained variance for the PCA transformation of Red Wine features
#Proccess learned from StatQuest Youtube video listed in sources below
#sklearn.decomposition package used for calculating explained variance
#numpy package used for rounding
perVarRW = np.round(pcaRW.explained_variance_ratio_*100, decimals = 4)

#Calculate and round the explained variance for the PCA transformation of White Wine features
perVarWW = np.round(pcaWW.explained_variance_ratio_*100, decimals = 4)

#Calculate and round the explained variance for the PCA transformation of All Wine features
perVarAW = np.round(pcaAW.explained_variance_ratio_*100, decimals = 4)

#Calculate and round the explained variance for the PCA transformation of White Wine features
#Create principal componenet labels for barplots
#Proccess learned from StatQuest Youtube video listed in sources below
label1 = ['PC' + str(x) for x in range(1, len(perVarRW)+1)]
label2 = ['PC' + str(x) for x in range(1, len(perVarWW)+1)]
label3 = ['PC' + str(x) for x in range(1, len(perVarAW)+1)]

#Insert data from each PCA transformation into array and barplots
#Proccess learned from StatQuest Youtube video listed in sources below
axarr[0].bar(x=range(1,len(perVarRW)+1), height = perVarRW, tick_label=label1)
axarr[1].bar(x=range(1,len(perVarWW)+1), height = perVarWW, tick_label=label1)
axarr[2].bar(x=range(1,len(perVarAW)+1), height = perVarAW, tick_label=label1)

#Label barplots
axarr[0].set_xlabel("Number of Components Red Wine")
axarr[0].set_ylabel("Percentage of Explained Variance")
axarr[1].set_xlabel("Number of Components White Wine")
axarr[2].set_xlabel("Number of Components All Wine")

# **Principal Component Analysis**
As we saw above, one principal component accounted for most of the variance. PCA is performed to identify the principal components with the highest variance, so we need to standardize the data. Since the values for each of the features were not measured along the same scale, this will allow the attributes with the smaller values to make equal contributions to the principal components and spread the variance more evenly over all of the principal components.

We will demonstrate the full proccess with the All Wine dataset, and then use the PCA Package in sklearn on the others to save time.

Process followed for both calculation and sklearn package from Sebastian Raschka's website and the Kaggle sources listed at the end.

###**Step 1: Data Standardization**
The features will be transformed to have a mean of 0 and a standard deviation of 1 so that all attributes  have the same unit scale, the data is centered, and ensure its normally distributed for optimal results. 

The mathematical process to achieve a more standard and normally distributed dataset is done be subtracting the means of each attribute from all of the corresponding attribute's values. If the data is viewed visually, it will show that the curve has shifted towards the orgin, but the shape of the curve is preserved.

After obtaining a more normal distribution, each attributes standard deviation is divided by all of the attribute's corresponding data values. This will give the data a new standard deviation of 1. Recall that standard deviation is a measure of how spread out the data values are.

In [0]:
#Standardize datasets
#sklearn.preprocessing package used for Standard Scaler
 
#Transform Red Wine Features Dataset
stdRW = StandardScaler().fit_transform(rwf)
#Transform Red Wine Features Dataset
stdWW = StandardScaler().fit_transform(wwf)
#Transform Red Wine Features Dataset
stdAW = StandardScaler().fit_transform(awf)

### **Step 2: Covariance Matrix**

In order to perform PCA, we need to obtain the covariance matrix for our data. This will allow us to perform eigendecomposition to find the eigenvalues and eigenvectors. 


In [0]:
#Calculate mean vector
#Numpy Package used
meansAW = np.mean(stdAW, axis=0)

#Display Mean Vector
meansAW

In [0]:
#Calculate Covariance Matrix
covMatAWC = (stdAW - meansAW).T.dot((stdAW-meansAW))/(stdAW.shape[0]-1)

#Display Covariance Matrix
covMatAWC

In [0]:
#Correlation Visualization for Red Wine Features
plt.figure(figsize=(8,8))
sns.heatmap(covMatAWC, vmax=1, square=True,annot=True,cmap='cubehelix')
plt.title('Correlation between different features for All Wine')

### **Step 3: Eigendecomposition**
We will now retrieve the eigenvector and eigenvalues from the covariance matrix.

The eigenvectors are the principal components and are representative of the direction with the greatest variance in the dataset. Specifically, each eigenvector, if placed on a plot of the original data, would be a line representative of a relationship between all the data in the set and perpendicular to all other eigenvectors.

The eigenvalues correspond to the amount of variance there is along a specific eigenvector. Larger values explain more variance.


In [0]:
#Calculate Eigenvalues and EigenVectors
#Numpy Package used for the linear algebra calculations of the eigenvectors and eigenvaules
eigvalsAWC , eigvecsAWC = np.linalg.eig(covMatAWC)

#Show results
print('\nEigenvectors All Wine Calculated\n%s' %eigvecsAWC)
print('\nEigenvalues All Wine Calculated\n%s' %eigvalsAWC)


### **Step 4: Select Principal Components**
Sort eigenvalues from largest to smallest in order to find the smallest ones. The smaller the eigenvalue, the less importance they have in representing the data distribution. The largest eigenvalue and its corresponding eigenvector is the principal component with the strongest relationship between the data and the new dimensions.

In [0]:
#Pair eigenvalues and eigenvectors in list
eigPairsAWC = [(np.abs(eigvalsAWC[i]), eigvecsAWC[:,i]) for i in range (len(eigvalsAWC))]

#Sorts above list ffrom largest to smallest and print the eigenvalues
eigPairsAWC.sort(key=lambda x: x[0], reverse=True)
print('Eigenvalues for All Wine Calculated')
for i in eigPairsAWC:
  print(i[0])

### **Step 5: Calculate Explained Variance**

The first two principal components clearly contribute to much more of the total variance than the rest of the components. We can also see some eigenvalues have very low values. Eigenvalues near zero are generally discarded for principal component analysis.

However, we need to calculate the explained variance ratio for each principal component in order to tell us exaclty how much information each principal compontent accounts for. Doing so will allow us to properly select the principal components that can be removed from the model.



In [0]:
#Calculate Explained Variance
#Add all eigenvalues together for total variance
totVar = sum(eigvalsAWC)

#Calculate the percantage of the contribution to variance for each principal component
expVar = [(i/totVar)*100 for i in sorted(eigvalsAWC, reverse=True)]

#Plot Explained variance of each Principal Component
plt.figure(figsize=(10,10))
plt.bar(range(11), expVar)
plt.ylabel('Explained Variance')
plt.xlabel('Principal Components')

After performing the calculations, the barplot shows the variance is much more spread out along the principal components and the first one is no longer accounting for 90% of the variance, but only around 27%. PC10 contributes very little and the 11th component isn't even listed, suggesting it explains none of the variance.

Once the significant principal components are chosen, their eigenvectors are transposed and multiplied on the left side of the origial data to give a dataset with smaller dimensions.

## **Principal Component Analysis with sklearn Package**

In [0]:
#Perform PCA with all components on standardized datasets

#PCA transformation for Red Wine Features
pca1 = PCA()
pcaStdRW = pca1.fit_transform(stdRW)

#PCA transformation for White Wine Features
pca2 = PCA()
pcaStdWW = pca2.fit_transform(stdWW)

#PCA transformation for All Wine Features
pca3 = PCA()
pcaStdAW = pca3.fit_transform(stdAW)                      

### **Visualize Principal Components and their Explained Variance**



In [0]:
#Visualizae Explained Variance
#Create the figure and the array that holds all barplots
fig, axarr = plt.subplots(1,3,figsize=(20,5))
fig.suptitle('Percentage of Explained Variance Per Principle Component', fontsize=16)
fig.tight_layout()
fig.subplots_adjust(top=0.9, wspace=0.1)


#Calculate and round the explained variance for the PCA transformation of Red Wine features
perVarStdRW = np.round(pca1.explained_variance_ratio_*100, decimals = 4)

#Calculate and round the explained variance for the PCA transformation of White Wine features
perVarStdWW = np.round(pca2.explained_variance_ratio_*100, decimals = 4)

#Calculate and round the explained variance for the PCA transformation of All Wine features
perVarStdAW = np.round(pca3.explained_variance_ratio_*100, decimals = 4)

#Create principal componenet labels for barplots
label1 = ['PC' + str(x) for x in range(1, len(perVarStdRW)+1)]
label2 = ['PC' + str(x) for x in range(1, len(perVarStdWW)+1)]
label3 = ['PC' + str(x) for x in range(1, len(perVarStdAW)+1)]

#Insert data from each PCA transformation into array and barplots
axarr[0].bar(x=range(1,len(perVarStdRW)+1), height = perVarStdRW, tick_label=label1)
axarr[1].bar(x=range(1,len(perVarStdWW)+1), height = perVarStdWW, tick_label=label1)
axarr[2].bar(x=range(1,len(perVarStdAW)+1), height = perVarStdAW, tick_label=label1)

#Label barplots
axarr[0].set_xlabel("Number of Components Red Wine")
axarr[0].set_ylabel("Percentage of Explained Variance")
axarr[1].set_xlabel("Number of Components White Wine")
axarr[2].set_xlabel("Number of Components All Wine")

We can see the resulting principal components for each of the three datasets and the ammounts of variance contributed by each one. It is interesting to see that the second principal component in the all wine dataset contributes significantly more to the total variance than the second component in the other datasets. This may be something to explore further in the future. It can also be seen that the variance in the Red Wine distribution seems to decrease at a more even ratio as viewing the plot from PC1 to PC11. It is interesting to note that PC11 for All Wine is shown with the pca method from sklearn package, but wasn't shown with the longer method above; however, quite clearly it contributes very little in all datasets and is a candidate for removal. 

### **Visualize the Total Explained Variance Accounted for as the Number of Principal Components Increases**
To aid us further in selecting the principal components to remove, we will visualize the total variance accounted for as more principal components are included. This additional step was taken from the kaggle source.


In [0]:
#Visualize Explained Variance as Number of Principal Components Increases
#Create the figure and the array that holds all barplots
fig, axarr = plt.subplots(1,3,figsize=(20,5))
fig.suptitle('Explained Variance vs Number of Components', fontsize=16)
fig.tight_layout()
fig.subplots_adjust(top=0.9, wspace=0.1)

#Matplotlib.pyplot used for plotting data
#sklearn packge used for calculating the explained variance of each principal component
#Numpy package used for calculating the total variance of all principal components
#Calculate Total Explained Variance and Create plots for Red Wine Principal Components and their corresponding Explained Variance Values
axarr[0].plot(np.cumsum(pca1.explained_variance_ratio_))
axarr[0].set_xlabel("Number of Components Red Wine")
axarr[0].set_ylabel("Cumulative Explained Variance")
#Insert dashed line to show the cutoff point we used for explained variance at 90%
axarr[0].axhline(y=0.9,color='gray',linestyle='--')

#Calculate Total Explained Variance and Create plots for Red Wine Principal Components and their corresponding Explained Variance Values
axarr[1].plot(np.cumsum(pca2.explained_variance_ratio_))
axarr[1].set_xlabel("Number of Components White Wine")
axarr[1].axhline(y=0.9,color='gray',linestyle='--')

#Calculate Total Explained Variance and Create plots for Red Wine Principal Components and their corresponding Explained Variance Values
axarr[2].plot(np.cumsum(pca3.explained_variance_ratio_))
axarr[2].set_xlabel("Number of Components All Wine")
axarr[2].axhline(y=0.9,color='gray',linestyle='--')

From the resulting plots, we can see that there is just about 90% Explained Variance in all standardized datasets at the sixth principal component, and therefore we can drop the remaining components because they provide little information. 

The 90% cutoff is generally the minimum total variance that should be accounted for and was chosen for no particular reason; however, many often look for the principal components that account for 95% of total variance or even 99% of total variance. Doing so would result in choosing 7 and 8 principal components, respectively.  

### **Perform Principal Component Analysis with Six Principal Components**
Besides PCA, we will also merge the six principal components back with the target variable in order to visualize them together. For the All Wine Principal Components, we attempted, but were unable to create a function that would join them with the quality values. A special method is needed, since the All Wine dataset was originaly created by joining two datasets. If it is desired to use that data for machine learning methods, it might be more appropriate if the All Wine dataset created outside of python, loaded in, and then transformed with the procedure above.

In [0]:
#Perform PCA with desired components of 6

#PCA on Standardized Red Wine Data with only 6 Principal Components
pcaRW6 = PCA(n_components=6)
pcaRW6.fit(stdRW)
pca_StdRW = pcaRW6.transform(stdRW)

#PCA on Standardized White Wine Data with only 6 Principal Components
pcaWW6 = PCA(n_components=6)
pcaWW6.fit(stdWW)
pca_StdWW = pcaWW6.transform(stdWW)

#PCA on Standardized All Wine Data with only 6 Principal Components
pcaAW6 = PCA(n_components=6)
pcaAW6.fit(stdAW)
pca_StdAW = pcaAW6.transform(stdAW)


#Create Dataframes of Principal Components for each wine Dataset
finalPrincipalRW = pd.DataFrame(data= pca_StdRW , columns = ['Principal Component 1','Principal Component 2','Principal Component 3',
                                                         'Principal Component 4','Principal Component 5','Principal Component 6'])
finalPrincipalWW = pd.DataFrame(data= pca_StdWW, columns = ['Principal Component 1','Principal Component 2','Principal Component 3',
                                                         'Principal Component 4','Principal Component 5','Principal Component 6'])
finalPrincipalAW = pd.DataFrame(data= pca_StdAW, columns = ['Principal Component 1','Principal Component 2','Principal Component 3',
                                                         'Principal Component 4','Principal Component 5','Principal Component 6'])


#Merge Red Wine Principal Components with the Target Variable
finalRW = pd.concat([finalPrincipalRW, rw[['quality']]], axis = 1)

#Merge White Wine Principal Components with the Target Variable
finalWW = pd.concat([finalPrincipalWW, ww[['quality']]], axis = 1)

#Special Join Needed for All Wine Components due to the fact the orginal dataset was created by concating two datasets. 
#Unable to figure out how to do correctly
#frames = [rw[['quality']], ww[['quality']]]
#qualityAW = pd.concat(frames, keys=['Red Wine', 'White Wine'])
#finalAW = pd.merge(left=finalPrincipalAW, right=qualityAW, left_on='Principal Component 6', right_on='quality')
#finalAW = finalPrincipalAW.join(qualityAW.set_index('quality'), on='quality')

#Display Red Wine Principal Componets with corresponding Quality Values
finalRW

In [0]:
#Display White Wine Principal Componets with corresponding Quality Values
finalWW

# **Results**
We were able to remove five principal components from our datasets. This will make sure models for prediction of wine quality or other machine learning methods are much easier in the future compared. Almost half of the data has been removed and the dimensions have been successully reduced to six, without losing much total variance. The correlations between our principal components can be seen to be much lower, in the visualizations below, than the correlations between the attributes in our original datasets. 

We now have six, instead of eleven, attributes that can be used for machine learning methods in the determination of quality or other explorations, a significantly lower number of dimensions to work with. Depending on the goal, it may be appropriate to use more principal components to account for more variance and it should be noted that the number of principal components used for White Wine should probably be one more than the other datasets as the plot above shows 90% total variance is accounted for closer to the 7th, not 6th principal component. The descision for percentage of total variance to aim for in principal components, in general, will largely depend on the number of attributes in the original dataset and the desired machine learing outcome.

Future explorations should include investigation of outliers and varying transformations (log, square root, inverse suqare root) of the original data to possibly allow for elimination of additional principal components and further reduction in dimensionality.

In [0]:
#Correlation Visualization for Red Wine Features
#Calculate Correlation of the Red Wine Principal Components
corrfinalRW = finalRW.corr()

#Create Figure from correlation data
#Matplotlib.pyplot and Seaborn Packages used
plt.figure(figsize=(10,10))
sns.heatmap(corrfinalRW, vmax=1, square=True,annot=True,cmap='cubehelix')
plt.title('Correlation between different components of Red Wine')

In [0]:
#Correlation Visualization for White Wine Features
#Calculate Correlation of the White Wine Principal Components
corrfinalWW = finalWW.corr()

#Create Figure
plt.figure(figsize=(10,10))
sns.heatmap(corrfinalWW, vmax=1, square=True,annot=True,cmap='cubehelix')
plt.title('Correlation between different components of White Wine')

# **Sources**

The following Sources were used more for coding purposes. A lot of coding for visualizations, arrays, and some other smaller aspects were put together from numerous google search results and past experience. The experience for exploring datasets in more detail was learned from Towsons COSC 431 Selected Topics Course on Data Mining. 

https://www.kaggle.com/nirajvermafcb/principal-component-analysis-explained

http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html

https://www.youtube.com/watch?v=Lsue2gEM9D0&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=27&t=0s



---


These sources were used for simpler explanations throughout the project's descriptions and analysis portions.

http://www.lauradhamilton.com/introduction-to-principal-component-analysis-pca

https://365datascience.com/standardization/



# **Dataset Source**
https://archive.ics.uci.edu/ml/datasets/Wine+Quality


# **Contributors**

Nathan Koh - nkoh1@students.towson.edu

Craig Neely - cneely6@students.towson.edu



---


Group Project

Math 490

Towson University

# **Visualization Experiments**
The following two blocks of code are our attempts at visualizing the correlation heatmaps, from the Seaborn package, for each dataset all together in one figure with gridspec from the matplotlib package. The numerous annotations on each figure and the colorbars are the main contributers to the difficulties experienced.

In [0]:
#Create figure and array to display all heatmaps together
fig, axarr = plt.subplots(1,3,figsize=(40,40))
fig.suptitle('Correlation Heatmap', fontsize=40)
fig.tight_layout()
fig.subplots_adjust(top=.9999, wspace=0.2, hspace=0)

#Modify Colorbar
cbar_ax = fig.add_axes([1, .33, .05, .35])

#Create Correlation Heatmaps for All Datasets
sns.heatmap(corrRW, vmax=1, square=True, annot=True,cmap='cubehelix', cbar=False, ax=axarr[0])
sns.heatmap(corrWW, vmax=1, square=True, annot=True,cmap='cubehelix', cbar=False, ax=axarr[1])
sns.heatmap(corrAW, vmax=1, square=True, annot=True,cmap='cubehelix', cbar_ax= cbar_ax, ax=axarr[2])

#Label specific subplots
axarr[0].set_xlabel('Correlation between different Red Wine fearures', fontsize=16)
axarr[1].set_xlabel('Correlation between different White Wine fearures', fontsize=16)
axarr[2].set_xlabel('Correlation between different All Wine fearures', fontsize=16)

In [0]:
#Another attempt to create figure using gridspec instead.
fig, (ax1,ax2,ax3, axcb) = plt.subplots(1,4,figsize=(40, 40), gridspec_kw={'width_ratios':[1,1,1,.08]})
ax1.get_shared_y_axes().join(ax2,ax3)
#fig, axarr = plt.subplots(1,3,figsize=(40,40))
fig.suptitle('Principle Component Analysis: No Normalization', fontsize=16)

#fig.subplots_adjust(top=0.95, wspace=0.2, hspace=0)

#Create Correlation Heatmaps
g1 = sns.heatmap(corrRW, vmax=1, square=True, annot=True,cmap='cubehelix', cbar=False, ax=ax1)
g1.set_ylabel('')
g1.set_xlabel('')
g2 = sns.heatmap(corrWW, vmax=1, square=True, annot=True,cmap='cubehelix', cbar=False, ax=ax2)
g2.set_ylabel('')
g2.set_xlabel('')
g2.set_yticks([])
g3 = sns.heatmap(corrAW, vmax=1, square=True,annot=True,cmap='cubehelix', cbar_kws={'shrink':[0.0001]}, cbar_ax=axcb, ax=ax3)
g3.set_ylabel('')
g3.set_xlabel('')
g3.set_yticks([])

#Label Specific Subplots
axarr[0].set_xlabel('Correlation between different Red Wine fearures', fontsize=16)
axarr[1].set_xlabel('Correlation between different White Wine fearures', fontsize=16)
axarr[2].set_xlabel('Correlation between different All Wine fearures', fontsize=16)


