# Using PCA and LDA for dimensionality reduction

The wine dataset is an example of multivariate dataset, which contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different types of grapes (referred to as “cultivars”). The analysis focused on quantifying 13 constituents found in each of the three types of wines. 
Using PCA on this dataset will help us to understanding the important features, because by looking at the weights of the original features in the principal components, we can see which features contribute most to the variability in the wine dataset.

**Attention:** The code in this notebook creates Google Cloud resources that can incur costs.

Refer to the Google Cloud pricing documentation for details.

For example:

* [Vertex AI Pricing](https://cloud.google.com/vertex-ai/pricing)


In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd

# Load dataset
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Standardize the features
scaler = StandardScaler()
df = scaler.fit_transform(df)

# Apply PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])

# Visualize 2D Projection
plt.figure(figsize=(8,6))
plt.scatter(principalDf['principal component 1'], principalDf['principal component 2'], c=data.target)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()


We can also then use the following code to print a table, where the values in the table represent the weights of each feature in each component.

In [None]:
components_df = pd.DataFrame(pca.components_, columns=data.feature_names, index=['Component 1', 'Component 2'])

print(components_df)

Next, let's use LDA to to identify the constituents that account for the most variance between the types of wine.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Apply LDA
lda = LDA(n_components=2)
lda_components = lda.fit_transform(df, data.target)
lda_df = pd.DataFrame(data = lda_components, columns = ['LDA 1', 'LDA 2'])

# Visualize 2D Projection
plt.figure(figsize=(8,6))
plt.scatter(lda_df['LDA 1'], lda_df['LDA 2'], c=data.target)
plt.xlabel('LDA 1')
plt.ylabel('LDA 2')
plt.show()


To view the most discriminative features, you can inspect the coef_ attribute of the fitted LDA object:

In [None]:
# Create a DataFrame with the LDA coefficients and feature names
coef_df = pd.DataFrame(lda.coef_, columns=data.feature_names, index=['Class 1 vs Rest', 'Class 2 vs Rest', 'Class 3 vs Rest'])
print(coef_df)
