<a href="https://colab.research.google.com/github/inamansari21/datascience/blob/main/PCA_new.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
import seaborn as sns
from sklearn.decomposition import PCA
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage


In [None]:
df = pd.read_csv("wine.csv")


In [None]:
df1 = df.iloc[:, 1:]


In [None]:
df1.head()


In [None]:
df1.describe()


# **finding correlation**

In [None]:
cor = df1.corr()


In [None]:
cor.style.background_gradient(cmap='coolwarm')


There are some quite correlation between variables. For example the correlation between flavanoids and dilution is pretty high (78%). Thus we can remove that variable from our dataset. However this method is long and tedious. Hence we PCA method for Dimensionality Reduction


# ** Dimensionality Reduction with PCA**

In [None]:
df_norm = StandardScaler().fit_transform(df1) # normalizing the data 


In [None]:
pca = PCA(n_components=13)


In [None]:
principalComponents = pca.fit_transform(df_norm)


In [None]:
PC = range(1, pca.n_components_+1)
plt.bar(PC, pca.explained_variance_ratio_, color='blue')
plt.xlabel('Principal Components')
plt.ylabel('Variance %')
plt.xticks(PC)


In [None]:
PCA_components = pd.DataFrame(principalComponents)


In [None]:
plt.scatter(PCA_components[0], PCA_components[1], alpha=.3, color='blue')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.show()


As shown in the bar graph, the most of varianve is put in the first 2 components. Since there is not much variance present from 3rd component, lets just the first 2 componets in our analysis. The scatter plot given an indication that there may be 3 clusters present


# **Finding the  number of clusters**

In [None]:
wcss = []


In [None]:
for i in range(1, 15):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(PCA_components.iloc[:,:3])
    wcss.append(kmeans.inertia_)


In [None]:
plt.plot(range(1, 15), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()


The scree plot levels off at k=3 and let's use it to determine the clusters


# **k - clusters**

In [None]:
model = KMeans(n_clusters=3)
model.fit(PCA_components.iloc[:,:2])


In [None]:
labels = model.predict(PCA_components.iloc[:,:2])


In [None]:
plt.scatter(PCA_components[0], PCA_components[1], c=labels)
plt.show()


In [None]:
k_new_df=pd.DataFrame(principalComponents[:,0:2])


In [None]:
model_k = KMeans(n_clusters=3)
model_k.fit(k_new_df)


In [None]:
model_k.labels_


In [None]:
md=pd.Series(model_k.labels_)


In [None]:
df1['clust']=md


In [None]:
k_new_df.head()


In [None]:
df1.groupby(df1.clust).mean()


# **H-Clusters**

In [None]:
model2 = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')


In [None]:
h_cluster = model2.fit(PCA_components.iloc[:,:2])


In [None]:
labels2 = model2.labels_


In [None]:
X = PCA_components.iloc[:,:1]
Y = PCA_components.iloc[:,1:2]


In [None]:
plt.figure(figsize=(10, 7))  
plt.scatter(X, Y, c=labels2) 


In [None]:
h_new_df=pd.DataFrame(principalComponents[:,0:2])


In [None]:
h_new_df.head()


In [None]:
hcf = linkage(h_new_df,method="complete",metric="euclidean")


In [None]:
plt.figure(figsize=(15, 5));plt.title('Hierarchical Clustering Dendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(
    hcf,
    leaf_rotation=0.,
    leaf_font_size=8.,
)
plt.show()


In [None]:
h_complete = AgglomerativeClustering(n_clusters=5,linkage='complete',affinity = "euclidean").fit(h_new_df) 


In [None]:
h_complete.labels_


In [None]:
cluster_labels=pd.Series(h_complete.labels_)


In [None]:
df1['clust']=cluster_labels


In [None]:
df1.head()


In [None]:
df1.groupby(df1.clust).mean()


## Conclusion
Using PCA we reduced the variables to only 2 from 13 and use clustering classification, we can safely assume that there exists 3 cluster in the wine data sets

