### GreenDS
# Fundamentals of Data Science
## Example on Unsupervised Machine Learning - Clustering
### Example 03.2

### Introduction

This Jupyter Notebook continues on the exeample 03.1. In this case, we will perform a dimensionality reduction using the Principal Component Analysis (PCA), and afterwards, will repeat the cluster analysis. The goal is to improve the quality of the cluster definition.

The data to be used is from the Agricultural Census of Portugal, from which data on **level of education**, **labour** data and **production** from 2019 was aggregated in one table, for the level of freguesia.

## 1. Prepare your environment and explore data

The first steps will be the same of the previous notebook.

In [None]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import random
from matplotlib import pyplot as plt

%matplotlib inline

In [None]:
# read data
census_df = pd.read_csv("./raw-data/data_agric_census_freg.csv")
census_df.shape

In [None]:
census_df.info()

In [None]:
sns.set(style='white',font_scale=1.3, rc={'figure.figsize':(20,20)})
ax=census_df.hist(bins=20,color='red' )

In [None]:
census_df.plot( kind = 'box', subplots = True, layout = (4,4), sharex = False, sharey = False,color='black')
plt.show()

In [None]:
values = ['Norte','Centro','Área Metropolitana de Lisboa', 'Alentejo', 'Algarve']
df1 = census_df.loc[census_df['NUTS2'].isin(values)].copy()

In [None]:
df1.describe

In [None]:
df1.drop(['municipality', 'freguesia', 'NUTS2'], axis = 1, inplace = True)

## 2. Perform the Principal Component Analysis

Data needs to be scaled, otherwise variables with higher absolute value would have a larger weight, generating bias.

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
data_cluster=df1.copy()
data_cluster[data_cluster.columns]=std_scaler.fit_transform(data_cluster)

In [None]:
data_cluster.describe()

Here we calculate the PCA, in this case retaining only two components.

In [None]:
from sklearn.decomposition import PCA
pca_2 = PCA(2)
pca_2_result = pca_2.fit_transform(data_cluster)

print ('Cumulative variance explained by 2 principal components: {:.2%}'.format(np.sum(pca_2.explained_variance_ratio_)))

The comulative value of variance retained is around 70% of the total variance, which can be considered a good value. However, we should check the plot of the first factorial plane, which used the two principal components where samples were projected.

In [None]:
sns.set(style='white', rc={'figure.figsize':(9,6)},font_scale=1.1)

plt.scatter(x=pca_2_result[:, 0], y=pca_2_result[:, 1], color='red',lw=0.1)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Data represented by the 2 strongest principal components',fontweight='bold')
plt.show()

We can se that the plot is highly depedent on a single sample, which generates high bias in the results. We should consider to clean the sample set from outliers. We will ise Isolation Forest to do this.

In [None]:
from sklearn.ensemble import IsolationForest
df2 = df1.copy()

In [None]:
# Model building
model=IsolationForest(n_estimators=150, max_samples='auto', contamination=float(0.1), max_features=1.0)
model.fit(df2)

In [None]:
# Adding 'scores' and 'anomaly' colums to df
scores=model.decision_function(df2)
anomaly=model.predict(df2)

df2['scores']=scores
df2['anomaly']=anomaly

anomaly = df2.loc[df2['anomaly']==-1]
anomaly_index = list(anomaly.index)
print('Total number of outliers is:', len(anomaly))

In [None]:
# dropping outliers
df2 = df2.drop(anomaly_index, axis = 0).reset_index(drop=True)

We can repeat the data visualisation to see the effects of removing the outliers.

In [None]:
sns.set(style='white',font_scale=1.3, rc={'figure.figsize':(20,20)})
ax=df2.hist(bins=20,color='red' )

In [None]:
df2.plot( kind = 'box', subplots = True, layout = (4,4), sharex = False, sharey = False,color='black')
plt.show()

In [None]:
# dropping columns that we don't need any more
df2.drop(['scores', 'anomaly'], axis = 1, inplace =True)

## Repeat PCA calculation

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
data_cluster=df2.copy()
data_cluster[data_cluster.columns]=std_scaler.fit_transform(data_cluster)

In [None]:
from sklearn.decomposition import PCA
pca_2 = PCA(2)
pca_2_result = pca_2.fit_transform(data_cluster)

print ('Cumulative variance explained by 2 principal components: {:.2%}'.format(np.sum(pca_2.explained_variance_ratio_)))

In [None]:
sns.set(style='white', rc={'figure.figsize':(9,6)},font_scale=1.1)

plt.scatter(x=pca_2_result[:, 0], y=pca_2_result[:, 1], color='red',lw=0.1)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Data represented by the 2 strongest principal components',fontweight='bold')
plt.show()

This time, the bias is not as pronounced as before. We will proceed to the cluster creation.

In [None]:
pca_2_result = pd.DataFrame(pca_2_result, columns=["PC1","PC2"])

In [None]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(pca_2_result)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
import scipy.cluster.hierarchy as sch
from matplotlib import pyplot
pyplot.figure(figsize=(12, 5))
dendrogram = sch.dendrogram(sch.linkage(pca_2_result, method = 'ward'))
plt.title('Dendrogram')
plt.ylabel('Euclidean distances')
plt.show()

In [None]:
df_kmeans = pca_2_result.copy()

We will use four clusters.

In [None]:
# Training model
kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(df_kmeans)

In [None]:
df_kmeans

In [None]:
# We called the df, that's why we need to refer to previous df to add cluster numbers
df_kmeans = pca_2_result.copy()
# Checking number of items in clusters and creating 'Cluster' column
df_kmeans['Cluster'] = y_kmeans
df_kmeans['Cluster'].value_counts()

In [None]:
# plt.figure(figsize=(15,7))
sns.scatterplot(data=df_kmeans, x='PC1', y='PC2', hue = 'Cluster', s=15, palette="tab10")

# Hierarchical clustering
## Agglomerative clustering

In [None]:
# Copying data sets
df_AgglomerativeC = pca_2_result.copy()

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Training model
AgglomerativeC = AgglomerativeClustering(n_clusters=4, metric = 'euclidean', linkage = 'ward')
y_AgglomerativeC = AgglomerativeC.fit_predict(df_AgglomerativeC)

In [None]:
# We called the df, that's why we need to refer to previous df to add cluster numbers
df_AgglomerativeC = pca_2_result.copy()
# Checking number of items in clusters and creating 'Cluster' column
df_AgglomerativeC['Cluster'] = y_AgglomerativeC
df_AgglomerativeC['Cluster'].value_counts()

In [None]:
plt.figure(figsize=(15,7))
sns.scatterplot(data=df_AgglomerativeC, x='PC1', y='PC2', hue = 'Cluster', s=15, palette="tab10")