### GreenDS
# Fundamentals of Data Science
## Example on Unsupervised Machine Learning - Clustering
### Example 03.1

### Introduction

The purpose of this Jupyter Notebook is to demonstrate the process of creating clusters on data for which we do not have a prior classification. We will explore two methods commonly used:
- K-means
- Hierarchical clusterimg

The data to be used is from the Agricultural Census of Portugal, from which data on **level of education**, **labour** data and **production** from 2019 was aggregated in one table, for the level of freguesia.

## 1. Prepare your environment and explore data

Import the necessary modules and the data file.

In [None]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline

Data was extracted from the dms_INE database, using the SQL script stored at the *script* directory of this project. Import data to a Pandas Dataframe:

In [None]:
# read data
census_df = pd.read_csv("./raw-data/data_agric_census_freg.csv")
census_df.shape

We can see that the file contains 3068 rows and 16 columns. Let us see the structure of the Pandas dataframe:

In [None]:
census_df.info()

We can also see a preview of the table:

In [None]:
census_df.head()

The table contains 16 columns. The first three are the freguesia (the operational unit of the table), its municipality and the NUTS 2 level. The following columns correspond to education (e prefix), labour (l prefix) and the final two to the production value, in euros, and production area, in hectars.

We can start exploring data by checking the histograms.

In [None]:
sns.set(style='white',font_scale=1.3, rc={'figure.figsize':(20,20)})
ax=census_df.hist(bins=20,color='red' )

We can observe that most of them are skewed, indicating a uneven distribution. This might have implications in the efficiency of the use of lienar methods to analyse data. Let's create another visualization, creating a scatterplot matrix between all variables. It is suggested that you analyse the output image in detail.

In [None]:
g = sns.PairGrid(census_df, hue="NUTS2")
g.map(sns.scatterplot)
g.add_legend()

The detailed analysis reveals that there seems to be a group of samples of different behaviour than the others. These are the samples from Madeira and Azores. Since these regions are insular, and in the context of agriculture are naturally particular and different to the practices of Portugal Mainland, it might be a good idea to remove the rows from these two regions from the analysis. Therefore, we filtered only the rows for the NUTS 2 of Portugal Mainland.

In [None]:
values = ['Norte','Centro','Área Metropolitana de Lisboa', 'Alentejo', 'Algarve']
df1 = census_df.loc[census_df['NUTS2'].isin(values)].copy()

Let us preview the current format of the dataframe.

In [None]:
df1.head()

And we can make a scatter plot to compare two variables, in this case the production value and production area.

In [None]:
sns.lmplot( x="value_eur", y="area_ha", data=df1, fit_reg=False, hue='NUTS2', legend=True)

In the next cell, create biplots between the production value and other variables of education and labour factors.

In [None]:
# put your code here

In order to continue with the analysis, we will remove the columns that are non-numeric (identified as *object* in the dataframe structure above)

In [None]:
df1.drop(['municipality', 'freguesia', 'NUTS2'], axis = 1, inplace = True)

In [None]:
df1.head()

We can also prepare a summary table for the dataframe, with descriptive statistics:

In [None]:
df1.describe(include='all')

Boxplots are very usefull to analyse the presence of outliers. The next code generates a boxplot for each variable:

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=5,figsize=(12,18))
fig.suptitle('Outliers\n', size = 25)

sns.boxplot(ax=axes[0, 0], data=df1['e_none'], palette='Spectral').set_title("education none")
sns.boxplot(ax=axes[0, 1], data=df1['e_basic'], palette='Spectral').set_title("education basic")
sns.boxplot(ax=axes[0, 2], data=df1['e_secondary'], palette='Spectral').set_title("education secondary")
sns.boxplot(ax=axes[0, 3], data=df1['e_superior'], palette='Spectral').set_title("education superior")
sns.boxplot(ax=axes[0, 4], data=df1['l_family'], palette='Spectral').set_title("labour family")
sns.boxplot(ax=axes[1, 0], data=df1['l_holder'], palette='Spectral').set_title("labour holder")
sns.boxplot(ax=axes[1, 1], data=df1['l_spouse'], palette='Spectral').set_title("labour spouse")
sns.boxplot(ax=axes[1, 2], data=df1['l_other_fam'], palette='Spectral').set_title("labour other family")
sns.boxplot(ax=axes[1, 3], data=df1['l_regular'], palette='Spectral').set_title("labour regular")
sns.boxplot(ax=axes[1, 4], data=df1['l_non_regular'], palette='Spectral').set_title("labour non regular")
sns.boxplot(ax=axes[2, 0], data=df1['l_non_hired'], palette='Spectral').set_title("labour non hired")
sns.boxplot(ax=axes[2, 1], data=df1['value_eur'], palette='Spectral').set_title("production value")
sns.boxplot(ax=axes[2, 2], data=df1['area_ha'], palette='Spectral').set_title("production area")

plt.tight_layout()


As we suspected from the analysis of the histograms above, there are many outliers.

## Scaling data

Now, we will make a copy of the dataframe. The new dataframe will be scaled, which means reduce all variables to the same range of variation. This is important so that a variable does not have a higher weight in the cluster analysis only because its absolute values are higher.

In [None]:
df2 = df1.copy()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df3=scaler.fit_transform(df2)

In [None]:
# Check the scaled values
df3

## 2. Determine k

First, we need to determine the number of clusters, k. We can use the Elbow method as a guide.

In [None]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(df3)
    wcss.append(kmeans.inertia_)
    
plt.subplots(nrows=1, ncols=1,figsize=(10,10))
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

To determine the number of clusters, we can try to identify in which step the change of the slope in the curve is more pronounced. 

Another help can be from a dendogram. This is a representation of an agglomerative clustering, also an unsupervised method.

In [None]:
import scipy.cluster.hierarchy as sch
from matplotlib import pyplot
pyplot.figure(figsize=(12, 5))
dendrogram = sch.dendrogram(sch.linkage(df3, method = 'ward'))
plt.title('Dendrogram')
plt.ylabel('Euclidean distances')
plt.show()

From the Elbow method and the dendogram, it seams that there are 4 clusters, with one very small. Let's use that number.

## 3. Start K-means calculation

In [None]:
df_kmeans = df2.copy()

In [None]:
# Training model
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(df_kmeans)

In [None]:
df_kmeans

In [None]:
df_kmeans = df1.copy()
# Checking number of items in clusters and creating 'Cluster' column
df_kmeans['Cluster'] = y_kmeans
df_kmeans['Cluster'].value_counts()

The number of elements of each cluster if very different. There are only four samples that belong to cluster 3. This requires the detailed identification of that samples, in order to understand why they separate from other samples. We can make a plot to check the clusters in relation to two variables selected.

In [None]:
plt.figure(figsize=(15,7))
sns.scatterplot(data=df_kmeans, x='e_secondary', y='value_eur', hue = 'Cluster', s=15, palette="tab10")

The production value seems to be a major factor for the creation of clusters. The samples of cluster 3 are the ones with higher value. We could pursue this indication to identify which freguesias are these, and try to understand its behaviour.

Try to plot the production value againts other variables, to see how the clusters appear.

In [None]:
# add your code here to create plots...

# 4. Hierarchical clustering
## Agglomerative clustering

Another unsupervised ML clustering method is Agglomerative clustering. The approach is different. First, a distance metrics is determined for all samples, and after that, a cluster method is applied to aggregate samples based on the distance. In this case, we will use the Euclidean distance and the Ward cluster method.

In [None]:
# Copying data sets
df_AgglomerativeC = df3.copy()

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Training model
AgglomerativeC = AgglomerativeClustering(n_clusters=4, affinity = 'euclidean', linkage = 'ward')
y_AgglomerativeC = AgglomerativeC.fit_predict(df_AgglomerativeC)

In [None]:
df_AgglomerativeC = df2.copy()
# Checking number of items in clusters and creating 'Cluster' column
df_AgglomerativeC['Cluster'] = y_AgglomerativeC
df_AgglomerativeC['Cluster'].value_counts()

Four clusters were created. Again, one of the clusters only has one value. Let's create a plot.

In [None]:
plt.figure(figsize=(15,7))
sns.scatterplot(data=df_AgglomerativeC, x='e_secondary', y='value_eur', hue = 'Cluster', s=15, palette="tab10")

The segregation is not as clear as in the result of the k-means. The work on this dataset still needs mode exploration. Continue to the other notebook of this example.