# Life Expectancy Analysis by District - Madrid 19'

In [None]:
# Libraries 

library(tidyverse)
library(readxl)
library(janitor)
library(heatmaply)
library(factoextra)
library(cluster)

In [None]:
# Dataset

data <- read_excel("../input/es-madrid-2019//ES_Madrid_2019.xlsx")


In [None]:
head(data)

****Simple Data Cleaning****

* Column Names
* Column Data Type
* Modification of the First Column

In [None]:
# Column Names
data <- data %>%
    row_to_names(row_number = 1)
names(data)[1] = 'Distrito'

In [None]:
# Data type from Character to Numeric
data <- 
    data %>% mutate_at(2:22, as.numeric)
head(data)

In [None]:
# Life Expectancy to 2 digits
data <- 
    data %>% mutate_if(is.numeric, round, digits = 2)
head(data,5)

In [None]:
# District Column with characters only
data <- data %>% mutate(Distrito = str_sub(Distrito,4))
head(data, 3)

**Dataset Analysis**

In [None]:
# Dataset Summary
summary(data)

No NA's or Outliers in the Dataset

In [None]:
for (i in 1:length(data)) {
        if (i != 1){
        boxplot(data[,i], main=names(data[i]), type="l")
        }
}

We will create Clusters based on the District Life Expectancy

In [None]:
# We change the Row Names
data <- data %>%
    as.data.frame()
rownames(data) <- data[,1]
data <- data[, -1]
head(data, 3)

First conclusions with a heatmap and a distance matrix

In [None]:
ggheatmap(as.matrix(data), seriate="mean")


In [None]:
# Distance Matrix (the data needs to be standarized)
data_st <- scale(data)
d_st <- dist(data_st, method = "euclidean")
fviz_dist(d_st, show_labels = TRUE)

Podemos apreciar como aparecen los primeros grupos en función de la distancia de las variables. Por ejemplo, los barrios de Moratalaz, Chamberí o Salamanca tienen menor distancia entre ellos que con otros barrios como Puente de Vallecas.

In [None]:
# dendrogram + ward method
res.hc <- hclust(d_st, method = "ward.D2")
fviz_dend(res.hc, cex = 0.5)

We're going to create 4 clusters

In [None]:
cluster <- cutree(res.hc, k = 4)
fviz_dend(res.hc, k = 4, # Cut in four groups
          cex = 0.5, # label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE) # Add rectangle around groups

In [None]:
fviz_cluster(list(data = data_st, cluster = cluster),
             palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"), 
             ellipse.type = "convex", # Concentration ellipse
             repel = TRUE, # Avoid label overplotting (slow)
             show.clust.cent = FALSE, ggtheme = theme_minimal())

In [None]:
set.seed(123)

In [None]:
km.res <- kmeans(data_st,4)
fviz_cluster(km.res, data_st)

We will use the Elbow and Silhouette method to identify the best number of clusters

In [None]:
# Elbow
fviz_nbclust(data_st, kmeans, method = "wss") +
  geom_vline(xintercept =5, linetype = 2)+
  labs(subtitle = "Elbow method")

In [None]:
# Silhouette method
fviz_nbclust(data_st, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

Cluster Analysis

In [None]:
sil <- silhouette(km.res$cluster, dist(data_st))
rownames(sil) <- rownames(data_st)
fviz_silhouette(sil)

The negative Silhouette value indicates that the district in that cluster is not in the appropiate cluster. 

Based on the results from the Silhouette analysis, we will try with the following number of clusters: 3.

In [None]:
km.res_3 <- kmeans(data_st,3)
fviz_cluster(km.res_3, data_st)

In [None]:
sil <- silhouette(km.res_3$cluster, dist(data_st))
rownames(sil) <- rownames(data_st)
fviz_silhouette(sil)

Again, there is one district that is not in the appropiate cluster (negative Silhouette value) 

Based on the Elbow Method, we will try with the following number of clusters: 5.

In [None]:
km.res_5 <- kmeans(data_st,5)
fviz_cluster(km.res_5, data_st)

In [None]:
sil <- silhouette(km.res_5$cluster, dist(data_st))
rownames(sil) <- rownames(data_st)
fviz_silhouette(sil)

Based on the results, the best number of cluster will be 4.

Conclusions:

In [None]:
set.seed(123)
final <- kmeans(data_st, 4)
print(final)

In [None]:
fviz_cluster(final, data = data_st)