<a href="https://colab.research.google.com/github/hazrakeruboO/DS-Colabs/blob/main/Customer_Segmentation_R_Project%3B_Group_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Problem Definition


a) **Defining the Data Analytic Question** 

* Difficulty in understanding how the various characteristics of the bank’s customers influence their chances of loan defaulting 
* Gap in understanding how the different characteristics of the customers relate and how the bank can mine information to streamline its marketing strategy to specific customer segments

We shall, therefore, perform a K-means and Hierarchical clustering on a bank's customer data to draw insights on the characteristics of different customers

b) **Metrics for Success**

This study shall be deemed successful if:


*   EDA is properly performed on the data
*   Relevant insights are drawn from data analysis
*   Optimal clustering of customers based on their characteristics
*   Comparisons of both K-Means and Hierarchical clustering on the dataset are well illustrated  






c) **Context**

Customer Segmentation is the process of dividing the customer-base into several groups of individuals that share a similarity in different ways that are relevant to marketing such as gender, age, level of education, credit score, interests, and miscellaneous spending habits etc.Customer segmentation can help us divide a diverse market into a number of smaller, 
more homogeneous markets based on one or more meaningful characteristics. 

The importance of customer segmentation include:

* Greater company focus - companies are able to understand its customer base/segments and as such be ware of what products or services to serve what segment of their customer base.
* Targetted communication - with specified segments, companies can know and choose proper communication platforms and pass information relevant to each segment
* satisfaction of customer preferences - given that segments have some differences among them, say income levels and age, companies can offer different product/service bundles and incentives to different segments. 

The study will be relevant in that, it will:
- Help the bank relate the various customer characteristics, say level of income and age vs chances of default, and so understand where higher credit risks lie. 
- Help the bank to employ targetted communication to customer segments


d) **Experimental Design**


Steps followed are:
- Business Understanding 
- Reading and previewing data
- Data preprocessing 
- Modelling
- Challenging the solution 
- Conclusion 

e) **Relevance of the data**

Dataset used in this study was sourced for Kaggle and compromise of bank customer information. These information include: customer age, level of education, years of experience, income, debts and income-debt ratios. The data has 850 observations and 10 features. It was deemed relevant in meeting the objectives of this study. 

#2. Loading data

In [None]:
suppressPackageStartupMessages('data.table')

In [None]:
# Installaton and loading of relevant packages
library(data.table)
library(dplyr)
install.packages('caret')
library(caret)
install.packages('Amelia')
library(Amelia)
#package for the multiple imputation of multivariate incomplete data
install.packages("psych")
library(psych)
install.packages("ggcorrplot")
library(ggcorrplot)
install.packages("cluster")
library(cluster)



Attaching package: ‘dplyr’


The following objects are masked from ‘package:data.table’:

    between, first, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘future.apply’, ‘progressr’, ‘numDeriv’, ‘SQUAREM’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘gower’, ‘hardhat’, ‘ipred’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’


Loading required package: ggplot2

Loading required package: lattice

“running command 'timedatectl' had status 1”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘RcppArmadillo’




In [None]:
#install.packages("factoextra")
#library(factoextra)

In [None]:
# Reading the data
df <- fread('/content/Customer_Segmentation.csv')

In [None]:
#review 4 rows
head(df,4)
#View(df)

#3) Checking the Data

In [None]:
# Review bottom rows
tail(df,5)

In [None]:
# Check size of data

dim(df)

Data consists of 850 entries and 10 variables 

In [None]:
# Viewing features in the dataset
colnames(df)

In [None]:
# Checking the structure of the dataset
#library(dplyr)
glimpse(df)


Our data is in numeric (int, dbl) and character (chr) datatypes  

In [None]:
#Checking for unique values in each column

lapply(df, function(x){length(unique(x))})

# 4) Data Preparation

**Uniformity**

In [None]:
# We convert categorical variables (education and defaulted) to factors
df$Edu = factor(df$Edu)
df$Defaulted = factor(df$Defaulted)

In [None]:
#checking the new structure of the dataset
print(str(df))

In [None]:
#Viewing the statistical summary of the data
print(summary(df))

There are null values (under Defaulted variable) in the dataset and shall be dealt with.



##Completeness

In [None]:
#checking if there are missing values in the data
anyNA(df)

In [None]:
#checking which columns have missing values
print(colSums(is.na(df)))

There are 150 missing values under the feature 'Defaulted'.

In [None]:
# Calculating the percentage of missing values 
data.frame(colMeans(is.na(df)))*100

The 150 missing values constitute about 17.6% of the dataset. To avoid any information loss from our relatively small dataset, we shall forward fill the missing values

In [None]:
#visualizing missing data - uses the library(Amelia)

missmap(df)

Shows there are some missng values

In [None]:
# To forward fill missing values. The library(tidyr) is needed.
library(tidyr)
df_1 <- df %>% fill(Defaulted)

In [None]:
# Check if missing values are availbale 
sum(is.na(df_1))

No missing values

In [None]:
#Confirm that no missing values with a missmap

missmap(df_1)

From the missmap, it is confirmed that there are no missing values left in the dataset

In [None]:
# View our new dataset without NAs
head(df_1, 10)

In [None]:
# New data size
dim(df_1)

We have 850 entries and 10 columns. All information in the dataset is retained

Dataset is now free of missing values. 

## Consistency

In [None]:
# Checking for duplicate values
duplicated <- df_1[duplicated(df_1), ]
duplicated

There are no duplicated values in the dataset

# Outliers

In [None]:
#checking for outliers

#plotting boxplots for all the numerical variables

#par(mfrow=c(1,1))   
#The par() function allows to set parameters to the plot. 
#The mfrow() parameter allows to split the screen in several panels. Subsequent charts will be drawn in panels.

boxplot((df_1$Age), horizontal = TRUE, col = 'red', main = "Boxplot of Age")

boxplot((df_1$`Years Employed`), horizontal = TRUE, col = 'red', main = "Boxplot of Years Employed")

boxplot((df_1$Income), horizontal = TRUE, col = 'cyan', main = "Boxplot of Income")

boxplot((df_1$`Card Debt`), horizontal = TRUE, col = 'purple', main = "Boxplot of Card Debt")

boxplot((df_1$`Other Debt`), horizontal = TRUE, col = 'violet', main = "Boxplot of Other Debt")

boxplot((df_1$DebtIncomeRatio), horizontal = TRUE, col = 'red', main = "Boxplot of Debt-Income Ratio")


Observations 
* For the income, there a single outlier which is above 400
* For the card credit card, there is one outlier above 20
* Under other debt, there are 3 outliers which are above 20
* Debt-Income ratio variable has 1 outlier which is above 40
* Age variable has no outliers
* There are 2 outliers under years of experience 

#5. Univariate Analysis

In [None]:
#obtaining the statistical properties of the variables - uses the #library(psych)
describe(df_1)

In [None]:
# obtaining the statistical summary of the data
summary(df_1)

#Histograms

In [None]:
#plotting multiple histograms of the numerical variables
# par(mfrow=c(2,2))

#histogram of Income
hist((df_1$Income), col = 'cyan', main = "Histogram of Income")

#histogram of Age
hist((df_1$Age), col = 'purple', main = "Histogram of Age")

#histogram of Card Debt
hist((df_1$`Card Debt`), col = 'violet', main = "Histogram of Card Debt")

#histogram of Other Debt
hist((df_1$`Other Debt`), col = 'lightblue', main = "Histogram of Other Debt")

#histogram of Defaulted
hist((df_1$`Years Employed`), col = 'blue', main = "Histogram of Years Employed")

#histogram of Defaulted
hist((df_1$DebtIncomeRatio), col = 'red', main = "Histogram of DebtIncomeRatio")

Observations:
* All the variables seem to be positively skewed except for Age which seems to have a gaussian distribution. 

# Bar Charts

In [None]:
# We will plot bar charts for factor features i.e Edu and Defaulted 

Education_level <-df_1$Edu                            #fetching education level
Education_level_frequency<- table(Education_level)    #creating a frequency table

default <-df_1$Defaulted                            #fetching default status
default_frequency<- table(default)                   #creating a frequency table

# Plotting bar charts using the frequency tables 

#par(mfrow=c(2,2))
barplot((Education_level_frequency), col = "gold", main = "Bar chart of Education Level")

#par(mfrow=c(2,2))
barplot((default_frequency), col = "red", main = "Bar chart of Default")

Observations:
* Most customers have not attained the highest level of education
* Most of the customers pay their debts (0 - not defaulters)

#6. Bivariate Analysis

###**Correlation**

In [None]:
install.packages('corrplot')
library(corrplot)

In [None]:
head(df_1)

In [None]:
#creating a dataframe of all the numerical columns
age<- df_1$Age
experience<- df_1$`Years Employed`
income <- df_1$Income
carddebt <- df_1$`Card Debt`
otherdebt <- df_1$`Other Debt`
debtratio <- df_1$DebtIncomeRatio

In [None]:
numerical <- data.frame(age, experience, income, carddebt, otherdebt, debtratio)
head(numerical)

In [None]:
#calculating a correlation matrix of the dataframe created 

corr <- round(cor(numerical), 1) #numerical matrix to 1 decimal point
head(corr[, 1:6])  #previewing the matrix

Observation:
* Age has a positive correlation with all variables except for debtincome ratio which has zero correlation
* There is no correlation between debtincome ratio and income, experience and age
* There is a positive correlation between income and age, experience and debt (credit and other debt)

In [None]:
#correlation matrix
install.packages('ggcorrplot')
library(ggcorrplot)
ggcorrplot(corr, method = "circle")

The brighter the color, the higher the level of correlation between variables

In [None]:
corrmatrix <- cor(df_1[,4:7])
corrplot(corrmatrix, method = 'number')

###**Boxplots**

In [None]:
#Finding out how do all the variables relate with debt defaulting

#plotting boxplots to show how the Default relates with the income
ggplot(data = df_1, mapping = aes(x = Income, y = Defaulted, fill = Income)) + 
  geom_boxplot()

In [None]:
#plotting boxplots to show how the Default relates with the Age
ggplot(data = df_1, mapping = aes(x = Age, y = Defaulted, fill = Age)) + 
  geom_boxplot()

In [None]:
#plotting boxplots to show how the Default relates with the Credit Debt
ggplot(data = df_1, mapping = aes(x = carddebt, y = Defaulted, fill = carddebt)) + 
  geom_boxplot()

In [None]:
#plotting boxplots to show how the Default relates with the Other debt
ggplot(data = df_1, mapping = aes(x = otherdebt, y = Defaulted, fill = otherdebt)) + 
  geom_boxplot()

In [None]:
#plotting boxplots to show how the Default relates with the debt-income ratio
ggplot(data = df_1, mapping = aes(x = debtratio, y = Defaulted, fill = debtratio)) + 
  geom_boxplot()

In [None]:
#plotting boxplots to show how the Default relates with the debt-income ratio
ggplot(data = df_1, mapping = aes(x = experience, y = Defaulted, fill = experience)) + 
  geom_boxplot()

###**Barplot**

In [None]:
library(ggplot2)
install.packages('plotly')
library(plotly)

In [None]:
#barplot showing how different age groups earn
ggplot(data = df_1)+
  geom_bar(mapping = aes(x = Age, fill = Income, position = "dodge"))

In [None]:
#barplot showing how income relate with debtincome ratio
ggplot(data = df_1) + 
  geom_bar(mapping = aes(x = df_1$Income, fill = df_1$DebtIncomeRatio,position = "dodge"))

In [None]:
#barplot showing how age relate with card debt
ggplot(data = df_1) + 
  geom_bar(mapping = aes(x = df_1$Age, fill = df_1$`Card Debt`,position = "dodge"))

###**Scatter Plots**



In [None]:
#scatter plots to assess the linear relationships of the numerical variables

#scatter plot of Income and the card debt 
ggplot(data = df_1) + 
  geom_point(mapping = aes (x = df_1$Income, y = df_1$`Card Debt`, color = Income)) 

Observation:
Customers with lower income have more card debt as compared to customers with higher income

In [None]:
#scatter plot of Income and the other debt 
ggplot(data = df_1) + 
  geom_point(mapping = aes (x = income, y = otherdebt, color = Income)) 

Observation:
Customers with lower income have more 'other debt' as compared to customers with higher income

In [None]:
#scatter plot of Income and the age 
ggplot(data = df_1) + 
  geom_point(mapping = aes (x = age, y = income , color = age)) 

Observation:

Age and income have a positive relationship. Customers above 40 years of age earn more than those below 40 years

In [None]:
#scatter plot of Income and the experience
ggplot(data = df_1) + 
  geom_point(mapping = aes (x = experience, y = income , color = experience)) 

Experience and income have a positive relationship. Customers with more than 20 years of experience earn more than those below 20 years of experience

In [None]:
#scatter plot of Income and debt income ratio 
ggplot(data = df_1) + 
  geom_point(mapping = aes (x = income, y = debtratio , color = income)) 

Low income customers tend to have a higher debt income ration as compared to those with higher income

In [None]:
#scatter plot of age and the debt income 
ggplot(data = df_1) + 
  geom_point(mapping = aes (x = age, y = debtratio , color = debtratio)) 

Customers between 20-39 years of age have a higher debt income ratio compared to those above 40 years of age

In [None]:
#scatter plot of Income and the experience on these pages
ggplot(data = df_1) + 
  geom_point(mapping = aes (x = age, y = otherdebt , color = age)) 

Customers between 20-39 years of age have a higher 'other debt' compared to those above 40 years of age

In [None]:
install.packages('GGally')
library(GGally)

In [None]:
# scatter plots of all the numerical variables
#ggpairs(df_1, columns = c(2,7), ggplot2::aes(colour=Income)) 

#7.Implementing the Solution

###Data Pre-processing

In [None]:
install.packages('plyr')
library(plyr)
library(dplyr)
library(caret)

In [None]:
#Converting factor variables to ordinal features

# View the order of Edu Variable
print(table(df_1$Edu))

In [None]:
#checking the levels in Edu variable
df_1$Edu <- factor(df_1$Edu, order= TRUE, levels = c(1,2,3,4,5))
df_1$Edu 

In [None]:
# View the order of Defaulted Variable
print(table(df_1$Defaulted))

In [None]:
#checking the levels in Defaulted variable
df_1$Defaulted <- factor(df_1$Defaulted, order= TRUE, levels = c(0,1))
df_1$Defaulted 

In [None]:
unique(df_1$Address)

In [None]:
#encoding character variable address
df_1$Address <- factor(df_1$Address, order = TRUE, levels =c('NBA001','NBA021','NBA013',
'NBA009','NBA008','NBA011','NBA010','NBA000','NBA004','NBA005','NBA022','NBA018',
'NBA002','NBA006','NBA007','NBA003','NBA026','NBA016','NBA019','NBA020','NBA012','NBA014','NBA015',
'NBA017','NBA023','NBA025','NBA027','NBA031','NBA024','NBA034','NBA029'))

df_1$Address_Numeric <-mapvalues(df_1$Address, from = c('NBA001','NBA021','NBA013',
'NBA009','NBA008','NBA011','NBA010','NBA000','NBA004','NBA005','NBA022','NBA018',
'NBA002','NBA006','NBA007','NBA003','NBA026','NBA016','NBA019','NBA020','NBA012','NBA014','NBA015',
'NBA017','NBA023','NBA025','NBA027','NBA031','NBA024','NBA034','NBA029'), to = c(001,021,013,009,008,011,010,000,004,005,022,018,002,006,007,
003,026,016,019,020,012,014,015,017,023,025,027,031,024,034,029))

In [None]:
#normalizing numerical variables
#function to normalize
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# subjecting the function to the variables
df_1$Income <- normalize(df_1$Income)
df_1$`Card Debt` <- normalize(df_1$`Card Debt`)
df_1$Age <- normalize(df_1$Age)
df_1$`Other Debt` <- normalize(df_1$`Other Debt`)
df_1$`Years Employed` <- normalize(df_1$`Years Employed`)
df_1$DebtIncomeRatio <- normalize(df_1$DebtIncomeRatio)

In [None]:
#viewing the structure of the encoded variables
str(df_1)

In [None]:
# Remove the address and customer id columns; to retain address numeric instead.

df_1$`Customer Id`<- NULL
df_1$Address <- NULL

In [None]:
# Rename columns
names(df_1)[names(df_1)== 'Edu'] <- "Education_Level"
names(df_1)[names(df_1)== 'Years Employed'] <- "Years_Employed"
names(df_1)[names(df_1)== 'Card Debt'] <- "Card_Debt"
names(df_1)[names(df_1)== 'Other Debt'] <- "Other_Debt"
names(df_1)[names(df_1)== 'Debt_Income_Ratio'] <- "Debt_Income_Ratio"

In [None]:
# Print column names after adjustments
print(colnames(df_1))

In [None]:
df2 <- df_1 %>% relocate(where(is.factor), .after = last_col())

In [None]:
head(df2,4)

#K-Means Clustering

K-means clustering is machine learning algorithm used for partitioning observations into a set of k clusters, where k is pre-specified. k-means is a clustering algorithm that tries to classify observations into mutually exclusive groups (or clusters), such that observations within the same cluster are as similar as possible (high intra-class similarity), whereas observations from different clusters are as dissimilar as possible (low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of the observation values assigned to the cluster

##Base Model

In [None]:
df3<- df2[, c(1:8)]
df3.class<- df2[, "Defaulted"]

#previewing the class attribute
head(df3.class)

In [None]:
#applying k-means clustering algorithm with number of centres (k=2) 

set.seed(123)  #for reproducibility
#stats::kmeans(df3, centers = 2,nstart=10)
df_k1 <- kmeans(df3, centers = 2)
print(df_k1)


In [None]:
#to visualize our clusters, we will convert all our variables to numerical variables 
str(df2)

In [None]:
#creating columns for a dataframe with all numeric values
Age<- df2$Age
Years.Employed<- df2$Years_Employed
Income <- df2$Income
Card.Debt <- df2$Card_Debt
Other.Debt <- df2$Other_Debt
DebtIncome.Ratio <- df2$DebtIncomeRatio
Education.numeric <- as.numeric(df2$Education_Level)
Defaulted.numeric<- as.numeric(df2$Defaulted)
Address.numeric <- as.numeric(df2$Address_Numeric)

In [None]:
df2.numeric <- data.frame(Age, Years.Employed, Income, Card.Debt, Other.Debt, DebtIncome.Ratio, Education.numeric, Defaulted.numeric, Address.numeric)

head(df2.numeric)

In [None]:
dim(df2.numeric)

In [None]:
#library(data.table)
#require(reshape2)
#df2.numeric <- rownames(df2) 
#melt(df2)

In [None]:
# applying k-means clustering on the new dataset to visualize the  clustering results

df_k2 <- kmeans(df3, centers = 2)
print(df_k2)

###Visualizing and interpreting results of K-means

In [None]:
# Visuaizing results 

#knitr::opts_chunk$set(fig.width=18, fig.height=12) 

library(repr)
options(repr.plot.width=18, repr.plot.height=12)

#set.seed(123)
#init <- sample(2, nrow(df3), replace=TRUE)
#plot(df3, col=init)

plot(df3, col=df_k2$cluster, main='K-Means with 2 clusters')

In [None]:
# Interpreting results
# Cluster size
df_k2$size

The data has been clustered into 2 clusters of 695 observations with no default (0) and 155 observations as defaulters (1)






In [None]:
# Between clusters sum of square
df_k2$betweenss

The sum of squares between the two clusters is 530.45

In [None]:
# Within cluster sum of square
df_k2$withinss

The sum of squares within each cluster is 381.47, 105.97

In [None]:
# Total with sum of square
df_k2$tot.withinss

The total sum of squares within both clusters is 487.44

In [None]:
# Total sum of square
df_k2$totss

The total sum of squares obtained from clustering the entire dataset is 1017.89



In [None]:
#visualizing the clusters
#fviz_cluster(df_k2, data = df2.numeric)

#Optimizing K-means Algorithm

For k-means algorithm, the number of clusters is pre-specified before analysis. For this study, we have been given the class variable - Defaulted, which has two classes which is a clear indication that we will have two clusters. However, for datasets that do Not have clear classes, it is not possible to know the optimal number of k hence, we can find the optimal number of k by creating a plot of the total sums of squares within-groups against the number of clusters. A bend in the graph can suggest the appropriate number of clusters in that dataset.

In [None]:
# finding the optimal number of clusters

#fviz_nbclust(x = df2.numeric,FUNcluster = kmeans, method = 'wss')

options(repr.plot.width=12, repr.plot.height=8)

wss <- 0

# For 1 to 8 cluster centers
for (i in 1:8) {
  df_k2 <- kmeans(df3, centers = i, nstart = 20)
  # Save total within sum of squares to wss variable
  wss[i] <- df_k2$tot.withinss
}

# Plot total within sum of squares vs. number of clusters
plot(1:8, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

The Elbow method shows that there are 2 possible clusters in the dataset

In [None]:
# Determining Optimal clusters (k) Using Average Silhouette Method

#fviz_nbclust(x = df2.numeric,FUNcluster = kmeans, method = 'silhouette' )

In [None]:
#applying k-means algorithm as with k = 2 and k =5 
#df_k2 <- kmeans(df2.numeric, centers = 2)
#df_k5 <- kmeans(df2.numeric, centers = 5)

#p1 <- fviz_cluster(df_k2, geom = "point", data = df2.numeric) + ggtitle(" K = 2")
#p2 <- fviz_cluster(df_k5, geom = "point", data = df2.numeric) + ggtitle(" K = 5")

#library(gridExtra)

In [None]:
#grid.arrange(p1, p2, nrow = 2)

In [None]:
#setting number of random starts to 15
#df_k3 <- kmeans(df2.numeric, centers = 2, nstart = 15)
#print(df_k3)

In [None]:
#visualizing the clusters 
#fviz_cluster(df_k3, data = df2.numeric)

In [None]:
#Descriptive statistics (mean) at the cluster level
#df2.numeric %>% 
#  mutate(Cluster = df_k3$cluster) %>%
#  group_by(Cluster) %>%
#  summarize_all('mean')

**Notes from k-means clustering**



Some advantages of k-means clustering we have observed are:

* It’s quite easy to implement
* The algorithm was computationally faster

Some disadvantages are:

* The number of clusters has to be defined from the beginning and we would not know how many clusters we should have

#Hierarchical Clustering

Hierarchical clustering is also a type of unsupervised machine learning algorithm that builds a hierarchy of clusters i.e tree type strucure based in hierarchy that helps solve the disadvantage mentioned above from k-means clustering.

In [None]:
# We use the R function hclust() for hierarchical clustering
# 
options(repr.plot.width=12, repr.plot.height=8)

set.seed(123)  #for reproducibility

# First we use the dist() function to compute the Euclidean distance between observations
dst <- dist(df2, method = "euclidean")

# We apply hierarchical clustering algorithim using the complete method
df2_h <- hclust(dst, method = "ward.D2" )

#plotting a dendrogram
plot(df2_h, cex = 0.6, hang = -1)

Our Dendrogram is messy hence we will try to use different approaches to obtain features from it


In [None]:
summary(df2_h)

###Cutting the tree

In [None]:
plot(df2_h)
abline(h=20,col='red')

In [None]:
# Cut by height
cutree(df2_h, h=20)

In [None]:
# Cut by number of clusters
cutree(df2_h, k=2)

In [None]:
#Determine the optimal k
#fviz_nbclust(df2.numeric, FUN = hcut, method = "silhouette", 
#                   k.max = 10)

In [None]:
#obtaining the subgroups from the dendrograms

# Ward's method
#hc5 <- hclust(d, method = "ward.D2" )

# Cut tree into 4 groups
sub_grp <- cutree(df2_h, k = 2)

# Number of members in each cluster
table(sub_grp)

The size obtained from hierarchical clustering is 693 from no default and 157 from defaults



In [None]:
install.packages('dendextend')
suppressPackageStartupMessages(library(dendextend))

In [None]:
avg_dend_obj <- as.dendrogram(df2_h)
avg_col_dend <- color_branches(avg_dend_obj, h = 2)
plot(avg_col_dend)

In [None]:
suppressPackageStartupMessages(library(dplyr))
df2_ct <- mutate(df2, cluster = sub_grp)
#count(df2_ct,cluster)

In [None]:
suppressPackageStartupMessages(library(ggplot2))
ggplot(df2_ct, aes(x=Income, y = Defaulted, color = factor(cluster))) + geom_point()

In [None]:
#visualizing the clusters generated using a plot of income and debt default

suppressPackageStartupMessages(library(ggplot2))
ggplot(df2_ct, aes(x=Income, y = Defaulted, color = factor(cluster))) + geom_point()

In [None]:
#visualizing the clusters generated using a plot of experience and defaulted
ggplot(df2_ct, aes(x=Years.Employed, y = Defaulted, color = factor(cluster))) + geom_point()

In [None]:
#creating a confusion matrix to compare the actual classes and the predicted classes

table(df2_ct$cluster,df2$Defaulted)

The true positives is the highest and the number of true negatives is also high. The number of false positives and false negatives is lower but there is also a significant number of false positives.


**Notes from Hieracrchical Clustering**



Some advantages of hierachical clustering include;

* We did Not need to specify the number of clusters
* The algorithm is also easy to implement
* Output a hierarchy

Some of the disadvantages include;

* For the dendrogram produced, the datapoints were too many hence there was a lot of overlapping. This means that we could not obtain any information from the structure. Any attempts to rectify this were time consuming (as a result of a huge dataset)which means that that this algorithm cannot be used when you have huge data

## Rtsne

In [None]:
install.packages('Rtsne')
library('Rtsne')

In [None]:
data <- unique(df3)
datatsne<-Rtsne(data[,1:8])
plot(datatsne$Y, col=df3$Defaulted)

In [None]:
model = kmeans(df3, 2)
clusplot(df3, model$cluster, color=T,shade=T)

#8. Challenging the Solution

From the metrics of success outlined at the beginning of this study, we can consider this study successfull. However, from the Hierarhical clustering, we were not able to make the best use of the algorithm as we could not interprete the dendrogram. In order to rectify this, it would probably be best to obtain samples of the data and use these sapmles to plot the dendrogram.

From both the k-Means clustering and the Hierarchial clustering, we obtained different cluster sizes. It is difficult to tell which of the two clustered the best. However, it is evident that K-means is less computationally expensive than hierarchical clustering and can be run on large datasets within a reasonable time frame.

