**Customer Segmentation Project in R**
Customer Segmentation is the process of division of customer base into several groups of individuals that share a similarity in different ways that are relevant to marketing such as gender, age, interests, and miscellaneous spending habits.
Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, companies can identify the several segments of customers allowing them to target the potential user base. That's why I will make use of K-means clustering.
I will in the first import the packages required

In [18]:
customer_data=read.csv("../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv")
str(customer_data)

In [19]:
names(customer_data)

In [20]:
head(customer_data)

In [21]:
sd(customer_data$Age)
summary(customer_data$Annual.Income..k..)
sd(customer_data$Annual.Income..k..)
summary(customer_data$Age)

In [22]:
sd(customer_data$Spending.Score..1.100.)

**Customer Gender Visualization**

For the visualisation , I will create a barplot and a piechart to show the gender distribution across our customer_data dataset

In [23]:
a=table(customer_data$Gender)
barplot(a,main="Using BarPlot to display Gender Comparision",
       ylab="Count",
       xlab="Gender",
       col=rainbow(3),
       legend=rownames(a))

From the above barplot, we observe that the number of females is higher than the males.
Now, let us visualize a pie chart to observe the ratio of male and female distribution

In [24]:
pct=round(a/sum(a)*100)
lbs=paste(c("Female","Male")," ",pct,"%",sep=" ")
library(plotrix)
pie3D(a,labels=lbs,
   main="Pie Chart Depicting Ratio of Female and Male")

Like we see in the above graph , we conclude that the percentage of females is 56%, whereas the percentage of male in the customer dataset is 44%.

**Visualization of Age Distribution**
I will now to create a histogram to view the distribution to plot the frequency of customer ages. I will first proceed by taking summary of the Age variable.

In [25]:
summary(customer_data$Age)

In [26]:
hist(customer_data$Age,
    col="pink",
    main="Histogram to Show Count of Age Class",
    xlab="Age Class",
    ylab="Frequency",
    labels=TRUE)

In [27]:
boxplot(customer_data$Age,
       col="beige",
       main="Boxplot for Descriptive Analysis of Age")

From the above two visualizations, we conclude that the maximum customer ages are between 30 and 35. The minimum age of customers is 18, whereas, the maximum age is 70.

**Analysis of the Annual Income of the Customers**

I will firstly create visualizations to analyze the annual income of the customers. I will plot a histogram and then we will proceed to examine this data using a density plot.

In [28]:
summary(customer_data$Annual.Income..k..)
hist(customer_data$Annual.Income..k..,
  col="#660033",
  main="Histogram for Annual Income",
  xlab="Annual Income Class",
  ylab="Frequency",
  labels=TRUE)

In [29]:
plot(density(customer_data$Annual.Income..k..),
    col="yellow",
    main="Density Plot for Annual Income",
    xlab="Annual Income Class",
    ylab="Density")
polygon(density(customer_data$Annual.Income..k..),
        col="#ccff66")

From the above descriptive analysis, we conclude that the minimum annual income of the customers is 15 and the maximum income is 137.

**Analyzing Spending Score of the Customers**

In [30]:
summary(customer_data$Spending.Score..1.100.)

In [31]:
boxplot(customer_data$Spending.Score..1.100.,
   horizontal=TRUE,
   col="#990000",
   main="BoxPlot for Descriptive Analysis of Spending Score")

We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20

In [32]:
hist(customer_data$Spending.Score..1.100.,
    main="HistoGram for Spending Score",
    xlab="Spending Score Class",
    ylab="Frequency",
    col="#6600cc",
    labels=TRUE)

From the histogram, we conclude that customers between class 40 and 50 have the highest spending score among all the classes.

**K-means Treatment**

In [33]:
library(purrr)
set.seed(123)
# function to calculate total intra-cluster sum of square 
iss <- function(k) {
  kmeans(customer_data[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss
}
k.values <- 1:10
iss_values <- map_dbl(k.values, iss)
plot(k.values, iss_values,
    type="b", pch = 19, frame = FALSE, 
    xlab="Number of clusters K",
    ylab="Total intra-clusters sum of squares")