# Customer Segmentation

In this practical we need to work on customer segmentation using unsupervised learning.

You are given a dataset that represents information collected on customers to an ammerican shopping centre (a mall).
The dataset has a relatively small set of features. The objective is to identify groups of customers that have similar patterns based on the dataset. As part of this analysis you should try to characterise the groups in terms of the common characteristics they have.

### Dataset

- CustomerID: Unique ID assigned to the customer
- Gender: Gender of the customer
- Age: Age of the customer
- Annual Income (k$): Annual Income of the customeer
- Spending Score (1-100): Score assigned by the mall based on customer behavior and spending nature



## Data Exploration

Similar to previous projects the first step is to load the dataset and try to understand our data. Hint: follow the steps you used in previous practicals, to perform the tasks indicated in the following sections.

The dataset is located in the csv file: `dataset/Mall_Customers.csv`

In [None]:
# ToDo: import libraries


In [None]:
# ToDo: load the dataset


In [None]:
# ToDo: Check the dataset
# Look at a snapshot of your data
# Check their datatypes and basic stats

In [None]:
# ToDo: Check for null values

By now you should have a high level understanding of what features you have in your dataset

#### Simple Exploratory Data Analysis
Your next step is to perform some simple statistical analysis on your dataset. The objective is to look for any 'unusual' or 'unexpected' patterns in your dataset and in the relationship between features. You are also looking for any 'hints' that may indicate how the feature values may help in the customer segmentation process.

In [None]:
# ToDo: plot the distributions of values for each feature


In [None]:

# Counting the values for the Gender feature (categorical)

# seaborn is a statistical data visualisation library that simplifies the creation of specific plots 
import seaborn as sns

plt.figure(1, figsize = (10, 2))
sns.countplot( y = 'Gender', data = df)
plt.show()

### Scatterplots

Scanterplots can give you an indication if there are some whays that your dataset is grouped into clusters.

Try gennerating a scatter_matrix using all the numerical features in your dataset. Check Lecture 3 for an example of using scatter_matrix.

In [None]:
# ToDo: plot scatter plots between features


### Groupping by Gender
As gender is not included in this analysis, we can try to generate scatter plots with values grouped by gender. Our aim is to see how gender may affect the groupping of values.

Using Seaborn we can create such scatter plots realtively easily. See the example below, and produce similar plots for all other combinations of features.

In [None]:

plt.figure(1 , figsize = (15 , 6))

plt.scatter(x = 'Annual Income (k$)', y = 'Spending Score (1-100)' , 
            data = df[df['Gender'] == 'Male'],
            s = 200 , alpha = 0.5 , label = 'Male')
plt.scatter(x = 'Annual Income (k$)', y = 'Spending Score (1-100)' , 
            data = df[df['Gender'] == 'Female'],
            s = 200 , alpha = 0.5 , label = 'Male')
plt.xlabel('Annual Income'), plt.ylabel('Spending Score') 
plt.title('Annual Income vs Spending Score w.r.t Gender')
plt.legend()
plt.show()

In [None]:
# ToDo produce scatter plots per gender for Age - Income, and Age - Spending.

## Observations

At this stage you should be able to make the following observations
- CustomerID is not a useful feature. It does not represent any information that would be valuable for groupping customers
- The dataset has been cleaned. There are no null values or missing values. There is good distribution of values, without any obvious erroneous values or outliers.
- There are some visible distinctions of groups when considering Income vs Spending.
- There are some not so visible distinctions of groups when considering Age vs Spending.
- Gender values do not indicate any significant patterns when looking at scaterplots.

# Clustering with K-means

We will work with K-means clustering using only the numerical values of the dataset.

First create a version of the dataset where you drop the `CustomerID` and `Gender`.

In [None]:
# ToDo: create dataset for clustering


In order to estimate the right number for K, you will need to work through a range of potential values and calculate the inertia for each model. Then you will plot the inertia values and try to identify the 'elbow' in the plot.

*TODO*
- Create your raw data matrix using `.values` from the pandas dataframe
- Create an empty array called `inertia`. You will usa that to store multiple inertia values.
- Create a for-loop with a variable `n` going through a range of values from 1 to 10 (`range(1,11)`)
    - In the loop create a KMeans model with the number of `n` clusters - it is good to set a fixed `random_state`.
    - Fit the dataset in the model
    - Retrieve the `.inertia_` parameter and append it to your inertial list

In [None]:
# ToDo: collect inertia values for multiple 'k' values

In [None]:
# Plot the inertia values

plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

#### Observation
You should observe an elbow point around k=6

# Characterise the clusters

You need to identify the customer characteristics for each cluster, for final reporting.

First you need to train the K-means model using the right number of clusters.
Then retrieve the `labels_` from the trained model which represent their clusters.

Create a copy of your dataset using `.copy()` and then append a column with the cluster numbers, titled `Cluster`.

In [None]:
# ToDo: train your K-Means using the right number of clusters


# ToDo: extract the list of labels and append them as a 
#       column titled `Cluster` - Use a copy of your dataset

Now you can generate scatterplots with your dataset, and group them according to their cluster. This visualisation will help you see how the different clusters are shaped. (the code assumes that the dataset copy is named `df_clustered`)

In [None]:
plt.figure(1 , figsize = (15 , 6))

for c in range(0, 6):
    plt.scatter(x = 'Age' , y = 'Annual Income (k$)' , 
                data = df_clustered[df_clustered['Cluster'] == c],
                s = 200 , alpha = 0.5 , label = c)

plt.xlabel('Age'), plt.ylabel('Annual Income (k$)') 
plt.title('Age vs Annual Income w.r.t Cluster')
plt.legend()
plt.show()

### Observation from this scatter plot
- Cluster 0: High income
- Cluster 1: Mid-range income, Older
- Cluster 2: High income, midle age
- Cluster 3: Mid-range income, young/mid age
- Cluster 4: Low income
- Cluster 5: Low income, young

In [None]:
# ToDo: Generate similar scatter plots for Age vs Spending, and Income vs Spending.

## Final Observations

You should be able to characterise the cluster with observations like the ones here:

- Cluster 0: High income, Low spending, Any age
- Cluster 1: Mid income, Mid spending, Older
- Cluster 2: High income, High spending, Middle age
- Cluster 3: Mid income, Mid spending, Young/Middle age
- Cluster 4: Low income, Low spending, Any age
- Cluster 5: Low income, High spending, Young

# 3D Visualisation
Visualising multidimentional data can be difficult. Because we are working with only 3 features, we can produce a 3-D visualisation and highlight the different clusters.

The following code uses the library plotly to produce these 3D visualisations.

In [None]:
import plotly as py
import plotly.graph_objs as go

trace1 = go.Scatter3d(
    x= df_clustered['Age'],
    y= df_clustered['Spending Score (1-100)'],
    z= df_clustered['Annual Income (k$)'],
    mode='markers',
     marker=dict(
        color = df_clustered['Cluster'], 
        size= 10,
        line=dict(
            color= df_clustered['Cluster'],
            width= 12
        ),
        opacity=0.8
     )
)
data = [trace1]
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    ),
    title= 'Clusters',
    scene = dict(
            xaxis = dict(title  = 'Age'),
            yaxis = dict(title  = 'Spending Score'),
            zaxis = dict(title  = 'Annual Income')
        )
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)