## Codio Activity 6.8: Running PCA with Clustering

**Expected Time = 120 minutes**

**Total Points = 36**

Now that you've seen how to use PCA to reduce the dimensionality of data while maintaining important information, it is time to see how we can use these ideas applied to a real dataset. In this activity you will use a dataset related to marketing campaigns with the task being to identify groups of similar customers.  Once the cluster labels are assigned, you will briefly explore inside of each cluster for patterns that help identify characteristics of customers.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)
- [Problem 7](#Problem-7)
- [Problem 8](#Problem-8)
- [Problem 9](#Problem-9)


In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
import warnings

In [4]:
warnings.filterwarnings("ignore")

### The Dataset

More information on the dataset can be found [here](https://www.kaggle.com/imakash3011/customer-personality-analysis).  Below the data is loaded, the info is displayed, describe the continuous features, and the first five rows of the data are displayed.

In [11]:
df = pd.read_csv('data/marketing_campaign.csv', sep = '\t')

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [15]:
df.describe()

Unnamed: 0,ID,Year_Birth,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
count,2240.0,2240.0,2216.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,...,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0
mean,5592.159821,1968.805804,52247.251354,0.444196,0.50625,49.109375,303.935714,26.302232,166.95,37.525446,...,5.316518,0.072768,0.074554,0.072768,0.064286,0.013393,0.009375,3.0,11.0,0.149107
std,3246.662198,11.984069,25173.076661,0.538398,0.544538,28.962453,336.597393,39.773434,225.715373,54.628979,...,2.426645,0.259813,0.262728,0.259813,0.245316,0.114976,0.096391,0.0,0.0,0.356274
min,0.0,1893.0,1730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
25%,2828.25,1959.0,35303.0,0.0,0.0,24.0,23.75,1.0,16.0,3.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
50%,5458.5,1970.0,51381.5,0.0,0.0,49.0,173.5,8.0,67.0,12.0,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
75%,8427.75,1977.0,68522.0,1.0,1.0,74.0,504.25,33.0,232.0,50.0,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
max,11191.0,1996.0,666666.0,2.0,2.0,99.0,1493.0,199.0,1725.0,259.0,...,20.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,11.0,1.0


In [17]:
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


[Back to top](#Index:)

## Problem 1

### Preparing the Data

**4 Points**

Before starting to build cluster models, the data needs to be limited to numeric representations.  How many non-numeric columns are there, and what are their names?  Assign you solution as a list of strings to `object_cols` below.  The names should match the column names in the DataFrame exactly.  

In [19]:
### GRADED

# YOUR CODE HERE
object_cols = ['Education', 'Marital_Status', 'Dt_Customer']

# Answer check
print(object_cols)
print(type(object_cols))

['Education', 'Marital_Status', 'Dt_Customer']
<class 'list'>


[Back to top](#Index:)

## Problem 2

### Dropping the `object` columns 

**4 Points**

To simplify things, eliminate the columns containing `object` datatypes.  Assign your new DataFrame to `df_numeric` below.

In [21]:
### GRADED

# YOUR CODE HERE
df_numeric = df.select_dtypes(exclude=['object'])

# Answer check
print(df_numeric.shape)
df_numeric.info()

(2240, 26)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Income               2216 non-null   float64
 3   Kidhome              2240 non-null   int64  
 4   Teenhome             2240 non-null   int64  
 5   Recency              2240 non-null   int64  
 6   MntWines             2240 non-null   int64  
 7   MntFruits            2240 non-null   int64  
 8   MntMeatProducts      2240 non-null   int64  
 9   MntFishProducts      2240 non-null   int64  
 10  MntSweetProducts     2240 non-null   int64  
 11  MntGoldProds         2240 non-null   int64  
 12  NumDealsPurchases    2240 non-null   int64  
 13  NumWebPurchases      2240 non-null   int64  
 14  NumCatalogPurchases  2240 non-null   int64  
 15  NumStorePurchases    2240 n

[Back to top](#Index:)

## Problem 3

### Dropping non-informative columns

**4 Points**

Two columns, `Z_CostContact`, and `Z_Revenue` have one unique value. Also, the `ID` column is basically an index. These will not add any information to our problem. Drop the columns `Z_CostContact`, `Z_Revenue`, and `ID` and save your all numeric data without these two columns as a DataFrame to `df_clean` below.

In [25]:
### GRADED

# YOUR CODE HERE
df_clean = df_numeric.drop(columns=['Z_CostContact', 'Z_Revenue', 'ID'])

# Answer check
print(df_clean.shape)
df_clean.info()

(2240, 23)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year_Birth           2240 non-null   int64  
 1   Income               2216 non-null   float64
 2   Kidhome              2240 non-null   int64  
 3   Teenhome             2240 non-null   int64  
 4   Recency              2240 non-null   int64  
 5   MntWines             2240 non-null   int64  
 6   MntFruits            2240 non-null   int64  
 7   MntMeatProducts      2240 non-null   int64  
 8   MntFishProducts      2240 non-null   int64  
 9   MntSweetProducts     2240 non-null   int64  
 10  MntGoldProds         2240 non-null   int64  
 11  NumDealsPurchases    2240 non-null   int64  
 12  NumWebPurchases      2240 non-null   int64  
 13  NumCatalogPurchases  2240 non-null   int64  
 14  NumStorePurchases    2240 non-null   int64  
 15  NumWebVisitsMonth    2240 n

[Back to top](#Index:)

## Problem 4

### Dropping the missing data

**4 Points**

Note that the `Income` column is missing data.  This will cause issues for `PCA` and clustering algorithms.  Drop the missing data using pandas `.dropna` method on `df_clean`, and assign your non-missing dataset as a DataFrame to `df_clean_nona` below. 

In [29]:
### GRADED

# YOUR CODE HERE
df_clean_nona = df_clean.dropna(subset=['Income'])

# Answer check
print(df_clean_nona.shape)
df_clean_nona.info()

(2216, 23)
<class 'pandas.core.frame.DataFrame'>
Index: 2216 entries, 0 to 2239
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year_Birth           2216 non-null   int64  
 1   Income               2216 non-null   float64
 2   Kidhome              2216 non-null   int64  
 3   Teenhome             2216 non-null   int64  
 4   Recency              2216 non-null   int64  
 5   MntWines             2216 non-null   int64  
 6   MntFruits            2216 non-null   int64  
 7   MntMeatProducts      2216 non-null   int64  
 8   MntFishProducts      2216 non-null   int64  
 9   MntSweetProducts     2216 non-null   int64  
 10  MntGoldProds         2216 non-null   int64  
 11  NumDealsPurchases    2216 non-null   int64  
 12  NumWebPurchases      2216 non-null   int64  
 13  NumCatalogPurchases  2216 non-null   int64  
 14  NumStorePurchases    2216 non-null   int64  
 15  NumWebVisitsMonth    2216 non-nu

[Back to top](#Index:)

## Problem 5

### Scaling the Data

**4 Points**

As earlier with the PCA models, the data needs to be mean centered.  


Below, scale the `df_clean_nona` by subtracting its mean and by dividing it by its standard deviation.  Assign your results as a DataFrame to `df_scaled` below.  

In [31]:
### GRADED

# YOUR CODE HERE
df_scaled = (df_clean_nona - df_clean_nona.mean()) / df_clean_nona.std()

# Answer check
print(df_scaled.shape)
print(type(df_scaled))

(2216, 23)
<class 'pandas.core.frame.DataFrame'>


[Back to top](#Index:)

## Problem 6

### PCA

**4 Points**

With the data cleaned and scaled, you are ready to perform PCA.  Below, use the `PCA` transformer from `sklearn` to transform your data and select the top three principal components.  First, create an instance of the `PCA` that limits the number of components to 3 using the `n_components` argument.  Also, set the argument `random_state = 42`  and assign your instance as `pca` below.

In [33]:
### GRADED

# YOUR CODE HERE
pca = PCA(n_components=3, random_state=42)

# Answer check
print(pca)
print(pca.n_components)

PCA(n_components=3, random_state=42)
3


[Back to top](#Index:)

## Problem 7

### Extracting the Components

**4 Points**

Use the `.fit_transform` method with argument equal to `df_scaled` on `pca` to extract the three principal components.  Save these components as an array to the variable `components` below.  

In [35]:
### GRADED

# YOUR CODE HERE
components = pca.fit_transform(df_scaled)

# Answer check
print(type(components))
print(components.shape)

<class 'numpy.ndarray'>
(2216, 3)


[Back to top](#Index:)

## Problem 8

### `KMeans`

**4 Points**
Complete the code below according to the instructions below:


- To the `kmeans` variable, assign the `KMeans` clusterer with argument `n_clusters` equal to `3` and argument `random_state` equal to `42`. To this, chain the `fit()` method with argument equal to `components`.
- Copy the code line that reads the data  in your solution code.
- Copy the code to drop the missing value in your solution. Here, inside the `dropna()` function, set the argument `subset` equal to `['Income']`.
- Inside `df_clustered`, create a new column `cluster`. To this column, assign `kmeans.labels_`.


In [39]:
### GRADED

# YOUR CODE HERE
kmeans = KMeans(n_clusters=3, random_state=42).fit(components)
df = pd.read_csv('data/marketing_campaign.csv', sep = '\t')
df_clustered = df.dropna()
df_clustered['cluster'] = kmeans.labels_

# Answer check
print(type(df_clustered))
print(df_clustered.shape)
df_clustered.head(10)

<class 'pandas.core.frame.DataFrame'>
(2216, 30)


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response,cluster
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,0,0,0,0,0,0,3,11,1,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,0,0,0,0,0,0,3,11,0,2
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,0,0,0,0,0,0,3,11,0,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,0,0,0,0,0,0,3,11,0,2
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,0,0,0,0,0,0,3,11,0,2
5,7446,1967,Master,Together,62513.0,0,1,09-09-2013,16,520,...,0,0,0,0,0,0,3,11,0,0
6,965,1971,Graduation,Divorced,55635.0,0,1,13-11-2012,34,235,...,0,0,0,0,0,0,3,11,0,0
7,6177,1985,PhD,Married,33454.0,1,0,08-05-2013,32,76,...,0,0,0,0,0,0,3,11,0,2
8,4855,1974,PhD,Together,30351.0,1,0,06-06-2013,19,14,...,0,0,0,0,0,0,3,11,1,2
9,5899,1950,PhD,Together,5648.0,1,1,13-03-2014,68,28,...,1,0,0,0,0,0,3,11,0,2


[Back to top](#Index:)

## Problem 9

### Examining the Results

**4 Points**

The image below shows a `boxenplot` of the clusters based on amounts spent on meat products.  If you were marketing a meat sale and there is a cost involved in advertisiting per customer.  If you were to select only one cluster to market to, which cluster would you target? Assign your response as an integer to `target_cluster` below.

![](images/meats.png)

In [None]:
### GRADED

# YOUR CODE HERE
target_cluster = 1

# Answer check
print(type(target_cluster))
print(target_cluster)

While this is a start, there is much more work to be done.  We glossed over perhaps one of the most important parts of the task -- feature engineering.  Some of the columns that were objects could be represented numerically.  Also, we could try different numbers of components from PCA and numbers of clusters.  In a business setting, it is important to keep the number of clusters small so that the groups can be distinguished in meaningful ways, so we don't want to let the number of clusters get too large.  