<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Module 2:</span> Dimensionality Reduction</h1>
<hr>


In the previous module, you created an analytical base table with useful customer-level features for **purchase patterns**.

However, remember, our client wishes to incorporate information about **specific item purchases** into the clusters. For example, our model should be more likely to group together customers who buy similar items.

* In this module, we'll prepare individual item features for our clustering algorithms.
* The Curse of Dimensionality is especially relevant for clustering because it means observations are "far away" from each other.
* We'll introduce a simple way to reduce the number of dimensions by applying thresholds.

<br><hr id="toc">

### In this module, we will cover...

1. The Curse of Dimensionality
2. Item data
3. Experimental example: one hot encoding
4. High dimensionality
5. Thresholds

### First, let's import libraries and load the cleaned transaction-level data.

First, import libraries that you'll need.

In [1]:
# Commonly used data science packages: NumPy, Pandas, Matplotlib, Seaborn 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Read cleaned_transactions.csv
data = pd.read_csv('data/cleaned_transaction.csv')

# The Curse of Dimensionality in Machine Learning

Often, we encounter there are too many features in the dataset to load into machine learning model. Therefore, we shall come up with methods for dimensionality reduction. Below you can check out for the introduction of curse of dimensionality. 

[Wiki Curse of Dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)	

### The curse of dimensionality in the transaction dataset, namely StockCode and Description column

In [8]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Sales
0,536370,22728,ALARM CLOCK BAKELIKE PINK,24,12/1/10 8:45,3.75,12583,France,90.0
1,536370,22727,ALARM CLOCK BAKELIKE RED,24,12/1/10 8:45,3.75,12583,France,90.0
2,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,12/1/10 8:45,3.75,12583,France,45.0
3,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,12/1/10 8:45,0.85,12583,France,10.2
4,536370,21883,STARS GIFT TAPE,24,12/1/10 8:45,0.65,12583,France,15.6


Q1: display the unique records for StockCode and Description columns

In [8]:
# Your answer here:
print(f'Number of unique StockCode: {len(data.StockCode.unique())}')
print(f'Number of unique Description: {len(data.Description.unique())}')

Number of unique StockCode: 2574
Number of unique Description: 2639


In [None]:
# %load solution/m2q1.py


Number of unique StockCode is: 2574
Number of unique Description is: 2639


#  Experimenting on high dimension item data

Due to the high dimension of the item data, when we aggregate them into customer level, we will encounter high dimension. To study the effect with manageable processing time, we can create a dataframe containing two customer ID first and study the effect.

** Create a exp_df that only contains transactions for 2 customers (i.e. 14566 and 17844) **


In [9]:
exp_df = data[data['CustomerID'].isin([14566,17844])]

In [10]:
exp_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Sales
19250,563900,85099C,JUMBO BAG BAROQUE BLACK WHITE,200,8/21/11 11:05,1.79,14566,Channel Islands,358.0
19251,563900,85099B,JUMBO BAG RED RETROSPOT,200,8/21/11 11:05,1.79,14566,Channel Islands,358.0
19252,563900,23199,JUMBO BAG APPLES,200,8/21/11 11:05,1.79,14566,Channel Islands,358.0
19253,563900,22386,JUMBO BAG PINK POLKADOT,200,8/21/11 11:05,1.79,14566,Channel Islands,358.0
19851,564428,21993,FLORAL FOLK STATIONERY SET,12,8/25/11 11:27,1.25,17844,Canada,15.0


In [11]:
exp_df.shape

(9, 9)

Q2: using get_dummies method in pandas dataframe to create customerID on transaction record and concatenate original CustomerID column into the dataframe

In [None]:
# Your answer here:


In [None]:
# %load solution/m2q2.py
pd.get_dummies(exp_df['CustomerID'])
exp_item_dummies = pd.get_dummies(exp_df['CustomerID'])
exp_item_dummies['CustomerID'] = exp_df['CustomerID']
exp_item_dummies


Unnamed: 0,14566,17844
19250,1,0
19251,1,0
19252,1,0
19253,1,0
19851,0,1
19852,0,1
19853,0,1
19854,0,1
19855,0,1


Unnamed: 0,14566,17844,CustomerID
19250,1,0,14566
19251,1,0,14566
19252,1,0,14566
19253,1,0,14566
19851,0,1,17844
19852,0,1,17844
19853,0,1,17844
19854,0,1,17844
19855,0,1,17844


**Next, we can aggregate this information to the customer-level**

In [20]:
exp_item_data = exp_item_dummies.groupby('CustomerID').sum()

In [21]:
exp_item_data

Unnamed: 0_level_0,14566,17844
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
14566,4,0
17844,0,5


# High dimensionality on the StockCode column

### Above we are using CustomerID for experimenting the get_dummies method, or in other terms "One Hot Encoding". Now we can imagine using get_dummies on "StockCode" which will generate high dimensionality

Q3: Create a dataframe of dummy variables for 'StockCode', this time for the full dataset.

Name it item_dummies.

Then, add 'CustomerID' to this new dataframe so that we can roll up by customer later.

Then, display the first 5 rows in this dataframe.


In [23]:
# Your answer here:
item_dummies = pd.get_dummies(data.StockCode)
item_dummies['CustomerID'] = data.CustomerID
item_dummies.head(5)

Unnamed: 0,10002,10120,10125,10133,10135,11001,15034,15036,15039,15044A,...,90201A,90201B,90201C,90201D,90202D,90204,C2,M,POST,CustomerID
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583


In [None]:
# %load solution/m2q3.py


In [32]:
item_dummies.head()

Unnamed: 0,10002,10120,10125,10133,10135,11001,15034,15036,15039,15044A,...,90201A,90201B,90201C,90201D,90202D,90204,C2,M,POST,CustomerID
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583


** Create item_data by aggregating at customer level and display first 5 rows of item_data **

In [24]:
item_data = item_dummies.groupby('CustomerID').sum()

In [25]:
item_data.head()

Unnamed: 0_level_0,10002,10120,10125,10133,10135,11001,15034,15036,15039,15044A,...,90192,90201A,90201B,90201C,90201D,90202D,90204,C2,M,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
12349,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12352,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3,5


In [26]:
# Total times each item was purchased
print(item_data.sum())

10002        12
10120         1
10125        13
10133         5
10135         4
11001         8
15034         5
15036        19
15039         3
15044A        6
15044B        3
15044C        2
15044D        4
15056BL      50
15056N       35
15056P       24
15058A        9
15058B        8
15058C        4
15060B       12
16008        11
16011         3
16012         4
16014        10
16016        16
16045         8
16048         8
16054         2
16156L        6
16156S       12
           ... 
90098         1
90099         2
90108         1
90114         1
90120B        1
90145         2
90160A        1
90160B        1
90160C        1
90160D        1
90161B        1
90161C        1
90161D        1
90162A        1
90162B        1
90164A        1
90170         1
90173         1
90184B        1
90184C        1
90192         1
90201A        1
90201B        3
90201C        2
90201D        1
90202D        1
90204         1
C2            6
M            34
POST       1055
Length: 2574, dtype: int

In [27]:
# Save item_data.csv for the usage of next module
item_data.to_csv('data/item_data.csv')

# Setting Thresholds to handle high dimensionality

One very **simple and straightforward way** to reduce the dimensionality of this item data is to set a **threshold** for keeping features.

<br>
**First, we can see which items those are and the number of times they were purchased.**
1. Take the sum by column.
* Sort the values.
* Look at the last 20 (since they are sorted in ascending order by default).

Q4: By using the item_data dataframe, perform the following steps to see top 20 highest purchased items.

1. Take the sum by column.
2. Sort the values.
3. Look at the last 20 (since they are sorted in ascending order by default).



In [None]:
# Your answer here:


In [None]:
# %load solution/m2q4.py

In [41]:
item_data.sum().sort_values(ascending=False).head(20)

POST      1055
22326      271
22423      222
22554      197
22556      179
21731      169
22328      166
22629      160
22551      158
21212      143
20725      141
23084      140
20750      132
20719      128
20726      123
85099B     123
21080      122
22139      117
22630      115
22961      114
dtype: int64

In [42]:
top_20_items = item_data.sum().sort_values().tail(20).index

In [43]:
# Display most popular 20 items

top_20_items

Index(['22961', '22630', '22139', '21080', '85099B', '20726', '20719', '20750',
       '23084', '20725', '21212', '22551', '22629', '22328', '21731', '22556',
       '22554', '22423', '22326', 'POST'],
      dtype='object')

In [47]:
# Displaying the 2 items with top 20 highest purchase
item_data[['22961','22630']].head()

Unnamed: 0_level_0,22961,22630
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
12347,0,0
12348,0,0
12349,0,0
12350,0,0
12352,0,1


Q5: Keep only the features for those 20 items.

In [31]:
# Your answer here:
top20 = item_data.sum().sort_values(ascending = False).head(20).index
top_20_item_data = item_data[top20]
top_20_item_data.head()

Unnamed: 0_level_0,POST,22326,22423,22554,22556,21731,22328,22629,22551,21212,20725,23084,20750,20719,20726,85099B,21080,22139,22630,22961
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
12347,0,0,4,0,0,5,0,0,0,0,0,3,0,4,0,0,0,0,0,0
12348,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12349,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12350,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
12352,5,0,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [None]:
# %load solution/m2q5.py

Unnamed: 0_level_0,22961,22630,22139,21080,85099B,20726,20719,20750,23084,20725,21212,22551,22629,22328,21731,22556,22554,22423,22326,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
12347,0,0,0,0,0,0,4,0,3,0,0,0,0,0,5,0,0,4,0,0
12348,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
12349,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1
12350,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
12352,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,5


In [32]:
# Save threshold_item_data.csv

top_20_item_data.to_csv('data/threshold_item_data.csv')

## Summary of Module 2

Congratulations! 

As a reminder, here are a few things you did in this module:
* You learned about the Curse of Dimensionality and how it can cause issues for clustering.
* You used another toy example to see the process of rolling up item data.
* You created customer-level item features that represent the number of times each item was purchased.
* And you reduced the dimensionality of that dataset using thresholds.
