#### COMPANION WORKBOOK

# Dimensionality Reduction

This module does not have a separate Coding Section. Instead, we will be using the exercises below to run all mission-critical code.

Remember, when you have many features (high dimensionality), it makes clustering especially hard because every observation is "far away" from each other. The amount of "space" that a data point could potentially exist in becomes larger and larger, and clusters become very hard to form.

#### First, let's import libraries we'll need for data cleaning and feature engineering.

In general, it's good practice to keep all of your library imports at the top of your notebook or program. **Tip:** If you forget one, you can always add it here later and re-run this code block.

We've provided comments for guidance.

In [2]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd

# Increase Pandas Max Display Columns
pd.set_option('display.max_columns', 100)


#### Next, read in the cleaned transaction dataset (not the analytical base table) that we saved in the previous module.
* Remember, we saved it as <code style="color:crimson">'cleaned_transactions.csv'</code>.

In [4]:
cleaned_transactions_df = pd.read_csv('euro_cleaned_transactions.csv')

#### <span style="color:#555">EXERCISES</span>

Complete each of the following exercises.

## <span style="color:RoyalBlue">Exercise 2.1 - Item Data</span>

So how does The Curse of Dimensionality arise in this problem?

Well, in the previous module, we created a customer-level **analytical base table** with important features such as total sales by customer and average cart value by customer. However, remember, the client would also like to to include **information about individual items** that were purchased.
* For example, if two customers purchased similar items, our model should be more likely to group them into the same cluster.
* In other words, we care not just about *how much* a customer purchases, but also *what* they purchase.

To get a better idea of what this would entail, let's take another look at the item information from our transactions dataset.

#### A.) Display the first 10 StockCodes and Descriptions from the cleaned transaction dataset.

In [7]:
cleaned_transactions_df[['StockCode', 'Description']].head(10)

Unnamed: 0,StockCode,Description
0,22728,ALARM CLOCK BAKELIKE PINK
1,22727,ALARM CLOCK BAKELIKE RED
2,22726,ALARM CLOCK BAKELIKE GREEN
3,21724,PANDA AND BUNNIES STICKER SHEET
4,21883,STARS GIFT TAPE
5,10002,INFLATABLE POLITICAL GLOBE
6,21791,VINTAGE HEADS AND TAILS CARD GAME
7,21035,SET/2 RED RETROSPOT TEA TOWELS
8,22326,ROUND SNACK BOXES SET OF4 WOODLAND
9,22629,SPACEBOY LUNCH BOX


<strong style="color:RoyalBlue">Expected output:</strong>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>StockCode</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>22728</td>
      <td>ALARM CLOCK BAKELIKE PINK</td>
    </tr>
    <tr>
      <th>1</th>
      <td>22727</td>
      <td>ALARM CLOCK BAKELIKE RED</td>
    </tr>
    <tr>
      <th>2</th>
      <td>22726</td>
      <td>ALARM CLOCK BAKELIKE GREEN</td>
    </tr>
    <tr>
      <th>3</th>
      <td>21724</td>
      <td>PANDA AND BUNNIES STICKER SHEET</td>
    </tr>
    <tr>
      <th>4</th>
      <td>21883</td>
      <td>STARS GIFT TAPE</td>
    </tr>
    <tr>
      <th>5</th>
      <td>10002</td>
      <td>INFLATABLE POLITICAL GLOBE</td>
    </tr>
    <tr>
      <th>6</th>
      <td>21791</td>
      <td>VINTAGE HEADS AND TAILS CARD GAME</td>
    </tr>
    <tr>
      <th>7</th>
      <td>21035</td>
      <td>SET/2 RED RETROSPOT TEA TOWELS</td>
    </tr>
    <tr>
      <th>8</th>
      <td>22326</td>
      <td>ROUND SNACK BOXES SET OF4 WOODLAND</td>
    </tr>
    <tr>
      <th>9</th>
      <td>22629</td>
      <td>SPACEBOY LUNCH BOX</td>
    </tr>
  </tbody>
</table>

As you can see, just within the first 10 transactions, we have 10 different items!

#### B.) Next, display the number of unique items in the dataset.

In [8]:
len(cleaned_transactions_df.StockCode.unique())

# number of unique items sold in this online retailer

2574

<strong style="color:RoyalBlue">Expected output:</strong>
<pre>
2574
</pre>

Wow, that's a lot!

But we still haven't explained how this would lead to **high-dimensionality** at the **customer level**. To understand how, let's first look at how we'll "roll up" the item data to the customer level.

## <span style="color:RoyalBlue">Exercise 2.2 - Toy Example: Rolling Up Item Data</span>

To illustrate how we'll **roll up item information to the customer level**, let's use a toy example. By this point, we've already used toy examples several times to clarify a concept or gain a deeper understanding of what we're doing. In general, toy examples are one of the best learning tools for data science, and we encourage you to continue using them, even after this program.

#### A.) First, create a <code style="color:crimson">toy_df</code> that only contains transactions for 2 customers in the <code style="color:crimson">tx_df</code>.
* Include transactions for these 2 CustomerID's: <code style="color:crimson">14566</code> and <code style="color:crimson">17844</code>.
* By the way, there's nothing special about these customers... we just chose them because they have relatively few purchases, making the toy example more manageable.
* Then, display the toy dataframe.

In [10]:
toy_df = cleaned_transactions_df[cleaned_transactions_df.CustomerID.isin([14566, 17844])]
toy_df

# creates a mini version of the problem we're trying to solve for illustrative purposesiuhl

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Sales
19250,563900,85099C,JUMBO BAG BAROQUE BLACK WHITE,200,8/21/2011 11:05,1.79,14566,Channel Islands,358.0
19251,563900,85099B,JUMBO BAG RED RETROSPOT,200,8/21/2011 11:05,1.79,14566,Channel Islands,358.0
19252,563900,23199,JUMBO BAG APPLES,200,8/21/2011 11:05,1.79,14566,Channel Islands,358.0
19253,563900,22386,JUMBO BAG PINK POLKADOT,200,8/21/2011 11:05,1.79,14566,Channel Islands,358.0
19851,564428,21993,FLORAL FOLK STATIONERY SET,12,8/25/2011 11:27,1.25,17844,Canada,15.0
19852,564428,23295,SET OF 12 MINI LOAF BAKING CASES,8,8/25/2011 11:27,0.83,17844,Canada,6.64
19853,564428,23293,SET OF 12 FAIRY CAKE BAKING CASES,16,8/25/2011 11:27,0.83,17844,Canada,13.28
19854,564428,23296,SET OF 6 TEA TIME BAKING CASES,8,8/25/2011 11:27,1.25,17844,Canada,10.0
19855,564428,23294,SET OF 6 SNACK LOAF BAKING CASES,8,8/25/2011 11:27,0.83,17844,Canada,6.64


<strong style="color:RoyalBlue">Expected output:</strong>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>InvoiceNo</th>
      <th>StockCode</th>
      <th>Description</th>
      <th>Quantity</th>
      <th>InvoiceDate</th>
      <th>UnitPrice</th>
      <th>CustomerID</th>
      <th>Country</th>
      <th>Sales</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>19250</th>
      <td>563900</td>
      <td>85099C</td>
      <td>JUMBO  BAG BAROQUE BLACK WHITE</td>
      <td>200</td>
      <td>8/21/11 11:05</td>
      <td>1.79</td>
      <td>14566</td>
      <td>Channel Islands</td>
      <td>358.00</td>
    </tr>
    <tr>
      <th>19251</th>
      <td>563900</td>
      <td>85099B</td>
      <td>JUMBO BAG RED RETROSPOT</td>
      <td>200</td>
      <td>8/21/11 11:05</td>
      <td>1.79</td>
      <td>14566</td>
      <td>Channel Islands</td>
      <td>358.00</td>
    </tr>
    <tr>
      <th>19252</th>
      <td>563900</td>
      <td>23199</td>
      <td>JUMBO BAG APPLES</td>
      <td>200</td>
      <td>8/21/11 11:05</td>
      <td>1.79</td>
      <td>14566</td>
      <td>Channel Islands</td>
      <td>358.00</td>
    </tr>
    <tr>
      <th>19253</th>
      <td>563900</td>
      <td>22386</td>
      <td>JUMBO BAG PINK POLKADOT</td>
      <td>200</td>
      <td>8/21/11 11:05</td>
      <td>1.79</td>
      <td>14566</td>
      <td>Channel Islands</td>
      <td>358.00</td>
    </tr>
    <tr>
      <th>19851</th>
      <td>564428</td>
      <td>21993</td>
      <td>FLORAL FOLK STATIONERY SET</td>
      <td>12</td>
      <td>8/25/11 11:27</td>
      <td>1.25</td>
      <td>17844</td>
      <td>Canada</td>
      <td>15.00</td>
    </tr>
    <tr>
      <th>19852</th>
      <td>564428</td>
      <td>23295</td>
      <td>SET OF 12 MINI LOAF BAKING CASES</td>
      <td>8</td>
      <td>8/25/11 11:27</td>
      <td>0.83</td>
      <td>17844</td>
      <td>Canada</td>
      <td>6.64</td>
    </tr>
    <tr>
      <th>19853</th>
      <td>564428</td>
      <td>23293</td>
      <td>SET OF 12 FAIRY CAKE BAKING CASES</td>
      <td>16</td>
      <td>8/25/11 11:27</td>
      <td>0.83</td>
      <td>17844</td>
      <td>Canada</td>
      <td>13.28</td>
    </tr>
    <tr>
      <th>19854</th>
      <td>564428</td>
      <td>23296</td>
      <td>SET OF 6 TEA TIME BAKING CASES</td>
      <td>8</td>
      <td>8/25/11 11:27</td>
      <td>1.25</td>
      <td>17844</td>
      <td>Canada</td>
      <td>10.00</td>
    </tr>
    <tr>
      <th>19855</th>
      <td>564428</td>
      <td>23294</td>
      <td>SET OF 6 SNACK LOAF BAKING CASES</td>
      <td>8</td>
      <td>8/25/11 11:27</td>
      <td>0.83</td>
      <td>17844</td>
      <td>Canada</td>
      <td>6.64</td>
    </tr>
  </tbody>
</table>

As you can see, the first customer in our toy dataframe bought 4 different items and the second customer bought 5 different items.

However, we can't pass text descriptions into our machine learning algorithms, so we need to find some way to represent each unique item.
* Good news! We can use a tool we've already seen: <code style="color:steelblue">.get_dummies()</code>! 
* That's right, we can create **dummy variables** for each item.
* While we might technically be able to use either the StockCode or Description columns, let's create dummy variables for <code style="color:steelblue">'StockCode'</code> just to be safe, since that's the actual item ID column.

#### B.) Create a dataframe of dummy variables for <code style="color:steelblue">'StockCode'</code>.
* Name it <code style="color:crimson">toy_item_dummies</code>.
* We don't need the other features right now, so you can actually just directly pass in the <code style="color:steelblue">toy_df.StockCode</code> Series to <code style="color:steelblue">pd.get_dummies()</code>.
* Then, add <code style="color:steelblue">'CustomerID'</code> to this new dataframe so that we can roll up by customer later.
* Finally, display the dataframe.

In [12]:
toy_item_dummies = pd.get_dummies( toy_df.StockCode )

toy_item_dummies['CustomerID'] = toy_df.CustomerID

toy_item_dummies

Unnamed: 0,21993,22386,23199,23293,23294,23295,23296,85099B,85099C,CustomerID
19250,0,0,0,0,0,0,0,0,1,14566
19251,0,0,0,0,0,0,0,1,0,14566
19252,0,0,1,0,0,0,0,0,0,14566
19253,0,1,0,0,0,0,0,0,0,14566
19851,1,0,0,0,0,0,0,0,0,17844
19852,0,0,0,0,0,1,0,0,0,17844
19853,0,0,0,1,0,0,0,0,0,17844
19854,0,0,0,0,0,0,1,0,0,17844
19855,0,0,0,0,1,0,0,0,0,17844


<strong style="color:RoyalBlue">Expected output:</strong>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>21993</th>
      <th>22386</th>
      <th>23199</th>
      <th>23293</th>
      <th>23294</th>
      <th>23295</th>
      <th>23296</th>
      <th>85099B</th>
      <th>85099C</th>
      <th>CustomerID</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>19250</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>14566</td>
    </tr>
    <tr>
      <th>19251</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>14566</td>
    </tr>
    <tr>
      <th>19252</th>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>14566</td>
    </tr>
    <tr>
      <th>19253</th>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>14566</td>
    </tr>
    <tr>
      <th>19851</th>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>17844</td>
    </tr>
    <tr>
      <th>19852</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>17844</td>
    </tr>
    <tr>
      <th>19853</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>17844</td>
    </tr>
    <tr>
      <th>19854</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>17844</td>
    </tr>
    <tr>
      <th>19855</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>17844</td>
    </tr>
  </tbody>
</table>

As you can see, we now have a new dataframe of dummy variables. Besides <code style="color:steelblue">'CustomerID'</code>, it has 9 other variables: one for each of the unique items in the toy dataframe.

#### C.) Finally, aggregate this information to the customer-level.
* In fact, it's as simple as grouping by customer and counting the number of times each customer bought each item.

In [13]:
toy_item_data = toy_item_dummies.groupby('CustomerID').sum()
toy_item_data

Unnamed: 0_level_0,21993,22386,23199,23293,23294,23295,23296,85099B,85099C
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
14566,0,1,1,0,0,0,0,1,1
17844,1,0,0,1,1,1,1,0,0


<strong style="color:RoyalBlue">Expected output:</strong>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>21993</th>
      <th>22386</th>
      <th>23199</th>
      <th>23293</th>
      <th>23294</th>
      <th>23295</th>
      <th>23296</th>
      <th>85099B</th>
      <th>85099C</th>
    </tr>
    <tr>
      <th>CustomerID</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>14566</th>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <th>17844</th>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

Now we have **customer-level** features that represent the **number of times a customer bought each item**, and each unique item has its own feature.

That's exactly the type of information we want to include in our clustering model!

## <span style="color:RoyalBlue">Exercise 2.3 - High Dimensionality</span>

Now, perhaps the alarms in your head have already started ringing!
* In the toy example, we had 9 unique items, and that translated to 9 customer-level item features. 
* What do you think would happen for the full dataset?

Let's see for ourselves.

#### A.) First, create a dataframe of dummy variables for <code style="color:steelblue">'StockCode'</code>, this time for the full dataset.
* Name it <code style="color:crimson">item_dummies</code>.
* Then, add <code style="color:steelblue">'CustomerID'</code> to this new dataframe so that we can roll up by customer later.
* Then, display the first 5 rows in this dataframe.

In [15]:
item_dummies = pd.get_dummies( cleaned_transactions_df.StockCode )
item_dummies['CustomerID'] = cleaned_transactions_df.CustomerID
item_dummies.head()

Unnamed: 0,10002,10120,10125,10133,10135,11001,15034,15036,15039,15044A,15044B,15044C,15044D,15056BL,15056N,15056P,15058A,15058B,15058C,15060B,16008,16011,16012,16014,16016,16045,16048,16054,16156L,16156S,16161G,16161P,16161U,16168M,16169E,16169K,16169M,16218,16219,16225,16235,16236,16237,16238,16258A,16259,17003,17011F,17012A,17012B,...,85232A,85232B,85232D,90001B,90001D,90013A,90013C,90018C,90019A,90024B,90030A,90030B,90030C,90031,90036E,90037B,90057,90070,90087,90098,90099,90108,90114,90120B,90145,90160A,90160B,90160C,90160D,90161B,90161C,90161D,90162A,90162B,90164A,90170,90173,90184B,90184C,90192,90201A,90201B,90201C,90201D,90202D,90204,C2,M,POST,CustomerID
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12583
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12583
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12583
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12583
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12583


As you can see, there are MANY features in this item dummies dataset.
* 1 for customer ID
* 2574 for the items!
* And very importantly... you see that most of the values are 0, indicating most items are not widely popular!

#### B.) Next, roll up the item dummies data into customer-level item data.
* Name it <code style="color:crimson">item_data</code>.
* This could take a few seconds.
* Then, display the first 5 rows of the dataframe.

In [16]:
item_data = item_dummies.groupby('CustomerID').sum()
item_data.head()

# what does it mean by "roll up the dummies data" I do not get it

Unnamed: 0_level_0,10002,10120,10125,10133,10135,11001,15034,15036,15039,15044A,15044B,15044C,15044D,15056BL,15056N,15056P,15058A,15058B,15058C,15060B,16008,16011,16012,16014,16016,16045,16048,16054,16156L,16156S,16161G,16161P,16161U,16168M,16169E,16169K,16169M,16218,16219,16225,16235,16236,16237,16238,16258A,16259,17003,17011F,17012A,17012B,...,85231B,85232A,85232B,85232D,90001B,90001D,90013A,90013C,90018C,90019A,90024B,90030A,90030B,90030C,90031,90036E,90037B,90057,90070,90087,90098,90099,90108,90114,90120B,90145,90160A,90160B,90160C,90160D,90161B,90161C,90161D,90162A,90162B,90164A,90170,90173,90184B,90184C,90192,90201A,90201B,90201C,90201D,90202D,90204,C2,M,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1
12347,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12348,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
12349,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
12350,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
12352,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,5


As you can see, even after rolling up to the customer level, most of the values are still 0. That means that most customers are not buying a huge array of different items, which is to be expected.

#### C.) Finally, let's display the total number times each item was purchased.
* This quick check confirms these features are pretty sparse.

In [17]:
item_data.sum()

10002       12
10120        1
10125       13
10133        5
10135        4
          ... 
90202D       1
90204        1
C2           6
M           34
POST      1055
Length: 2574, dtype: int64

<strong style="color:RoyalBlue">Expected output:</strong>
<pre>
10002        12
10120         1
10125        13
10133         5
10135         4
11001         8
15034         5
15036        19
15039         3
15044A        6
15044B        3
15044C        2
15044D        4
15056BL      50
15056N       35
15056P       24
15058A        9
15058B        8
15058C        4
15060B       12
16008        11
16011         3
16012         4
16014        10
16016        16
16045         8
16048         8
16054         2
16156L        6
16156S       12
           ... 
90098         1
90099         2
90108         1
90114         1
90120B        1
90145         2
90160A        1
90160B        1
90160C        1
90160D        1
90161B        1
90161C        1
90161D        1
90162A        1
90162B        1
90164A        1
90170         1
90173         1
90184B        1
90184C        1
90192         1
90201A        1
90201B        3
90201C        2
90201D        1
90202D        1
90204         1
C2            6
M            34
POST       1055
Length: 2574, dtype: int64
</pre>

As you can see, most items were purchased less than a handful of times! 
* First of all, we've just created 2574 customer-level item features, which leads to The Curse of Dimensionality.
* To make matters even worse, most of the values for many of those features are 0!

However, there's no need to panic. Next, we'll introduce a strategy for reducing the number of item features that we actually keep.

#### D.) Before moving on, let's save this customer-level item dataframe as <code style="color:crimson">'item_data.csv'</code>. We'll use it again in the next module.
* In the next module, we'll look at an alternative way to reduce dimensionality.
* Again, we won't set <code style="color:steelblue">index=None</code> because we want to keep the CustomerID's as the index.

In [18]:
item_data.to_csv('euro_item_data.csv')

## <span style="color:RoyalBlue">Exercise 2.4 - Thresholds</span>

One very **simple and straightforward way** to reduce the dimensionality of this item data is to set a **threshold** for keeping features.
* The rationale is that you might only want to keep **popular items**.
* For example, let's say item A was only purchased by 2 customers. Well, the feature for item A will be 0 for almost all observations, which isn't very helpful.
* On the other hand, let's say item B was purchased by 100 customers. The feature for item B will allow more meaningful comparisons.

To make this concrete, assume we only wish to keep item features for the **20 most popular items**. 

#### A.) First, we can see which items those are and the number of times they were purchased.
1. Take the sum by column.
* Sort the values.
* Look at the last 20 (since they are sorted in ascending order by default).

In [19]:
item_data.sum().sort_values().tail(20)

22961      114
22630      115
22139      117
21080      122
85099B     123
20726      123
20719      128
20750      132
23084      140
20725      141
21212      143
22551      158
22629      160
22328      166
21731      169
22556      179
22554      197
22423      222
22326      271
POST      1055
dtype: int64

<strong style="color:RoyalBlue">Expected output:</strong>
<pre>
22961      114
22630      115
22139      117
21080      122
85099B     123
20726      123
20719      128
20750      132
23084      140
20725      141
21212      143
22551      158
22629      160
22328      166
21731      169
22556      179
22554      197
22423      222
22326      271
POST      1055
dtype: int64
</pre>

#### B.) Next, if we take the <code style="color:steelblue">.index</code> of the above series, we can get just a list of the StockCodes for those 20 items.

In [20]:
top_20_items = item_data.sum().sort_values().tail(20).index
print( top_20_items )

Index(['22961', '22630', '22139', '21080', '85099B', '20726', '20719', '20750',
       '23084', '20725', '21212', '22551', '22629', '22328', '21731', '22556',
       '22554', '22423', '22326', 'POST'],
      dtype='object')


<strong style="color:RoyalBlue">Expected output:</strong>
<pre>
Index(['22961', '22630', '22139', '21080', '85099B', '20726', '20719', '20750',
       '23084', '20725', '21212', '22551', '22629', '22328', '21731', '22556',
       '22554', '22423', '22326', 'POST'],
      dtype='object')
</pre>

#### C.) Keep only the features for those 20 items. Save them in a new object <code style="color:steelblue">top_20_item_data</code>.
* Then, as a quick sanity check, display its shape.

In [21]:
top_20_item_data = item_data[top_20_items]

#why do we write the code like this I don't get it

top_20_item_data.shape

(414, 20)

<strong style="color:RoyalBlue">Expected output:</strong>
<pre>
(414, 20)
</pre>

#### D.) Take a look at some example rows in <code style="color:steelblue">top_20_item_data</code> for yourself.
* These 20 features are much more manageable than the 2574 from earlier, and they are arguably the most important features because they are the most popular items.

In [22]:
top_20_item_data.head()

Unnamed: 0_level_0,22961,22630,22139,21080,85099B,20726,20719,20750,23084,20725,21212,22551,22629,22328,21731,22556,22554,22423,22326,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
12347,0,0,0,0,0,0,4,0,3,0,0,0,0,0,5,0,0,4,0,0
12348,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
12349,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1
12350,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
12352,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,5


#### E.) Finally, save this top 20 items dataframe as <code style="color:crimson">'threshold_item_data.csv'</code>.
* We'll see a different way to reduce dimensionality in the next module, but we'll come back to this dataframe again in Module 4.
* Do **not** set <code style="color:steelblue">index=None</code> because we want to keep the CustomerID's as the index.

In [23]:
top_20_item_data.to_csv('euro_threshold_item_data.csv')

Congratulations for making it through the Dimensionality Reduction module! As a reminder, here are a few things you did in this module:
* You learned about the Curse of Dimensionality and how it can cause issues for clustering.
* You used another toy example to see the process of rolling up item data.
* You created customer-level item features that represent the number of times each item was purchased.
* And you reduced the dimensionality of that dataset using thresholds.

In the next module, Principal Components Analysis, we'll look at a different way to reduce the number of customer-level item features. This is a more advanced technique, and it's actually considered its own Unsupervised Learning task!