# Unsupervised Learning Guided Lesson


## Lesson Goals

In this guided lesson, we will analyze an unsupervised learning problem from start to finish and introduce several different processing techniques.


**Introduction**

As a data scientist or analyst, you may be asked open ended question about a dataset. One example is to find some patterns in a dataset that is unlabeled. Typically, this happens when analyzing a group of customers and trying to find a common themes between the transactions. The leading choice of algorithm for this type of problem is an unsupervised algorithm. Specifically, this lesson will be using clustering to analyze this problem. In this lesson, we will be analyzing a log of transactions from a bakery to make recommendations about the marketing and sale of products.


**The Data**

The dataset we will be analyzing comes from Kaggle and is a log of transactions from a bakery in Edinburgh called The Bread Basket. We will start by evaluating the types of the variables in the data as well as the contents of the categorical variables and the distribution of the numerical variables. 

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

breadbasket = pd.read_csv('../BreadBasket_DMS.csv')
breadbasket.dtypes

Date           object
Time           object
Transaction     int64
Item           object
dtype: object

In [2]:
breadbasket.head()

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam


In [3]:
breadbasket.describe()

Unnamed: 0,Transaction
count,21293.0
mean,4951.990889
std,2787.7584
min,1.0
25%,2548.0
50%,5067.0
75%,7329.0
max,9684.0


As we can see from these three functions, we have 4 columns in the dataset. The date and time are separated into two columns and stored as text. This means we will have to combine and convert them later on. The transaction variable is ordinal, so the summary statistics really have no meaning in this context. The only meaningful information from the describe function is the max which tells us we have 9684 transactions in the dataset. From the head function, we see that each row represents an item in the transaction. This means that there are potentially multiple items per transaction or maybe just one. We might benefit from consolidating the data and creating a new dataset that contains one row per transaction.

So the data we have is a breakdown of each item in a transaction and the date and time when the transaction occurred. The data we do not have is any information about the customer. Since we cannot associate the transactions back to customers, we cannot tell if a certain customer is a regular who comes in and buys a coffee and a pastry every day or whether a customer is a tourist who came in once and bought a specialty dessert.

Our strategy to make sense of the dataset will be to generate derived variables from this transaction log and cluster based on these derived variables. We will evaluate the aggregate information regarding each cluster and make recommendations about which products to advertise and which products should be in stock and at what days and times.


## Exploring the Variables

While we only have a few variables, we should explore their contents.

First, let's look at the time and day of week. Hour is a crucial factor since customer behavior differs significantly between the morning and the afternoon. However, we can even find differences between customer behavior at 7 am vs. at 9 am. Similarly, we see differences between weekday and weekend customer behavior.

To examine the date and time, we must reformat this variable. We start by combining the date and time into one column and then transforming the column to a datetime column. This allows us to extract the hour and the time of day.

In [4]:
breadbasket['DateTime'] = pd.to_datetime(breadbasket.Date + ' ' + breadbasket.Time)

Next, we look at the item variable. This variable will tell us how many products are sold by the bakery and which products are more popular.

In [5]:
breadbasket.Item.unique()

array(['Bread', 'Scandinavian', 'Hot chocolate', 'Jam', 'Cookies',
       'Muffin', 'Coffee', 'Pastry', 'Medialuna', 'Tea', 'NONE',
       'Tartine', 'Basket', 'Mineral water', 'Farm House', 'Fudge',
       'Juice', "Ella's Kitchen Pouches", 'Victorian Sponge', 'Frittata',
       'Hearty & Seasonal', 'Soup', 'Pick and Mix Bowls', 'Smoothies',
       'Cake', 'Mighty Protein', 'Chicken sand', 'Coke',
       'My-5 Fruit Shoot', 'Focaccia', 'Sandwich', 'Alfajores', 'Eggs',
       'Brownie', 'Dulce de Leche', 'Honey', 'The BART', 'Granola',
       'Fairy Doors', 'Empanadas', 'Keeping It Local', 'Art Tray',
       'Bowl Nic Pitt', 'Bread Pudding', 'Adjustment', 'Truffles',
       'Chimichurri Oil', 'Bacon', 'Spread', 'Kids biscuit', 'Siblings',
       'Caramel bites', 'Jammie Dodgers', 'Tiffin', 'Olum & polenta',
       'Polenta', 'The Nomad', 'Hack the stack', 'Bakewell',
       'Lemon and coconut', 'Toast', 'Scone', 'Crepes', 'Vegan mincepie',
       'Bare Popcorn', 'Muesli', 'Crisps', 'Pi

We can look at the counts as well to see what items are most popular.

In [6]:
breadbasket.Item.value_counts()

Coffee                           5471
Bread                            3325
Tea                              1435
Cake                             1025
Pastry                            856
NONE                              786
Sandwich                          771
Medialuna                         616
Hot chocolate                     590
Cookies                           540
Brownie                           379
Farm House                        374
Muffin                            370
Alfajores                         369
Juice                             369
Soup                              342
Scone                             327
Toast                             318
Scandinavian                      277
Truffles                          193
Coke                              185
Spanish Brunch                    172
Fudge                             159
Baguette                          152
Jam                               149
Tiffin                            146
Mineral wate

There are a total of 95 items. The most common items are coffee, bread, tea, cake and pastry. Since 95 is a really large number, if we created dummy variables out of this data, it would produce too many variables. One option is to classify the data. Typically, when working on these types of problems, companies will have a classification system for the items they sell. However, let's try to come up with our own.

Looking at the list of unique items, we can identify a number of obvious categories. We have beverages and breakfast pastries like muffins and medialuna. We have items for kids like juice and pouches. We also have non food items like gift vouchers and t-shirts. Another group of items that we can notice is ready to eat snacks like popcorn and crisps. With a bit of work, we can narrow it down from 95 products to 11 categories: beverage, other, kids, snacks, bread, breakfast pastry, dessert, condiments, breakfast, lunch, and other foods. The last group is used to classify mostly uncommon items that sell very little (like polenta) or have names that are hard to identify (like "Hack the Stack").

We generate the categories using lists and then use the lists to create dummy variables.



In [7]:
beverage = ['Hot chocolate', 'Coffee', 'Tea', 'Mineral water', 'Juice', 'Coke', 'Smoothies']
other = ['NONE', 'Christmas common', 'Gift voucher', "Valentine's card", 'Tshirt', 'Afternoon with the baker', 'Postcard', 'Siblings', 'Nomad bag', 'Adjustment', 'Drinking chocolate spoons ', 'Coffee granules ']
kids = ["Ella's Kitchen Pouches", 'My-5 Fruit Shoot', 'Kids biscuit']
snacks = ['Mighty Protein', 'Pick and Mix Bowls', 'Caramel bites', 'Bare Popcorn', 'Crisps', 'Cherry me Dried fruit', 'Raw bars']
bread = ['Bread', 'Toast', 'Baguette', 'Focaccia', 'Scandinavian']
breakfast_pastry = ['Muffin', 'Pastry', 'Medialuna', 'Scone']
dessert = ['Cookies', 'Tartine', 'Fudge', 'Victorian Sponge', 'Cake', 'Alfajores', 'Brownie', 'Bread Pudding', 'Bakewell', 'Raspberry shortbread sandwich', 'Lemon and coconut', 'Crepes', 'Chocolates', 'Truffles', 'Panatone']
condiments = ['Jam', 'Dulce de Leche', 'Honey', 'Gingerbread syrup', 'Extra Salami or Feta', 'Bacon', 'Spread', 'Chimichurri Oil']
breakfast = ['Eggs', 'Frittata', 'Granola', 'Muesli', 'Duck egg', 'Brioche and salami']
lunch = ['Soup', 'Sandwich', 'Chicken sand', 'Salad', 'Chicken Stew']


other_food = [x for x in breadbasket.Item.unique() if x not in beverage 
                and x not in other and x not in kids and x not in snacks 
                and x not in bread and x not in breakfast_pastry 
                and x not in dessert and x not in condiments 
                and x not in breakfast and x not in lunch]


breadbasket['beverage'] = np.where(breadbasket.Item.isin(beverage), 1, 0)
breadbasket['other'] = np.where(breadbasket.Item.isin(other), 1, 0)
breadbasket['kids'] = np.where(breadbasket.Item.isin(kids), 1, 0)
breadbasket['snacks'] = np.where(breadbasket.Item.isin(snacks), 1, 0)
breadbasket['bread'] = np.where(breadbasket.Item.isin(bread), 1, 0)
breadbasket['breakfast_pastry'] = np.where(breadbasket.Item.isin(breakfast_pastry), 1, 0)
breadbasket['dessert'] = np.where(breadbasket.Item.isin(dessert), 1, 0)
breadbasket['condiments'] = np.where(breadbasket.Item.isin(condiments), 1, 0)
breadbasket['breakfast'] = np.where(breadbasket.Item.isin(breakfast), 1, 0)
breadbasket['lunch'] = np.where(breadbasket.Item.isin(lunch), 1, 0)
breadbasket['other_food'] = np.where(breadbasket.Item.isin(other_food), 1, 0)
breadbasket.head()

Unnamed: 0,Date,Time,Transaction,Item,DateTime,beverage,other,kids,snacks,bread,breakfast_pastry,dessert,condiments,breakfast,lunch,other_food
0,2016-10-30,09:58:11,1,Bread,2016-10-30 09:58:11,0,0,0,0,1,0,0,0,0,0,0
1,2016-10-30,10:05:34,2,Scandinavian,2016-10-30 10:05:34,0,0,0,0,1,0,0,0,0,0,0
2,2016-10-30,10:05:34,2,Scandinavian,2016-10-30 10:05:34,0,0,0,0,1,0,0,0,0,0,0
3,2016-10-30,10:07:57,3,Hot chocolate,2016-10-30 10:07:57,1,0,0,0,0,0,0,0,0,0,0
4,2016-10-30,10:07:57,3,Jam,2016-10-30 10:07:57,0,0,0,0,0,0,0,1,0,0,0


**Processing the Data**

The first bit of work we will do to process the data is to aggregate by transaction. This will give us the count of each category per transaction. We will use the groupby function to find the sum in each transaction. We group by the datetime as well since we want to keep this column after the aggregation. This should not be a problem since a transaction number and a datetime uniquely identifies each row. 

In [8]:
bread_group = breadbasket.groupby(['Transaction','DateTime']).sum()
bread_group.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,beverage,other,kids,snacks,bread,breakfast_pastry,dessert,condiments,breakfast,lunch,other_food
Transaction,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,2016-10-30 09:58:11,0,0,0,0,1,0,0,0,0,0,0
2,2016-10-30 10:05:34,0,0,0,0,2,0,0,0,0,0,0
3,2016-10-30 10:07:57,1,0,0,0,0,0,1,1,0,0,0
4,2016-10-30 10:08:41,0,0,0,0,0,1,0,0,0,0,0
5,2016-10-30 10:13:03,1,0,0,0,1,1,0,0,0,0,0


Now the transaction number and the datetime are indices in this aggregated dataset. If we would like to use the information in these columns, we would have to reset the index.

In [9]:
bread_group.reset_index(level=['DateTime'], inplace=True)

Next, we will generate a column for day of week and for hour.

In [10]:
bread_group['hour'] = bread_group.DateTime.dt.hour
bread_group['day'] = bread_group.DateTime.dt.day_name()
bread_group.day.value_counts()

Saturday     2068
Friday       1488
Sunday       1264
Thursday     1252
Tuesday      1203
Monday       1135
Wednesday    1121
Name: day, dtype: int64

Saturday has the most transactions of any weekday by far.

In [11]:
bread_group.hour.value_counts()

11    1445
12    1347
10    1267
13    1163
14    1130
9     1007
15     924
16     583
8      375
17     160
18      52
19      34
7       16
20      15
22       7
23       3
21       2
1        1
Name: hour, dtype: int64

11 am has the most transactions followed by noon and 10 am.

Now let's create dummy variables out of the day column and drop all other non numeric columns to prepare our dataset for the ML algorithm.

In [12]:
bread_days = pd.get_dummies(data=bread_group, columns=['day'])
bread_days.drop(columns='DateTime', inplace=True, axis=1)

## Some More Transformations - PCA

Our plan here is to use k-means clustering. However, an important note on k-means clustering is that it does not respond well to dummy variable columns. Therefore, our best option is to transform the data using principal component analysis or PCA. What PCA does is project our data onto a lower dimensional subspace. The new data will typically reduce the dimensions of our original data and will therefore, contain less variables. The first dimension will explain the most amount of variation in the data and subsequent components will explain less and less variation. This transformation will provide us with a smaller amount of continuous variables that we can cluster more effectively.

Here we chose to generate 4 components.

In [13]:
from sklearn.decomposition import PCA

pca = PCA(n_components=4)

principalComponents = pca.fit_transform(bread_days)
principalDf = pd.DataFrame(data = principalComponents ,columns = ['pc1', 'pc2', 'pc3', 'pc4'])

principalDf.head()           

Unnamed: 0,pc1,pc2,pc3,pc4
0,3.197413,-0.926669,0.244933,0.005981
1,2.227896,-1.274346,1.183371,-0.159231
2,2.107647,0.424884,-0.219793,0.879771
3,2.19896,-0.580674,-0.783351,-0.031003
4,2.206812,0.037385,0.443747,-0.333643


## The Algorithm - K-Means Clustering

We are now ready to cluster the data using k-means. We chose to create 5 clusters. The choice is normally arbitrary though there are ways to optimize the number of clusters. Here, the choice is more driven by the number of transaction clusters we would like to create. Two clusters would definitely be too few to capture meaningful differences while 10 is certainly too many.

In [14]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5)
bread_clusters = kmeans.fit(principalDf)
bread_clusters.cluster_centers_

array([[ 1.63146200e+00,  5.06638014e-02,  2.89304459e-02,
         7.19925427e-03],
       [-4.42371509e+00, -8.64897829e-02,  1.99163392e-02,
        -1.07060028e-02],
       [ 3.46747051e+00, -5.43764931e-02, -3.01346671e-02,
        -3.01326448e-03],
       [-2.29584237e+00,  3.82205465e-02,  1.87745961e-02,
         1.46931638e-02],
       [-3.07237949e-01, -2.62397308e-02, -3.66405545e-02,
        -1.44834332e-02]])

Let's apply the labels back to our original data so we can do some analysis.

In [15]:
bread_days['labels'] = bread_clusters.fit_predict(principalDf)
bread_days.reset_index('Transaction', inplace=True)
bread_merged = pd.merge(breadbasket, bread_days[['Transaction', 'labels']], on='Transaction', how='outer')
bread_merged.head()

Unnamed: 0,Date,Time,Transaction,Item,DateTime,beverage,other,kids,snacks,bread,breakfast_pastry,dessert,condiments,breakfast,lunch,other_food,labels
0,2016-10-30,09:58:11,1,Bread,2016-10-30 09:58:11,0,0,0,0,1,0,0,0,0,0,0,3
1,2016-10-30,10:05:34,2,Scandinavian,2016-10-30 10:05:34,0,0,0,0,1,0,0,0,0,0,0,3
2,2016-10-30,10:05:34,2,Scandinavian,2016-10-30 10:05:34,0,0,0,0,1,0,0,0,0,0,0,3
3,2016-10-30,10:07:57,3,Hot chocolate,2016-10-30 10:07:57,1,0,0,0,0,0,0,0,0,0,0,3
4,2016-10-30,10:07:57,3,Jam,2016-10-30 10:07:57,0,0,0,0,0,0,0,1,0,0,0,3


Let's so some analysis on the clusters. First let's look at how many transactions we have per cluster.

In [16]:
bread_merged.labels.value_counts()

2    5296
3    4446
1    4041
0    3932
4    3578
Name: labels, dtype: int64

The largest cluster is the 5th cluster (our clusters are numbered 0 through 4).

One interesting thing to check is whether the clusters captured a different type of transaction by looking at the hour breakdown for each cluster.

In [17]:
pd.crosstab(bread_days.hour,bread_days.labels)

labels,0,1,2,3,4
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,0,1,0
7,0,0,0,16,0
8,0,0,0,375,0
9,0,0,0,982,25
10,0,0,0,1004,263
11,1111,0,0,0,334
12,1080,0,0,0,267
13,0,0,1130,0,33
14,0,0,1128,0,2
15,0,924,0,0,0


We can clearly see a separation. Clusters 0, 2, and 4 center around noon. Cluster 1 is an early morning cluster. Cluster 3 is an evening cluster.

We can do the same analysis for day of week.

In [18]:
pd.crosstab(bread_group.day,bread_days.labels)

labels,0,1,2,3,4
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Friday,358,269,355,326,160
Monday,243,232,250,315,95
Saturday,441,411,506,393,224
Sunday,297,209,304,295,127
Thursday,287,194,264,387,120
Tuesday,293,223,271,300,116
Wednesday,225,213,283,329,71


In cluster 1 (the early morning cluster), the disparity between the weekends and the weekdays is small. While in the clusters that center around later times, there seem to be more transactions during the weekends.

Let's also look at the top 5 products per cluster

In [19]:
a = bread_merged.groupby(['labels']).Item.value_counts()
b = a.to_frame("counts").reset_index()
b.set_index("Item", inplace=True)
b.groupby('labels').counts.nlargest(5)

labels  Item         
0       Bread             893
        Coffee            821
        NONE              185
        Cake              164
        Pastry            164
1       Coffee            927
        Bread             560
        Tea               382
        Cake              328
        Hot chocolate     183
2       Coffee           1188
        Bread             677
        Tea               402
        Sandwich          381
        Cake              302
3       Coffee           1163
        Bread            1027
        Pastry            389
        Medialuna         260
        Tea               198
4       Coffee           1372
        Tea               290
        Hot chocolate     177
        Bread             168
        Pastry            149
Name: counts, dtype: int64

While tea and coffee are popular in all 3 clusters, the morning cluster (Cluster 1) contains only bread, beverages and breakfast pastries. Clusters 0 and 2 are afternoon cluster and contain cake and sandwiches as top items. Cluster 3 is also an afternoon cluster and contains more desserts.

We can use this data to run promotions for certain items like cake and sandwiches at certain hours to increase our sales.