In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

breadbasket = pd.read_csv('BreadBasket_DMS.csv')
breadbasket.dtypes

Date           object
Time           object
Transaction     int64
Item           object
dtype: object

In [2]:
breadbasket.head()

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam


In [3]:
breadbasket.describe()

Unnamed: 0,Transaction
count,21293.0
mean,4951.990889
std,2787.7584
min,1.0
25%,2548.0
50%,5067.0
75%,7329.0
max,9684.0


To examine the date and time, we must reformat this variable. We start by combining the date and time into one column and then transforming the column to a datetime column. This allows us to extract the hour and the time of day.

In [4]:
breadbasket['DateTime'] = pd.to_datetime(breadbasket.Date + ' ' + breadbasket.Time)

In [5]:
breadbasket.Item.unique()

array(['Bread', 'Scandinavian', 'Hot chocolate', 'Jam', 'Cookies',
       'Muffin', 'Coffee', 'Pastry', 'Medialuna', 'Tea', 'NONE',
       'Tartine', 'Basket', 'Mineral water', 'Farm House', 'Fudge',
       'Juice', "Ella's Kitchen Pouches", 'Victorian Sponge', 'Frittata',
       'Hearty & Seasonal', 'Soup', 'Pick and Mix Bowls', 'Smoothies',
       'Cake', 'Mighty Protein', 'Chicken sand', 'Coke',
       'My-5 Fruit Shoot', 'Focaccia', 'Sandwich', 'Alfajores', 'Eggs',
       'Brownie', 'Dulce de Leche', 'Honey', 'The BART', 'Granola',
       'Fairy Doors', 'Empanadas', 'Keeping It Local', 'Art Tray',
       'Bowl Nic Pitt', 'Bread Pudding', 'Adjustment', 'Truffles',
       'Chimichurri Oil', 'Bacon', 'Spread', 'Kids biscuit', 'Siblings',
       'Caramel bites', 'Jammie Dodgers', 'Tiffin', 'Olum & polenta',
       'Polenta', 'The Nomad', 'Hack the stack', 'Bakewell',
       'Lemon and coconut', 'Toast', 'Scone', 'Crepes', 'Vegan mincepie',
       'Bare Popcorn', 'Muesli', 'Crisps', 'Pi

In [6]:
breadbasket.Item.value_counts()

Coffee                           5471
Bread                            3325
Tea                              1435
Cake                             1025
Pastry                            856
NONE                              786
Sandwich                          771
Medialuna                         616
Hot chocolate                     590
Cookies                           540
Brownie                           379
Farm House                        374
Muffin                            370
Alfajores                         369
Juice                             369
Soup                              342
Scone                             327
Toast                             318
Scandinavian                      277
Truffles                          193
Coke                              185
Spanish Brunch                    172
Fudge                             159
Baguette                          152
Jam                               149
Tiffin                            146
Mineral wate

Looking at the list of unique items, we can identify a number of obvious categories. We have beverages and breakfast pastries like muffins and medialuna. We have items for kids like juice and pouches. We also have non food items like gift vouchers and t-shirts. Another group of items that we can notice is ready to eat snacks like popcorn and crisps. With a bit of work, we can narrow it down from 95 products to 11 categories: beverage, other, kids, snacks, bread, breakfast pastry, dessert, condiments, breakfast, lunch, and other foods. The last group is used to classify mostly uncommon items that sell very little (like polenta) or have names that are hard to identify (like "Hack the Stack").
We generate the categories using lists and then use the lists to create dummy variables.

In [7]:
beverage = ['Hot chocolate', 'Coffee', 'Tea', 'Mineral water', 'Juice', 'Coke', 'Smoothies']
other = ['NONE', 'Christmas common', 'Gift voucher', "Valentine's card", 'Tshirt', 'Afternoon with the baker', 'Postcard', 'Siblings', 'Nomad bag', 'Adjustment', 'Drinking chocolate spoons ', 'Coffee granules ']
kids = ["Ella's Kitchen Pouches", 'My-5 Fruit Shoot', 'Kids biscuit']
snacks = ['Mighty Protein', 'Pick and Mix Bowls', 'Caramel bites', 'Bare Popcorn', 'Crisps', 'Cherry me Dried fruit', 'Raw bars']
bread = ['Bread', 'Toast', 'Baguette', 'Focaccia', 'Scandinavian']
breakfast_pastry = ['Muffin', 'Pastry', 'Medialuna', 'Scone']
dessert = ['Cookies', 'Tartine', 'Fudge', 'Victorian Sponge', 'Cake', 'Alfajores', 'Brownie', 'Bread Pudding', 'Bakewell', 'Raspberry shortbread sandwich', 'Lemon and coconut', 'Crepes', 'Chocolates', 'Truffles', 'Panatone']
condiments = ['Jam', 'Dulce de Leche', 'Honey', 'Gingerbread syrup', 'Extra Salami or Feta', 'Bacon', 'Spread', 'Chimichurri Oil']
breakfast = ['Eggs', 'Frittata', 'Granola', 'Muesli', 'Duck egg', 'Brioche and salami']
lunch = ['Soup', 'Sandwich', 'Chicken sand', 'Salad', 'Chicken Stew']
other_food = [x for x in breadbasket.Item.unique() if x not in beverage 
                and x not in other and x not in kids and x not in snacks 
                and x not in bread and x not in breakfast_pastry 
                and x not in dessert and x not in condiments 
                and x not in breakfast and x not in lunch]

breadbasket['beverage'] = np.where(breadbasket.Item.isin(beverage), 1, 0)
breadbasket['other'] = np.where(breadbasket.Item.isin(other), 1, 0)


In [8]:
breadbasket['kids'] = np.where(breadbasket.Item.isin(kids), 1, 0)
breadbasket['snacks'] = np.where(breadbasket.Item.isin(snacks), 1, 0)
breadbasket['bread'] = np.where(breadbasket.Item.isin(bread), 1, 0)
breadbasket['breakfast_pastry'] = np.where(breadbasket.Item.isin(breakfast_pastry), 1, 0)
breadbasket['dessert'] = np.where(breadbasket.Item.isin(dessert), 1, 0)
breadbasket['condiments'] = np.where(breadbasket.Item.isin(condiments), 1, 0)
breadbasket['breakfast'] = np.where(breadbasket.Item.isin(breakfast), 1, 0)
breadbasket['lunch'] = np.where(breadbasket.Item.isin(lunch), 1, 0)
breadbasket['other_food'] = np.where(breadbasket.Item.isin(other_food), 1, 0)
breadbasket.head()

Unnamed: 0,Date,Time,Transaction,Item,DateTime,beverage,other,kids,snacks,bread,breakfast_pastry,dessert,condiments,breakfast,lunch,other_food
0,2016-10-30,09:58:11,1,Bread,2016-10-30 09:58:11,0,0,0,0,1,0,0,0,0,0,0
1,2016-10-30,10:05:34,2,Scandinavian,2016-10-30 10:05:34,0,0,0,0,1,0,0,0,0,0,0
2,2016-10-30,10:05:34,2,Scandinavian,2016-10-30 10:05:34,0,0,0,0,1,0,0,0,0,0,0
3,2016-10-30,10:07:57,3,Hot chocolate,2016-10-30 10:07:57,1,0,0,0,0,0,0,0,0,0,0
4,2016-10-30,10:07:57,3,Jam,2016-10-30 10:07:57,0,0,0,0,0,0,0,1,0,0,0


Processing the Data 
The first bit of work we will do to process the data is to aggregate by transaction. This will give us the count of each category per transaction. We will use the groupby function to find the sum in each transaction. We group by the datetime as well since we want to keep this column after the aggregation. This should not be a problem since a transaction number and a datetime uniquely identifies each row. 

In [10]:
bread_group = breadbasket.groupby(['Transaction','DateTime']).sum()
bread_group.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,beverage,other,kids,snacks,bread,breakfast_pastry,dessert,condiments,breakfast,lunch,other_food
Transaction,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,2016-10-30 09:58:11,0,0,0,0,1,0,0,0,0,0,0
2,2016-10-30 10:05:34,0,0,0,0,2,0,0,0,0,0,0
3,2016-10-30 10:07:57,1,0,0,0,0,0,1,1,0,0,0
4,2016-10-30 10:08:41,0,0,0,0,0,1,0,0,0,0,0
5,2016-10-30 10:13:03,1,0,0,0,1,1,0,0,0,0,0


In [11]:
#Now the transaction number and the datetime are indices in this aggregated dataset. 
#If we would like to use the information in these columns, we would have to reset the index.
bread_group.reset_index(level=['DateTime'], inplace=True)

In [12]:
bread_group['hour'] = bread_group.DateTime.dt.hour
bread_group['day'] = bread_group.DateTime.dt.day_name()
bread_group.day.value_counts()

Saturday     2068
Friday       1488
Sunday       1264
Thursday     1252
Tuesday      1203
Monday       1135
Wednesday    1121
Name: day, dtype: int64

In [13]:
bread_group.hour.value_counts()
#11 am has the most transactions followed by noon and 10 am.

11    1445
12    1347
10    1267
13    1163
14    1130
9     1007
15     924
16     583
8      375
17     160
18      52
19      34
7       16
20      15
22       7
23       3
21       2
1        1
Name: hour, dtype: int64

In [14]:
#Now let's create dummy variables out of the day column and drop all other non numeric 
#columns to prepare our dataset for the ML algorithm.
bread_days = pd.get_dummies(data=bread_group, columns=['day'])
bread_days.drop(columns='DateTime', inplace=True, axis=1)

In [15]:
bread_days.head(5)

Unnamed: 0_level_0,beverage,other,kids,snacks,bread,breakfast_pastry,dessert,condiments,breakfast,lunch,other_food,hour,day_Friday,day_Monday,day_Saturday,day_Sunday,day_Thursday,day_Tuesday,day_Wednesday
Transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0,0,0,0,1,0,0,0,0,0,0,9,0,0,0,1,0,0,0
2,0,0,0,0,2,0,0,0,0,0,0,10,0,0,0,1,0,0,0
3,1,0,0,0,0,0,1,1,0,0,0,10,0,0,0,1,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,10,0,0,0,1,0,0,0
5,1,0,0,0,1,1,0,0,0,0,0,10,0,0,0,1,0,0,0


Our plan here is to use k-means clustering. However, an important note on k-means clustering is that it does not respond well to dummy variable columns. Therefore, our best option is to transform the data using principal component analysis or PCA. What PCA does is project our data onto a lower dimensional subspace. The new data will typically reduce the dimensions of our original data and will therefore, contain less variables. The first dimension will explain the most amount of variation in the data and subsequent components will explain less and less variation. This transformation will provide us with a smaller amount of continuous variables that we can cluster more effectively. 

In [16]:
from sklearn.decomposition import PCA

pca = PCA(n_components=4)

principalComponents = pca.fit_transform(bread_days)
principalDf = pd.DataFrame(data = principalComponents
             ,columns = ['pc1', 'pc2', 'pc3', 'pc4'])
principalDf.head()             

Unnamed: 0,pc1,pc2,pc3,pc4
0,3.197413,-0.926669,0.244939,0.005987
1,2.227896,-1.274346,1.183375,-0.159225
2,2.107647,0.424883,-0.219924,0.879613
3,2.19896,-0.580674,-0.783346,-0.030998
4,2.206812,0.037385,0.44375,-0.333639


The Algorithm - K-Means Clustering 
We are now ready to cluster the data using k-means. We chose to create 5 clusters. The choice is normally arbitrary though there are ways to optimize the number of clusters. Here, the choice is more driven by the number of transaction clusters we would like to create. Two clusters would definitely be too few to capture meaningful differences while 10 is certainly too many.

In [17]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5)
bread_clusters = kmeans.fit(principalDf)
bread_clusters.cluster_centers_

array([[-1.34565874e+00, -2.66021414e-02, -4.26035315e-02,
         3.07945176e-04],
       [ 2.92578124e+00, -1.80072702e-01, -3.26662330e-02,
         1.20675708e-02],
       [ 1.09643866e+00,  1.39898044e+00,  2.68322652e-01,
        -1.09054389e-01],
       [-3.60462547e+00, -1.92471430e-02,  3.54632858e-02,
         5.46597782e-03],
       [ 6.77368177e-01, -3.51491324e-01, -6.26087189e-02,
         2.81354593e-02]])

In [18]:
#Let's apply the labels back to our original data so we can do some analysis.
bread_days['labels'] = bread_clusters.fit_predict(principalDf)
bread_days.reset_index('Transaction', inplace=True)
bread_merged = pd.merge(breadbasket, bread_days[['Transaction', 'labels']], on='Transaction', how='outer')
bread_merged.head()

Unnamed: 0,Date,Time,Transaction,Item,DateTime,beverage,other,kids,snacks,bread,breakfast_pastry,dessert,condiments,breakfast,lunch,other_food,labels
0,2016-10-30,09:58:11,1,Bread,2016-10-30 09:58:11,0,0,0,0,1,0,0,0,0,0,0,1
1,2016-10-30,10:05:34,2,Scandinavian,2016-10-30 10:05:34,0,0,0,0,1,0,0,0,0,0,0,1
2,2016-10-30,10:05:34,2,Scandinavian,2016-10-30 10:05:34,0,0,0,0,1,0,0,0,0,0,0,1
3,2016-10-30,10:07:57,3,Hot chocolate,2016-10-30 10:07:57,1,0,0,0,0,0,0,0,0,0,0,1
4,2016-10-30,10:07:57,3,Jam,2016-10-30 10:07:57,0,0,0,0,0,0,0,1,0,0,0,1


In [19]:
#Let's so some analysis on the clusters. First let's look at how many transactions we have per cluster.
bread_merged.labels.value_counts()

4    5344
1    4733
2    4707
0    4041
3    2468
Name: labels, dtype: int64

The largest cluster is the 5th cluster (our clusters are numbered 0 through 4).
One interesting thing to check is whether the clusters captured a different type of transaction by looking at the hour breakdown for each cluster.

In [20]:
pd.crosstab(bread_days.hour,bread_days.labels)

labels,0,1,2,3,4
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,1,0,0,0
7,0,16,0,0,0
8,0,375,0,0,0
9,0,1007,0,0,0
10,0,1045,221,1,0
11,0,0,766,679,0
12,0,0,647,700,0
13,0,0,24,15,1124
14,0,0,0,0,1130
15,924,0,0,0,0


In [21]:
#We can clearly see a separation. Clusters 0, 2, and 4 center around noon. 
#Cluster 1 is an early morning cluster. Cluster 3 is an evening cluster.
#We can do the same analysis for day of week.
pd.crosstab(bread_group.day,bread_days.labels)

labels,0,1,2,3,4
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Friday,269,338,262,242,357
Monday,232,321,191,147,244
Saturday,411,406,378,273,507
Sunday,209,305,214,199,305
Thursday,194,396,222,178,262
Tuesday,223,306,222,181,271
Wednesday,213,339,139,146,284


In cluster 1 (the early morning cluster), the disparity between the weekends and the weekdays is small. While in the clusters that center around later times, there seem to be more transactions during the weekends.
Let's also look at the top 5 products per cluster

In [22]:
a = bread_merged.groupby(['labels']).Item.value_counts()
b = a.to_frame("counts").reset_index()
b.set_index("Item", inplace=True)
b.groupby('labels').counts.nlargest(5)

labels  Item         
0       Coffee            927
        Bread             560
        Tea               382
        Cake              328
        Hot chocolate     183
1       Coffee           1272
        Bread            1059
        Pastry            405
        Medialuna         269
        Tea               217
2       Coffee           1855
        Tea               382
        NONE              219
        Cake              200
        Pastry            198
3       Bread             902
        Coffee            214
        Farm House        117
        Pastry            100
        NONE               97
4       Coffee           1203
        Bread             663
        Tea               409
        Sandwich          388
        Cake              306
Name: counts, dtype: int64

While tea and coffee are popular in all 3 clusters, the morning cluster (Cluster 1) contains only bread, beverages and breakfast pastries. Clusters 0 and 2 are afternoon cluster and contain cake and sandwiches as top items. Cluster 3 is also an afternoon cluster and contains more desserts.
We can use this data to run promotions for certain items like cake and sandwiches at certain hours to increase our sales.