# Recommendation System and Customer Segmentation on E-Commerce Data

**Author**: Moch Nabil Farras Dhiya

**E-mail**: nabilfarras923@gmail.com

**Institution**: Bandung Institute of Technology

**Student ID**: 10120034


---

**About**: This is a side-project created in order to build a recommendation system for users as well as segmenting the users based on an e-commerce transactional data by the following steps:

1.   Extract the data from Kaggle website
2.   Transform the data (making sure it is usable and consistent)
3.   Make a Recommendation System model using Apriori Algorithm
4.   Segmenting the customers based on the transaction history

# Import Modules

In [1]:
# Connect to local
import os

# Importing and transforming file
import pandas as pd

# Data manipulation
import numpy as np
import re # Cleaning texts
import datetime as dt # Datetime manipulation

# Modeling
from sklearn.model_selection import train_test_split
from libreco.data import random_split, DatasetPure
from libreco.algorithms import SVDpp
from libreco.evaluation import evaluate

Instructions for updating:
non-resource variables are not supported in the long term


# Import Data

In [2]:
data = pd.read_csv('data.csv', encoding= 'unicode_escape')

In [3]:
data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


# Feature Engineering

Notice that the InvoiceDate column contains the date and time of the transaction. We will **split** the information into **2 different columns (Date and Time)** to make it easier to gain insight and build the model later on.

In [4]:
# Extract the time
data['InvoiceTime'] = data['InvoiceDate'].apply(lambda x: x.split()[1])

# Extract the date
data['InvoiceDate'] = data['InvoiceDate'].apply(lambda x: x.split()[0])
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'], format = '%m/%d/%Y')

In [5]:
data = data[['InvoiceNo', 'InvoiceDate', 'InvoiceTime', 'CustomerID', 'Country',
             'StockCode', 'Description', 'Quantity', 'UnitPrice']]

data

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,CustomerID,Country,StockCode,Description,Quantity,UnitPrice
0,536365,2010-12-01,8:26,17850.0,United Kingdom,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2.55
1,536365,2010-12-01,8:26,17850.0,United Kingdom,71053,WHITE METAL LANTERN,6,3.39
2,536365,2010-12-01,8:26,17850.0,United Kingdom,84406B,CREAM CUPID HEARTS COAT HANGER,8,2.75
3,536365,2010-12-01,8:26,17850.0,United Kingdom,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,3.39
4,536365,2010-12-01,8:26,17850.0,United Kingdom,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,3.39
...,...,...,...,...,...,...,...,...,...
541904,581587,2011-12-09,12:50,12680.0,France,22613,PACK OF 20 SPACEBOY NAPKINS,12,0.85
541905,581587,2011-12-09,12:50,12680.0,France,22899,CHILDREN'S APRON DOLLY GIRL,6,2.10
541906,581587,2011-12-09,12:50,12680.0,France,23254,CHILDRENS CUTLERY DOLLY GIRL,4,4.15
541907,581587,2011-12-09,12:50,12680.0,France,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,4.15


# Initial EDA

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   InvoiceDate  541909 non-null  datetime64[ns]
 2   InvoiceTime  541909 non-null  object        
 3   CustomerID   406829 non-null  float64       
 4   Country      541909 non-null  object        
 5   StockCode    541909 non-null  object        
 6   Description  540455 non-null  object        
 7   Quantity     541909 non-null  int64         
 8   UnitPrice    541909 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(5)
memory usage: 37.2+ MB


In [7]:
for col in data.columns:
    print(f'========== {col} ==========')
    display(data[col].value_counts())



573585     1114
581219      749
581492      731
580729      721
558475      705
           ... 
554023        1
554022        1
554021        1
554020        1
C558901       1
Name: InvoiceNo, Length: 25900, dtype: int64



2011-12-05    5331
2011-12-08    4940
2011-11-29    4313
2011-11-16    4195
2011-11-11    4089
              ... 
2011-03-13     537
2010-12-19     522
2011-05-01     452
2010-12-22     291
2011-02-06     279
Name: InvoiceDate, Length: 305, dtype: int64



15:56    2628
14:41    2554
15:17    2376
16:14    2372
14:09    2172
         ... 
6:14        1
6:13        1
6:12        1
20:32       1
6:21        1
Name: InvoiceTime, Length: 774, dtype: int64



17841.0    7983
14911.0    5903
14096.0    5128
12748.0    4642
14606.0    2782
           ... 
15070.0       1
15753.0       1
17065.0       1
16881.0       1
16995.0       1
Name: CustomerID, Length: 4372, dtype: int64



United Kingdom          495478
Germany                   9495
France                    8557
EIRE                      8196
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               2002
Portugal                  1519
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Unspecified                446
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
Israel                     297
USA                        291
Hong Kong                  288
Singapore                  229
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United Arab Emirates        68
European Community          61
RSA                         58
Lebanon 



85123A    2313
22423     2203
85099B    2159
47566     1727
20725     1639
          ... 
21431        1
22275        1
17001        1
90187A       1
72759        1
Name: StockCode, Length: 4070, dtype: int64



WHITE HANGING HEART T-LIGHT HOLDER     2369
REGENCY CAKESTAND 3 TIER               2200
JUMBO BAG RED RETROSPOT                2159
PARTY BUNTING                          1727
LUNCH BAG RED RETROSPOT                1638
                                       ... 
Missing                                   1
historic computer difference?....se       1
DUSTY PINK CHRISTMAS TREE 30CM            1
WRAP BLUE RUSSIAN FOLKART                 1
PINK BERTIE MOBILE PHONE CHARM            1
Name: Description, Length: 4223, dtype: int64



 1        148227
 2         81829
 12        61063
 6         40868
 4         38484
           ...  
-472           1
-161           1
-1206          1
-272           1
-80995         1
Name: Quantity, Length: 722, dtype: int64



1.25      50496
1.65      38181
0.85      28497
2.95      27768
0.42      24533
          ...  
84.21         1
46.86         1
28.66         1
156.45        1
224.69        1
Name: UnitPrice, Length: 1630, dtype: int64

In [8]:
# Split the columns into 2 type, namely object and numeric columns
object_cols = [col for col in data.columns if data[col].dtypes in ['object', 'datetime64[ns]']]
numeric_cols = [col for col in data.columns if col not in object_cols]

## Object Columns

In [9]:
data[object_cols].describe().transpose()

  data[object_cols].describe().transpose()


Unnamed: 0,count,unique,top,freq,first,last
InvoiceNo,541909,25900,573585,1114,NaT,NaT
InvoiceDate,541909,305,2011-12-05 00:00:00,5331,2010-12-01,2011-12-09
InvoiceTime,541909,774,15:56,2628,NaT,NaT
Country,541909,38,United Kingdom,495478,NaT,NaT
StockCode,541909,4070,85123A,2313,NaT,NaT
Description,540455,4223,WHITE HANGING HEART T-LIGHT HOLDER,2369,NaT,NaT


## Numeric Columns

In [10]:
data[numeric_cols].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CustomerID,406829.0,15287.69057,1713.600303,12346.0,13953.0,15152.0,16791.0,18287.0
Quantity,541909.0,9.55225,218.081158,-80995.0,1.0,3.0,10.0,80995.0
UnitPrice,541909.0,4.611114,96.759853,-11062.06,1.25,2.08,4.13,38970.0


# Data Cleaning & Manipulation

Notice that there are several entries which StockCode are unusual. Thus, we will drop these entries.

In [11]:
data = data.loc[(data['StockCode'] != 'BANK CHARGES') &
                (data['StockCode'] != 'C2') &
                (data['StockCode'] != 'CRUK') &
                (data['StockCode'] != 'D') &
                (data['StockCode'] != 'DOT') &
                (data['StockCode'] != 'M') &
                (data['StockCode'] != 'PADS') &
                (data['StockCode'] != 'POST')]

We will also check just in case there are entries with negative values. We will drop these entries.

In [12]:
data = data[data[numeric_cols].ge(0).all(1)].reset_index(drop = True)

data

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,CustomerID,Country,StockCode,Description,Quantity,UnitPrice
0,536365,2010-12-01,8:26,17850.0,United Kingdom,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2.55
1,536365,2010-12-01,8:26,17850.0,United Kingdom,71053,WHITE METAL LANTERN,6,3.39
2,536365,2010-12-01,8:26,17850.0,United Kingdom,84406B,CREAM CUPID HEARTS COAT HANGER,8,2.75
3,536365,2010-12-01,8:26,17850.0,United Kingdom,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,3.39
4,536365,2010-12-01,8:26,17850.0,United Kingdom,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,3.39
...,...,...,...,...,...,...,...,...,...
396365,581587,2011-12-09,12:50,12680.0,France,22613,PACK OF 20 SPACEBOY NAPKINS,12,0.85
396366,581587,2011-12-09,12:50,12680.0,France,22899,CHILDREN'S APRON DOLLY GIRL,6,2.10
396367,581587,2011-12-09,12:50,12680.0,France,23254,CHILDRENS CUTLERY DOLLY GIRL,4,4.15
396368,581587,2011-12-09,12:50,12680.0,France,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,4.15


In addition to that, we will also try to reduce the number of rows by performing aggregation on the quantity transaction, if the same transaction were to occur in the same time, but recorded differently (whether the quantity is the same or different).

In [13]:
data['TotalQuantity'] = data.groupby(['InvoiceNo', 'InvoiceDate', 'InvoiceTime',
                                      'CustomerID', 'Country', 'StockCode', 'Description', 
                                      'UnitPrice'])['Quantity'].transform('sum')

data = data[['InvoiceNo', 'InvoiceDate', 'InvoiceTime',
             'CustomerID', 'Country', 'StockCode', 'Description', 
             'TotalQuantity', 'UnitPrice']]

# Missing Value(s) Handling 

In [14]:
missing_pct = []

for col in data.columns:
    missing = data[col].isnull().sum()
    missing_pct.append(100 * missing/len(data))
    
missing_df = pd.DataFrame({'Column': data.columns,
                           'Missing Percentage': missing_pct})
missing_df

Unnamed: 0,Column,Missing Percentage
0,InvoiceNo,0.0
1,InvoiceDate,0.0
2,InvoiceTime,0.0
3,CustomerID,0.0
4,Country,0.0
5,StockCode,0.0
6,Description,0.0
7,TotalQuantity,0.0
8,UnitPrice,0.0


Notice that there is no missing value in the data, either it is because there really is not any in the first place, or it was dropped during the negative values entry checking. Either way, it means that our data is good to go right now.

# Duplicate Value(s) Handling 

In [15]:
print(f'Missing value percentage: {100 - 100 * len(data.drop_duplicates()) / len(data)} %')

Missing value percentage: 2.5070010343870592 %


Since the missing value percentage is not 0 %, then there must be duplicate entries recorded in the dataset.

In [16]:
data[data.duplicated(keep = False) == True]

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,CustomerID,Country,StockCode,Description,TotalQuantity,UnitPrice
112,536381,2010-12-01,9:41,15311.0,United Kingdom,71270,PHOTO CLIP LINE,4,1.25
124,536381,2010-12-01,9:41,15311.0,United Kingdom,71270,PHOTO CLIP LINE,4,1.25
472,536409,2010-12-01,11:45,17908.0,United Kingdom,90199C,5 STRAND GLASS NECKLACE CRYSTAL,6,6.35
474,536409,2010-12-01,11:45,17908.0,United Kingdom,22111,SCOTTIE DOG HOT WATER BOTTLE,2,4.95
478,536409,2010-12-01,11:45,17908.0,United Kingdom,22866,HAND WARMER SCOTTY DOG DESIGN,2,2.10
...,...,...,...,...,...,...,...,...,...
396159,581538,2011-12-09,11:34,14446.0,United Kingdom,22992,REVOLVER WOODEN RULER,2,1.95
396164,581538,2011-12-09,11:34,14446.0,United Kingdom,21194,PINK HONEYCOMB PAPER FAN,3,0.65
396165,581538,2011-12-09,11:34,14446.0,United Kingdom,35004B,SET OF 3 BLACK FLYING DUCKS,3,5.45
396166,581538,2011-12-09,11:34,14446.0,United Kingdom,22694,WICKER STAR,2,2.10


Notice that it does not make sense for a transaction to be recorded more than once (at the same Invoice Date and Time). Thus, we will simply drop the duplicate entries.

**Note**: This may occur because of the aggregation performed in the previous section (Data Cleaning & Manipulation).

In [17]:
data = data.drop_duplicates().reset_index(drop = True)

In [18]:
print(f'Missing value percentage: {100 - 100 * len(data.drop_duplicates()) / len(data)} %')

Missing value percentage: 0.0 %


# Recommendation System

Here, we will build 2 different models, namely Customer-Based and Item-Based Recommendation System.

## Customer-Based

For this model, we will use 2 different data, which is Data 1 and Data 2.

**Data 1**: Aggregating by a customer' purchase count on a certain product

**Data 2**: Binary input whether a customer already purchased certain product or not

### Data Preparation

In [19]:
customer_item_matrix = data.pivot_table(index = 'CustomerID', 
                                        columns = 'StockCode', 
                                        values = 'TotalQuantity',
                                        aggfunc = 'count')

Data 1

In [20]:
# Customer-Item Matrix by Count Aggregate
purchase_matrix = customer_item_matrix.applymap(lambda x: x if x > 0 else 0)

purchase_matrix

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214O,90214P,90214R,90214S,90214T,90214U,90214V,90214W,90214Y,90214Z
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12348.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12349.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12350.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18280.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18281.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18282.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18283.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
stock_codes = purchase_matrix.columns
temp_purchase = purchase_matrix.reset_index()

purchase_data = pd.melt(temp_purchase, id_vars = 'CustomerID', value_vars = stock_codes)

purchase_data

Unnamed: 0,CustomerID,StockCode,value
0,12346.0,10002,0.0
1,12347.0,10002,0.0
2,12348.0,10002,0.0
3,12349.0,10002,0.0
4,12350.0,10002,0.0
...,...,...,...
15861760,18280.0,90214Z,0.0
15861761,18281.0,90214Z,0.0
15861762,18282.0,90214Z,0.0
15861763,18283.0,90214Z,0.0


Data 2

In [22]:
# Customer-Item Matrix by Binary Input
binary_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)

binary_matrix

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214O,90214P,90214R,90214S,90214T,90214U,90214V,90214W,90214Y,90214Z
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12350.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18280.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18281.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18282.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18283.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
stock_codes = binary_matrix.columns
temp_binary = binary_matrix.reset_index()

binary_data = pd.melt(temp_binary, id_vars = 'CustomerID', value_vars = stock_codes)

binary_data

Unnamed: 0,CustomerID,StockCode,value
0,12346.0,10002,0
1,12347.0,10002,0
2,12348.0,10002,0
3,12349.0,10002,0
4,12350.0,10002,0
...,...,...,...
15861760,18280.0,90214Z,0
15861761,18281.0,90214Z,0
15861762,18282.0,90214Z,0
15861763,18283.0,90214Z,0


### Train Test Split

In [24]:
# Split whole data into three folds for training, evaluating and testing
train_data, eval_data, test_data = random_split(purchase_data, multi_ratios=[0.8, 0.1, 0.1])

train_data, data_info = DatasetPure.build_trainset(train_data)
eval_data = DatasetPure.build_evalset(eval_data)
test_data = DatasetPure.build_testset(test_data)

data_info   # n_users: 5894, n_items: 3253, data sparsity: 0.4172 %

AttributeError: 'DataFrame' object has no attribute 'user'

In [None]:
svdpp = SVDpp(task="rating", data_info=data_info, embed_size=16, n_epochs=3, lr=0.001,
              reg=None, batch_size=256)
# monitor metrics on eval_data during training
svdpp.fit(train_data, verbose=2, eval_data=eval_data, metrics=["rmse", "mae", "r2"])

# do final evaluation on test data
print("evaluate_result: ", evaluate(model=svdpp, data=test_data,
                                    metrics=["rmse", "mae"]))
# predict preference of user 2211 to item 110
print("prediction: ", svdpp.predict(user=2211, item=110))
# recommend 7 items for user 2211
print("recommendation: ", svdpp.recommend_user(user=2211, n_rec=7))

# cold-start prediction
print("cold prediction: ", svdpp.predict(user="ccc", item="not item",
                                         cold_start="average"))
# cold-start recommendation
print("cold recommendation: ", svdpp.recommend_user(user="are we good?",
                                                    n_rec=7,
                                                    cold_start="popular"))

### Modelling

In [None]:
# Build the model with default parameters
model = tensorrec.TensorRec()

# Fit the model for 5 epochs
model.fit(interactions, user_features, item_features, epochs=5, verbose=True)

# Predict scores and ranks for all users and all items
predictions = model.predict(user_features=user_features,
                            item_features=item_features)
predicted_ranks = model.predict_rank(user_features=user_features,
                                     item_features=item_features)

# Calculate and print the recall at 10
r_at_k = tensorrec.eval.recall_at_k(predicted_ranks, interactions, k=10)
print(np.mean(r_at_k))

In [None]:
# Customer-Customer Similarity Matrix
customer_customer_sim_matrix = pd.DataFrame(cosine_similarity(customer_item_matrix))

customer_customer_sim_matrix

In [None]:
customer_customer_sim_matrix.columns = customer_item_matrix.index
customer_customer_sim_matrix['CustomerID'] = customer_item_matrix.index
customer_customer_sim_matrix = customer_customer_sim_matrix.set_index('CustomerID')

customer_customer_sim_matrix

Now, we will test our model by seeing the similarities between certain customer and the other customers based on this Recommendation System model.

In [None]:
# Recommendations

customers_lst = []
first_item = []
second_item = []
third_item = []
fourth_item = []
fifth_item = []
sixth_item = []
seventh_item = []
eighth_item = []
ninth_item = []
tenth_item = []

In [None]:
for customer in customer_customer_sim_matrix.index:
    print(customer)
    customers_lst.append(customer)
    
    # Items bought by customer A
    items_bought_by_A = customer_item_matrix.loc[customer] \
                        .loc[customer_item_matrix.loc[customer].values > 0].index

    # Items bought by customer B
    highest_sim_customer = customer_customer_sim_matrix[customer].sort_values(ascending = False).index[1]
    items_bought_by_B = customer_item_matrix.loc[highest_sim_customer] \
                        .loc[customer_item_matrix.loc[highest_sim_customer].values > 0].index

    # Pick top 10 recommended items for user B based on user A behaviour 
    items_recommended_to_B = [item for item in items_bought_by_A if item not in items_bought_by_B][:10]

    # Append list
    temp = data.loc[data['StockCode'] \
                      .isin(items_recommended_to_B)][['StockCode', 'Description']] \
                      .drop_duplicates().set_index('StockCode')
    
    try:
        first_item.append(temp['Description'][0])
    except:
        first_item.append('-')
    
    try:
        second_item.append(temp['Description'][1])
    except:
        second_item.append('-')
        
    try:
        third_item.append(temp['Description'][2])
    except:
        third_item.append('-')
        
    try:
        fourth_item.append(temp['Description'][3])
    except:
        fourth_item.append('-')
        
    try:
        fifth_item.append(temp['Description'][4])
    except:
        fifth_item.append('-')
        
    try:
        sixth_item.append(temp['Description'][5])
    except:
        sixth_item.append('-')
        
    try:
        seventh_item.append(temp['Description'][6])
    except:
        seventh_item.append('-')
        
    try:
        eighth_item.append(temp['Description'][7])
    except:
        eighth_item.append('-')
        
    try:
        ninth_item.append(temp['Description'][8])
    except:
        ninth_item.append('-')
        
    try:
        tenth_item.append(temp['Description'][9])
    except:
        tenth_item.append('-')

In [None]:
cust_recommendations_df = pd.DataFrame()
cust_recommendations_df['CustomerID'] = customers_lst
cust_recommendations_df['1st Item'] = first_item
cust_recommendations_df['2nd Item'] = second_item
cust_recommendations_df['3rd Item'] = third_item
cust_recommendations_df['4th Item'] = fourth_item
cust_recommendations_df['5th Item'] = fifth_item
cust_recommendations_df['6th Item'] = sixth_item
cust_recommendations_df['7th Item'] = seventh_item
cust_recommendations_df['8th Item'] = eighth_item
cust_recommendations_df['9th Item'] = ninth_item
cust_recommendations_df['10th Item'] = tenth_item

cust_recommendations_df

### Item-Based

In [None]:
# Customer-Customer Similarity Matrix
item_item_sim_matrix = pd.DataFrame(cosine_similarity(np.transpose(customer_item_matrix)))

item_item_sim_matrix

In [None]:
item_item_sim_matrix.columns = np.transpose(customer_item_matrix).index
item_item_sim_matrix['StockCode'] = np.transpose(customer_item_matrix).index
item_item_sim_matrix = item_item_sim_matrix.set_index('StockCode')

item_item_sim_matrix

In [None]:
# Recommendations

items_lst = []
first_item = []
second_item = []
third_item = []
fourth_item = []
fifth_item = []
sixth_item = []
seventh_item = []
eighth_item = []
ninth_item = []
tenth_item = []

In [None]:
for item in item_item_sim_matrix.index:
    print(item)
    items_lst.append(item)
    
    # Recommended items
    items_recommended = item_item_sim_matrix.loc[item].sort_values(ascending = False).iloc[:10].index
    
    # Append list
    temp = data.loc[data['StockCode'] \
                    .isin(items_recommended)][['StockCode', 'Description']] \
                    .drop_duplicates()
    
    try:
        first_item.append(temp['StockCode'][0])
    except:
        first_item.append('-')
    
    try:
        second_item.append(temp['StockCode'][1])
    except:
        second_item.append('-')
        
    try:
        third_item.append(temp['StockCode'][2])
    except:
        third_item.append('-')
        
    try:
        fourth_item.append(temp['StockCode'][3])
    except:
        fourth_item.append('-')
        
    try:
        fifth_item.append(temp['StockCode'][4])
    except:
        fifth_item.append('-')
        
    try:
        sixth_item.append(temp['StockCode'][5])
    except:
        sixth_item.append('-')
        
    try:
        seventh_item.append(temp['StockCode'][6])
    except:
        seventh_item.append('-')
        
    try:
        eighth_item.append(temp['StockCode'][7])
    except:
        eighth_item.append('-')
        
    try:
        ninth_item.append(temp['StockCode'][8])
    except:
        ninth_item.append('-')
        
    try:
        tenth_item.append(temp['StockCode'][9])
    except:
        tenth_item.append('-')

In [None]:
item_recommendations_df = pd.DataFrame()
item_recommendations_df['StockCode'] = items_lst
item_recommendations_df['1st Item'] = first_item
item_recommendations_df['2nd Item'] = second_item
item_recommendations_df['3rd Item'] = third_item
item_recommendations_df['4th Item'] = fourth_item
item_recommendations_df['5th Item'] = fifth_item
item_recommendations_df['6th Item'] = sixth_item
item_recommendations_df['7th Item'] = seventh_item
item_recommendations_df['8th Item'] = eighth_item
item_recommendations_df['9th Item'] = ninth_item
item_recommendations_df['10th Item'] = tenth_item

item_recommendations_df

# Customer Segmentation

# Summary & Recommendations

## Summary



## Recommendations

