# E-shop Recommender Engine

# Solution strategy framework for the data science problem

<img src="images/4Dframework.png" width="80%">

# 1- Define the problem and goal
## The problem
An e-shop wants to increase their revenue by suggesting items similar to the items customers have put in their basket, while they are shopping online.

## The goal
The e-shop needs a recommender engine to suggest to the customers, the items that are similar to the items they already have put in their baskets. In this way, the chances of selecting and purchasing new items by the customers from the online platform of the e-shop, and hence the revenue of the e-shop will increase.

In [1]:
# Importing the required libraries
import pandas as pd
import numpy as np
from tqdm import tqdm

# 2- Discover the data

In [2]:
def load_data(file_path, format = 'txt'):
    if format == 'excel':
        df = pd.read_excel(file_path)
    elif format == 'csv':
        df = pd.read_csv(file_path)
    elif format == 'tsv':
        df = pd.read_csv(file_path, sep = '\t')
    elif format == 'txt':
        df = pd.read_table(file_path)
    else:
        raise ValueError('Invalid file format. The readable formats are "excel", "csv", "tsv" and "txt".')
    return df

def drop_small_orders(df, min_order_size = 15):
    ''' 
    In order to have a well-developed model, only the purchase record of the customers that have bought more than
    min_order_size items will be considered. Hence, this function filters out the small orderes 
    as explained aforementioned.
    '''
    return df[df.groupby('customer_ID').customer_ID.transform(len) >= min_order_size]

def test_df_prep(df, customer_ID_col, max_order_size_test = 3, num_sample_per_customer_type = 2):
    ''' 
    This function creates the test dataframe from the available dataframe. The test dataframe includes 
    * num_sample_per_customer_type (2 in this case) samples of the customers that have bought 1 item,
    * num_sample_per_customer_type (2 in this case) samples of the customers that have bought 2 items,
    * ..., and finally 
    * num_sample_per_customer_type (2 in this case) samples of the customers that have bought 
    max_order_size_test (3 in this case) items.
    '''
    df_test = pd.DataFrame()
    cols = customer_ID_col.split()
    for i in range(1, max_order_size_test + 1):
        cols.append('item ' + str(i))
    labels = [x for x in range(max_order_size_test * num_sample_per_customer_type)]
    df_test_items = pd.DataFrame(index = labels, columns = cols)
    c = 0
    for i in range(1, max_order_size_test + 1):
        df_test_part = df[df.groupby('customer_ID').customer_ID.transform(len) == i]
        df_test_part.sort_values(customer_ID_col, inplace = True)
        j = 1
        k = 0
        while j < num_sample_per_customer_type + 1:
            if df_test_part.iloc[k : k + i, 3].duplicated().sum() == 0:
                df_test = df_test.append(df_test_part.iloc[k: k + i, :])
                df_test_items.iloc[c, 0] = df_test_part.iloc[k, 0]
                for n in range(i):
                    df_test_items.iloc[c, 1 + n] = df_test_part.iloc[k + n, 3]
                c = c + 1
                j = j + 1
            k = k + i
    df_test_itemss = df_test_items.replace(np.nan,' ')
    df_test_itemss.set_index(customer_ID_col, inplace=True, drop=True)
    return df_test, df_test_itemss

def expand_item_column(df, col):
    '''
    This function one hot encodes the column 'col' of the dataframe.
    '''
    df_ohe = []
    df_ohe.append(df)
    df_ohe.append(pd.get_dummies(df[col], prefix = None, sparse = False))
    df = pd.concat(df_ohe, axis = 1)
    return df

def drop_cols(df, cols=[]):
    '''
    This function drops the columns 'cols' from the dataframe.
    '''
    df = df.drop(cols, axis = 1)
    return df

def consolidate_orders(df, col):
    '''
    This function consolidates the orders of the customers, meaning instead of having several rows that indicate
    the purchase record of a specific customer, all these rows will be consolidated into one single row.
    '''
    df = df.groupby(col).sum().reset_index()
    return df

def get_items_list(df, col):
    '''
    This function produces the list of all the items purchased by all the customers in the dataframe.
    '''
    items_list = df.columns.tolist()
    items_list.remove(col)
    return items_list

In [3]:
# The path of the input and output files
file_path = 'data/All Transations - 2 Weeks.txt'
results_path = 'data/recommendations.csv'

# the minimum number of items purchased by the customers. This will be used for training purposes.
min_order_size_train = 20

# the maximum number of items purchased by the customers. This will be used for test purposes.
max_order_size_test = 3

# number of samples per every customer type (type means the number of items purchased)
num_sample_per_customer_type = 2

customer_ID_col = 'customer_ID'

## Step 2.1: Loading the data

In [4]:
df = load_data(file_path, format = 'tsv')

The provided data are the purchase history of multitude of customers from the e-shop during 2 weeks.

## Step 2.2: Examining and high level overviewieng the data

In [5]:
df.head(10)

Unnamed: 0,customer_ID,L1,L2,L3,sku,brand
0,168266,Power Tools,Power Saws and Accessories,Reciprocating Saw Blades,265105,2768
1,123986,Safety,Spill Control Supplies,Temporary Leak Repair,215839,586
2,158978,Hardware,Door Hardware,Thresholds,284756,1793
3,449035,"Electronics, Appliances, and Batteries",Batteries,Standard Batteries,12579,1231
4,781232,Motors,General Purpose AC Motors,General Purpose AC Motors,194681,2603
5,116599,Pneumatics,Pneumatic Tube Fittings,Pneumatic Push to Connect Tube Fittings,167757,3889
6,701116,Motors,General Purpose AC Motors,General Purpose AC Motors,310296,1068
7,555497,Motors,Motor Supplies,Capacitors,306732,1068
8,282317,Safety,Footwear and Footwear Accessories,Insoles,148549,2696
9,644437,Hand Tools,Sockets and Bits,Crowfoot Socket Wrenches,283869,3356


The dataframe has 6 columns as follows:
* **customer_ID:** Every customer, has a unique ID, however, might have bought more than one item from the e-shop. Hence, there are repetitions in the customer_ID column.
* **L1:** Level 1 Product hierarchy (most broad)
* **L2:** Level 2 Product hierarchy
* **L3:** Level 3 Product hierarchy (most granular)
* **sku:** Product ID (encoded)
* **brand:** Product brand (encoded)

In [6]:
df[df[customer_ID_col] == 66563]

Unnamed: 0,customer_ID,L1,L2,L3,sku,brand
1430483,66563,Plumbing,Shut-Off Valves,Ball Valves,310757,271
1430525,66563,Safety,Gloves and Hand Protection,Inspection Gloves and Glove Liners,249781,934
1430567,66563,Cleaning,Paper Products and Dispensers,"Paper Towels, Rolls",184171,1726


Examining a sample customer shows that they have bought 3 items from the e-shop. Hence, the elements in the customer_ID column are not all unique numbers.

## Step 2.3: Inspect more detail of the dataset (i.e. length, columns and data types)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2107537 entries, 0 to 2107536
Data columns (total 6 columns):
customer_ID    int64
L1             object
L2             object
L3             object
sku            int64
brand          int64
dtypes: int64(3), object(3)
memory usage: 96.5+ MB


* The dataframe has 2,107,537 entries.
* All the columns are categorical.
* Object columns:
    * L1
    * L2
    * L3
* Numerical columns:
    * customer_ID
    * sku
    * brand

In [8]:
df[customer_ID_col].nunique()

801575

In [9]:
df['sku'].nunique()

275958

In [10]:
df['brand'].nunique()

4574

In [11]:
df.describe(include = ['O'])

Unnamed: 0,L1,L2,L3
count,2107537,2107537,2107537
unique,33,593,6203
top,Safety,Gloves and Hand Protection,Standard Batteries
freq,447381,137779,38793


Hence, this shows that the number of unique elements within every single column is much less than the total number of entries of the column, which confirms that all the columns are categorical.

### Creating the list of all the columns

In [12]:
num_cols = list(df.select_dtypes(include=[np.number]).columns)
object_cols = list(df.select_dtypes(include=['O']).columns)

cols = object_cols + num_cols

## Step 2.4: Checking for duplicates and NaN values

In [13]:
df.duplicated().sum()

52070

As expected, there are duplicates in the dataframe, since some of the customers have bought the same item several times. Hence, this is fine and is not an issue to take care of.

In [14]:
df.isnull().values.any()

False

The dataframe has no NaN values, and hence it is a nice and clean dataframe.

In case of presence of any NaN values, this needed to be investigated further. It should be decided how to handle them, whether  they should be dropped or be replaced based on the similar items.

## Step 2.5: Creating the test and training dataframes
### Step 2.5.1: Extraction of the elements of the test set from the dataframe
This is performed based on the available dataframe. The test dataframe includes:
* num_sample_per_customer_type (2 in this case) samples of customers that have bought 1 item, 
* num_sample_per_customer_type (2 in this case) samples of customers that have bought 2 items, 
* ..., and 
* num_sample_per_customer_type (2 in this case) samples of customers that have bought max_order_size_test (3 in this case) items.

In [15]:
df_test, df_test_items = test_df_prep(df, customer_ID_col, max_order_size_test, num_sample_per_customer_type)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [16]:
df_test

Unnamed: 0,customer_ID,L1,L2,L3,sku,brand
1409074,1,Safety,Footwear and Footwear Accessories,Steel-Toe Work Boots and Shoes,148301,1185
990328,3,Safety,Rainwear,Rainsuits,93710,2676
1010661,7,Safety,Footwear and Footwear Accessories,Steel-Toe PVC and Rubber Boots,155789,3854
1010619,7,Safety,Footwear and Footwear Accessories,Steel-Toe Work Boots and Shoes,132147,4746
311176,66275,Cleaning,Floor Care,Carpet & Upholstery Cleaning Chemicals,189129,4815
311134,66275,Cleaning,Cleaning Chemicals,"Etchants, Rust and Lime Removers",260363,4815
1430483,66563,Plumbing,Shut-Off Valves,Ball Valves,310757,271
1430525,66563,Safety,Gloves and Hand Protection,Inspection Gloves and Glove Liners,249781,934
1430567,66563,Cleaning,Paper Products and Dispensers,"Paper Towels, Rolls",184171,1726
371374,66783,Cleaning,Cleaning Equipment and Vacuum Cleaners,Vacuum Cleaner Accessory Kits,321985,1068


In [17]:
df_test_items

Unnamed: 0_level_0,item 1,item 2,item 3
customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Steel-Toe Work Boots and Shoes,,
3,Rainsuits,,
7,Steel-Toe PVC and Rubber Boots,Steel-Toe Work Boots and Shoes,
66275,Carpet & Upholstery Cleaning Chemicals,"Etchants, Rust and Lime Removers",
66563,Ball Valves,Inspection Gloves and Glove Liners,"Paper Towels, Rolls"
66783,Vacuum Cleaner Accessory Kits,Parts,Vacuum Cleaner Hoses


The items listed for every customer_ID in the df_test_items are taken from the most granular level, i.e. L3, of the df_test. For example, the customer_ID 66783 has purchased 3 items as follows:
* Vacuum Cleaner Accessory Kits
* Parts
* Vacuum Cleaner Hoses

The target of the recommender engine is to suggest other items to each of these customers that are very similar to the items they already have put in their baskets.

### Step 2.5.2: Creating the training dataframe by first dropping small orders

In [18]:
df_train = drop_small_orders(df, min_order_size_train)

In [19]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 248099 entries, 58 to 2107483
Data columns (total 6 columns):
customer_ID    248099 non-null int64
L1             248099 non-null object
L2             248099 non-null object
L3             248099 non-null object
sku            248099 non-null int64
brand          248099 non-null int64
dtypes: int64(3), object(3)
memory usage: 13.2+ MB


As can be seen, the training set has 248,099 entries out of 2,107,537 entries of the main dataframe, from the customers that have bought at least min_order_size_train (20 in this case) items.

### Step 2.5.3: Concatinating the training set and the test set

In [20]:
df_train_test = pd.concat([df_train, df_test], ignore_index=True)

### Step 2.5.4: One hot encoding the column 'L3', the most granular column

In [21]:
df_train_test = expand_item_column(df_train_test, ['L3'])

The above two steps are performed so that the items in the test dataframe also appear and be encoded in the list of items. In other words, the engine needs to have seen them before it is asked to suggest similar items to these items in the test dataframe. The engine is not trained considering these items, however, it needs to consider these items in its list of all the items.

### Step 2.5.5: Dropping the unwanted columns

In [22]:
dropping_cols = [x for x in cols if x != customer_ID_col]
print(dropping_cols)

['L1', 'L2', 'L3', 'sku', 'brand']


In [23]:
df_train_test = drop_cols(df_train_test, dropping_cols)

### Step 2.5.6: Extracting the training set and test set from the concatinated and processed dataframe

In [24]:
df_train = df_train_test.iloc[0:df_train.shape[0], :]

In [25]:
df_test = df_train_test.iloc[df_train.shape[0]:df_train_test.shape[0]+1, :]

### Step 2.5.7: Consolidating the training set and test set

In [26]:
df_train = consolidate_orders(df_train, customer_ID_col)

In [27]:
df_test = consolidate_orders(df_test, customer_ID_col)

### Step 2.5.8: Overviewing the training set and test set
A high level overview of the training set and test set is performed to ensure correct and perfect dataframes are produced to be fed to the recommender engine (training set) and evaluate the performance of it (test set).

In [28]:
df_train.head(10)

Unnamed: 0,customer_ID,L3_12 Volt Accessories,L3_12-Point Flange Head Cap Screws,L3_3-Ring Binder Accessories,L3_3-Ring Binders,L3_3.3 Inch Diameter Motors,L3_4.4 Inch Diameter Motors,L3_5 X 20mm Glass and Ceramic Fuses,L3_5S Red Tag Stations,L3_A/C Mounting Pads,...,L3_Worker Emergency Identification,L3_Worm Gear Clamps,L3_Wrap-a-Round Tape Measures,L3_Wrist Rests and Palm Supports,L3_Wrist Supports and Wraps,L3_Y Strainers,L3_Yard Hydrants,L3_Zone Valve Actuators,L3_Zone Valves,L3_pH Meters
0,66334,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,66361,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
2,66619,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,66768,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,66849,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,66883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,66916,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,67077,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8,67226,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,67250,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7453 entries, 0 to 7452
Columns: 3790 entries, customer_ID to L3_pH Meters
dtypes: int64(1), uint8(3789)
memory usage: 27.0 MB


The training set consists of 7,453 unique customers that have purchased at least min_order_size_train (20 in this case) items.

In [30]:
df_test

Unnamed: 0,customer_ID,L3_12 Volt Accessories,L3_12-Point Flange Head Cap Screws,L3_3-Ring Binder Accessories,L3_3-Ring Binders,L3_3.3 Inch Diameter Motors,L3_4.4 Inch Diameter Motors,L3_5 X 20mm Glass and Ceramic Fuses,L3_5S Red Tag Stations,L3_A/C Mounting Pads,...,L3_Worker Emergency Identification,L3_Worm Gear Clamps,L3_Wrap-a-Round Tape Measures,L3_Wrist Rests and Palm Supports,L3_Wrist Supports and Wraps,L3_Y Strainers,L3_Yard Hydrants,L3_Zone Valve Actuators,L3_Zone Valves,L3_pH Meters
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,66275,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,66563,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,66783,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The test set consists of 6 unique customers selected from multitude of customers that have bought less than max_order_size_test (3 in this case) items.

Both training set and test set are nice and clean and can be used to develop the recommender engine and evaluate its performance.

## Step 2.6: Prepare a list of all the items 
This is a list of all the items purchased by the customers that are present in both the training set and test set.

In [31]:
items_list = get_items_list(df_train_test, customer_ID_col)

# 3- Develop the recommender engine

In [32]:
def create_similarity_df(df, items_list):
    '''
    This function creates the 'Jaccard similarity' matrix based on the description given below.
    '''
    items_df = df[items_list]
    similarity_matrix = np.eye(len(items_list))
    progress_bar = tqdm(total = similarity_matrix.shape[0], mininterval = 5)
    for i in range(similarity_matrix.shape[0] - 1):
        progress_bar.update()
        x = items_df.iloc[:, i]
        nonzero_x = set(np.nonzero(x)[0])
        for j in range(i, similarity_matrix.shape[1] - 1):
            y = items_df.iloc[:, j]
            nonzero_y = set(np.nonzero(y)[0])
            intersection_size = len(nonzero_x.intersection(nonzero_y))
            if intersection_size > 0:
                union_size = len(nonzero_x.union(nonzero_y))
                jaccard_similarity = intersection_size / union_size
                similarity_matrix[i, j] = jaccard_similarity
                similarity_matrix[j, i] = jaccard_similarity
    similarity_df = pd.DataFrame(data = similarity_matrix, index = items_list, columns = items_list)
    return similarity_df

The developed recommender engine calculates the similarity based on the **'Jaccard similarity'**, which is illustrated as below:

![jaccSim](images/jaccard_similarity3.png)


In the above definition:
* A: represents the customers that have purchased item A.
* B: represents the customers that have purchased item B.
* Intersection of A and B: the customers that have purchased both items A and item B.
* Union of A and B: All the customers that have purchased either item A or item B.

Jaccard similarity is defined as the division of the intersecction of items A and B by the union of items A and B. It does not consider how many of an item X is purchased by a customer. Also, it is the best algorithm that can be used for this type of problem, since either an item has been purchased or not. Additionally, it is the best choice of algorithm since this problem does not deal with scoring the items by the customers.

As an example: 
Item A has been purchased by customers: {'Mark', 'Kat', 'Jon'}, and item B has been purchased by customers {'Kat', 'Sam'}. Hence, both items A and B have been purchased only by 'Kat', meaning intersection of items A and B has just 1 member. Items A and B have been purchased by 'Mark', 'Kat', 'Jon', and 'Sam', meaning the union of items A and B has 4 members. Therefore, the Jaccard similarity of item A to item B is 1 / 4 = 0.25 . 

Hence, it can be concluded that:
* Jaccard similarity of item A to item B is equal to Jaccard similarity of item B to item A.
* Jaccard similarity of item A to item A is equal to 1.

### Creating the Jaccard similarity matrix

In [33]:
similarity_df = create_similarity_df(df_train, items_list)

100%|█████████████████████████████████████████████████████████████████████████████▉| 3788/3789 [13:40<00:00, 16.27it/s]

Once a set of new customers, in this case the test set, is fed to the engine, it:
* calculates the similarity of their orders with the available items based on the Jaccard similarity matrix.
* ranks the items based on the calculated similarities.
* suggests the top items similar to the items in the customers' baskets.

This has been illustrated in the next section using the test dataset.


# 4- Deploy the the recommender engine

In [34]:
def create_customers_list(df, customer_ID_col, new_customers = None):
    '''
    This function creates a list of the customers of the dataframe df.
    '''
    if not new_customers:
        new_customers = df[customer_ID_col]
    return new_customers

def score_cutomers_baskets(customers_list, df, customer_ID_col, items_list, similarity_df):
    '''
    This function scores the baskets of the customers using a dot product and the similarity matrix.
    Basically, this function gives the similarity score of all the items in the e-shop compared to the items 
    currently in the basket of the customer.
    '''
    cols = [customer_ID_col] + items_list
    customers_baskets_scores = pd.DataFrame(index = customers_list, columns = items_list)
    customers_baskets_scores = df[items_list].dot(similarity_df)
    customers_baskets_scores[customer_ID_col] = customers_list
    customers_baskets_scores.set_index(customer_ID_col, inplace=True, drop=True)
    return customers_baskets_scores

def gen_recoms_df(num_top_recoms, customers_list, customers_baskets_scores):
    '''
    This function generates recommendations based on the items currently in the baskets of the customers.
    Basically, it sorts the items in the e-shop based on the scores of the customers_baskets_scores dataframe 
    from the most similar to the least similar.
    '''
    cols = ['Recom. ' + str(x) for x in range(1, num_top_recoms + 1)] + ['Score ' + str(x) for x in range(1, num_top_recoms + 1)]
    recoms_df = pd.DataFrame(index = customers_list, columns = cols)
    for customer in customers_list:
        sorted_items = customers_baskets_scores.sort_values(by = customer, ascending = False, axis = 1).loc[customer, :].index
        for i in range(num_top_recoms):
            item = sorted_items[i]
            item_col = cols[i]
            score_col = cols[i + num_top_recoms]
            recoms_df.loc[customer, item_col] = item
            recoms_df.loc[customer, score_col] = customers_baskets_scores.loc[customer, item]
    recoms_df.reset_index(inplace = True, drop = False)
    return recoms_df

In [35]:
# The number of top similar items the recommender engine suggests
num_top_recoms = 5

## Step 4.1: Creaitng the list of customers
This list is created using the test dataset prepared in Section 2 (Discover the data).

In [36]:
customers_list = create_customers_list(df_test, customer_ID_col)

## Step 4.2: Scoring the baskets of the customers

In [37]:
customers_baskets_scores = score_cutomers_baskets(customers_list, df_test, customer_ID_col, items_list, similarity_df)

## Step 4.3: Generating the top recommendations for the customers

In [38]:
recoms_df = gen_recoms_df(num_top_recoms, customers_list, customers_baskets_scores)

## Step 4.4: Creating the final 'result' dataframe

In [39]:
df_test_items.reset_index(drop = True, inplace = True)

In [40]:
result = pd.concat([df_test_items, recoms_df], axis = 1, join_axes=[df_test_items.index])

In [41]:
result.set_index(customer_ID_col, inplace=True, drop=True)

## Step 4.5: Evaluating the performance of the recommender engine 

In [42]:
df_test_items

Unnamed: 0,item 1,item 2,item 3
0,Steel-Toe Work Boots and Shoes,,
1,Rainsuits,,
2,Steel-Toe PVC and Rubber Boots,Steel-Toe Work Boots and Shoes,
3,Carpet & Upholstery Cleaning Chemicals,"Etchants, Rust and Lime Removers",
4,Ball Valves,Inspection Gloves and Glove Liners,"Paper Towels, Rolls"
5,Vacuum Cleaner Accessory Kits,Parts,Vacuum Cleaner Hoses


In [43]:
recoms_df

Unnamed: 0,customer_ID,Recom. 1,Recom. 2,Recom. 3,Recom. 4,Recom. 5,Score 1,Score 2,Score 3,Score 4,Score 5
0,1,L3_Steel-Toe Work Boots and Shoes,L3_Socks,L3_Insoles,L3_UNCATEGORIZED,L3_Plain-Toe Work Boots and Shoes,1.0,0.796512,0.532258,0.0708333,0.0628571
1,3,L3_Rainsuits,L3_Rain Jackets and Coats,L3_Steel-Toe PVC and Rubber Boots,L3_PVC and Rubber Overboots and Overshoes,L3_Plain-Toe PVC and Rubber Boots,1.0,0.0927152,0.084507,0.0512821,0.0490196
2,7,L3_Steel-Toe Work Boots and Shoes,L3_Steel-Toe PVC and Rubber Boots,L3_Socks,L3_Insoles,L3_Rain Jackets and Coats,1.00615,1.00615,0.803361,0.554985,0.12844
3,66275,L3_Carpet & Upholstery Cleaning Chemicals,"L3_Etchants, Rust and Lime Removers",L3_Floor Strippers,L3_Portion Control System Chemicals,L3_Fire Resistant Treatments,1.03226,1.03226,0.2846,0.230519,0.176692
4,66563,"L3_Paper Towels, Rolls",L3_Ball Valves,L3_Inspection Gloves and Glove Liners,L3_Toilet Paper,L3_Trash Bags,1.06721,1.04829,1.03026,0.605578,0.447326
5,66783,L3_Vacuum Cleaner Hoses,L3_Vacuum Cleaner Accessory Kits,L3_Parts,L3_Office and Drafting Chairs,L3_Air Mattress Pumps,1.30559,1.30556,1.01114,0.32652,0.325


A few of the recommendations to the customers, developed by the engine (reported by 'item name, similarity'), are evaluated:
* The recommendations to the customer with the index 0, who has bought 'Steel-Toe Work Boots and Shoes' are: 'Socks, 0.79', 'Insoles, 0.53', and 'Plain-Toe Work Boots and Shoes, 0.06'. All the recommended items have high chances to be needed by the customer even though the similarity of the last item is less than 0.1.
* The recommendations to the customer with the index 2, who has bought 'Steel-Toe PVC and Rubber Boots' and 'Steel-Toe Work Boots and Shoes' are: 'Socks, 0.80, 'Insoles, 0.55', and 'Rain Jackets and Coats, 0.12', which are all very similar to the items bought by the customer and the customer might need and take them as well.
* The recommendations to the customer with the index 4, who has bought 'Ball Valves', 'Inspection Gloves and Glove Liners' and 'Paper Towels, Rolls' are: 'Toilet Paper, 0.60' and 'Trash Bags, 0.44', which might be needed for cleaning as the customer is taking the former items for cleaning and might need the latter ones.

Based on the evaluation of recommendations generated by the engine, it can be seen the engine has correctly identified the items similar to the items purchased by the customers. Hence, it is a powerful tool that can be deployed further to some improvements below, to help increase the revenue of the e-shop.

### Wrting the recommendations in a csv file

In [44]:
result.to_csv(results_path)

## Improvements to the recommender engine
* The devloped engine recommends exactly the same items that are already in the customer's basket. This can be frustrating and confusing for the customer. In the later versions, this needs to be dealt with and these items should be filtered out.
* Currently, the recommender engine does not consider the brand of items. Some people will purchase items from a brand just because they prefer that brand, even though they might not even need the item at the time of purchase. This important feature can be considered to further improve the performance of the developed engine and increase the revenue of the e-shop.
* The Level 1 (Product hierarchy - most broad) and Level 2 (Product hierarchy) have not been considered in the model. Considering these levels of hierarchy in the recommender engine will further improve the performance of it.