![BTS](https://github.com/vfp1/bts-mbds-data-science-foundations-2019/raw/master/sessions/img/Logo-BTS.jpg)

# Session 06: Market Basket Analysis and Recommender Systems
### Lenin Escobar <lenin.escobar@bts.tech> - Advanced Data Analysis

 <p>In this assigment you should create a system able to suggest 5 products based on the customer purchase. Use the dataset from "Finding Association Rules for Market Basket Analysis" example(KNIME). (path <- LOCAL/Example Workflows/Retail/). You should use a Table Writer (Links to an external site.) node in order to download the data to your local drive(transactions and items tables). Then you will be able to generate association rules using the apriori algorithm.

Create a python notebook including a method that recommends 5 products given a purchase(transaction).

You should create a method that print those recommendations given a set (cvs) of transactions.

The deliverable should be a Python notebook and you should comment each step. Please, explain the criteria you used for the recommendations and any other action that you could take based on the associations rules.

</p> 

<p>The data files were exported using KNIME (I created a workflow):</p>

 <img src="KNIME_Workflow.png" alt="KNIME_Workflowt" width="500" height="600"> 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from itertools import islice
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules, fpmax, fpgrowth
import apyori
import csv

In [2]:
DATA_PATH = 'Data/'

In [3]:
items_file = DATA_PATH + 'Items.csv'

In [4]:
transaction_file = DATA_PATH + 'Transactions.csv'

In [5]:
items = pd.read_csv(items_file, header=0)
items.iloc[:10, :10]

Unnamed: 0,Item,Price,ProductName
0,0,2.511933,swiss cheese
1,1,52.441562,Cherry coke
2,2,28.215382,Bio Coke
3,3,3.991994,Peppers
4,4,1.042489,scrambled egg
5,5,3.900205,alkopops(rum/cherry)
6,6,1.655984,Pomegranate
7,7,4.267187,strawberries
8,8,2.729088,chachasa
9,9,0.846793,Riesling(White wine)


In [6]:
#items[items.ProductName == 'lobster']

In [7]:
transaction = pd.read_csv(transaction_file, header=0)
transaction = transaction.rename(columns={"Col0": "Transactions"})
transaction.iloc[:10, :10]

Unnamed: 0,Transactions
0,224 80 109 177 50 43 83 173 70 202 94 227 162 ...
1,56 95 106 186 103 170 69 198 186 211 83 24 78 ...
2,9 196 184 119 88 196 222 94 212 187 95 3 224 5...
3,228 9 193 127 163 117 24 34 204 163 48 74 69 2...
4,94 9 22 133 107 228 77 173 38 109 32 31 110 79...
5,13 184 209 20 229 207 32 162 3 54 163 20 17 81...
6,158 203 205 25 137 16 194 70 65 198 64 145 241...
7,167 117 187 12 235 231 128 17 84 173 87 66 36 ...
8,241 222 107 200 203 92 74 145 170 239 215 59 229
9,12 235 41 95 79 133 132 12 235 98 121 138 65 1...


In [8]:
transaction = pd.concat([transaction[['Transactions']], transaction['Transactions'].str.split(' ', expand=True)], axis=1)
transaction.head(5)

Unnamed: 0,Transactions,0,1,2,3,4,5,6,7,8,...,48,49,50,51,52,53,54,55,56,57
0,224 80 109 177 50 43 83 173 70 202 94 227 162 ...,224,80,109,177,50,43,83,173,70,...,,,,,,,,,,
1,56 95 106 186 103 170 69 198 186 211 83 24 78 ...,56,95,106,186,103,170,69,198,186,...,,,,,,,,,,
2,9 196 184 119 88 196 222 94 212 187 95 3 224 5...,9,196,184,119,88,196,222,94,212,...,,,,,,,,,,
3,228 9 193 127 163 117 24 34 204 163 48 74 69 2...,228,9,193,127,163,117,24,34,204,...,,,,,,,,,,
4,94 9 22 133 107 228 77 173 38 109 32 31 110 79...,94,9,22,133,107,228,77,173,38,...,,,,,,,,,,


In [9]:
#filter_row = items[items.Item == 241]
#cell_str = str(filter_row.Price.values[0]) + ',' + filter_row.ProductName.values[0]
#print(cell_str)

In [None]:
dataset = []
for rowIndex, row in transaction.iterrows(): #iterate over rows
    tmp_row = []
    for columnIndex, value in row.items():
        if columnIndex != 'Transactions':
            #val_map = value
            if value is not None:
                filter_row = items[items.Item == int(value)]
                #value = str(filter_row.Price.values[0]) + ',' + filter_row.ProductName.values[0]
                value = filter_row.ProductName.values[0]
                tmp_row.append(value)
            #print(columnIndex)
            #print(value)
            #tmp_row.append(value)
    dataset.append(tmp_row)

In [None]:
df_tmp = pd.DataFrame(dataset)
tmp = df_tmp.apply(lambda row: row.astype(str).str.contains('lobster').any(), axis=1)
tmp = pd.DataFrame(tmp)
#tmp.where(tmp == True)
tmp = tmp.loc[tmp[0] == True]
tmp.shape[0]/transaction.shape[0]

In [None]:
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df_encode = pd.DataFrame(te_ary, columns=te.columns_)
df_encode

<h1 style="background-color:powderblue;">First, we are going to generate association rules using the apriori and fpgrowth algorithms</h1>

In [None]:
frequent_itemsets = fpgrowth(df_encode,verbose = 0, min_support=0.1, use_colnames=True)
frequent_itemsets.head(5)

In [None]:
apriori(df_encode, min_support=0.3, use_colnames=True)

<h1 style="background-color:powderblue;">Now, we are going to create a real implementation</h1>

In [None]:
#Following the Python name convention: https://www.python.org/dev/peps/pep-0008/#method-names-and-instance-variables

In [None]:
class Suggestion():
    """Recommendation Class, using Apriori algorithm"""
    def __init__(self, input_file, combo_discount = 10):
        self.dataset_file = input_file # cvs format
        self.combo_discount = combo_discount # Promo discount when purchasing in special combos
        self.dataset_df = pd.DataFrame() # Holds sniffed dataset as dataframe
        self.dataset_ls = [] # Holds sniffed dataset as list
        self.association_rules_raw = [] # Apriori RelationRecord
        self.association_rules = {} # Final set of rules
    def __sniff_file(self):
        """Deduce the format of a CSV file
        :return:
        """
        try:
            sniffer = csv.Sniffer()
            sample_bytes = 100
            sample = open(self.dataset_file).read(sample_bytes)
            dialect = sniffer.sniff(sample)
            #has_header = sniffer.has_header(sample)
            #print(has_header)

            if dialect.delimiter == ',':
                with open(self.dataset_file ) as fileObj:
                    transactions = list(apyori.load_transactions(fileObj, delimiter=","))

                    # remove empty strings if any
                    for li in transactions:
                        li = list(filter(None, li))
                        self.dataset_ls.append(li)
                
                #Just for inspecting (has no functional value - it is really worthless but it helped me to understand)
                self.dataset_df = pd.read_csv(self.dataset_file, header = None) 
                #for i in range(0, self.dataset_df.shape[0]):
                #    self.dataset_ls.append([str(self.dataset_df.values[i,j]) for j in range(0, self.dataset_df.shape[1])])
        except csv.Error as msg:
            print(msg)
    def __generate_rules(self):
        """Generate association rules (apriori)
        :return:
        """
        # This is our business rule: 
        # We are using min_lift because It filters positive correlation 
        # within the itemset only (are more likely to be bought together)
        self.association_rules_raw = apyori.apriori(self.dataset_ls, min_support=0.01, min_confidence=0.01, min_lift=1.0, max_length=None)
    def __generate_business_rules(self):
        """Generate business rules based on apriori results
        :return:
        """
        for item in self.association_rules_raw:
            #association_rules items type is: "RelationRecord" which is a namedtuple
            #Its definition is the following
            #SupportRecord = namedtuple( # pylint: disable=C0103
            #    'SupportRecord', ('items', 'support'))
            #RelationRecord = namedtuple( # pylint: disable=C0103
            #    'RelationRecord', SupportRecord._fields + ('ordered_statistics',))
            #OrderedStatistic = namedtuple( # pylint: disable=C0103
            #    'OrderedStatistic', ('items_base', 'items_add', 'confidence', 'lift',))

            #Source code can be inspected at: https://github.com/ymoch/apyori/blob/master/apyori.py
            
            #We need at least 2 items
            if len(item[0]) < 2:
                continue

            for x in item[2]:

                baseItemList = list(x[0])
                # if base item set is empty then go to the next record.
                if not baseItemList:
                    continue

                # sort the baseItemList before adding it as a key to the AssociationRules
                baseItemList.sort()
                baseItemList_key = tuple(baseItemList)

                if baseItemList_key not in self.association_rules.keys():
                    self.association_rules[baseItemList_key] = []

                self.association_rules[baseItemList_key].append((list(x[1]), x[3]))

        # sort the rules in descending order of lift values.
        for ruleList in self.association_rules:
            self.association_rules[ruleList].sort(key=lambda x: x[1], reverse=True)

    def analyse_rules(self):
        """Analyse association rules (apriori)
        :return:
        """
        self.__sniff_file()
        self.__generate_rules()
        self.__generate_business_rules()

    def recommend(self, item_list, total_recommendation=5):
        """
        item_list is the list of selected items
        total_recommendation is total recommendations (5 by default)
        :param item:
        :return:
        """

        # convert itemList to itemTuple as our dictionary key is a sorted tuple
        item_list.sort()
        itemTuple = tuple(item_list)

        if itemTuple not in self.association_rules.keys():
            return []

        return self.association_rules[itemTuple][:total_recommendation]

    def get_promos(self, item_list, total_deals=5):
        """
        Calculate discount percentage based on lift
        discount_percentage = combo_discount * lift
        item_list is a list of items selected by user
        total_deals is total deals required.
        :return:
        """
        
        for item in self.recommend(item_list, total_deals):
            print (f'If you buy this item: {item[0]}, along with this ones: {item_list} then you will get a total discount of: {round((item[1] * self.combo_discount), 2)} Monsieur!')

<h1 style="background-color:powderblue;">Finally, we are going to ask our virtual assistant "Majordome"</h1>

In [None]:
#Using famous dataset for testing
test_file = DATA_PATH + 'store_data.csv'
Majordome = Suggestion(input_file = test_file, combo_discount = 5)

In [None]:
Majordome.analyse_rules()

In [None]:
Majordome.dataset_df.head(3)

In [None]:
Majordome.recommend(item_list = ['cookies'], total_recommendation=3)

In [None]:
Majordome.get_promos(item_list = ['cookies'], total_deals=3)