# Practice Session 04: Basket analysis

Author: <font color="blue">Àlex Montoya Pérez</font>

E-mail: <font color="blue">alex.montoya01@estudiant.upf.edu</font>

Date: <font color="blue">19/10/2023</font>

# **Google Colaboratory Setup & Imports**

In order to develop this laboratory, I used Google Colaboratory, since I have worked with different files I had to set up the environment as follows:


1.   Importing the drive module from the google.colab package.
2.   Mounting the Google Drive at the specified path (/content/drive).
3.   Changing the current working directory to the directory where I have all needed data /content/drive/MyDrive/MineriaDadesMasives/Labs/.

Verify that we are in the correct directory:


4.   Printing the current working directory path using !pwd.
5.   Listing the contents of the current directory using !ls.

In [78]:
from google.colab import drive
drive.mount('/content/drive')
#Here is how to change current working directory
#By default the current working directory is /content
%cd /content/drive/MyDrive/MineriaDadesMasives/Labs/
#Print path and content of the current directory
!pwd
!ls

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/MineriaDadesMasives/Labs
/content/drive/MyDrive/MineriaDadesMasives/Labs
data				       ps04_association_rules.ipynb	ps08_data_streams.ipynb
old				       ps05_content_based_recsys.ipynb	ps09_forecasting.ipynb
ps01_02_data_preparation_242873.ipynb  ps06_item_based_recsys.ipynb	README.md
ps03_near_duplicates.ipynb	       ps07_outlier_analysis.ipynb


In [79]:
!pip install apyori



In [80]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import gzip
from apyori import apriori

# 1. Playing with apyori

In [81]:
transactions = [
    ['beer', 'chips', 'nuts', 'olives'],
    ['beer', 'chips', 'olives'],
    ['chips', 'nuts' ],
    ['chips', 'olives'],
    ['beer', 'nuts' ],
    ['chips'],
    ['nuts', 'olives'],
    ['beer', 'nuts'],
    ['beer', 'chips', 'olives'],
    ['beer', 'nuts', 'olives'],
    ['coke', 'nuts', 'olives'],
    ['beer', 'nuts', 'chips'],
    ['beer', 'nuts', 'olives', 'chips'],
    ['nuts', 'olives'],
    ['coke'],
    ['coke', 'olives'],
    ['beer', 'olives'],
    ['coke', 'nuts', 'olives'],
    ['beer', 'nuts', 'olives', 'coke'],
    ['beer', 'nuts', 'coke'],
]
results = list(apriori(transactions, min_support=0.3, min_confidence=0.75, min_lift=1.0))

In [82]:
# LEAVE AS-IS

def print_apyori_output (association_results, info=False, info_key=False):
    for relation_record in association_results:
        itemset = list(relation_record.items)

        # Consider only itemsets of two elements
        if len(itemset) > 1:

            print("Rules involving itemset %s" % itemset)
            support = relation_record.support

            for rules in relation_record.ordered_statistics:
                antecedent = list(rules.items_base)
                consequent = list(rules.items_add)

                if info_key:
                    antecedent = [info.loc[x][info_key] for x in antecedent]
                    consequent = [info.loc[x][info_key] for x in consequent]

                confidence = rules.confidence
                lift = rules.lift

                print("%s => %s (support=%.4f, confidence=%.2f, lift=%.2f)" %
                      (antecedent, consequent, support, confidence, lift))
            print()

In [83]:
#Print the results to see the rules having the desired support, confidence and lift.
print_apyori_output(results)

In [84]:
count_c_b = 0
count_c_b_o = 0
count_olives = 0

for transaction in transactions:
    #Count appearances of chips and beer together in a transaction
    if('beer' in transaction and 'chips' in transaction):
        count_c_b+=1
    #Count appearances of chips beer and olives together in a transaction
    if('beer' in transaction and 'chips' in transaction and 'olives' in transaction):
        count_c_b_o+=1
    #Count appearances of olives together in a transaction
    if('olives' in transaction):
        count_olives+=1

# Support of the rule, confidence of the rule, and lift of the rule
print("support_chips_beer =", count_c_b/len(transactions))
print("support_olives =", count_olives/len(transactions))
print("confidence_chips_beer_olives =", count_c_b_o/count_c_b)
print("lift_chips_beer =", count_c_b_o/count_c_b/(count_olives/len(transactions)))

count_o_c = 0
count_o_c_b = 0
count_beer = 0
for transaction in transactions:
    #Count appearances of olives and chips together in a transaction
    if('olives' in transaction and 'chips' in transaction):
        count_o_c+=1
    #Count appearances of olives, chips and beer together in a transaction
    if('olives' in transaction and 'chips' in transaction and 'beer' in transaction):
        count_o_c_b+=1
    #Count appearances of beer together in a transaction
    if('beer' in transaction):
        count_beer+=1

# Support of the rule, confidence of the rule, and lift of the rule
print("\nsupport_olives_chips =", count_o_c/len(transactions))
print("support_beer =", count_beer/len(transactions))
print("confidence_olives_chips_beer =", count_o_c_b/count_o_c)
print("lift_olives_chips =", count_o_c_b/count_o_c/(count_beer/len(transactions)))

support_chips_beer = 0.25
support_olives = 0.65
confidence_chips_beer_olives = 0.8
lift_chips_beer = 1.2307692307692308

support_olives_chips = 0.25
support_beer = 0.55
confidence_olives_chips_beer = 0.8
lift_olives_chips = 1.4545454545454546


# 2. Load and prepare the shopping baskets

In [85]:
# LEAVE AS-IS

# File names
INPUT_PRODUCTS = "data/instacart/instacart-products.csv"
INPUT_TRANSACTIONS = "data/instacart/instacart-transactions.csv.gz"

# Read into a dataframe
products = pd.read_csv(INPUT_PRODUCTS, delimiter=",")

# Set product_id as index, and drop column aisle_id
products = products.set_index('product_id').drop(columns=['aisle_id'])

products.head(100)

Unnamed: 0_level_0,product_name,department_id
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Chocolate Sandwich Cookies,19
2,All-Seasons Salt,13
3,Robust Golden Unsweetened Oolong Tea,7
4,Smart Ones Classic Favorites Mini Rigatoni Wit...,1
5,Green Chile Anytime Sauce,13
...,...,...
96,Sprinklez Confetti Fun Organic Toppings,13
97,Organic Chamomile Lemon Tea,7
98,2% Yellow American Cheese,16
99,Local Living Butter Lettuce,4


## 2.1. Select by department

In [86]:
# LEAVE AS-IS

DEPT_BAKERY = 3
DEPT_VEGGIES = 4
DEPT_ALCOHOL = 5
DEPT_WORLD = 6
DEPT_DRINKS = 7
DEPT_PETS = 8
DEPT_PHARMACY = 11
DEPT_CLEANING = 17
DEPT_BABIES = 18

In [87]:
def select_from_departments(products, product_id, department_id):
    product_id_belonging = []
    #Should return product_id that belong to one of the departments
    department_belonging = products.loc[product_id].department_id
    for product in product_id:#Iterate over all products demanded to select if they are on some department
        if(department_belonging[product] in department_id):#check if the department of the product it's one of the desired departments
            product_id_belonging.append(product)#Add to the output
    #If no products belong to any department return empty list because any product will be added to the output
    return product_id_belonging

### Test select_from_departments

In [88]:
#Testing whether the function works how it should or not

## TEST 1
product_id = [22, 26, 45, 54, 57, 71, 111, 112]
department_id = [DEPT_PETS, DEPT_CLEANING]
print("\nTest Case 1:\n", product_id, "\n")
print("Input products:")
for product in product_id:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")
products_selected = select_from_departments(products, product_id, department_id)
print("\nSelected products from departments", department_id, ":")
for product in products_selected:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")

## TEST 2
product_id = [1, 2, 3, 4, 5, 6, 7, 8]
department_id = [DEPT_WORLD, DEPT_BAKERY, 13]
print("\nTest Case 2:\n", product_id, "\n")
print("Input products:")
for product in product_id:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")
products_selected = select_from_departments(products, product_id, department_id)
print("\nSelected products from departments", department_id, ":")
for product in products_selected:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")

## TEST 3
product_id = [100, 200, 300, 400, 500, 600, 700, 800]
department_id = [DEPT_PETS, DEPT_CLEANING, 13]
print("\nTest Case 3:\n", product_id, "\n")
print("Input products:")
for product in product_id:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")
products_selected = select_from_departments(products, product_id, department_id)
print("\nSelected products from departments", department_id, ":")
for product in products_selected:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")




Test Case 1:
 [22, 26, 45, 54, 57, 71, 111, 112] 

Input products:
22 Fresh Breath Oral Rinse Mild Mint (dept  11 )
26 Fancy Feast Trout Feast Flaked Wet Cat Food (dept  8 )
45 European Cucumber (dept  4 )
54 24/7 Performance Cat Litter (dept  8 )
57 Flat Toothpicks (dept  17 )
71 Ultra 7 Inch Polypropylene Traditional Plates (dept  17 )
111 Fabric Softener, Geranium Scent (dept  17 )
112 Hot Tomatillo Salsa (dept  13 )

Selected products from departments [8, 17] :
26 Fancy Feast Trout Feast Flaked Wet Cat Food (dept  8 )
54 24/7 Performance Cat Litter (dept  8 )
57 Flat Toothpicks (dept  17 )
71 Ultra 7 Inch Polypropylene Traditional Plates (dept  17 )
111 Fabric Softener, Geranium Scent (dept  17 )

Test Case 2:
 [1, 2, 3, 4, 5, 6, 7, 8] 

Input products:
1 Chocolate Sandwich Cookies (dept  19 )
2 All-Seasons Salt (dept  13 )
3 Robust Golden Unsweetened Oolong Tea (dept  7 )
4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce (dept  1 )
5 Green Chile Anytime Sauce (d

## 2.2. Read and filter transactions

In [89]:
TRANSACTION_LIMIT = 5000

# Open a compressed file
def keep_items_of_department(departments):
    with gzip.open(INPUT_TRANSACTIONS, "rt") as inputfile:

        # Create a CSV reader
        reader = csv.reader(inputfile, delimiter=",")
        transactions = []
        i = 0
        # Iterate through the CSV file
        for row in reader:
            # Convert to integers
            items = [int(x) for x in row]
            #Select for each row if some of the products of the row its from the departments we desire
            temp = select_from_departments(products, items, departments)
            # If exist some product of the department, add it to the output and summarize to counter to break at 5000 readings
            if(temp != []):
                i+=1
                transactions.append(temp)
                if(i%1000 == 0):
                    print("Reading transaction:", i)
                if(i > TRANSACTION_LIMIT):
                    break

    return transactions

## 2.3. Extract association rules and comment on them (DEPT_CLEANING)

In [90]:
#Check results of reading 5000 transactions, with support, confidence, and lift desired.
results = list(apriori(keep_items_of_department([DEPT_CLEANING]), min_support=0.0008, min_confidence=0.2, min_lift=1.0))
print_apyori_output(results, products, 'product_name')

Reading transaction: 1000
Reading transaction: 2000
Reading transaction: 3000
Reading transaction: 4000
Reading transaction: 5000
Rules involving itemset [47865, 5047]
['Easy Open TabsBags'] => ['Quart Storage Bags'] (support=0.0012, confidence=0.23, lift=38.47)

Rules involving itemset [37357, 8021]
['Natural Laundry Detergent, Free & Clear 33'] => ['100% Recycled Paper Towels'] (support=0.0010, confidence=0.21, lift=3.84)

Rules involving itemset [31801, 21653]
['Compostable Forks'] => ['9 Inch Plates'] (support=0.0010, confidence=0.25, lift=35.72)

Rules involving itemset [41387, 21653]
['Compostable Forks'] => ['Plastic Spoons'] (support=0.0016, confidence=0.40, lift=90.93)
['Plastic Spoons'] => ['Compostable Forks'] (support=0.0016, confidence=0.36, lift=90.93)



My recommendation would be the rule with the highest level of confidence. This specific rule suggests that when you purchase compostable forks, it's also common to buy plastic spoons. Consequently, I propose incorporating a suggestion in the application that advises customers to consider purchasing compostable forks when they intend to buy plastic spoons, and vice versa. I base this recommendation on a confidence level of 36%, which meets my criteria for making suggestions. To clarify my criteria, I aim to recommend rules that exhibit both a confidence level exceeding 30% and a lift value exceeding 30. The last rule in the output boasts a lift value of 90.93, making it the most compelling recommendation compared to any other rule.

I adjusted the code cell by setting the minimum confidence threshold to 0.2 and the minimum lift threshold to 1, allowing me to see all results, even though they don't strictly adhere to my criteria.

## 2.4. Extract association rules and comment on them (other departments)

In [91]:
results = list(apriori(keep_items_of_department([DEPT_BABIES, DEPT_BAKERY]), min_support=0.002, min_confidence=0.2, min_lift=1.0))
print_apyori_output(results, products, 'product_name')

Reading transaction: 1000
Reading transaction: 2000
Reading transaction: 3000
Reading transaction: 4000
Reading transaction: 5000
Rules involving itemset [3020, 34134]
['Broccoli & Apple Stage 2 Baby Food'] => ['Spinach Peas & Pear Stage 2 Baby Food'] (support=0.0024, confidence=0.34, lift=46.34)
['Spinach Peas & Pear Stage 2 Baby Food'] => ['Broccoli & Apple Stage 2 Baby Food'] (support=0.0024, confidence=0.32, lift=46.34)

Rules involving itemset [47888, 43875]
['Baby Food Stage 2 Blueberry Pear & Purple Carrot'] => ['Apple and Carrot Stage 2 Baby Food'] (support=0.0022, confidence=0.22, lift=41.58)
['Apple and Carrot Stage 2 Baby Food'] => ['Baby Food Stage 2 Blueberry Pear & Purple Carrot'] (support=0.0022, confidence=0.41, lift=41.58)




I have decided to retain items from both the baby department and bakery department. After reviewing the output rules and applying the criteria mentioned earlier, I propose three out of the four output rules as recommendations:

*  Wether you're purchasing Broccoli & Apple Stage 2 baby food, I suggest considering Spinach Peas & Pear Stage 2 baby Food due to a confidence level of over 30% and a lift value exceeding 30.

*   Same recommendation applies, but in reverse, for the same reasons mentioned above.

*   Lastly, if you are buying Apple and Carrot Stage 2 Baby Food, my recommendation is to also consider Baby Food Stage 2 Blueberry Pear & Purple Carrot.

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>