# Apriori

## Dataset

### Layout

* Columns:
    * Market store products (20 columns total)
* Rows: 1000s of observations
    * Each row represents a unique customer transaction of market products purchased

### Background

* Business owner of market store in French countryside town
* Wants to optimize inventory and boost sales
* Wants to offer new great deals to customers
* Identify the best association rules of market products bought by customers
* Deal is *buy this product, then get this product for free*
* Hired data scientist to identify the best association rules of market products

### Goals

* Build Apriori association rule learning model to identify the best association rules of market products to maximize sales potential of customers buying one product and getting an associated product for free

## Import Libraries

In [1]:
!pip3 install apyori

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try brew install
[31m   [0m xyz, where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a Python library that isn't in Homebrew,
[31m   [0m use a virtual environment:
[31m   [0m 
[31m   [0m python3 -m venv path/to/venv
[31m   [0m source path/to/venv/bin/activate
[31m   [0m python3 -m pip install xyz
[31m   [0m 
[31m   [0m If you wish to install a Python application that isn't in Homebrew,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. You can install pipx with
[31m   [0m 
[31m   [0m brew install pipx
[31m   [0m 
[31m   [0m You may restore the old behavior of pip by passing
[31m   [0m the '--break-system-packages' flag to pip, or by adding
[31m   [0

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Data Preprocessing

* In the `read_csv` function from the Pandas library, the `header` parameter set to `None` ignores the default behavior where the function assumes the first row contains column names
* Apyori library function expects the data format to be a list instead of a Pandas data frame
    * Iterate through rows in the dataset and add each transaction to the list
* Apyori library function also expects all values in a list to be of data type string
* Since the dataset contains up to 20 values per list item, in order for the Apyori library to understand when an item has less than 20 values, empty values will be appended with not a number `nan`

In [3]:
dataset = pd.read_csv('Market_Basket_Optimization.csv', header=None)
transactions = []
for i in range(0, len(dataset)):
    transactions.append([str(dataset.values[i, j]) for j in range(0, 20)])

In [12]:
print(*transactions[:10], sep='\n')

['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil']
['burgers', 'meatballs', 'eggs', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']
['chutney', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']
['turkey', 'avocado', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']
['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']
['low fat yogurt', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'na

## Train Apriori Model on Dataset

* `apriori` function from Apyori library, not only trains the model, it also returns the association rules
    * `transactions` parameter
        * Expects a list of transactions
    * `min_support` parameter
        * Sets the minimum support value
        * This value is determined by the business case
        * For the market products business case, one decides the products must appear in $3$ transactions per day
            * Products that only appear in 1-2 transactions per day are not considered to be frequent enough to build strong rules for the business case
            * Since the business case considers transactions per week, one needs to calculate $3\ transactions * 7\ days$ to get $21$
            * Total number of transactions is $7501$
            * Support formula is:

                $Support(T) = \frac{\#\_transactions\_containing\_T}{\#\_transactions}$

                $T = transaction$

                $Support(T) = \frac{21}{7501} = 0.003$
            * This means the first product will appear in rules at least $0.3\%$ of the time
    * `min_confidence` parameter
        * Sets the minimum confidence value
        * Use rule of thumb
            * Start with $0.8$, which is a default confidence value for a similar association rule learning function in R
            * If no rules returned, divide by 2 to get $0.4$
            * If no rules returned, divide by 2 to get $0.2$
        * This means first and second products will appear in rules at least $20\%$ of the time
    * `min_lift` parameter
        * Sets the minimum lift value
        * Lift measures quality or relevance of a rule
        * Use rule of thumb
            * Start with $3$
            * Increase by values of $3$ as needed
            * Any value below $3$ makes rules not relevant
    * `min_length` parameter
        * Sets minimum length of items in a transaction
        * This value is determined by the business case
        * For market business case, this will be $2$
    * `max_length` parameter
        * Sets maximum length of items in a transaction
        * This value is determined by the business case
        * For market business case, this will be $2$

In [5]:
from apyori import apriori

rules = apriori(transactions=transactions, min_support=0.003, min_confidence=0.2, min_lift=3, min_length=2,
                max_length=2)

## Visualize Results

### Put Rules Results into List

* Put rules results into a list for display purposes via the `list` function

In [6]:
results = list(rules)

### Display First Results Coming Directly from Output of Apriori Function

In [7]:
results

[RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]),
 RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
 RelationRecord(items=frozenset({'pasta', 'escalope'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)]),
 RelationRecord(items=frozenset({'fromage blanc', 'honey'}), support=0.003332888948140248, ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confidence=0

### Rules Results Analysis

In rules results output, analyzing the first rule yields the following:

* The first product is **light cream**, denoted by `items_base=frozenset({'light cream'})`
* The second product is **chicken**, denoted by `items_add=frozenset({'chicken'})`
* The rule is: *If a customer buys light cream, the customer also has a high chance to buy chicken*
* High chance is measured by confidence `confidence=0.29`
* This means if customers buy light cream, they will have a $29\%$ chance of buying chicken
* Lift is `lift=4.84`, meaning this rule's relevance is $4.84$ and all lift values for rules will be greater than $3$
* Support is `support=0.0045`, meaning this product association will appear in $0.45\%$ of transactions

### Custom Function for Organizing Rules into Data Frame

* `inspect` is a custom function that returns the rules organized into a Pandas data frame
* Since it is a data frame, one can sort the rules by a descending column
* `lhs` parameter
    * Gets the value for left-hand side of a rule
* `rhs` parameter
    * Gets the value for right-hand side of a rule
* `supports` parameter
    * Gets the value for support of a rule
* `confidences` parameter
    * Gets the value for confidence of a rule
* `lifts` parameter
    * Gets the value for lift of a rule
* Returns all values from parameters above as a list

In [8]:
def inspect(results):
    lhs = [tuple(result[2][0][0])[0] for result in results]
    rhs = [tuple(result[2][0][1])[0] for result in results]
    supports = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))

## Put Results into Well Organized Pandas Data Frame

* Creates a Pandas data frame the output of the `inspect` function with the column names specified

In [9]:
results_data_frame = pd.DataFrame(inspect(results),
                                  columns=['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

### Display Results Unsorted

In [10]:
results_data_frame

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,light cream,chicken,0.004533,0.290598,4.843951
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
2,pasta,escalope,0.005866,0.372881,4.700812
3,fromage blanc,honey,0.003333,0.245098,5.164271
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
6,light cream,olive oil,0.0032,0.205128,3.11471
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
8,pasta,shrimp,0.005066,0.322034,4.506672


### Display Results Sorted by Descending Lifts

* `nlargest` function sorts column in descending order

In [11]:
results_data_frame.nlargest(n=10, columns=['Lift'])

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
3,fromage blanc,honey,0.003333,0.245098,5.164271
0,light cream,chicken,0.004533,0.290598,4.843951
2,pasta,escalope,0.005866,0.372881,4.700812
8,pasta,shrimp,0.005066,0.322034,4.506672
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
6,light cream,olive oil,0.0032,0.205128,3.11471


## Analyzing Results

* Rule with first product, **fromage blanc**, and second product, **honey**, is the strongest rule
* Business owner of market could offer a sale for customers to buy fromage blanc and get honey for free