# Assignment 3: Association Analysis

To read Excel files, you might need to install the `xlrd` package, using something like:

Select the conda environment you use for this module (skip this step if you have not created a separate environment for this 

    conda activate myEnvironment  # where myEnvironment is the conda environment you use for this module

then install as usual

    conda install xlrd

Note:

 * To run this command from within a notebook you prefix command with !  
 * You will also need the package `mlxtend` which you installed as part of the Week 10 - ARM practical.

In [1]:
conda install xlrd

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/wazby/opt/anaconda3

  added / updated specs:
    - xlrd


The following packages will be UPDATED:

  ca-certificates    anaconda::ca-certificates-2020.10.14-0 --> conda-forge::ca-certificates-2020.12.5-h033912b_0
  certifi                anaconda::certifi-2020.6.20-py38_0 --> conda-forge::certifi-2020.12.5-py38h50d1736_1


Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.


In [2]:
# ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.
# Got this error when running next step - trying to run openpyxl
# https://anaconda.org/anaconda/openpyxl
# https://stackoverflow.com/questions/65250207/pandas-cannot-open-an-excel-xlsx-file

!conda install -c anaconda openpyxl -y

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/wazby/opt/anaconda3

  added / updated specs:
    - openpyxl


The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2020.12.~ --> anaconda::ca-certificates-2020.10.14-0
  certifi            conda-forge::certifi-2020.12.5-py38h5~ --> anaconda::certifi-2020.6.20-py38_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


---

You may find the following useful to obtain the data from the UCI data repository, and to read it into a dataframe.

In [3]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

import requests, os
csvUrl = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/groceries.csv"
csvFile = 'data/groceries.csv'
xlUrl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
xlFile = 'data/Online Retail.xlsx'
dataFile = xlFile
url = xlUrl
if not os.path.exists('data'):
    os.makedirs('data')
if not os.path.isfile(dataFile):
    r = requests.get(url)
    with open(dataFile, 'wb') as f:
        f.write(r.content)
if (dataFile == xlFile):
    df = pd.read_excel(dataFile, engine='openpyxl')    #https://stackoverflow.com/questions/65250207/pandas-cannot-open-an-excel-xlsx-file
else:
    df = pd.read_csv(dataFile)
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


The following lines tidy up the description column, ensure that every row is assigned an invoice number, and that they represent actual transactions.

In [4]:
df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]
df.shape

(532621, 8)

In [5]:
# Get the unique list of countries
print(df['Country'].unique())

['United Kingdom' 'France' 'Australia' 'Netherlands' 'Germany' 'Norway'
 'EIRE' 'Switzerland' 'Spain' 'Poland' 'Portugal' 'Italy' 'Belgium'
 'Lithuania' 'Japan' 'Iceland' 'Channel Islands' 'Denmark' 'Cyprus'
 'Sweden' 'Finland' 'Austria' 'Bahrain' 'Israel' 'Greece' 'Hong Kong'
 'Singapore' 'Lebanon' 'United Arab Emirates' 'Saudi Arabia'
 'Czech Republic' 'Canada' 'Unspecified' 'Brazil' 'USA'
 'European Community' 'Malta' 'RSA']


__Task 1.1__: Select the transactions arising from the `Country` having _9042_ records in the dataframe and convert them to the OneHotEncoded form, where each column has (0,1) values representing the (absence,presence) of that product in a given basket, where each basket (row) is labeled by its `InvoiceNo`.
Use mlxtend's `apriori` function to find the frequent itemsets where the minimum support threshold is set to 0.02. You should check the number of frequent itemsets &mdash; you should find there are 528. 

Hints
1. Use `groupby` and `size()` to determined the number of rows per `Country`.
2. Use `groupby` and `sum()` on the `Quantity` to encode as 0 and positive integer, and `reset_index()` so that the rows are labeled by `InvoiceNo`. Remember to set any positive numbers to 1 rather than a frequency count.

In [6]:
## BEGIN YOUR ANSWER HERE

In [7]:
# Attempt 1
# df.groupby('Country').size()

# Attempt 2
# df['Country'].value_counts()[df['Country'].value_counts()==9042]

# Finally used this method as it returns an array with the values
print(df.groupby('Country').filter(lambda x: len(x) == 9042)['Country'].unique().item(0))


Germany


In [8]:
# Filter so that only German transactions are in the data frame
country = df.groupby('Country').filter(lambda x: len(x) == 9042)['Country'].unique().item(0)

df = df.loc[df['Country'] == country]

# Note: Setting InvoiceNo as index causes me a disadvantage in EDA later to identify abnormal products
# No real benefit to setting it as index so commenting out

# df.set_index(['InvoiceNo'], inplace = True)

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
1109,536527,22809,SET OF 6 T-LIGHTS SANTA,6,2010-12-01 13:04:00,2.95,12662.0,Germany
1110,536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,2010-12-01 13:04:00,2.55,12662.0,Germany
1111,536527,84945,MULTI COLOUR SILVER T-LIGHT HOLDER,12,2010-12-01 13:04:00,0.85,12662.0,Germany
1112,536527,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,2010-12-01 13:04:00,1.65,12662.0,Germany
1113,536527,22244,3 HOOK HANGER MAGIC GARDEN,12,2010-12-01 13:04:00,1.95,12662.0,Germany


#### Quick EDA

In [9]:
# Number of invoices in the data frame 

df['InvoiceNo'].nunique()

457

In [10]:
# Number of stock items in the data frame

df['StockCode'].nunique()

1665

In [11]:
# Check for Abnormalities

df.groupby('StockCode').InvoiceNo.nunique()

# POST and M look strange

StockCode
10002       1
10125       6
10135       1
11001       2
15034       1
         ... 
90201C      1
90201D      1
90202D      1
M           7
POST      374
Name: InvoiceNo, Length: 1665, dtype: int64

In [12]:
df.loc[df['StockCode'] == 'M']

# M is for Manual as in documentation
# It is a product as it has a price
# I think it will be worth while dropping this product as it is generic

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
159703,550354,M,Manual,1,2011-04-18 10:28:00,222.75,13811.0,Germany
248485,558841,M,Manual,1,2011-07-04 11:59:00,30.0,12471.0,Germany
249272,558895,M,Manual,1,2011-07-04 15:54:00,389.68,12619.0,Germany
398150,571223,M,Manual,1,2011-10-14 13:36:00,599.5,13810.0,Germany
455619,575632,M,Manual,1,2011-11-10 13:44:00,40.46,12473.0,Germany
455620,575632,M,Manual,1,2011-11-10 13:44:00,424.06,12473.0,Germany
455621,575632,M,Manual,1,2011-11-10 13:44:00,549.34,12473.0,Germany
455648,575636,M,Manual,1,2011-11-10 13:46:00,40.46,12473.0,Germany
479546,577168,M,Manual,1,2011-11-18 10:42:00,0.0,12603.0,Germany


In [13]:
df.loc[df['StockCode'] == 'POST']

# This stock code refers to postage which is really not a stock item
# I will delete this stock code

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
1123,536527,POST,POSTAGE,1,2010-12-01 13:04:00,18.0,12662.0,Germany
5073,536840,POST,POSTAGE,1,2010-12-02 18:27:00,18.0,12738.0,Germany
5369,536861,POST,POSTAGE,3,2010-12-03 10:44:00,18.0,12427.0,Germany
6602,536967,POST,POSTAGE,1,2010-12-03 12:57:00,18.0,12600.0,Germany
6973,536983,POST,POSTAGE,1,2010-12-03 14:30:00,18.0,12712.0,Germany
...,...,...,...,...,...,...,...,...
537459,581266,POST,POSTAGE,5,2011-12-08 11:25:00,18.0,12621.0,Germany
541216,581494,POST,POSTAGE,2,2011-12-09 10:13:00,18.0,12518.0,Germany
541730,581570,POST,POSTAGE,1,2011-12-09 11:59:00,18.0,12662.0,Germany
541767,581574,POST,POSTAGE,2,2011-12-09 12:09:00,18.0,12526.0,Germany


In [14]:
# Check for more abnormalities

df['StockCode'].value_counts()

POST     374
22326    113
22328     72
22554     64
22423     63
        ... 
23089      1
23090      1
22089      1
23094      1
90098      1
Name: StockCode, Length: 1665, dtype: int64

In [15]:
# Check for more abnormalities

df['Description'].value_counts()

POSTAGE                               374
ROUND SNACK BOXES SET OF4 WOODLAND    113
ROUND SNACK BOXES SET OF 4 FRUITS      72
PLASTERS IN TIN WOODLAND ANIMALS       64
REGENCY CAKESTAND 3 TIER               63
                                     ... 
TRAVEL CARD WALLET RETRO PETALS         1
ANT COPPER PINK BOUDICCA BRACELET       1
VINTAGE CHRISTMAS GIFT BAG LARGE        1
DRAWER KNOB VINTAGE GLASS STAR          1
PACK 3 IRON ON DOG PATCHES              1
Name: Description, Length: 1695, dtype: int64

In [16]:
# ROUND SNACK BOXES SET OF 4 FRUITS
# ROUND SNACK BOXES SET OF4 WOODLAND

# Same product, maybe?

In [17]:
# Number of stock items in the data frame

print(df['Description'].nunique())
print(df['StockCode'].nunique())

# This could mean that some of the descriptions are different for the same stock code
# See note at end of this answer for my interpretation

1695
1665


---

In [18]:
# To make up the transactions list I will only need these fields from the dataframe
# Please see note at end of the answer - I feel it would have been better perhaps to use StockCode

df2 = df[['InvoiceNo', 'Description']]
df2

Unnamed: 0,InvoiceNo,Description
1109,536527,SET OF 6 T-LIGHTS SANTA
1110,536527,ROTATING SILVER ANGELS T-LIGHT HLDR
1111,536527,MULTI COLOUR SILVER T-LIGHT HOLDER
1112,536527,5 HOOK HANGER MAGIC TOADSTOOL
1113,536527,3 HOOK HANGER MAGIC GARDEN
...,...,...
541801,581578,SET OF 4 PANTRY JELLY MOULDS
541802,581578,PACK OF 20 NAPKINS PANTRY DESIGN
541803,581578,PACK OF 20 NAPKINS RED APPLES
541804,581578,JINGLE BELL HEART ANTIQUE SILVER


In [19]:
crosstab = pd.crosstab(df2.InvoiceNo, df2.Description).astype('bool').astype('int')
crosstab

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE SKULLS,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536527,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536840,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536861,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536967,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581266,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581494,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581574,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# crosstab.POST and crosstab.Manual
# Identified as stock items to drop from the analysis

crosstab.drop('POSTAGE', inplace=True, axis=1)
crosstab.drop('Manual', inplace=True, axis=1)

In [21]:
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(crosstab, min_support=0.02,use_colnames=True)
frequent_itemsets.nunique()
frequent_itemsets.sort_values(by='support',ascending=False)

Unnamed: 0,support,itemsets
178,0.245077,(ROUND SNACK BOXES SET OF4 WOODLAND)
176,0.157549,(ROUND SNACK BOXES SET OF 4 FRUITS)
137,0.137856,(PLASTERS IN TIN WOODLAND ANIMALS)
165,0.137856,(REGENCY CAKESTAND 3 TIER)
441,0.131291,"(ROUND SNACK BOXES SET OF4 WOODLAND, ROUND SNA..."
...,...,...
226,0.021882,(SPACEBOY ROCKET LOLLY MAKERS)
440,0.021882,"(ROUND CONTAINER SET OF 5 RETROSPOT, ROUND SNA..."
439,0.021882,"(SPACEBOY BIRTHDAY CARD, ROBOT BIRTHDAY CARD)"
235,0.021882,(TABLECLOTH RED APPLES DESIGN)


In [22]:
# Should find there are 528 such item sets

frequent_itemsets.shape[0]

528

#### Note:

It's worth noting that in my experience as an ERP consultant, I've learned not to use the product description as the unique identifier. This is because it's likely that spelling corrections, definition changes will be made on the item master or stock items are duplicated. At least one of these scenarios appears to be the case in this dataset because the number of unique descriptions is not equal to the number of stock codes. I ran the apriori crosstab again with a data frame of StockCode and discovered **537** frequent itemsets.  This for me is the more accurate answer.  There is also scope to merge identical stock descriptions into a single item which do exist.

We were given an expected number of frequent itemsets, **528** in this case, so I returned to using the Description field. If this were a company use case exercise, I think I would be more comfortable using the StockCode because my EDA would revolve around it and I would have researched the products more thoroughly to find potential duplicates.

In [23]:
## END YOUR ANSWER HERE

---

__Task 1.2__: Use mlxtend's `association_rules` function to find the association rules where the minimum lift threshold is 1.
Sort them in non-increasing order of lift (largest to smallest).
You should then check the number of such rules &mdash; you should find there are 738 such rules.

In [24]:
## BEGIN YOUR ANSWER HERE

In [25]:
from mlxtend.frequent_patterns import association_rules

# Find the association rules where the minimum lift threshold is 1

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)


# Sort them in non-increasing order of lift (largest to smallest).

rules.sort_values(by='lift',ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
127,(DOLLY GIRL CHILDRENS BOWL),(DOLLY GIRL CHILDRENS CUP),0.026258,0.024070,0.024070,0.916667,38.083333,0.023438,11.711160
126,(DOLLY GIRL CHILDRENS CUP),(DOLLY GIRL CHILDRENS BOWL),0.024070,0.026258,0.024070,1.000000,38.083333,0.023438,inf
69,(BLUE VINTAGE SPOT BEAKER),(PINK VINTAGE SPOT BEAKER),0.030635,0.024070,0.024070,0.785714,32.642857,0.023333,4.554340
68,(PINK VINTAGE SPOT BEAKER),(BLUE VINTAGE SPOT BEAKER),0.024070,0.030635,0.024070,1.000000,32.642857,0.023333,inf
478,"(RED STRIPE CERAMIC DRAWER KNOB, WHITE SPOT RE...",(BLUE STRIPE CERAMIC DRAWER KNOB),0.032823,0.028446,0.021882,0.666667,23.435897,0.020948,2.914661
...,...,...,...,...,...,...,...,...,...
359,(ROUND SNACK BOXES SET OF4 WOODLAND),(REGENCY CAKESTAND 3 TIER),0.245077,0.137856,0.039387,0.160714,1.165816,0.005602,1.027236
239,(ROUND SNACK BOXES SET OF4 WOODLAND),(PACK OF 72 RETROSPOT CAKE CASES),0.245077,0.085339,0.024070,0.098214,1.150870,0.003155,1.014277
238,(PACK OF 72 RETROSPOT CAKE CASES),(ROUND SNACK BOXES SET OF4 WOODLAND),0.085339,0.245077,0.024070,0.282051,1.150870,0.003155,1.051500
356,(REGENCY CAKESTAND 3 TIER),(ROUND SNACK BOXES SET OF 4 FRUITS),0.137856,0.157549,0.024070,0.174603,1.108245,0.002351,1.020662


In [26]:
# Should find there are 738 such rules

rules.shape[0]

738

In [27]:
## END YOUR ANSWER HERE

---

__Task 1.3__: Comparing row indexes 452 and 453 above, which have the same lift value (32.642857), by reviewing the rule metrics above, would it be better to suggest 'BLUE VINTAGE SPOT BEAKER' to someone who already had 'PINK VINTAGE SPOT BEAKER', or vice-versa? Give reasons for your answer.

In [28]:
## BEGIN YOUR ANSWER HERE

In [29]:
rules.iloc[[452, 453]]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
452,(WOODLAND MINI BACKPACK),(SPACEBOY MINI BACKPACK),0.043764,0.032823,0.021882,0.5,15.233333,0.020445,1.934354
453,(SPACEBOY MINI BACKPACK),(WOODLAND MINI BACKPACK),0.032823,0.043764,0.021882,0.666667,15.233333,0.020445,2.868709


**Note**

Since we would expect lift values of about 32.642857, it appears that my machine has not placed the correct rules at the index of 452 and 453. It's assumed that the computer framework used for preparing the specification notebook was run on a laptop that pulls data in a different way than mine. As a result, I'll search for the right rows using the lift value of 32.642857.

After the rules have been sorted, I think it's a good idea to add a step that resets the index. This ensures that the results are indexed in the same order on all platforms.


**P.S.**

It turns out that the unstable sorting algorithm is to blame for the sorting discrepancy.

In [30]:
import numpy as np

# Since there are various levels of precision, the numpy isclose function is used to compare floats. 
# i.e. The lift number output in the rules table does not exactly match the number stored in the memory. 
# As a result, use this function to set an accuracy threshold when comparing.

rules['findMe'] = np.isclose(rules['lift'], 32.6428, rtol=1e-05, atol=1e-08, equal_nan=False)

rules.loc[rules['findMe'] == True]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,findMe
68,(PINK VINTAGE SPOT BEAKER),(BLUE VINTAGE SPOT BEAKER),0.02407,0.030635,0.02407,1.0,32.642857,0.023333,inf,True
69,(BLUE VINTAGE SPOT BEAKER),(PINK VINTAGE SPOT BEAKER),0.030635,0.02407,0.02407,0.785714,32.642857,0.023333,4.55434,True


**Comments**

 *  Please notice the difference in index number and addressed in the comment above under the **Note** heading.
 
##### Support
Support informs us of how often an item set appears in all transactions.  The fraction of the total number of transactions in which an item set exists is known as support. It signifies the popularity of an itemset.  The value of support aids us in determining which laws are worth investigating further.  At around 0.02407 (BLUE VINTAGE SPOT BEAKER -> PINK VINTAGE SPOT BEAKER) and 0.030635 (PINK VINTAGE SPOT BEAKER -> BLUE VINTAGE SPOT BEAKER), these item sets aren't particularly popular. If an itemset has a very low support, we don't have enough information about the relationship between its items, so we can't draw any conclusions from it.  I'd like to point out that the minimum support threshold for frequent item sets was set to 0.02, and these item sets only scrape by. It could also be argued that making a good recommendation regarding this item set is possibly valuable to the business because we can move slow-moving stock items with certainty and optimise the amount of stock moved in a single transaction.  The rare item problem occurs when items that appear infrequently in the data set are pruned, despite the fact that they could generate interesting and potentially useful rules. This example illustrates why it is important not to set the minimum threshold too low in certain cases, as these relationships would otherwise be overlooked.

##### Confidence
The likelihood of the consequent appearing on the cart if the cart has the antecedent is described by the confidence measure. Where the confidence is the conditional probability of the consequent occurring given the antecedent. The nearer the value is to 1, the more likely the consequent will be present with the antecedent.  In terms of these metrics we can see there is a confidence of 1 for the rule {PINK VINTAGE SPOT BEAKER} -> {BLUE VINTAGE SPOT BEAKER} while the rule {BLUE VINTAGE SPOT BEAKER} -> {PINK VINTAGE SPOT BEAKER}  has a confidence of 0.785714.  So, if we want to increase the number of products a consumer buys, we can suggest the Pink Vintage Spot Beaker, and based on the association rules metrics, the customer would certainly also purchase a Blue Vintage Spot Beaker.

##### Lift
When calculating the conditional probability of occurrence of Y given X, lift regulates the support (frequency) of the consequent. Lift is the increase in the probability of having Y on the cart without knowing that X is present over the probability of having Y on the cart without knowing that X is present. The value of lift would be greater than 1 in cases where X actually leads to Y on the cart. While a value of lift less than 1 indicates that not having X on the cart does not increase the likelihood of Y occurring, a value of lift greater than 1 indicates that having X on the cart does. The lift of both BLUE VINTAGE SPOT BEAKER and PINK VINTAGE SPOT BEAKER is 32.542857.  This indicates that the two items are very much associated with each other and we can make a strong recommendation. By contrast lift would be low if for example we were looking at milk in a grocery store, milk is almost always bought with every transaction in the grocery store but is not necessarily going to have a high lift (>1) in all item sets.

##### Leverage
The difference between the observed frequency of X and Y occurring together and the frequency that would be expected if X and Y were independent is calculated using leverage. Independence is indicated by a leverage value of 0. In the case of the beakers the value is 0.023333.  In a sales situation, the rationale is to see how many more units (items X and Y combined) are sold than anticipated from the separate sales. Since these item sets are rare and only made it over the threshold, I assume the metric is weak and does not completely represent the relationship between the items in this case. I will disregard this metric in this analysis.

##### Conviction
A high conviction value indicates that the consequent is strongly affected by the antecedent. In the case of the pink beaker the denominator, a perfect confidence score becomes 0 (due to 1 - 1), and the conviction score is described as 'inf' meaning infinite as in k/0.  If items are independent, the conviction is 1.  In the case of {PINK VINTAGE SPOT BEAKER} -> {BLUE VINTAGE SPOT BEAKER} our conviction is infinite, while {BLUE VINTAGE SPOT BEAKER} -> {PINK VINTAGE SPOT BEAKER} our conviction is 4.55434, implying the items are independent.  This finally confirms our recommendation as surely we should recommend the dependent item set.

##### The Recommendation
So given this analysis of the metrics and making a recommendation to the customer.  If I recommended PINK VINTAGE SPOT BEAKER it would be with 100% confidence and infinite conviction that the clients would also buy the BLUE VINTAGE SPOT BEAKER.  Whereas if I recommended BLUE VINTAGE SPOT BEAKER to the client there is only a 78% chance that they would also buy the PINK VINTAGE SPOT BEAKER.  The high lift indicates a strong association between the items.  The leverage and support are not particularly useful to help us assess the relationship due to the rare item problem.  That said the support metric was how we initially identified the rules in the first place.

This association rule will maximise the sales: **{PINK VINTAGE SPOT BEAKER} -> {BLUE VINTAGE SPOT BEAKER}**

In [31]:
## END YOUR ANSWER HERE