# Module AA: Association Analysis

An example from this link.

https://www.kaggle.com/code/sangwookchn/association-rule-learning-with-scikit-learn/notebook

In [None]:
# Please install the package. There is another package class mlxend.
!pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5953 sha256=68fc999c727f8f243dd61f57926980e77aa5ba676537992762ba06d45da651e7
  Stored in directory: /root/.cache/pip/wheels/c4/1a/79/20f55c470a50bb3702a8cb7c94d8ada15573538c7f4baebe2d
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from apyori import apriori

In [None]:
dataset = pd.read_csv('http://pluto.hood.edu/~dong/datasets/Market_Basket_Optimisation.csv', header = None) #To make sure the first row is not thought of as the heading
print(dataset.shape)
dataset.head()

(7501, 20)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [None]:
#Transforming the list into a list of lists, so that each transaction can be indexed easier
transactions = []
for i in range(0, dataset.shape[0]):
    transactions.append([str(dataset.values[i, j]) for j in range(0, 20)])

print(transactions[0])

['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil']


In [None]:
# Support: number of transactions containing set of times / total number of transactions
# .      --> products that are bought at least 3 times a day --> 21 / 7501 = 0.0027
# Confidence: Should not be too high, as then this wil lead to obvious rules
#Try many combinations of values to experiment with the model.

rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)

#viewing the rules
results = list(rules)

In [None]:
#Transferring the list to a table

results = pd.DataFrame(results)
results.head(10)

Unnamed: 0,items,support,ordered_statistics
0,"(chicken, light cream)",0.004533,"[((light cream), (chicken), 0.2905982905982905..."
1,"(escalope, mushroom cream sauce)",0.005733,"[((mushroom cream sauce), (escalope), 0.300699..."
2,"(escalope, pasta)",0.005866,"[((pasta), (escalope), 0.3728813559322034, 4.7..."
3,"(fromage blanc, honey)",0.003333,"[((fromage blanc), (honey), 0.2450980392156863..."
4,"(herb & pepper, ground beef)",0.015998,"[((herb & pepper), (ground beef), 0.3234501347..."
5,"(ground beef, tomato sauce)",0.005333,"[((tomato sauce), (ground beef), 0.37735849056..."
6,"(light cream, olive oil)",0.0032,"[((light cream), (olive oil), 0.20512820512820..."
7,"(whole wheat pasta, olive oil)",0.007999,"[((whole wheat pasta), (olive oil), 0.27149321..."
8,"(pasta, shrimp)",0.005066,"[((pasta), (shrimp), 0.3220338983050847, 4.506..."
9,"(avocado, milk, spaghetti)",0.003333,"[((avocado, spaghetti), (milk), 0.416666666666..."


"The first item in the list is a list itself containing three items. The first item of the list shows the grocery items in the rule.

For instance from the first item, we can see that light cream and chicken are commonly bought together. This makes sense since people who purchase light cream are careful about what they eat hence they are more likely to buy chicken i.e. white meat instead of red meat i.e. beef. Or this could mean that light cream is commonly used in recipes for chicken.

The support value for the first rule is 0.0045. This number is calculated by dividing the number of transactions containing light cream divided by total number of transactions. The confidence level for the rule is 0.2905 which shows that out of all the transactions that contain light cream, 29.05% of the transactions also contain chicken. Finally, the lift of 4.84 tells us that chicken is 4.84 times more likely to be bought by the customers who buy light cream compared to the default likelihood of the sale of chicken."

From https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

## In-Class: Association Analysis ##
Customer Computer Configuration

In [None]:
pc_purchase = pd.read_csv('http://pluto.hood.edu/~dong/datasets/PC-Purchase-Data.csv') #To make sure the first row is not thought of as the heading
print(pc_purchase.shape)
pc_purchase.head()

(67, 12)


Unnamed: 0,Intel Core i3,Intel Core i5,Intel Core i7,10 inch screen,12 inch screen,15 inch screen,2 GB,4 GB,8 GB,320 GB,500 GB,750 GB
0,0,1,0,0,1,0,0,1,0,0,1,0
1,0,1,0,0,0,1,0,0,1,0,0,1
2,0,1,0,0,1,0,0,1,0,1,0,0
3,1,0,0,0,1,0,0,0,1,0,1,0
4,0,0,1,0,0,1,0,0,1,0,0,1


The data represent the configurations for a small number of orders of laptops placed over the web. The main options from which customers can choose are the type of processors, screen size, memory, and hard drive. A '1' signifies that a customer selected a particular option. If the manufacturer can better understand what types of components are often ordered together, it can speed up final assembly by having partially completed laptops with the mYourorelar combinations of orderingnents configured before order.

In [None]:
#Transforming the list into a list of lists, so that each transaction can be indexed easier
computers = []
j = 0
for i in range(0, pc_purchase.shape[0]):
  newList = []
  for j in range(0, 12):
    if(pc_purchase.values[i][j] == 1):
      newList.append(str(pc_purchase.columns[j]))
  computers.append(newList)
print(computers[1])

['Intel Core i5', '15 inch screen', '8 GB', '750 GB']


In [None]:
rules = apriori(computers, min_support = 0.02, min_confidence = 0.2, min_lift = 3, min_length = 3)

#viewing the rules
results = list(rules)

In [None]:
#Transferring the list to a table

results = pd.DataFrame(results)
print(results.shape)
results.head(28)

(13, 3)


Unnamed: 0,items,support,ordered_statistics
0,"(15 inch screen, Intel Core i3, 320 GB)",0.029851,"[((15 inch screen, Intel Core i3), (320 GB), 1..."
1,"(8 GB, 15 inch screen, 750 GB)",0.074627,"[((750 GB), (8 GB, 15 inch screen), 0.29411764..."
2,"(15 inch screen, 750 GB, Intel Core i7)",0.074627,"[((750 GB), (15 inch screen, Intel Core i7), 0..."
3,"(8 GB, 15 inch screen, Intel Core i7)",0.059701,"[((8 GB), (15 inch screen, Intel Core i7), 0.3..."
4,"(4 GB, 750 GB, Intel Core i7)",0.059701,"[((Intel Core i7), (4 GB, 750 GB), 0.333333333..."
5,"(8 GB, 750 GB, Intel Core i7)",0.059701,"[((Intel Core i7), (8 GB, 750 GB), 0.333333333..."
6,"(2 GB , 10 inch screen, 500 GB, Intel Core i3)",0.029851,"[((2 GB , 10 inch screen), (500 GB, Intel Core..."
7,"(2 GB , 320 GB, 12 inch screen, Intel Core i3)",0.029851,"[((320 GB, 12 inch screen), (2 GB , Intel Core..."
8,"(2 GB , 12 inch screen, 750 GB, Intel Core i3)",0.029851,"[((2 GB , 750 GB), (12 inch screen, Intel Core..."
9,"(12 inch screen, 750 GB, Intel Core i7, 4 GB)",0.044776,"[((Intel Core i7), (12 inch screen, 750 GB, 4 ..."


**You task to is find the top popular configuraions. Use support, confidence, and lift correctly to explain your findings.**

Hint: make sure the data are in the right format.


The top two popular configurations are (8 GB, 15 inch screen, 750 GB), (15 inch screen, 750 GB, Intel Core i7). Other popular configurations follow (see the table above). Their support value is indicative of the proportion of configurations that include a particular oart of the selected configuration, the confidence values show the likelihood that part of a configuration follows given a particular configuration, and the lift characterizes the strength of the likelihood that the second half of a rule will occur in comparison to the likelihood of the first half alone. All of these values indicate that the top two configurations are strong rules.

In [1]:
from google.colab import drive
drive.mount('/content/drive')
#do not include the output from installation.
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
!pip install pypandoc
!pip install nbconvert
!cp "./drive/My Drive/Colab Notebooks/Copy of module_aa.ipynb" ./
!jupyter nbconvert --to PDF "Copy of module_aa.ipynb"

Mounted at /content/drive
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  dvisvgm fonts-droid-fallback fonts-lato fonts-lmodern fonts-noto-mono fonts-texgyre
  fonts-urw-base35 libapache-pom-java libcmark-gfm-extensions0.29.0.gfm.3 libcmark-gfm0.29.0.gfm.3
  libcommons-logging-java libcommons-parent-java libfontbox-java libfontenc1 libgs9 libgs9-common
  libidn12 libijs-0.35 libjbig2dec0 libkpathsea6 libpdfbox-java libptexenc1 libruby3.0 libsynctex2
  libteckit0 libtexlua53 libtexluajit2 libwoff1 libzzip-0-13 lmodern pandoc-data poppler-data
  preview-latex-style rake ruby ruby-net-telnet ruby-rubygems ruby-webrick ruby-xmlrpc ruby3.0
  rubygems-integration t1utils teckit tex-common tex-gyre texlive-base texlive-binaries
  texlive-fonts-recommended texlive-latex-base texlive-latex-recommended texlive-pictures
  texlive-plain-generic tipa xfonts-encodings xfonts-utils
Suggested packag