<a href="https://colab.research.google.com/github/markborano11/Linear_Algebra_coding_projects/blob/main/lecture_8_3_computing_arules_the_dplyr_way.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 8.3 - Computing Association Rules with `dplyr`

### Review - Association Rules

Consider the rule $\{butter\} \rightarrow \{whole.milk\}$

  * $Support(\textrm{butter and milk}) = \frac{\textrm{# butter and milk transactions}}{\textrm{# total transactions}}$ 
  * $Support(\textrm{butter}) = \frac{\textrm{# butter transactions}}{\textrm{#
  total transactions}}$ 
  * $Confidence= \frac{Support(\textrm{butter and milk})}{Support(\textrm{butter})}$ 
  * $Lift= \frac{Confidence}{Support(\textrm{milk})}$ 
  

### Example: investigate rule $\{butter\} \longrightarrow \{milk\}$ with `dplyr`
  

In [26]:
groceries <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Groceries.csv')
head(groceries)

Unnamed: 0_level_0,frankfurter,sausage,liver.loaf,ham,meat,finished.products,organic.sausage,chicken,turkey,pork,⋯,candles,light.bulbs,sound.storage.medium,newspapers,photo.film,pot.plants,flower.soil.fertilizer,flower..seeds.,shopping.bags,bags
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


#### Use the commands below to load the `dplyr` package and select only the columns of interest.

In [27]:
library(dplyr)
butter_milk <- groceries %>%
                select(butter, whole.milk)
head(butter_milk)

Unnamed: 0_level_0,butter,whole.milk
Unnamed: 0_level_1,<int>,<int>
1,0,0
2,0,0
3,0,1
4,0,0
5,0,1
6,1,1


#### Next, we can compute the total number of transactions using the `nrow()` function:

In [28]:
N <- nrow(groceries)
N

#### Note that we could compute Support(Butter) in 2 steps:

In [29]:
butter_milk %>%
  summarize(Nbutter = sum(butter)) %>% 
  mutate(support_butter = Nbutter/N)

Nbutter,support_butter
<int>,<dbl>
545,0.05541434


#### We can alternatively compute Support(Butter) all at once in a single step:

In [30]:
butter_milk %>%
  summarize(support_butter = sum(butter)/N)

support_butter
<dbl>
0.05541434


#### Now, let's compute the support of whole.milk in a similar way:

In [31]:
butter_milk %>%
  summarize(support_milk = sum(whole.milk)/N)

support_milk
<dbl>
0.255516


#### Next, we need to compute the support of $\{Butter\;and\;Milk\}$

To do this, note that we focus on `butter * whole.milk`. Think about why! 

In [32]:
butter_milk %>%
  mutate(butter_and_milk = butter * whole.milk) %>%
  summarize(support_rule = sum(butter_and_milk)/N)

support_rule
<dbl>
0.02755465


#### Now, we can put it all together (and also compute the confidence and lift).

In [33]:
groceries %>%
  mutate(bought_butter_milk = butter *  whole.milk) %>%
  summarize(support_milk = sum(whole.milk)/N,
            support_butter = sum(butter)/N,
            support_rule = sum(bought_butter_milk)/N) %>%
  mutate(confidence = support_rule/support_butter) %>%
  mutate(lift = confidence/support_milk)

support_milk,support_butter,support_rule,confidence,lift
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.255516,0.05541434,0.02755465,0.4972477,1.946053


#### Important Things to Note

* You must compute values before you use them
    * Compute Supports before Confidence
    * Compute Confidence before Lift

## <font color="red"> Activity 8.3 - Exercise 1 </font>

Compute and interpret all interesting statistics (supports, confidence, and lift) for the rule $\{domestic\,eggs\}\rightarrow\{ham\}$

In [88]:
library(dplyr)
butter_milk <- groceries %>%
                select(ham,domestic.eggs)
head(butter_milk)

Unnamed: 0_level_0,ham,domestic.eggs
Unnamed: 0_level_1,<int>,<int>
1,0,0
2,0,0
3,0,0
4,0,0
5,0,0
6,0,0


In [89]:
N <- nrow(groceries)
N

In [90]:
#support ham
butter_milk %>%
  summarize(support_ham = sum(ham)/N)

support_ham
<dbl>
0.02602949


In [112]:
#support domestic eggs
butter_milk %>%
  mutate(butter_and_milk = domestic.eggs * ham) %>%
  summarize(support_rule = sum(butter_and_milk)/N)

support_rule
<dbl>
0.004168785


In [113]:
groceries %>%
  mutate(bought_butter_milk = domestic.eggs *  ham) %>%
  summarize(support_milk = sum(ham)/N,
            support_butter = sum(domestic.eggs)/N,
            support_rule = sum(bought_butter_milk)/N) %>%
  mutate(confidence = support_rule/support_butter) %>%
  mutate(lift = confidence/support_milk)

support_milk,support_butter,support_rule,confidence,lift
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.02602949,0.06344687,0.004168785,0.06570513,2.524258


#### Now, if you're interested, you can look at the code below which can be used to compute many rules at once.
##### Here's the big idea:


* Stack the items that could be considered on the LHS into one column named LHS
* Group by LHS
* Compute:
    * Support
    * Confidence
    * Lift
  

#### Step 0 - Read the data and load libraries

In [None]:
library(tidyr)
library(dplyr)

In [None]:
groceries <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Groceries.csv')
N <- nrow(groceries)

#### Step 1 - Stack all of the other products

In [None]:
groceries_stacked <-
  groceries %>%
  gather(key = "lhs",
         value = "pur_lhs",
         -whole.milk) 
head(groceries_stacked)

Unnamed: 0_level_0,whole.milk,lhs,pur_lhs
Unnamed: 0_level_1,<int>,<chr>,<int>
1,0,frankfurter,0
2,0,frankfurter,0
3,1,frankfurter,0
4,0,frankfurter,0
5,1,frankfurter,0
6,1,frankfurter,0


#### Step 2 - find whether lhs and milk were bought together

In [None]:
groceries_stacked <-
  groceries_stacked %>%
  mutate(pur_both = whole.milk * pur_lhs) 
head(groceries_stacked)

Unnamed: 0_level_0,whole.milk,lhs,pur_lhs,pur_both
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>
1,0,frankfurter,0,0
2,0,frankfurter,0,0
3,1,frankfurter,0,0
4,0,frankfurter,0,0
5,1,frankfurter,0,0
6,1,frankfurter,0,0


#### Step 3 - Compute the support, confidence, and lift for each

In [None]:
# Note that we group_by the products to keep them separate.
many_rules <-
groceries_stacked %>%
  group_by(lhs) %>%
  summarize(sup_milk = sum(whole.milk)/N,
            sup_lhs = sum(pur_lhs)/N,
            joint_support = sum(pur_both)/N) %>%
  mutate(conf = joint_support/sup_lhs) %>%
  mutate(lift = conf/sup_milk) 
many_rules %>% head

lhs,sup_milk,sup_lhs,joint_support,conf,lift
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
abrasive.cleaner,0.255516,0.0035587189,0.0016268429,0.4571429,1.7890967
artif..sweetener,0.255516,0.0032536858,0.0011184545,0.34375,1.3453169
baby.cosmetics,0.255516,0.0006100661,0.000305033,0.5,1.9568245
baby.food,0.255516,0.0001016777,0.0,0.0,0.0
bags,0.255516,0.0004067107,0.0001016777,0.25,0.9784123
baking.powder,0.255516,0.0176919166,0.009252669,0.5229885,2.0467935


#### Step 4 - filter rules with low support; sort by lift

In [None]:
many_rules %>%
  filter(joint_support > .05) %>%
  arrange(-lift)

lhs,sup_milk,sup_lhs,joint_support,conf,lift
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
yogurt,0.255516,0.1395018,0.0560244,0.4016035,1.571735
other.vegetables,0.255516,0.1934926,0.07483477,0.3867578,1.513634
rolls.buns,0.255516,0.1839349,0.05663447,0.3079049,1.205032


Interpretation of first rule: 

* Milk is purchased 25.6% of the time.  
* Knowing yogurt was also purchased 'lifts' this rate of purchase by 57%.  
* In other words, knowing yogurt was purchased increases the likelihood that milk was purchased by 57%, relative to the underlying rate at which milk was already being purchased.