# <center> Association Rules Mining </center>

___

Association Rules is one of the very important concepts of machine learning being used in market basket analysis. 
But it is not the only use case.

For example: In a store, all vegetables are placed in the same aisle, all dairy items are placed together and cosmetics form another set of such groups. Investing time and resources on deliberate product placements like this not only reduces a customer’s shopping time, but also reminds the customer of what relevant items (s)he might be interested in buying, thus helping stores cross-sell in the process. Association rules help uncover all such relationships between items from huge databases. 

Association Rules is an unsupervised technique to unravel any pattern or relation between items. The rule defines association between A and B as A => B i.e if A is purchased B is also purchased. 

An association rule consists of an antecedent and a consequent.

$${\{Pen, Pencil\}} \to \{Paper\}$$
$$     {antecedent} \to consequent$$

For a given rule, `itemset` is the list of all the items in the antecedent and the consequent.

$${itemset} \to \{Pen, Pencil, Paper\}$$

The goodness of an association rule is measured based on three primary factors


**Support**

This measure gives an idea of how frequent an itemset is in all the transactions. 

- Consider itemset1 = {bread} and itemset2 = {shampoo}. There will be far more transactions containing bread than those containing shampoo. So as you rightly guessed, itemset1 will generally have a higher support than itemset2. 

- Now consider itemset1 = {bread, butter} and itemset2 = {bread, shampoo}. Many transactions will have both bread and butter on the cart but bread and shampoo? Not so much. So in this case, itemset1 will generally have a higher support than itemset2. Mathematically, support is the fraction of the total number of transactions in which the itemset occurs.

$$
{Support(\{A\} \to \{B\}) = \frac{Transactions\ containing\ both\ A\ and\ B"}{Total\ number\ of\ transactions}}
$$

Value of `support` helps us identify the rules worth considering for further analysis. 

For example, one might want to consider only the itemsets which occur at least 50 times out of a total of 10,000 transactions i.e. support = 0.005. 

If an itemset happens to have a very low support, we do not have enough information on the relationship between its items and hence no conclusions can be drawn from such a rule.

**Confidence**

This measure defines the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents. 

For example, to answer the question — of all the transactions containing say, {Kellogs Cornflakes}, how many also had {Milk} on them? 

We can say by common knowledge that {Kellogs Cornflakes} → {Milk} should be a high confidence rule. Technically, confidence is the conditional probability of occurrence of consequent given the antecedent.


$$
{Confidence(\{A\} \to \{B\}) = \frac{Transactions\ containing\ both\ A\ and\ B"}{Transactions\ containing\ A}}
$$

Consider few more examples before moving ahead. 

- What do you think would be the confidence for {Butter} → {Bread}? 
   That is, what fraction of transactions having butter also had bread? Very high i.e. a value close to 1? That’s right. 
   
- What about {Yogurt} → {Milk}? High again. {Toothbrush} → {Milk}? Not so sure? Confidence for this rule will also be high since {Milk} is such a frequent itemset and would be present in every other transaction.

*It does not matter what you have in the antecedent for such a frequent consequent. The confidence for an association rule having a very frequent consequent will always be high.*

![Confidence](confidence.png)

<i><center>Total transactions = 100.</center></i>
<i><center>10 of them have both milk and toothbrush, 70 have milk but no toothbrush and 4 have toothbrush but no milk.</center></i>

Consider the numbers from the above figure. `Confidence` for ${\{Toothbrush}\} \to {\{Milk\}}$
will be 10/(10+4) = 0.7. Looks like a high confidence value. But we know intuitively that these two products have a weak association and there is something misleading about this high confidence value. Lift is introduced to overcome this challenge.

If confidence is very high, it implies that when A is purchased then the probability of purchasing B is very high i.e. the rule is strong.

**`Considering just the value of confidence limits our capability to make any business inference.`**

**Lift**

Lift controls for the support (frequency) of consequent while calculating the conditional probability of occurrence of {B} given {A}. 

Lift is a very literal term given to this measure. Think of it as the **`lift`** that {A} provides to our confidence for having {B} on the cart. To rephrase, lift is the rise in probability of having {B} on the cart with the knowledge of {A} being present over the probability of having {B} on the cart without any knowledge about presence of {A}. Mathematically,

$$
{Lift(\{A\} \to \{B\}) = \Bigg( \frac{Transactions\ containing\ both\ A\ and\ B}{Transactions\ containing\ A}} \Bigg)/{(Fractions\ of\ transactions\ containing\ B )}
$$




In cases where {A} actually leads to {B} on the cart, value of lift will be greater than 1. 

Let us understand this with an example which will be continuation of the {Toothbrush} → {Milk} rule.

Probability of having milk on the cart with the knowledge that toothbrush is present (i.e. confidence) : 10/(10+4) = 0.7

Now to put this number in perspective, consider the probability of having milk on the cart without any knowledge about toothbrush: 80/100 = 0.8

These numbers show that having toothbrush on the cart actually reduces the probability of having milk on the cart to 0.7 from 0.8! This will be a lift of 0.7/0.8 = 0.87. Now that’s more like the real picture. A value of lift less than 1 shows that having toothbrush on the cart does not increase the chances of occurrence of milk on the cart in spite of the rule showing a high confidence value. A value of lift greater than 1 vouches for high association between {B} and {A}. More the value of lift, greater are the chances of preference to buy {B} if the customer has already bought {A}. Lift is the measure that will help store managers to decide product placements on aisle.




## What does it mean by lift >1


When will be the confidence of (A->B) >1


$$
{Lift(\{A\} \to \{B\}) = \Bigg( \frac{Transactions\ containing\ both\ A\ and\ B}{Transactions\ containing\ A}} \Bigg)>{(Fractions\ of\ transactions\ containing\ B )}
$$


What does this mean???
We know that Probability of buying milk is 0.8

If lift is very high, it implies that when B is purchased then the confidence for the rule is very high or most of the times B was purchased along with A i.e. the rule is very strong.

***`Ideally, we look for rules that have low support, high confidence and high lift.`***


**Association Rules Mining**

Now that we understand how to quantify the importance of association of products within an itemset, the next step is to generate rules from the entire list of items and identify the most important ones. This is not as simple as it might sound. Supermarkets will have thousands of different products in store. For d items there are ${2}^{d}$ 
itemsets!! And this number increases exponentially with the increase in number of items. Finding lift values for each of these will get computationally very very expensive. How to deal with this problem? How to come up with a set of most important association rules to be considered? **`Apriori`** algorithm comes to our rescue for this.

We will see apriori algorithm as part of our activity.



##### Association rules is a rule-based learning method used to draw frequent patterns and correlations from datasets such as transactional and relational data.

##### In essence it computes the co-occurence statistics between items, in the form of an implication expression (X → Y).

##### For instance, in customer basket analysis, {diaper} → {beer} means if diaper is bought, then beer is put into basket.

#### 4 fundamental concepts in association rules:

* *(Not a Rule)* Support: number of times X occurs over all instances. 

* Support(X→Y) is the probability of co-occurence of both items within all data.

* Confidence(X→Y) is the probability of Y occurs given that X is present.

* Lift(X→Y) is the probability of Y being bought given that X is present, taking into account the popularity of Y as well.

* Conviction(X→Y) is the measure of implication. A value > 1 indicates that Y is highly depending on X.




# Example 1

### Before getting into the formnulas and terminology, let's begin by a simple example.

Mlxtend is a rich and useful library for machine learning. It provides methods in association rules with a major algorithm *apriori*.

You can install mlxtend via pip or conda.

In [3]:
pip install mlxtend

Collecting mlxtendNote: you may need to restart the kernel to use updated packages.
  Downloading mlxtend-0.20.0-py2.py3-none-any.whl (1.3 MB)
Collecting scikit-learn>=1.0.2
  Downloading scikit_learn-1.1.2-cp38-cp38-win_amd64.whl (7.3 MB)
Installing collected packages: scikit-learn, mlxtend
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.23.2

    Uninstalling scikit-learn-0.23.2:
      Successfully uninstalled scikit-learn-0.23.2
Successfully installed mlxtend-0.20.0 scikit-learn-1.1.2


ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

scikit-learn 1.1.2 requires joblib>=1.0.0, but you'll have joblib 0.17.0 which is incompatible.


In [4]:
#import libraries
import pandas as pd
#import mlxtend
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

To use association rules, first we neeed some data in one-hot encoded format.

Imagine in a grocery database, there are order id with some products...

In [11]:
# Defining the dictionary
data = {'ID':[1,2,3,4,5,6],
        'Onion':[1,0,0,1,1,1],
        'Potato':[1,1,0,1,1,1],
        'Burger':[1,1,0,0,1,1],
        'Milk':[0,1,1,1,0,1],
        'Beer':[0,0,1,0,1,0]
    
}

type(data)

  and should_run_async(code)


dict

In [12]:
# create the dataframe 
df = pd.DataFrame(data)
print(df)
#print(type(df))

   ID  Onion  Potato  Burger  Milk  Beer
0   1      1       1       1     0     0
1   2      0       1       1     1     0
2   3      0       0       0     1     1
3   4      1       1       0     1     0
4   5      1       1       1     0     1
5   6      1       1       1     1     0


  and should_run_async(code)


### Then, we can generate frequent itemsets based on *support*.

Here we need to set the minimum support value between [0,1]. Using min_supp = 50% means we only want itemsets that co-occur more than half of the time. Note that min_support default value is 0.5.

`apriori(df, min_support=0.5, use_colnames=False, max_len=None)`

<img src="tables-of-definitions-and-properties-of-association-rules-measures.png" width='600' height='400'/>

In [22]:
# Use apriori
frequent_itemsets = apriori(df[['Onion','Potato','Burger','Milk','Beer']], min_support=0.50,use_colnames=True)

  and should_run_async(code)


In [23]:
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.666667,(Onion)
1,0.833333,(Potato)
2,0.666667,(Burger)
3,0.666667,(Milk)
4,0.666667,"(Onion, Potato)"
5,0.5,"(Burger, Onion)"
6,0.666667,"(Burger, Potato)"
7,0.5,"(Milk, Potato)"
8,0.5,"(Burger, Onion, Potato)"


Itemsets with 1, 2 or 3 items are returned, with support > 0.5

The only itemset with 3 products is [Onion, Potato, Burger].

### Final Step: generate the rules with their corresponding support, confidence and lift, (and leverage & conviction):

```association_rules(df, metric='confidence', min_threshold=0.8)```

* Here, df means the frequent_itemsets dataframe; 

* metrics is the parameters to consider if there is association. You can set it to one of the five metrics.

* min_threshold is the mininum value for the specified metrics.

In [25]:
# Association rules, metric lift
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.8)

  and should_run_async(code)


In [26]:
rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
1,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
2,"(Burger, Onion)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf


In [41]:
rules1=association_rules(frequent_itemsets, metric='lift', min_threshold=1)

  and should_run_async(code)


In [42]:
rules1

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
1,(Potato),(Onion),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667
2,(Burger),(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
3,(Onion),(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
4,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
5,(Potato),(Burger),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667
6,"(Burger, Onion)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf
7,"(Burger, Potato)",(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
8,"(Onion, Potato)",(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
9,(Burger),"(Onion, Potato)",0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333


In [45]:
rules1=rules1[(rules1['confidence']>0.8) & (rules1['lift']>1.2)]

  and should_run_async(code)


In [46]:
rules1

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


<img src="tables-of-definitions-and-properties-of-association-rules-measures.png" width='600' height='400'/>

In [33]:
# Display results for lift> 1.125 and confidence>0.8
rules2=association_rules(frequent_itemsets, metric='lift', min_threshold=1.125)

  and should_run_async(code)


In [34]:
rules2=rules1[rules1['confidence']>0.8]
rules2

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
4,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
6,"(Burger, Onion)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf


### Intrepreting the result:

We can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations.

Several are high in confidence as well. But domain knowledge will be useful in explaining the phenomenon.

Subsetting the lift and confidence values return you with the itemsets that are relatively highly correlated in this data.

We can see that:

* ???


### How we come up with these interpretations? Let's look at the formulas of each terminology:

This single picture clearly illustrates all the mathematical definitions & probablistic interpretation of each rule. 

It also states the min/max & independent values.

### Some notes on Lift, Conviction & Leverage:


1.  Lift(X→Y) : the likelihood of Y being bought when X is present, taking into account the popularity of Y as well.
    > When Lift=1,  X makes no impact on Y  
    > When Lift>1, there is a relationship between X & Y
2.  Conviction(X→Y): Conviction is a measure of the implication and has value 1 if items are unrelated.
    > A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.
3.  Leverage(X→Y): the difference between the observed frequency of X and Y appearing together and the frequency that would be expected if X and Y were independent. An leverage value of 0 indicates independence.

# Example 2 (Shopping Basket)

In [47]:
retail_shopping_basket = {'ID':[1,2,3,4,5,6],
                         'Basket':[['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
                                   ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
                                   ['Soda', 'Chips', 'Milk'],
                                   ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
                                   ['Soda', 'Coffee', 'Milk', 'Bread'],
                                   ['Beer', 'Chips']
                                  ]
                         }

  and should_run_async(code)


In [53]:
# Convert to dataframe
retail = pd.DataFrame(retail_shopping_basket)
retail

  and should_run_async(code)


Unnamed: 0,ID,Basket
0,1,"[Beer, Diaper, Pretzels, Chips, Aspirin]"
1,2,"[Diaper, Beer, Chips, Lotion, Juice, BabyFood,..."
2,3,"[Soda, Chips, Milk]"
3,4,"[Soup, Beer, Diaper, Milk, IceCream]"
4,5,"[Soda, Coffee, Milk, Bread]"
5,6,"[Beer, Chips]"


First one-hot encode the basket, but how?

Converting the items in basket to one hot code
retail.drop('Basket', axis = 1) drops Basket column and returns a dataframe 1
retail.Basket.str.join(',').str.get_dummies(',') converts into one-hot code and returns dataframe 2
using join we join the both datafranes 1 and two

Dropping the Basket colum with items data and adding the corresponding 'one hot' columns

In [61]:
#preparing one-hot encoded data
retail=retail.drop('Basket',axis=1).join(retail.Basket.str.join(',').str.get_dummies(','))

  and should_run_async(code)


KeyError: "['Basket'] not found in axis"

In [62]:
#now the retail dataframe has one hot coded data as shown below
retail

  and should_run_async(code)


Unnamed: 0,ID,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,2,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,3,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,4,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,5,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,6,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [67]:
retail

  and should_run_async(code)


Unnamed: 0,ID,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,2,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,3,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,4,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,5,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,6,0,0,1,0,1,0,0,0,0,0,0,0,0,0


Making use of `Series.str.get_dummies`, we can easily encode lists of items in a dataframe's column!

In [86]:
#dropping ID column and using apriori algo

#frequent_itemsets_2 = retail.drop('ID',axis=1)
frequent_itemsets_2= apriori(retail[['Aspirin','BabyFood','Beer','Bread','Chips','Coffee','Diaper','IceCream','Juice','Lotion','Milk','Pretzels','Soda','Soup']], min_support=0.40,use_colnames=True)

  and should_run_async(code)


In [87]:
frequent_itemsets_2

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.666667,(Beer)
1,0.666667,(Chips)
2,0.5,(Diaper)
3,0.666667,(Milk)
4,0.5,"(Beer, Chips)"
5,0.5,"(Beer, Diaper)"


Just by calculating the support(X>Y), [Beer, Chips] & [Beer, Diaper] are the two frequent basket of intereseted.

But which one is more correlated than the other?

In [88]:
#association_rules
our_rules=association_rules(frequent_itemsets_2, metric='lift', min_threshold=1.1)
our_rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Beer),(Chips),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
1,(Chips),(Beer),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
2,(Beer),(Diaper),0.666667,0.5,0.5,0.75,1.5,0.166667,2.0
3,(Diaper),(Beer),0.5,0.666667,0.5,1.0,1.5,0.166667,inf


In [89]:
our_rules1=our_rules[(our_rules['confidence']>0.65) & (our_rules['lift']>1.15)]
our_rules1

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(Beer),(Diaper),0.666667,0.5,0.5,0.75,1.5,0.166667,2.0
3,(Diaper),(Beer),0.5,0.666667,0.5,1.0,1.5,0.166667,inf


In [90]:
association_rules(frequent_itemsets_2)

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Diaper),(Beer),0.5,0.666667,0.5,1.0,1.5,0.166667,inf


What can you discover from the two rules? *(Tips: what are the default parameters?)*

Clearly, {Diaper, Beer} is the most associated itemset in this data!

# Example 3 - Movie Genre Associations

It seems a bit boring playing only with basket analysis and imaginary datasets.

In this example, let's play with an open dataset [MovieLens (small)](https://grouplens.org/datasets/movielens/).

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. These data were created by 671 users between January 09, 1995 and October 16, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

We might want to take a look at the data and look at the stat first:

In [95]:
movies = pd.read_csv("movies.csv")
movies.head(5)
movies1=movies.drop('title', axis=1)
movies1.head()

  and should_run_async(code)


Unnamed: 0,movieId,genres
0,1,Adventure|Animation|Children|Comedy|Fantasy
1,2,Adventure|Children|Fantasy
2,3,Comedy|Romance
3,4,Comedy|Drama|Romance
4,5,Comedy


Covert the data into one-hot code

In [None]:
retail=retail.drop('Basket',axis=1).join(retail.Basket.str.join(',').str.get_dummies(','))

In [111]:
hs=movies1.genres
hs

  and should_run_async(code)


0       Adventure|Animation|Children|Comedy|Fantasy
1                        Adventure|Children|Fantasy
2                                    Comedy|Romance
3                              Comedy|Drama|Romance
4                                            Comedy
                           ...                     
9120                        Adventure|Drama|Romance
9121                Action|Adventure|Fantasy|Sci-Fi
9122                                    Documentary
9123                                         Comedy
9124                                    Documentary
Name: genres, Length: 9125, dtype: object

In [112]:
movies2=movies1.drop('genres',axis=1).join(movies1.genres.str.join('').str.get_dummies('|'))

  and should_run_async(code)


In [113]:
movies2

  and should_run_async(code)


Unnamed: 0,movieId,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9120,162672,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
9121,163056,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9122,163949,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9123,164977,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [114]:
# One hot encoding
movies_ohe = movies2
movies_ohe.shape

  and should_run_async(code)


(9125, 21)

In [115]:
movies_ohe.head(5)

  and should_run_async(code)


Unnamed: 0,movieId,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [117]:
movies_ohe.columns

  and should_run_async(code)


Index(['movieId', '(no genres listed)', 'Action', 'Adventure', 'Animation',
       'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
       'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance',
       'Sci-Fi', 'Thriller', 'War', 'Western'],
      dtype='object')

In [118]:

frequent_itemsets_movies = apriori(movies_ohe[['Action', 'Adventure', 'Animation',
       'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
       'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance',
       'Sci-Fi', 'Thriller', 'War', 'Western']], min_support=0.025,use_colnames=True)

  and should_run_async(code)


In [119]:
# association rules
rules_movies =  association_rules(frequent_itemsets_movies, metric='lift', min_threshold=1.25)

  and should_run_async(code)


In [120]:
rules_movies

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Action),(Adventure),0.169315,0.122411,0.058301,0.344337,2.812955,0.037575,1.338475
1,(Adventure),(Action),0.122411,0.169315,0.058301,0.476276,2.812955,0.037575,1.586111
2,(Action),(Crime),0.169315,0.120548,0.038247,0.22589,1.87386,0.017836,1.136081
3,(Crime),(Action),0.120548,0.169315,0.038247,0.317273,1.87386,0.017836,1.216716
4,(Action),(Sci-Fi),0.169315,0.086795,0.040986,0.242071,2.789015,0.026291,1.20487
5,(Sci-Fi),(Action),0.086795,0.169315,0.040986,0.472222,2.789015,0.026291,1.573929
6,(Action),(Thriller),0.169315,0.189479,0.062904,0.371521,1.960746,0.030822,1.289654
7,(Thriller),(Action),0.189479,0.169315,0.062904,0.331984,1.960746,0.030822,1.24351
8,(Children),(Adventure),0.06389,0.122411,0.02926,0.457976,3.741299,0.021439,1.619096
9,(Adventure),(Children),0.122411,0.06389,0.02926,0.239033,3.741299,0.021439,1.230158


***As we can see in this dataset, the support and hence confidence values are fairly small. This makes it difficult interpreting the result based on these two values. Whereas, the lift and conviction remains to very intuitive and representative. That is why we should understand the meaning of all of the 5 metrics to accurately interpret the result!***

In [121]:
rules_movies[(rules_movies.conviction>1.25)]

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Action),(Adventure),0.169315,0.122411,0.058301,0.344337,2.812955,0.037575,1.338475
1,(Adventure),(Action),0.122411,0.169315,0.058301,0.476276,2.812955,0.037575,1.586111
5,(Sci-Fi),(Action),0.086795,0.169315,0.040986,0.472222,2.789015,0.026291,1.573929
6,(Action),(Thriller),0.169315,0.189479,0.062904,0.371521,1.960746,0.030822,1.289654
8,(Children),(Adventure),0.06389,0.122411,0.02926,0.457976,3.741299,0.021439,1.619096
10,(Fantasy),(Adventure),0.071671,0.122411,0.030685,0.428135,3.497518,0.021912,1.534608
12,(Sci-Fi),(Adventure),0.086795,0.122411,0.027726,0.319444,2.609607,0.017101,1.289519
14,(Animation),(Children),0.048986,0.06389,0.027068,0.552573,8.648758,0.023939,2.092205
15,(Children),(Animation),0.06389,0.048986,0.027068,0.423671,8.648758,0.023939,1.650122
17,(Children),(Comedy),0.06389,0.363288,0.032877,0.51458,1.416453,0.009666,1.311672


In [122]:
rules_movies[(rules_movies.confidence>0.5)&(rules_movies.lift>1.15)]

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
14,(Animation),(Children),0.048986,0.06389,0.027068,0.552573,8.648758,0.023939,2.092205
17,(Children),(Comedy),0.06389,0.363288,0.032877,0.51458,1.416453,0.009666,1.311672
19,(Romance),(Comedy),0.169315,0.363288,0.090082,0.532039,1.464511,0.028572,1.360609
23,(Romance),(Drama),0.169315,0.478356,0.10126,0.598058,1.250236,0.020267,1.29781
24,(War),(Drama),0.040219,0.478356,0.031014,0.771117,1.612015,0.011775,2.279087
29,(Mystery),(Thriller),0.059507,0.189479,0.036055,0.605893,3.197672,0.024779,2.056601


* As we are expecting the {Romance, Drama} pair, it is not as correlated as other groups such as {Animation, Childres} which has a much higher lift & conviction levels.

By making a subset with ordering with lift & conviction:

* The highest correlation: {Animation, Childres} correlates in both directions! Recall those Pixar & Disney films that we love watching
* {Children, Adventure} ...
* {Fantasy, Adventure} ... How to interpret these two pairs?

The best way is to go back to your movies table and check it out!

# Summary

To recap, a straightforward 4-steps approach to association rule:

1. One-hot encode the basket in dataframe.
2. Generate frequent itemsets using `apriori`.
3. Generate rule with `association_rules`.
4. Interpret & evalute the result with metrics.

### References:
1. [Introduction to Market Basket Analysis in Python](http://pbpython.com/market-basket-analysis.html)
2. [Movie genre associations](https://mathematicaforprediction.wordpress.com/2013/10/06/movie-genre-associations/)
3. [Mining Association Rules](https://paginas.fe.up.pt/~ec/files_0506/slides/04_AssociationRules.pdf)
4. [Association Rules Generation from Frequent Itemsets](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)
5. F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872