# The Priori Algorythm in Python. Mining Movie Choices

Credits to: 
- Usman Malik https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
- Hadelin de Ponteves and his course Machine Learning from A to Z
- My wonderful Data Engineer Shokat Ali

Association Rules are used to identify underlying relations between different items. Take an example of a Movie Platform where customers can rent or buy movies. Usually, there is a pattern in what the customers buy. There are clear patterns, for instance the Super Hero theme, or the Kids category. 

More profit can be generated if the relationship between the movies can be can be identified.

If movie A and B are frequently bought together, this pattern can be exploited to increase profit

People who buy or rent one of these two movies, can be nudged into renting or buying the other one, via campaings or suggestions within the platform.

We are today very familiar with these recommendation engines on Netflix, Amazon, to name the most prominent.

## The Apriori Algorithm

The Apriori Algotithm falls in the Association Rule category.

### Theory of Apriori Algorithm

There are three major components of Apriori algorithm:

- Support
- Confidence
- Lift

Let's analyze each component. Before we start, we need to agree on the time window that makes business sense. In our example it could be all the movies purchased or rented by individual customers in a month or a year.

#### Support

Support in our use case refers to the popularity of a movie and it is calculated as the number of times a movie is watched divided by the total number of transactions. 

For instance if out of 100 transactions, 25 transactions contain The Avengers, the support for can be calculated as:

Support(The Avengers) = (Transactions containing The Avengers)/(Total Transactions)

Support(The Avengers) = 25/100 = 25%

#### Confidence

Confidence refers to the likelihood that the movie Avengers is also bought or rented if movie Thor is bought or rented. It can be calculated by finding the number of transactions where Thor and Avengers were bought together, divided by total number of transactions where Thor is bought or rented. 

Confidence(Thor → Avengers) = (Transactions containing both (Thor and The Avengers))/(Transactions containing Thor)

If we had 10 transactions where customers watched Thor and Avengers, while in 20 transactions, Thor is purchased or rented. Then we can find likelihood of buying Avengers when a Thor is bought. 

Confidence(Thor → Avengers) = 10/20  
                            = 50%

#### Lift

Lift(Thor -> Avengers) refers to the increase in the ratio of sale of Avengers when Thor is sold. It can be calculated by dividing Confidence(Thor -> Avengers) divided by Support(Avengers). Mathematically it can be represented as:

Lift(Thor → Avengers) = (Confidence (Thor → Avengers))/(Support (Avengers))  
It can be calculated as:

Lift(Burger→Ketchup) = 50%/25%  
                     = 2

Lift basically tells us that the likelihood of buying a Thor and Avengers together is 2 times more than the likelihood of just buying the Avengers.

A Lift of 1 means there is no association between products. Lift of greater than 1 means that products are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.

### Steps Involved in Apriori Algorithm

For large sets of data, there can be hundreds of items in hundreds of thousands transactions. The Apriori algorithm tries to extract rules for each possible combination of items. For instance, Lift can be calculated for item 1 and item 2, item 1 and item 3, item 1 and item 4 and then item 2 and item 3, item 2 and item 4 and then combinations of items e.g. item 1, item 2 and item 3; similarly item 1, item2, and item 4, and so on.

As you can see from the above example, this process can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:

1. Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (confidence).
2. Extract all the subsets having higher value of support than minimum threshold.
3. Select all the rules from the subsets with confidence value higher than minimum threshold.
4. Order the rules by descending order of Lift.

### Implementing Apriori Algorithm with Python

In this section we will use the Apriori algorithm to find rules that describe associations between different products given 7500 transactions over the course of a month. The dataset of movies is randomly picked, these are not real data. 

Another interesting point is that we do not need to write the script to calculate support, confidence, and lift for all the possible combination of items. We will use an off-the-shelf library where all of the code has already been implemented.

The library apyori. Use the following command in your environment: pip install apyori

If you are planning to emded this python code inside an Alteryx workflow (2018.3 and up) uncomment the following lines

In [None]:
""" ALTERYX
from ayx import Alteryx
Alteryx.installPackages("apyori")
Alteryx.installPackages("numpy")
Alteryx.installPackages("pandas")
"""

#### Import the Libraries
The first step, as always, is to import the required libraries. Execute the following script to do so:

In [2]:
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
from apyori import apriori

Importing the Dataset
Now let's import the dataset and see what we're working with. 

In [3]:
movie_data = pd.read_csv('movie_dataset.csv', header = None)
num_records = len(movie_data)
print(num_records)

7501


In [4]:
movie_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,The Revenant,13 Hours,Allied,Zootopia,Jigsaw,Achorman,Grinch,Fast and Furious,Ghostbusters,Wolverine,Mad Max,John Wick,La La Land,The Good Dunosaur,Ninja Turtles,The Good Dunosaur Bad Moms,2 Guns,Inside Out,Valerian,Spiderman 3
1,Beirut,Martian,Get Out,,,,,,,,,,,,,,,,,
2,Deadpool,,,,,,,,,,,,,,,,,,,
3,X-Men,Allied,,,,,,,,,,,,,,,,,,
4,Ninja Turtles,Moana,Ghost in the Shell,Ralph Breaks the Internet,John Wick,,,,,,,,,,,,,,,


Use the following script if you are reading data inside an Alteryx workflow 

In [5]:
""" ALTERYX
title_data = Alteryx.read("#1")
"""

' ALTERYX\ntitle_data = Alteryx.read("#1")\n'

Now we will use the Apriori algorithm to find out which items are commonly sold together, so that store owner can take action to place the related items together or advertise them together in order to have increased profit.

#### Data Proprocessing

The Apriori library we are going to use requires our dataset to be in the form of a list of lists, where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list. Currently we have data in the form of a pandas dataframe. To convert our pandas dataframe into a list of lists, execute the following script:

In [6]:
records = []  
for i in range(0, num_records):  
    records.append([str(movie_data.values[i,j]) for j in range(0, 20)])

In [7]:
records

[['The Revenant',
  '13 Hours',
  'Allied',
  'Zootopia',
  'Jigsaw',
  'Achorman',
  'Grinch',
  'Fast and Furious',
  'Ghostbusters',
  'Wolverine',
  'Mad Max',
  'John Wick',
  'La La Land',
  'The Good Dunosaur',
  'Ninja Turtles',
  'The Good Dunosaur Bad Moms',
  '2 Guns',
  'Inside Out',
  'Valerian',
  'Spiderman 3'],
 ['Beirut',
  'Martian',
  'Get Out',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['Deadpool',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['X-Men',
  'Allied',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['Ninja Turtles',
  'Moana',
  'Ghost in the Shell',
  'Ralph Breaks the Internet',
  'John Wick',
  'nan'

#### Applying Apriori

We can now specify the parameters of the apriori class.

- The List
- min_support
- min_confidence
- min_lift
- min_length (the minimum number of items that you want in your rules, typically 2)

Let's suppose that we want only movies that are purchased at least 40 times in a month. The support for those items can be calculated as 40/7500 = 0.0053. The minimum confidence for the rules is 20% or 0.2. Similarly, we specify the value for lift as 3 and finally min_length is 2 since we want at least two products in our rules. These values are mostly just arbitrarily chosen and they need to be fine-tuned empirically.

Execute the following script:

In [None]:
association_rules = apriori(records, min_support=0.0053, min_confidence=0.20, min_lift=3, min_length=2)
association_results = list(association_rules)  

In the second line here we convert the rules found by the apriori class into a list since it is easier to view the results in this form.

Viewing the Results
Let's first find the total number of rules mined by the apriori class. Execute the following script:

In [None]:
print(len(association_results))

The script above should return 32. Each item corresponds to one rule.

Let's print the first item in the association_rules list to see the first rule. Execute the following script:

The output should look like this:

In [None]:
print(association_results[0])

The first item in the list is a list itself containing three items. The first item of the list shows the movies in the rule.

For instance from the first item, we can see that Red Sparrow and Green Lantern are commonly bought together.

The support value for the first rule is 0.0057. This number is calculated by dividing the number of transactions containing Red Sparrow divided by total number of transactions. The confidence level for the rule is 0.3006 which shows that out of all the transactions that contain Red Sparrow, 30% of the transactions also contain Green Lantern. Finally, the lift of 3.79 tells us that Green Lantern is 3.79 times more likely to be bought by the customers who buy Red Sparrow compared to the default likelihood of the sale of Green Lantern.

The following script displays the rule in a data frame in a much more legible way:

In [None]:
results = []
for item in association_results:
    
    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    
    value0 = str(items[0])
    value1 = str(items[1])

    #second index of the inner list
    value2 = str(item[1])[:7]

    #third index of the list located at 0th
    #of the third index of the inner list

    value3 = str(item[2][0][2])[:7]
    value4 = str(item[2][0][3])[:7]
    
    rows = (value0, value1,value2,value3,value4)
    results.append(rows)
    
labels = ['Title 1','Title 2','Support','Confidence','Lift']
movie_suggestion = pd.DataFrame.from_records(results, columns = labels)

print(movie_suggestion)

Use this script if you want to output in Alteryx

In [None]:
""" ALTERYX
title_data = Alteryx.read("#1")
"""

## Conclusion

Association rule mining algorithms such as Apriori are very useful for finding simple associations between our data items. They are easy to implement and easy to explain. Google, Amazon, Netflix, Spotify use more complex algorithms for their recommendation engine.