# Practice Session 04: Association rules mining

Association rule mining techniques are useful to find common patterns of items in large data sets. One specific application called **market basket analysis** is useful for online shops because if we know that item A and B are bought together frequently, we can design new actions to increase the profit as:

- A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
- People who buy one of the products can be targeted through an advertisement campaign to buy the other.
- Collective discounts can be offered on these products if the customer buys both of them.
- Both A and B can be packaged together.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Author: <font color="blue">Judith Camacho</font>

E-mail: <font color="blue">judith.camacho01@estudiant.upf.edu</font>

Date: <font color="blue">19/10/2020</font>

In [2]:

import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
from apyori import apriori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py): started
  Building wheel for apyori (setup.py): finished with status 'done'
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5979 sha256=c825e204452e6d5350851afde0e7c72d56bfb6b9a75f2d9b38c85dae96eec533
  Stored in directory: c:\users\judith\appdata\local\pip\cache\wheels\cb\f6\e1\57973c631d27efd1a2f375bd6a83b2a616c4021f24aab84080
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


You should consider upgrading via the 'c:\users\judith\anaconda3\python.exe -m pip install --upgrade pip' command.


If the apyori library is not already installed in your laptop, you can install it with: `pip install apyori`

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

## 0. The Apriori Algorithm in a nutshell

There are three major components of Apriori algorithm, which we describe below using as an example the case where transactions are purchase histories.

**Support**: the number of transactions containing a particular item divided by total number of transactions:

   *Support(A) = (Transactions containing (A))/(Total Transactions)*

**Confidence**: normally indicates the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought:

   *Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)*

**Lift**: the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) by Support(B):

   *Lift(A→B) = (Confidence (A→B))/(Support (B))*
   
A Lift of 1 means there is no association between products A and B. Lift greater than 1.0 means products A and B are more likely to be bought together. Lift less than 1.0 indicates two products are unlikely to be bought together.

The Apriori algorithm first finds itemsets having the desired level of support, and then within those itemsets tries to derive rules having the desired confidence and lift.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1. Playing with apyori

The [apyori library](https://pypi.org/project/apyori/) is an implementation of the Apriori algorithm. Its typical usage is to receive a list of transactions and then print the association rules it found.

To use this library, we pass a list in which each element represents a transaction, for instance:

```python
transactions = [
    ['beer', 'chips', 'nuts', 'olives'],
    ['beer', 'chips', 'olives'],
    ['chips', 'nuts' ],
    ['chips', 'olives'],
    ['beer', 'nuts' ],
    ['chips'],
    ['nuts', 'olives'],
    ['beer', 'nuts'],
    ['beer', 'chips', 'olives'], 
    ['beer', 'nuts', 'olives'], 

]
results = list(apriori(transactions, min_support=0.2, min_confidence=0.75, min_lift=1.0))

```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your own example of transactions (at least 20 transactions) and execution of the apriori algorithm, in which you should obtain at least ONE and at most THREE rules.</font>

In [72]:
'''transactions = [
    ['rice', 'chicken', 'potatoes', 'olives', 'chocolate'],
    ['rice', 'chicken', 'olives'],
    ['potatoes', 'chocolate'],
    ['chicken', 'olives'],
    ['rice', 'potatoes', 'chocolate'],
    ['chicken', 'rice'],
    ['chocolate', 'olives'],
    ['rice', 'chocolate'],
    ['rice', 'chicken', 'olives'], 
    ['rice', 'olives', 'chocolate'],
    ['chocolate', 'rice'],
    ['chicken', 'potatoes', 'olives'],
    ['rice', 'olives'],
    ['chocolate'],
    ['chocolate', 'rice', 'potatoes'],
    ['chicken', 'potatoes'],
    ['rice', 'potatoes', 'chicken'],
    ['rice', 'chicken', 'olives'],
    ['olives', 'chocolate'],
    ['chicken', 'chocolate', 'olives'],
]'''

transactions = [
    ['beer', 'chips', 'nuts', 'olives'],
    ['beer', 'chips', 'olives'],
    ['chips', 'nuts' ],
    ['chips', 'olives'],
    ['beer', 'nuts' ],
    ['chips'],
    ['nuts', 'olives'],
    ['beer', 'nuts'],
    ['beer', 'chips', 'olives'], 
    ['beer', 'nuts', 'olives'], 

    ['chips', 'nuts', 'olives'],
    ['olives'],
    ['chips', 'nuts', 'olives'],
    ['beer', 'olives'],
    ['olives', 'nuts' ],
    ['beer'],
    ['nuts', 'olives', 'beer'],
    ['nuts'],
    ['chips', 'olives'], 
    ['beer', 'nuts'],
]

results = list(apriori(transactions, min_support=0.2, min_confidence=0.75, min_lift=1.0))

The function below, which you can leave as-is, prints the output of the apyori library in a readable format. Use it to print the results of your association rules mining.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [43]:
def print_apyori_output(association_results):
    for relation_record in association_results:
        itemset = list(relation_record.items)
        
        # Consider only itemsets of two elements
        if len(itemset) > 1: 
        
            print("Rules involving itemset %s" % itemset)
            support = relation_record.support

            for rules in relation_record.ordered_statistics:
                antecedent = list(rules.items_base)
                consequent = list(rules.items_add)
                confidence = rules.confidence
                lift = rules.lift

                print("%s => %s (support=%.2f, confidence=%.2f, lift=%.2f)" %
                      (antecedent, consequent, support, confidence, lift))
            print()

<font size="+1" color="red">Replace this cell with (1) a printout of the rules you have obtained, and (2) for each of those rules, indicate clearly how the support, confidence, and lift is calculated. Do not merely repeat the formula: indicate how each number is computed based on the transactions you provided, as if you were trying to verify that the numbers are correct.</font>

In [73]:
print_apyori_output(results)

Rules involving itemset ['olives', 'chips']
['chips'] => ['olives'] (support=0.35, confidence=0.78, lift=1.20)




Support is 0.35 because I have in total 20 transactions, from which, only 7 of them contain olives and chips. 

Confidence is 0.78 because we have 7 trasnactions containing both chips and olives, and 9 containing chips (7/9 = 0.77777...)

Finally, lift is 1.20 means that chips and olives are likely to be bought together, which, as you can see from the transactions list, is true. 0.78 / (13/20)

# 2. Load and prepare the services purchased dataset

Next we will use a dataset contained in `services_purchased.csv` with 1000 customers that purchased up to 8 different services from a portfolio of a Big Internet Player. The portfolio includes:

- *WEBHOSTING*: Web hosting
- *OFFICESUITE*: Office suite that includes email, Office tools as docs, excels and presentation
- *SECURITY*: Security solutions to protect cyber-attacks
- *CLOUD_IAAS*: Cloud sub-product: infrastructure as a service
- *CLOUD_PAAS*: Cloud sub-product: platform as a service
- *CONTENTMGM*: Content management solution such as Wordpress, Joomla!, Drupal, etc....
- *CHATBOT*: Chatbot for customer care
- *ADVERTISING*: Advertising

Each record (row) corresponds to a company and each column represents one of the products from the portfolio and can take the value 1 if the product was purchased or 0 if it was not.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [76]:
INPUT_FILENAME = "services_purchased.csv"

In [111]:
dataset = pd.read_csv(INPUT_FILENAME, sep=",")
dataset.head()
dataset.size

9000

<font size="+1" color="red">Replace this cell with code to print how many customers have requested each service.</font>

In [93]:
for col in dataset.columns[1:8]:
    print(col, dataset[col].sum())

WEBHOSTING 274
OFFICESUITE 176
SECURITY 608
CLOUD_IAAS 67
CLOUD_PAAS 6
CONTENTMGM 152
CHATBOT 0


<font size="+1" color="red">Replace this cell with code to verify that all customers have purchased at least one service, otherwise remove them from the dataset.</font>

In [112]:
cols_list = list(dataset)

cols_list.remove("ID_customer")
dataset["sum"] = dataset[cols_list].sum(axis=1)

dataset = dataset[dataset["sum"]>0]
dataset.size

7530

<font size="+1" color="red">Replace this cell with code to remove the ID_customer column, which we do not need.</font>

In [117]:

cols = dataset.columns
newcols = cols.drop(["ID_customer", "sum"])

dataset1= dataset[newcols]

dataset1.head()

Unnamed: 0,WEBHOSTING,OFFICESUITE,SECURITY,CLOUD_IAAS,CLOUD_PAAS,CONTENTMGM,CHATBOT,ADVERTISING
0,0,0,1,0,0,0,0,0
1,0,1,1,0,0,0,0,0
2,1,0,1,0,0,1,0,0
3,0,0,1,0,0,0,0,0
4,1,1,1,0,0,1,0,0


Now, you need to create a variable named `transactions` containing the dataset as a list of transactions.

The first five elements of this `transactions` variable should be:

```python
[
  ['SECURITY'],
  ['OFFICESUITE', 'SECURITY'],
  ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'],
  ['SECURITY'],
  ['WEBHOSTING', 'OFFICESUITE', 'SECURITY', 'CONTENTMGM'],
  ...
]
```

You can iterate through the rows of a dataframe `df` with `for recordnum, record in df.iterrows()`  and through its columns with `for column in df.columns`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create the "transactions" list.</font>

In [124]:
'''transactions = []
for row in range(len(dataset1)):
    for col in  dataset1.columns:
        if dataset1[row, col] == 1:
            transactions.append(col)'''

transactions = []
for index, row in dataset1.iterrows():
    purchased_products = []
    for col in dataset1.columns:
        if row[col] == 1:
            purchased_products.append(col)
        transactions.append(purchased_products)

In [125]:
print(transactions)

[['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['OFFICESUITE', 'SECURITY'], ['OFFICESUITE', 'SECURITY'], ['OFFICESUITE', 'SECURITY'], ['OFFICESUITE', 'SECURITY'], ['OFFICESUITE', 'SECURITY'], ['OFFICESUITE', 'SECURITY'], ['OFFICESUITE', 'SECURITY'], ['OFFICESUITE', 'SECURITY'], ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['SECURITY'], ['WEBHOSTING', 'OFFICESUITE', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'OFFICESUITE', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'OFFICESUITE', 'SECURITY', 'CONTENTMGM'], ['WEBHOSTING', 'OFFICESUITE', 'SECUR

# 3. Run the Apriori algorithm

Execute the apriori algorithm using [apyori.apriori](https://pypi.org/project/apyori/) **twice** with different values of minimum values for support, confidence, lift. **Remember to set the "lift" parameter to a value strictly greater than 1.0.** 

Your first run should seek rules with high support, and should return a set having more than 1 and less than 5 rules.

Your second run should seek rules with high lift, and should return a set having more than 1 and less than 5 rules.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to produce and print high-support rules.</font>

In [135]:
results = list(apriori(transactions, min_support=0.1, min_confidence=0.70, min_lift=1.0))
results1 = list(apriori(transactions, min_support=0.05, min_confidence=0.75, min_lift=1.0))
print_apyori_output(results)


Rules involving itemset ['CONTENTMGM', 'SECURITY']
['CONTENTMGM'] => ['SECURITY'] (support=0.18, confidence=0.91, lift=1.12)



In [134]:
print_apyori_output(results1)

Rules involving itemset ['CONTENTMGM', 'SECURITY']
['CONTENTMGM'] => ['SECURITY'] (support=0.18, confidence=0.91, lift=1.12)

Rules involving itemset ['CONTENTMGM', 'WEBHOSTING', 'SECURITY']
['CONTENTMGM', 'WEBHOSTING'] => ['SECURITY'] (support=0.07, confidence=0.93, lift=1.15)



<font size="+1" color="red">Replace this cell with a brief commentary on the rules that you have found.</font>

For the first execution of the apriori algorithm, the rules found is that clients buying ContentMGM, also buy Security with a confidence of 91%.
For the second, we have also, that clients buying ContentMGM and Webhosting, also buy Security with a confidence of 93%.
They have low values of support because the dataset is quite huge. Also, we can see that adding Webhosting only increases 0.03 the probability of people buying security after having bought ContentMGM and Webhosting.

<font size="+1" color="red">Replace this cell with code to produce and print high-lift rules.</font>

In [146]:
results = list(apriori(transactions, min_support=0.05, min_confidence=0.80, min_lift=1.1))
print_apyori_output(results)

Rules involving itemset ['CONTENTMGM', 'SECURITY']
['CONTENTMGM'] => ['SECURITY'] (support=0.18, confidence=0.91, lift=1.12)

Rules involving itemset ['CONTENTMGM', 'WEBHOSTING', 'SECURITY']
['CONTENTMGM', 'WEBHOSTING'] => ['SECURITY'] (support=0.07, confidence=0.93, lift=1.15)



<font size="+1" color="red">Replace this cell with a brief commentary on the rules that you have found.</font>

If I raised more the min_lift value, I didn't get any rule. Hence the minimum value I can use in order to get some results is 1.1.
Once again I am getting the same results as before. The reason why is that these 3 products are the ones most bought (you can see it in the transaction list).

<font size="+1" color="red">Replace this cell with (1) a description of the customers that purchase the office suite product, and (2) a description of the customers that purchase the platform as a service product. You may need to do additional runs of Apriori to obtain the rules you will need for this characterization.</font>

In [163]:
cust_office = dataset1.where(dataset1["OFFICESUITE"]==1).dropna()
cust_office

transactions1 = []
for index, row in cust_office.iterrows():
    purchased_products = []
    for col in cust_office.columns:
        if row[col] == 1:
            purchased_products.append(col)
        transactions1.append(purchased_products)

cust_platform = dataset1[(dataset1["WEBHOSTING"]==1) | (dataset1["CLOUD_PAAS"]==1)]
transactions2 = []
for index, row in cust_platform.iterrows():
    purchased_products = []
    for col in cust_platform.columns:
        if row[col] == 1:
            purchased_products.append(col)
        transactions2.append(purchased_products)

In [174]:
results = list(apriori(transactions1, min_support=0.15, min_confidence=0.75, min_lift=1.2))
print_apyori_output(results)

Rules involving itemset ['CONTENTMGM', 'SECURITY']
['CONTENTMGM'] => ['SECURITY'] (support=0.20, confidence=0.82, lift=1.21)

Rules involving itemset ['OFFICESUITE', 'CONTENTMGM', 'SECURITY']
['CONTENTMGM'] => ['OFFICESUITE', 'SECURITY'] (support=0.20, confidence=0.82, lift=1.21)
['OFFICESUITE', 'CONTENTMGM'] => ['SECURITY'] (support=0.20, confidence=0.82, lift=1.21)



In [173]:
results = list(apriori(transactions2, min_support=0.2, min_confidence=0.75, min_lift=1))
print_apyori_output(results)

Rules involving itemset ['WEBHOSTING', 'SECURITY']
['SECURITY'] => ['WEBHOSTING'] (support=0.68, confidence=0.98, lift=1.00)



<font size="+1" color="red">Replace this cell with your conclusions. What would be your top three recommendations towards this service provider? Remember to justify clearly your recommendations based on the results from the association rules mining.</font>

In my opinion, the top three products are Security, ContentMGM and nad OfficeSuite. That's why they are the ones being the most bought. They are the ones appearing in the Rules with the highest lift values, which means that it occurs more frequently than would be expected given the number of transaction and product combinations.

# DELIVER (individually)

Remember to read the section on "delivering your code" in the [course evaluation guidelines](https://github.com/chatox/data-mining-course/blob/master/upf/upf-evaluation.md).

Deliver a zip file containing:

* This notebook

## Extra points available

For more learning and extra points, perform association rules mining on this [bakery dataset](https://github.com/viktree/curly-octo-chainsaw). There is a nice [notebook](https://github.com/viktree/curly-octo-chainsaw/blob/master/Bakery%20Transactions.ipynb) describing how to load this data, feel free to copy-paste from that notebook the data loading and cleaning parts. Format the data in the format that apyori expects, run the association rules mining, and write your conclusions briefly.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: experiments on the bakery dataset</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>