# Association Analysis

**Author: Jessica Cervi**

**Expected time =2 hours**

**Total points = 55 points**


## Assignment Overview

In this assignment, we will work on association analysis. 

As we have seen, there are many complex ways to analyze data (clustering, regression, Neural Networks, Random Forests, SVM, etc.). The challenge with many of these approaches is that they can be difficult to tune, challenging to interpret and require quite a bit of data prep and feature engineering to get good results. In other words, they can be very powerful but require a lot of knowledge to implement properly.

Association analysis is relatively light on math concepts and easy to explain to non-technical people. In addition, it is an unsupervised learning tool that looks for hidden patterns, so there is limited need for data prep and feature engineering. It is a good start for certain cases of data exploration and can point the way for a deeper dive into the data using other approaches.

For this assignment we will use the Python implementation of association analysis found in the `MLxtend` library. This implementation should be very familiar to anyone who has exposure to `scikit-learn` and `pandas`. 

**NOTE:** Technically, market basket analysis is just one application of association analysis. In this assignment though, we will use association analysis and market basket analysis terms interchangeably.

This assignment is designed to build your familiarity and comfort in coding in Python. It will also help you review the key topics from each module. As you progress through the assignment, answers to the questions will get increasingly complex. You must adopt a data scientist's mindset when completing this assignment. Remember to run your code from each cell before submitting your assignment. Running your code beforehand will notify you of errors and giving you a chance to fix your errors before submitting it. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless asked specifically. This will cause an error in Vocareum.
- Do not use a library unless you are explicitly asked in the question. 
- You can download the Grading Report after submitting the assignment. It will include the feedback and hints on incorrect questions. 

### Learning Objectives

- Understanding the basic of association analysis
- Apply one-hot encoding to a dataframe
- Apply association analysis to data using the library `Mlxtend`
- Interpret the results

## Index: 

#### Association Analysis
+ [Question 01](#q1)
+ [Question 02](#q2)
+ [Question 03](#q3)
+ [Question 04](#q4)
+ [Question 05](#q5)
+ [Question 06](#q6)

## Association Analysis


Association rules are normally written like this: {Bread} -> {Beer} which means that there is a strong relationship between customers that purchased bread and also purchased beer in the same transaction.

In the above example, the {Bread} is the **antecedent** and the {Beer} is the **consequent**. Both antecedents and consequents can have multiple items. In other words, {Bread, Gum} -> {Beer, Chips} is also a valid rule.

**Support** is the relative frequency that the rules show up. In many instances, you may want to look for high support in order to make sure it is a useful relationship. However, there may be instances where a low support is useful if you are trying to find “hidden” relationships.

**Confidence** is a measure of the reliability of the rule. A confidence of .5 in the above example would mean that in 50% of the cases where Bread and Gum were purchased, the purchase also included Beer and Chips. For product recommendation, a 50% confidence may be perfectly acceptable but in a medical situation, this level may not be high enough.

**Lift** is the ratio of the observed support to that expected if the two rules were independent (click [here](https://en.wikipedia.org/wiki/Association_rule_learning) for more details). The basic rule of thumb is that a lift value close to 1 means the rules were completely independent. Lift values greater than 1 are generally more “interesting” and could be indicative of a useful rule pattern.

One final note, related to the data. This analysis requires that all the data for a transaction be included in 1 row and the items should be 1-hot encoded. 

You can find the documentation abut the `MLxtend` library [here](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/).


### Apriori algorithm
The apriori algorithm is commonly used for association analysis in scenarios like e-commerce. The objective is to find which items are commonly bought toghether.

Since big e-commerce sites usually have tens of thousands of different items and millions of different historical orders, an exhaustive brute-force search of which items are commonly bought together is unfeasible.

The **Apriori** algorithm allows us to find all the combinations of _k_ items that appear at least in _t_ different orders. The computation is much faster than brute-force approaches because the algorithm simplifies the search by recursive elimination of item combinations that do not satisfy the search constraints.

The recursive elimination of item combinations is illustrated in the image below:

<img src="data/combination-graph.png">

The image represents a scenario with 5 different items _a,b,c,d,e_, and we need to find the combinations of such items up to length _k_ with a threshold _t_. In this example the combination _ab_ is not present in at least _t_ orders and it is therefore discarded. Since _ab_ does not satisfy the threshold, any other itemset containing _ab_ will also be discarded. The combinations are computed adding one extra element at a time (until _k_ elements) and the search space is iteratively reduced by just keeping these combinations that do not satisfy the threshold.

## Importing the dataset and exploratory data analysis

For this assignment, we will be using  data that comes from the UCI Machine Learning Repository. This data represents transactional data from a UK retailer from 2010-2011. This mostly represents sales to wholesalers, so it is slightly different from consumer purchase patterns but is still a useful case study.

You can find more information about the dataset [here](http://archive.ics.uci.edu/ml/datasets/Online+Retail).

Below we import the necessary libraries and read the dataset using the `pandas` `read_excel` function.

### Import libraries

In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

df = pd.read_excel('data/Online_retail.xlsx')


Next, we use the function `head()` to visualize the first five rows of our dataframe

In [None]:
df.head()

Here's a description of the attributes in our dataframe:

- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. 
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. 
- Description: Product (item) name. Nominal. 
- Quantity: The quantities of each product (item) per transaction. Numeric.	
- InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated. 
- UnitPrice: Unit price. Numeric, Product price per unit in sterling. 
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. 
- Country: Country name. Nominal, the name of the country where each customer resides.



[Back to top](#Index:) 
<a id='q1'></a>

### Question 1:

*5 points*
 
For the sake of keeping the data set small, I’m only looking at sales for France.
 
Create a subset of our dataframe containing only the orders from `France`. Assign the new dataframe to `df_france`

In [None]:
### GRADED

### YOUR SOLUTION HERE
df_france = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


The first thing we need to do is to consolidate the items into 1 transaction per row with each product **one hot encoded**. 


[Back to top](#Index:) 
<a id='q2'></a>

### Question 2:

*10 points*

Group the entries of `df_france` by `InvoiceNo` and `Description` into a series called `basket`. Each entry of `basket` should be the sum of how many times a certain product appears over all the transactions.

In [None]:
### GRADED

### YOUR SOLUTION HERE
basket = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q3'></a>

### Question 3:

*10 points*

Unstack the series `basket` to create a dataframe `df_basket`. Do so by using the function `unstack()`.

Next, reset the index of your dataframe, fill the NaN values with zeroes, and set the index column to `InvoiceNo`.

**HINT: The functions `reset_index()`, `fillna()` and `set_index()` will be useful**

In [None]:
### GRADED

### YOUR SOLUTION HERE
df_basket = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Let's have a look at our new dataframe!

In [None]:
df_basket

There are a lot of zeros in the data. We also need to convert any positive values to a 1 and convert any values less than 0 to a 0.

[Back to top](#Index:) 
<a id='q4'></a>

### Question 4:

*10 points*

Define a function, `encode_units` that takes one float, `x`. Your function should return `0` if `x` is less than or equal to zero or 1 if `x` is greater than or equal to one.

Use the function `applymap()` to apply this function to your dataframe `df_basket`. Assign the new dataframe to `basket_sets`.

Finally, remove the `POSTAGE` column.

In [None]:
### GRADED

### YOUR SOLUTION HERE
basket_sets = None

def encode_unit():
    return

###
### YOUR CODE HERE
###



In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Now that the data is structured properly, we can generate frequent item sets that have a support of at least 7%.

[Back to top](#Index:) 
<a id='q5'></a>

### Question 5:

*10 points*

Use the function `apriori` from the `mlxtend` library on `basket_sets` to create a dataframe `frequent_itemsets` with the support for each item. Set the arguments of `apriori` to `min_support = 0.07` and `use_colnames = True`

In [None]:
### GRADED

### YOUR SOLUTION HERE

frequent_itemsets = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Let's have a look at our result.

In [None]:
frequent_itemsets.head()

The final step is to generate the rules with their corresponding support, confidence, and lift. We can do so using the function `association_rules` from `mlxtend`.

[Back to top](#Index:) 
<a id='q6'></a>

### Question 6:

*10 points*

Use the function `association_rules` from `mlxtend` to create the decided dataframe from `frequent_itemsets`. Name this dataframe `rules`. Set the arguments `metric= "lift"`.

In [None]:
### GRADED

### YOUR SOLUTION HERE

rules = None

###
### YOUR CODE HERE
###



In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Let's have a look at our result!

In [None]:
rules.head()

Now, the tricky part is figuring out what this tells us. For instance, we can see that there are quite a few rules with a high lift value, which means that it occurs more frequently than it would be expected given the number of transactions and product combinations. We can also see several rules where the confidence is high. This part of the analysis is where the domain knowledge will come in handy. Since we do not have that, we’ll just look for a couple of illustrative examples.

We can filter the dataframe using standard `pandas` code. In this case, look for a large lift (6) and high confidence (.8), like so:

In [None]:
rules[ (rules['lift'] >= 6) & (rules['confidence'] >= 0.8) ]

What is also interesting to see is how the combinations vary by the country of purchase. Let’s check out what some popular combinations might be in Germany. 

We can do so by creating a new subset of our dataframe (with `Germany` set as a counter) and by applying the same steps as above.

In [None]:
basket2 = (df[df['Country'] =="Germany"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

basket_sets2 = basket2.applymap(encode_units)
basket_sets2.drop('POSTAGE', inplace=True, axis=1)
frequent_itemsets2 = apriori(basket_sets2, min_support=0.05, use_colnames=True)
rules2 = association_rules(frequent_itemsets2, metric="lift")

rules2[ (rules2['lift'] >= 4) &
        (rules2['confidence'] >= 0.5)]

It seems that in addition to David Hasselhoff, Germans love Plasters in Tin Spaceboy and Woodland Animals.

