## General instructions

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel/runtime** (Colab: in the menubar, select *Runtime*$\rightarrow$*Factory Reset Runtime*; Jupyter: in the menubar, select *Kernel*$\rightarrow$*Restart*) and then **run all cells** (Colab: in the menubar, select *Runtime*$\rightarrow$*Run all*; Jupyter: in the menubar, select *Cell*$\rightarrow$*Run All*).

Make sure you fill in any place that says `YOUR CODE HERE` or `"YOUR ANSWER HERE"`, as well as the list of the group members in the following cell.

Enter here the *Group Name* and the list of *Group Members*.

`Doi eletronics`

`Burco Lorenzo, Persello Riccardo`

In order to be able to have an evaluation DO NOT delete/cut the cells with code and answers. Once you have finished you can downolad the notebook (Colab: in the menubar, select *File*$\rightarrow$*Download .ipynb*; Jupyter: in the menubar, select *File*$\rightarrow$*Download as*$\rightarrow$*Notebook (.ipynb)*) and upload as an assignment on the e-learning platform.

The following cell will load the Google Drive extension for the current notebook, when the variable `MOUNT` is `True`. This allow you to mount the Google Drive filesystem for file persistence. The mountpoint will be `/content/gdrive`.
Furthermore, it will set the `PATH` variable, from now on, so that if you have to refer to external files you could do that by writing:

```python
os.path.join(PATH, filename)
```

This will append the filename after the specific PATH.

In [None]:
import os
MOUNT = False
if 'google.colab' in str(get_ipython()) and MOUNT:
    from google.colab import drive
    drive.mount('/content/gdrive')
    PATH = '/content/gdrive/MyDrive'
else:
    PATH = '.'

# Important warning

**⚠️ avoid copying, removing or modifying test cells, if you do that your assignment might be graded wrongly ⚠️**

---

This notebook will deal with the Employee Attrition dataset, describing a possible analysis workflow for frequent itemsets and association rules.

In a nutshell, association analysis is an unsupervised machine learning method that tries to identify frequent patterns in *transactions* (it is also known as *marketbasket analysis*), i.e., associations that occur more often than just being there for chance.

In the specific case we associate each employee with his/her characteristics and we will try to find the frequent combination of features that occur when the employees do/do not have **attrition**. To this aim we will transform the dataset so that each employee will be represented by the itemset of his/her feature values (e.g., the department which he/she works for, the business travel frequency, the education level, etc.).

In this case there is not specific assertion but the outcome of the analysis will be graded manually. In general, you can add multiple cells below a task assignment for performing the task.

# Step 1

Load the Employee Attrition Dataset from the `pkl` file and inspect if there are missing values inside the dataset. In the affirmative case, try to identify them.

To this aim, a helpful tool could be the [`missingno` library](https://github.com/ResidentMario/missingno). Give a look at it trying to understand how to use it.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Step 2

Frequent itemset and association analyses are required to deal with categorical data. Categorical data are qualitative information that can be stored and identified based on their names or labels (e.g., gender, age group, educational level, satisfaction level). It might be ordered when they refer to some quantitative dimension that can be ordered (e.g., age group [20, 30) precedes age group [30, 40)). Not necessarily a categorical variable is a string or a label, but it might also be encoded with a numerical value (e.g., satisfaction level in a Likert scale ranging from 1 to 5).

Identify the categorical variables in the data set, also looking at the data dictionary on kaggle. Notice that those variables could also have an integer encoding, therefore for each (integer) numerical variable briefly indicate the motivation why it is not categorical.

Write down, in the following cell, a brief statement with your observations about the outcome of this preliminary analysis.

# Step 3

Also the numerical variables of the dataset conceive relevant information for the analyses. Therefore, we want to transform them into categorical ones. Specifically:

1. among the numerical variables, identify those which might be meaningful for characterizing the employee (e.g., the age) and those which are not (e.g., the Employee Number which is basically a primary key in the DB).
2. the meaningful numerical variables should be transformed into categorical by means of *discretization*, or *binning*. There are several approaches to discretization (you might look at (this medium article)[https://towardsdatascience.com/an-introduction-to-discretization-in-data-science-55ef8c9775a2) for a review, but basically the two major approaches for transforming a $[\alpha, \beta]$ interval into its discretized counterpart consisting of $n$ categories, are *equal-width* discretization ($w = \frac{\beta - \alpha}{n}$ and each category $c_i, i = 0, \ldots, n - 1$ is an interval $c_i = [\alpha + i \cdot w, \alpha + (i + 1) \cdot w)$ such that $v$ is encoded in category $i$ if $v \in c_i$ (in pandas the method is called `pd.cut()`). The other approach, instead, called *equal-frequency*, defines a series of $n$ breakpoints $b_i, i = 0, \ldots, n - 1$, such that the intervals $c_i = [\alpha + b_i, \alpha + b_{i + 1})$ will contain the same amount of values (i.e., about $\frac{m}{n}$, where $m$ is the size of the dataset, in pandas the method is called `pd.qcut()`). For simplicity, for this analysis I suggest to use the *equal-frequency* approach, using 4 categories.

The transformed dataset, at the end of this preparation task, should comprise only categorical variables.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Step 4

Given the preprocessed data, you have to convert into the format that is expected by the `mlxtend` library for frequent patterns. You can have a look [here](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/) for the documentation.

In particular the *one-hot* encoding is required, i.e., a dataframe whose columns corresponds to the different items and the rows are the invoices. Each value is the binary indicator of the fact that that row (i.e., invoice) contains or not that item. Here is an example of the expected format:

|Index|Bread|Coke|Milk|Beer|Diaper|
|-----|-----|----|----|----|------|
|1    |1    |1   |1   |0   |0     |
|2    |1    |0   |0   |1   |1     |
|3    |0    |1   |0   |1   |1     |
|4    |1    |0   |1   |1   |1     |
|5    |0    |1   |1   |0   |1     |
|.....|...  |... |... |... |...   |

This can be achieved through the `pd.get_dummies()` method, search for it.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

The [FP-Growth](https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm) algorithm is an alternative algorithm for determining frequent itemsets based on a peculiar and efficient data structure called FP-tree (it stores the compressed information). You can give a look at the linked document or better at [this blog article](https://towardsdatascience.com/understand-and-build-fp-growth-algorithm-in-python-d8b989bab342) if you are interested in the details of the algorithm.

The `fp_growth` function is already implemented in the `mlxtend` library. 

Use this algorithm to compute the frequent items in the transformed dataset, with a support threshold of 0.05 and of maximum length of 4. Inspect the itemsets extracted.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Association rules

Once frequent item sets have been computed it is meaningful to transform them into causal rules, called *association rules*. These rules have the form $Age\_Range\_(36.0, 43.0] \rightarrow Attrition\_No$, indicating that there is a direction in the relation of the common occurrence of the feature values. 

In order to measure the quality of these rules (recall the method is unsupervised, and we need a metric), a popular choice is using the *lift*. Specifically, $\mathrm{lift}(A \rightarrow B) = \frac{\mathrm{freq}(A \cup B)}{\mathrm{freq}(A) \cdot \mathrm{freq}(B)}$, and values greater than 1 indicate a meaningful relation.

Another metric is the *confidence*, which is the probability of finding the item set $B$ given that $A$ has been found, i.e., $\mathrm{confidence}(A \rightarrow B) = \frac{\mathrm{freq}(A \cup B)}{\mathrm{freq}(A)}$.    

Use the frequent items computed in the previous step to find the association rules whose lift is greater than 1.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Step 7

Since we are interested in investigating the attrition phenomenon, extract from the association rules those having `Attrition_Yes` or `Attrition_No` as a consequent (i.e., those of the form `A` $\rightarrow$ `Attrition_Yes` and `A` $\rightarrow$ `Attrition_No`).

Pay attention that the values of the association rules columns are frozen sets, like, e.g., `frozenset({'Attrition_No'})`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Step 8

Write down your observations among the frequent patterns dealing with attrition, what can you deduce from the analysis of those patterns?