## General instructions

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel/runtime** (Colab: in the menubar, select *Runtime*$\rightarrow$*Factory Reset Runtime*; Jupyter: in the menubar, select *Kernel*$\rightarrow$*Restart*) and then **run all cells** (Colab: in the menubar, select *Runtime*$\rightarrow$*Run all*; Jupyter: in the menubar, select *Cell*$\rightarrow$*Run All*).

Make sure you fill in any place that says `YOUR CODE HERE` or `"YOUR ANSWER HERE"`, as well as the list of the group members in the following cell.

Enter here the *Group Name* and the list of *Group Members*.

`GROUP NAME`

`GROUP MEMBERS`

In order to be able to have an evaluation DO NOT delete/cut the cells with code and answers. Once you have finished you can downolad the notebook (Colab: in the menubar, select *File*$\rightarrow$*Download .ipynb*; Jupyter: in the menubar, select *File*$\rightarrow$*Download as*$\rightarrow$*Notebook (.ipynb)*) and upload as an assignment on the e-learning platform.

The following cell will load the Google Drive extension for the current notebook, when the variable `MOUNT` is `True`. This allow you to mount the Google Drive filesystem for file persistence. The mountpoint will be `/content/gdrive`.
Furthermore, it will set the `PATH` variable, from now on, so that if you have to refer to external files you could do that by writing:

```python
os.path.join(PATH, filename)
```

This will append the filename after the specific PATH.

In [None]:
import os
MOUNT = False
if 'google.colab' in str(get_ipython()) and MOUNT:
    from google.colab import drive
    drive.mount('/content/gdrive')
    PATH = '/content/gdrive/MyDrive'
else:
    PATH = '.'

# Important warning

**⚠️ avoid copying, removing or modifying test cells, if you do that your assignment might be graded wrongly ⚠️**

---

In this practice we will investigate frequent itemsets analysis and association rules. 

For this purpose, you should have the `mlxtend` library installed.

In order to install a library in your environment you can issue the command `!pip3 install <library_name>` from a notebook cell. On the COLAB notebook you have to issue it every time the virtual machine is restarted, since a default python environment is created. 

The `!` before the command tells the notebook to run that command on the underlying operating system (which incidentally, in COLAB, is a unix/linux OS).

If you want to check whether the library is installed before (re)-installing it you can issue the following command:

```bash
!pip3 freeze | egrep "mlxtend"
```

Alternatively, you can try to import the library through the python `import` statement and check whether this instruction raises an `ImportError`.

The library documentation is available [here](http://rasbt.github.io/mlxtend/).

For this practice, we will use the **IBM HR Analytics Employee Attrition & Performance**, which is made available on the [Kaggle Repository](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset). It contains a set of employee attrition data from IBM. A csv file is provided alongside the notebook, however refer to the website for the data dictionary.

Also, we learn how to save a binary `pickle` file so to speed up intermediate data save and retrieval in case of big datasets.

# Preliminaries 1

Write down the code to ensure that the `mlxtend` library is installed on your system, by trying to import them, or install it if needed.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
try:
    import mlxtend
except:
    assert False, "mlxtend cannot be imported"

# Preliminaries 2

Search the Kaggle Repository and retrieve the IBM HR Analytics Employee Attrition & Performance. It should be available to your scripts in a file `WA_Fn-UseC_-HR-Employee-Attrition.csv`.

There is no formal *answer* to this task, you just have to make available the file to your notebook.

The `%time` annotation in the cell gives you the amount of time spent in the operations in the cells and is useful to compare the performances of different ways to accomplish a task.

In [None]:
%time 

import pandas as pd
try:
    df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
except Exception as e:
    assert False, f"the employee attrition file seems to be not available: {str(e)}"

# Preliminaries 3

Save the `df` object into a `pickle` (`.pkl`) binary file. You can refer to [this resource](https://www.datacamp.com/community/tutorials/pickle-python-tutorial) for a short tutorial on pickle. The expected file name is `employee-attrition.pkl`.

In [None]:
%time 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
%time

import pickle
with open('employee-attrition.pkl', 'rb') as f:
    df2 = pickle.load(f)

In [None]:
assert all(df == df2), 'The content of the two dataframes differ'

If everything went right you now should have the data in binary form which will be ready for the analysis.