# Homework 2
This homework has been assigned in date 31st of October 2024.
We have been required to complete a series of tasks on a *basket-item* kind dataset:
1. Compute the frequency of each product. Then print in a bar plot the top 20. In
the first axis you need to specify the product name, in the seconf one the
product frequency.
2. Compute the frequency of each level 3 and level 4. Then print the top 5 as the
point 1.
3. Apply the APRIORI algorithm.

We have been offered a dataset of transactions owned by a third party, as I have no rights over it, I assume we are not allowed to share, please reach out our professors of *Introduction to Data Mining* at [this link](https://web.dmi.unict.it/courses/l-31/course-units/?seuid=8EAB2D3A-4281-40F4-83A0-C6B007577BA2) if interested.

### Notes
We have been required to follow an object-oriented approach, my work doesn't stricly follow this requirements as a jupyter notebook is provided.

I did my best to not put any verbose or annoying code in it.


### Loading environment
We use `python-dotenv` package to load environment variables from a `.env` file, in particular the variable `TASK_DATASET_FILE` which, with no further explaination, points to our dataset file.

In [None]:
import os
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

dataset_file_path = os.getenv('TASK_DATASET_FILE_PATH')
print("Our dataset path is: %s" % dataset_file_path)

### Loading our dataset
Due to the large size of our raw dataset file, a series of facilities have been implemeneted to make my work easier, in particular:
- A **Preprocessor** class has been made, it is its responsibility to:
  - Select the features we are interested in.
  - Drop invalid records from our dataset.
  - Split our dataset into three different dataframes, **items**, **categories**, **transactions**.
- A **CacheManager** and its **Cache** product have been implemented:
  - A **CacheManager's** responsibility is to cache **Preprocessor's** results, so that iteration times, memory and computing resources are optimized.
  - **Cache** is the result of a **CacheManager**, it holds the three dataset needed.
  - Python's library **pickle**'s format has been used to serialize our dataframes as it preserves Panda's DataFrame informations. It is imperative you must create your own cache files through running this notebook, as **pickly** files can be used with malicious intent by third party users.

Follows the code that loads or dataset:

In [None]:
from src.cache import CacheManager
cache = CacheManager(dataset_file_path).get_cache()

Our cache object the three dataframes I considered interesting to extract to complete the homework:


#### Items
`cache.items` holds the data about the items present in the original dataset.

In [None]:
cache.items

#### Categories
`cache.categories` holds the data about the categories of the original dataset, which should originate from a tree structure, although it is necessary to reconstruct for this task, thus this dataframe just holds the needed informations for each category, regardless of its level or parent.

In [None]:
cache.categories

#### Transactions
`cache.transactions` holds the transactions from the original files, the data has been pruned down to the two needed features:
- Transaction id ( *scontrino_id* ).
- Item id ( *cod_prod* )

In [None]:
cache.transactions

## Task 1
### Frequency of each product
To compute the data frame of frequencies of each product, we have to count singles `cod_prod` from **transaction's** data frame.

As I preferred to keep the occurrence counts as I found it easier to understand, I'm going to calculate the requirred frequency by dividing each result's *count* column to the *transaction's* data frame size.

Then the result is joined with **item's** data frame, of which I select only `descr_prod`, the product's description, as we are required to use as labels in the next plot.

Follows the code and the data which holds the frequency of each product.

In [None]:
most_frequent_df = (
    cache.transactions['cod_prod'].value_counts(sort= True).to_frame()
    .assign(frequency = lambda x: x['count'] / cache.transactions.size )
    .join(cache.items[['descr_prod']], 'cod_prod')
)

most_frequent_df

Follows the requested bar plot.

In [None]:
import matplotlib.pyplot as plt

# just select twenty of, as we are required
most_frequent_df = most_frequent_df[:20]
# draw a bigger figure
plt.figure(figsize=(8,10))
# use barh for horizontal bars, which I found a better fit here than vertical ones
plt.barh(most_frequent_df['descr_prod'], most_frequent_df['count'])
plt.show()

## Task 2
### Frequency of categories of level 3
To compute the frequency of categories in level 3, it is needed to:
1. Select the transaction data frame
2. Join the data frame with items, but of items only select `liv3`
3. Compute the `value_counts` of `liv3`.
4. Join with *categories* data frame.


In [None]:
top_level3_df = (
    cache.transactions
    .join(cache.items[['liv3']], 'cod_prod' )
    ['liv3'].value_counts()
    .to_frame()
    .assign(frequency = lambda x: x['count'] / cache.transactions.size )
    .join(cache.categories)
)

top_level3_df

Follows the bar plot of the top 5 categories of level 3:

In [None]:
#just select twenty of, as we are required
top_level3_df = top_level3_df[:5]
# draw a bigger figure
plt.figure(figsize=(8,6))
# use barh for horizontal bars, which I found a better fit here than vertical ones
plt.barh(top_level3_df['descr'], top_level3_df['count'])
plt.show()

### Frequency of categories of level 4
It is the same as [before](#frequency-of-categories-of-level-3), we extract the most frequent categories of level 4:

In [None]:
top_level4_df = (
    cache.transactions
    .join(cache.items[['liv4']], 'cod_prod' )
    ['liv4'].value_counts()
    .to_frame()
    .assign(frequency = lambda x: x['count'] / cache.transactions.size )
    .join(cache.categories)
)

top_level4_df

And we plot them the same:

In [None]:
#just select twenty of, as we are required
top_level4_df = top_level4_df[:5]
# draw a bigger figure
plt.figure(figsize=(8,6))
# use barh for horizontal bars, which I found a better fit here than vertical ones
plt.barh(top_level4_df['descr'], top_level4_df['count'])
plt.show()

## Task 3
It is required to apply **Apriori** to our transactions.