## Importing libraries

In [8]:
from gsppy.gsp import GSP
import pandas as pd


## Loading the data set

Get indexes of customers filtered after task 1

In [14]:
customer_indexes = pd.read_csv('data/new_df.csv', index_col=0, decimal='.').index.tolist()
df = pd.read_csv('data/clean_df.csv', index_col=0, decimal='.')

df = df[df['CustomerID'].isin(customer_indexes)]

4171


Model customers as sequence of baskets, as requested by the GSP implementation.

For reference: https://github.com/jacksonpradolima/gsp-py

In [10]:
df_grouped = df.groupby(['BasketID'])['ProdDescr'].apply(list)
baskets = df_grouped.values.tolist()

print("Number of transactions: ", len(baskets))

Number of transactions:  14929


Due to computational costs, we decided to filter some baskets depending
on their length. The boundaries were inferred by the *mean* and *standard deviation*
on the `Imax` attribute defined for task 1.

In [11]:
filtered_baskets = []
for b in baskets:
    if 6 < len(b) < 80:
        filtered_baskets.append(b)

print("Number of filtered baskets:", len(filtered_baskets))

Number of filtered baskets: 11229


## Perform GSP

Results are printed into the `results.txt` file, because the output is too long
to be printed on the notebook.

In [13]:
minsup = 0.007
minsup_count = len(filtered_baskets) * minsup

print("Minimum support:", int(minsup_count))

# this operation requires hours of computation!!!
resultGSP = GSP(filtered_baskets).search(minsup)

with open('results.txt', 'w') as f:
    print(resultGSP, file=f)




Minimum support: 78
