In this tutorial, we will explore the Market Basket Analysis dataset, which can be downloaded from Kaggle using the following link: Market Basket Analysis Dataset.

We will start by performing data cleaning to ensure the dataset is ready for analysis. Next, we will extract frequent itemsets using the FP-Growth algorithm, a powerful tool for mining frequent patterns in large datasets. Additionally, we will analyze the frequent itemsets on a per-country basis, enabling us to uncover unique purchasing patterns and trends across different regions.

By the end of this tutorial, you will have a solid understanding of how to preprocess transactional data and use the FP-Growth algorithm to gain valuable insights into customer purchasing behavior.

Before we begin, we have to install some pandas and mlxtend to run queries on the data and run the FPGrowth algorithm.

To install pandas and mlxtend with pip:

```bash
pip install pandas
pip install mlxtend==0.23.1

# Importing Data

Once the libraries have been installed, we can start to import data into our program. There are multiple ways of importing the data such as using the Kaggle API or downloading the csv file. In this tutorial, we will be using the downloaded folder provided in kaggle and renaming the folder to `data`. The file path is shown below in the variable `FILE` and can be modified to your liking.

When importing the csv file with pandas, there is some inconsistent data present as a result of poor formatting. Since only a small number of rows are affected by this, we decided to just skip the rows and still retain a large portion of the data.  

In [None]:
import pandas as pd

FILE = "./data/Assignment-1_Data.csv"

data = pd.read_csv(FILE, sep=";", on_bad_lines="skip", low_memory=False)

print(data.head())

# Cleaning Data

In this step, we will clean the dataset by focusing on the columns that are most relevant to our analysis: `BillNo`, `Itemname`, and `Country`. All other columns will be dropped, as they are not necessary for the insights we aim to extract. Next, we will remove any entries where the data is incomplete, specifically rows where `BillNo`, `Itemname`, or `Country` are missing. These incomplete records can introduce inconsistencies or inaccuracies in our analysis, so it is important to exclude them. Additionally, we will clean up the Itemname column by removing any leading or trailing whitespace to ensure consistency and accuracy when analyzing item names. These preprocessing steps will ensure that the dataset is clean, focused, and ready for further analysis.

In [None]:
columns_to_keep = ['BillNo', 'Itemname', 'Country']

data = data[columns_to_keep]

#Drop rows with missing values
data.dropna(inplace=True)

data['Itemname'] = data['Itemname'].str.strip()

print(data.head())

Group Data
- Group transaction data by country 
- Remove "Unspecified" country transactions

In [None]:
country_datas = {country: data for country, data in data.groupby('Country')}
 
del country_datas["Unspecified"]

for country, data in country_datas.items():
        print(f"Data for {country}:")
        print(data.head()) 
        print("\n")

Modify Data to be Transactions for TransactionEncoder
- Make transactions by joining items with the same BillNo
- Make it so transactions data for each country is setup for the transaction encoder

In [None]:
country_transactions = {}

for country, data in country_datas.items():
    country_transactions[country] = data.groupby(['BillNo'])['Itemname'].apply(lambda x: ','.join(x)).reset_index()
        
for country, transactions in country_transactions.items():
    transactions.drop(columns=['BillNo'], inplace=True)
    transactions.rename(columns={'Itemname': 'Items'}, inplace=True)
    
for country, transactions in country_transactions.items():
    print(f"Transactions for {country}:")
    print(transactions.head()) 
    print("\n")
    
for country, transactions in country_transactions.items():
    country_transactions[country] = transactions['Items'].apply(lambda x: x.split(',')).tolist()
    
for country, transactions in country_transactions.items():
    print(f"Transactions for {country}:")
    print(transactions[0]) 
    print("\n")

# Generating Frequent Itemsets

Make One-Hot Encoded Dataframe for FP-Growth Algorithm
- The fpgrowth function from the mlxtend library expects data in a one-hot encoded pandas DataFrame
 

In [None]:
from mlxtend.preprocessing import TransactionEncoder

for country, transactions in country_transactions.items():
    te = TransactionEncoder()
    te_ary = te.fit(transactions).transform(transactions)
    data = pd.DataFrame(te_ary, columns=te.columns_)
    country_transactions[country] = data
    
for country, transactions in country_transactions.items():
    print(f"Transactions for {country}:")
    print(transactions.head())
    print("\n")

In [None]:
print(len(country_transactions))
from mlxtend.frequent_patterns import fpgrowth, association_rules

fq_itemsets = {}
fq_rules = {}
# Apply FP-Growth to each country's transactions
for country, transactions in country_transactions.items():
    # print(country)
    frequent_itemsets = fpgrowth(transactions, min_support=0.1, use_colnames=True)
    top_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)
    rules = association_rules(top_itemsets, metric='confidence', min_threshold=0.2)
    fq_itemsets[country] = frequent_itemsets
    fq_rules[country] = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]


In [None]:
for country, itemsets in fq_itemsets.items():
    print(f"Frequent itemsets for {country}")
    print(len(itemsets))
    print(itemsets.sort_values(by='support', ascending=False).head(5))
    print("\n")

In [None]:
for country, rules in fq_rules.items():
    print(f"Association rules for {country}")
    print(len(rules))
    print(rules.head(5))
    print("\n")