# Market Basket Analysis

In this tutorial, we will explore the Market Basket Analysis dataset, which can be downloaded from Kaggle using the following link: Market Basket Analysis Dataset.

We will start by performing data cleaning to ensure the dataset is ready for analysis. Next, we will extract frequent itemsets using the FP-Growth algorithm, a powerful tool for mining frequent patterns in large datasets. Additionally, we will analyze the frequent itemsets on a per-country basis, enabling us to uncover unique purchasing patterns and trends across different regions.

By the end of this tutorial, you will have a solid understanding of how to preprocess transactional data and use the FP-Growth algorithm to gain valuable insights into customer purchasing behavior.

Before we begin, we have to install some pandas and mlxtend to run queries on the data and run the FPGrowth algorithm.

To install pandas and mlxtend with pip:

```bash
pip install pandas
pip install mlxtend==0.23.1

# Importing Data

Once the libraries have been installed, we can start to import data into our program. There are multiple ways of importing the data such as using the Kaggle API or downloading the csv file. In this tutorial, we will be using the downloaded folder provided in kaggle and renaming the folder to `data`. The file path is shown below in the variable `FILE` and can be modified to your liking.

When importing the csv file with pandas, there is some inconsistent data present as a result of poor formatting. Since only a small number of rows are affected by this, we decided to just skip the rows and still retain a large portion of the data.  

In [None]:
import pandas as pd

FILE = "./data/Assignment-1_Data.csv"

data = pd.read_csv(FILE, sep=";", on_bad_lines="skip", low_memory=False)

print(data.head())

# Cleaning Data

In this step, we will clean the dataset by focusing on the columns that are most relevant to our analysis: `BillNo`, `Itemname`, and `Country`. All other columns will be dropped, as they are not necessary for the insights we aim to extract. Next, we will remove any entries where the data is incomplete, specifically rows where `BillNo`, `Itemname`, or `Country` are missing. These incomplete records can introduce inconsistencies or inaccuracies in our analysis, so it is important to exclude them. Additionally, we will clean up the Itemname column by removing any leading or trailing whitespace to ensure consistency and accuracy when analyzing item names. These preprocessing steps will ensure that the dataset is clean, focused, and ready for further analysis.

In [None]:
columns_to_keep = ['BillNo', 'Itemname', 'Country']

data = data[columns_to_keep]

#Drop rows with missing values
data.dropna(inplace=True)

data['Itemname'] = data['Itemname'].str.strip()

print(data.head())

To analyze transactions across different regions, we separate the data by country. This is achieved by creating a dictionary called `country_datas`, which leverages the groupby function to group all rows based on the `Country` column. Each country serves as a key in the dictionary, with its corresponding value being a subset of the data containing only transactions for that specific country. However, there are transactions with a `Country` of `Undefined` which we will exclude from our data. To maintain accuracy of our results, we only keep the countries where there are at least 1000 rows of transactional data.

Finally, a quick preview of the transactions for each country is displayed which can be commented out.

In [None]:
country_datas = {country: data for country, data in data.groupby('Country')}
    
del country_datas["Unspecified"]

# Keep countries with more than 2000 rows of transaction details 
country_datas = {key: value for key, value in country_datas.items() if value.shape[0] > 1000}

for country, data in country_datas.items():
        print(f"Data for {country}:")
        print(data.head()) 
        print("\n")

In our current dataset, we have `country_data`, which contains transactions for each country, organized by `BillNo`. To prepare the data for frequent itemset mining, we need to group all transactions sharing the same BillNo into a single transaction. This ensures that all items purchased together in the same transaction are treated as a single unit. We achieve this by using the `groupby` function to aggregate the items by `BillNo`, effectively combining them into grouped transactions. 

Examples of our transactions can be ran in the following print statements to have a close look. After this transformation, out transactions data for each country is now ready for the `TransactionEncoder` in `mlxtend`.

In [None]:
country_transactions = {}

for country, data in country_datas.items():
    country_transactions[country] = data.groupby(['BillNo'])['Itemname'].apply(lambda x: ','.join(x)).reset_index()
        
for country, transactions in country_transactions.items():
    transactions.drop(columns=['BillNo'], inplace=True)
    transactions.rename(columns={'Itemname': 'Items'}, inplace=True)
    
for country, transactions in country_transactions.items():
    print(f"Transactions for {country}:")
    print(transactions.head()) 
    print("\n")
    
for country, transactions in country_transactions.items():
    country_transactions[country] = transactions['Items'].apply(lambda x: x.split(',')).tolist()
    
for country, transactions in country_transactions.items():
    print(f"Transactions for {country}:")
    print(transactions[0]) 
    print("\n")

# Generating Frequent Itemsets

To perform frequent itemset mining using the FP-Growth algorithm, we first prepared our data by transforming the `country_transactions` dictionary into a one-hot encoded format. The FP-Growth functions from the `mlxtend` library expect the input to be a binary matrix where each row represents a transaction, and columns represent the presence or absence of items. Using `TransactionEncoder` from `mlxtend.preprocessing`, we transformed each country's transaction data into a Pandas DataFrame with binary encoding.

In [None]:
from mlxtend.preprocessing import TransactionEncoder

for country, transactions in country_transactions.items():
    te = TransactionEncoder()
    te_ary = te.fit(transactions).transform(transactions)
    data = pd.DataFrame(te_ary, columns=te.columns_)
    country_transactions[country] = data
    
for country, transactions in country_transactions.items():
    print(f"Transactions for {country}:")
    print(transactions.head())
    print("\n")

Next, we applied the FP-Growth algorithm to extract frequent itemsets and association rules for each country's transactions. For a subset of countries, including United Kingdom, France, Germany, and others, we identified frequent itemsets with a minimum support threshold of 0.1, ensuring only the most relevant patterns are included. These frequent itemsets were then sorted by their support values to highlight the most common combinations of items. Additionally, we used the association_rules function to derive meaningful rules, filtering them based on a confidence threshold of 0.8. 

Finally, we displayed the frequent itemsets and association rules for each country to gain insights into region-specific shopping patterns.

In [None]:
from mlxtend.frequent_patterns import fpgrowth, association_rules

fq_itemsets = {}
fq_rules = {}
# Apply FP-Growth to each country's transactions
for country, transactions in country_transactions.items():
    #if country in {'United Kingdom', 'France', 'Germany', 'Australia', 'Austria', 'Bahrain', 'Belgium'}:
    frequent_itemsets = fpgrowth(transactions, min_support=0.1, use_colnames=True)
    top_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)
    rules = association_rules(top_itemsets, metric='confidence', min_threshold=0.8)
    fq_itemsets[country] = frequent_itemsets
    fq_rules[country] = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]


In [None]:
for country, itemsets in fq_itemsets.items():
    print(f"Frequent itemsets for {country}")
    print(len(itemsets))
    print(itemsets.sort_values(by='support', ascending=False).head(5))
    print("\n")

In [None]:
for country, rules in fq_rules.items():
    print(f"Association rules for {country}")
    print(len(rules))
    print(rules.head(5))
    print("\n")

To analyze the strength of association rules, we categorize them based on lift: rules with lift > 1 are positively correlated, and those with lift ≤ 1 are negatively correlated. Overall, most of the rules are positively correlated, which is a good sign.

In [None]:
for country, rules in fq_rules.items():
    # Separate rules based on lift
    positive_corr = rules[rules['lift'] > 1]
    negative_corr = rules[rules['lift'] <= 1]
    
    # Print association rule correlation summary
    print(f"Association rules correlation for country: {country}")
    print(f"Total Rules: {len(rules)}")
    print(f"Positive correlation rules: {len(positive_corr)}")
    print(f"Negative correlation rules: {len(negative_corr)}")
    print("\n")


# Apriori Algorithm

The Apriori algorithm is another method for detecting frequent itemsets, leveraging the `apriori` and `association_rules` functions from the `mlxtend` library. In this analysis, we applied the Apriori algorithm to the same subset of countries and used the same minimum support threshold of 0.1. These itemsets were then sorted in descending order of support to identify the most common combinations of items. Additionally, we used the association_rules function to extract meaningful rules from the frequent itemsets, applying a minimum confidence threshold of 0.2 to filter the results. 

In [None]:
from mlxtend.frequent_patterns import apriori, association_rules

apriori_itemsets = {}
apriori_rules = {}

In [None]:
# Apply Apriori to each country's transactions
for country, transactions in country_transactions.items():
    #if country in {'United Kingdom', 'France', 'Germany', 'Australia', 'Austria', 'Bahrain', 'Belgium'}:
    print(f"Processing Apriori for {country}...\n")
        
    # Generate frequent itemsets using Apriori
    frequent_itemsets = apriori(transactions, min_support=0.1, use_colnames=True)
        
    # Sort itemsets by support in descending order
    top_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)
        
    # Generate association rules from the frequent itemsets
    rules = association_rules(top_itemsets, metric='confidence', min_threshold=0.8)
        
    # Store results
    apriori_itemsets[country] = frequent_itemsets
    apriori_rules[country] = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Finally, we stored and displayed the frequent itemsets and association rules for each country, providing insights into purchasing patterns based on the Apriori algorithm.

In [None]:
# Display results for frequent itemsets
for country, itemsets in apriori_itemsets.items():
    print(f"Frequent itemsets for {country} (Apriori)")
    print(len(itemsets))
    print(itemsets.sort_values(by='support', ascending=False).head(5))
    print("\n")

In [None]:
# Display results for association rules
for country, rules in apriori_rules.items():
    print(f"Association rules for {country} (Apriori)")
    print(len(rules))
    print(rules.head(5))
    print("\n")

Once we run through all the calculations, we observe that the results of the Apriori algorithm matches the results obtained by FP Growth for all countries.