# Market Basket Analysis

In this tutorial, we will explore the Market Basket Analysis dataset, which can be downloaded from Kaggle using the following link: Market Basket Analysis Dataset.

We will start by performing data cleaning to ensure the dataset is ready for analysis. Next, we will extract frequent itemsets using the FP-Growth algorithm, a powerful tool for mining frequent patterns in large datasets. Additionally, we will analyze the frequent itemsets on a per-country basis, enabling us to uncover unique purchasing patterns and trends across different regions.

By the end of this tutorial, you will have a solid understanding of how to preprocess transactional data and use the FP-Growth algorithm to gain valuable insights into customer purchasing behavior.

Before we begin, we have to install some pandas and mlxtend to run queries on the data and run the FPGrowth algorithm.

To install pandas and mlxtend with pip:

```bash
pip install pandas
pip install mlxtend==0.23.1

# Importing Data

Once the libraries have been installed, we can start to import data into our program. There are multiple ways of importing the data such as using the Kaggle API or downloading the csv file. In this tutorial, we will be using the downloaded folder provided in kaggle and renaming the folder to `data`. The file path is shown below in the variable `FILE` and can be modified to your liking.

When importing the csv file with pandas, there is some inconsistent data present as a result of poor formatting. Since only a small number of rows are affected by this, we decided to just skip the rows and still retain a large portion of the data.  

In [None]:
import pandas as pd

FILE = "./data/Assignment-1_Data.csv"

data = pd.read_csv(FILE, sep=";", on_bad_lines="skip", low_memory=False)

print(data.head())

# Cleaning Data

In this step, we will clean the dataset by focusing on the columns that are most relevant to our analysis: `BillNo`, `Itemname`, and `Country`. All other columns will be dropped, as they are not necessary for the insights we aim to extract. Next, we will remove any entries where the data is incomplete, specifically rows where `BillNo`, `Itemname`, or `Country` are missing. These incomplete records can introduce inconsistencies or inaccuracies in our analysis, so it is important to exclude them. Additionally, we will clean up the Itemname column by removing any leading or trailing whitespace to ensure consistency and accuracy when analyzing item names. These preprocessing steps will ensure that the dataset is clean, focused, and ready for further analysis.

In [None]:
columns_to_keep = ['BillNo', 'Itemname', 'Country']

data = data[columns_to_keep]

#Drop rows with missing values
data.dropna(inplace=True)

data['Itemname'] = data['Itemname'].str.strip()

print(data.head())

To analyze transactions across different regions, we separate the data by country. This is achieved by creating a dictionary called `country_datas`, which leverages the groupby function to group all rows based on the `Country` column. Each country serves as a key in the dictionary, with its corresponding value being a subset of the data containing only transactions for that specific country. However, there are transactions with a `Country` of `Undefined` which we will exclude from our data. To maintain accuracy of our results, we only keep the countries where there are at least 1000 rows of transactional data.

Finally, a quick preview of the transactions for each country is displayed which can be commented out.

In [None]:
country_datas = {country: data for country, data in data.groupby('Country')}
    
del country_datas["Unspecified"]

# Keep countries with more than 2000 rows of transaction details 
country_datas = {key: value for key, value in country_datas.items() if value.shape[0] > 1000}

for country, data in country_datas.items():
        print(f"Data for {country}:")
        print(data.head()) 
        print("\n")

In our current dataset, we have `country_data`, which contains transactions for each country, organized by `BillNo`. To prepare the data for frequent itemset mining, we need to group all transactions sharing the same BillNo into a single transaction. This ensures that all items purchased together in the same transaction are treated as a single unit. We achieve this by using the `groupby` function to aggregate the items by `BillNo`, effectively combining them into grouped transactions. 

Examples of our transactions can be ran in the following print statements to have a close look. After this transformation, out transactions data for each country is now ready for the `TransactionEncoder` in `mlxtend`.

In [None]:
country_transactions = {}

for country, data in country_datas.items():
    country_transactions[country] = data.groupby(['BillNo'])['Itemname'].apply(lambda x: ','.join(x)).reset_index()
        
for country, transactions in country_transactions.items():
    transactions.drop(columns=['BillNo'], inplace=True)
    transactions.rename(columns={'Itemname': 'Items'}, inplace=True)
    
for country, transactions in country_transactions.items():
    print(f"Transactions for {country}:")
    print(transactions.head()) 
    print("\n")
    
for country, transactions in country_transactions.items():
    country_transactions[country] = transactions['Items'].apply(lambda x: x.split(',')).tolist()
    
for country, transactions in country_transactions.items():
    print(f"Transactions for {country}:")
    print(transactions[0]) 
    print("\n")

# Generating Frequent Itemsets

To perform frequent itemset mining using the FP-Growth algorithm, we first prepared our data by transforming the `country_transactions` dictionary into a one-hot encoded format. The FP-Growth functions from the `mlxtend` library expect the input to be a binary matrix where each row represents a transaction, and columns represent the presence or absence of items. Using `TransactionEncoder` from `mlxtend.preprocessing`, we transformed each country's transaction data into a Pandas DataFrame with binary encoding.

In [None]:
from mlxtend.preprocessing import TransactionEncoder

for country, transactions in country_transactions.items():
    te = TransactionEncoder()
    te_ary = te.fit(transactions).transform(transactions)
    data = pd.DataFrame(te_ary, columns=te.columns_)
    country_transactions[country] = data
    
for country, transactions in country_transactions.items():
    print(f"Transactions for {country}:")
    print(transactions.head())
    print("\n")

Next, we applied the FP-Growth algorithm to extract frequent itemsets and association rules for each country's transactions. For a subset of countries, including United Kingdom, France, Germany, and others, we identified frequent itemsets with a minimum support threshold of 0.1, ensuring only the most relevant patterns are included. These frequent itemsets were then sorted by their support values to highlight the most common combinations of items. Additionally, we used the association_rules function to derive meaningful rules, filtering them based on a confidence threshold of 0.8. 

Finally, we displayed the frequent itemsets and association rules for each country to gain insights into region-specific shopping patterns.

In [None]:
from mlxtend.frequent_patterns import fpgrowth, association_rules

fq_itemsets = {}
fq_rules = {}
# Apply FP-Growth to each country's transactions
for country, transactions in country_transactions.items():
    #if country in {'United Kingdom', 'France', 'Germany', 'Australia', 'Austria', 'Bahrain', 'Belgium'}:
    frequent_itemsets = fpgrowth(transactions, min_support=0.1, use_colnames=True)
    top_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)
    rules = association_rules(top_itemsets, metric='confidence', min_threshold=0.8)
    fq_itemsets[country] = frequent_itemsets
    fq_rules[country] = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]


In [None]:
for country, itemsets in fq_itemsets.items():
    print(f"Frequent itemsets for {country}")
    print(len(itemsets))
    print(itemsets.sort_values(by='support', ascending=False).head(5))
    print("\n")

In [None]:
for country, rules in fq_rules.items():
    print(f"Association rules for {country}")
    print(len(rules))
    print(rules.head(5))
    print("\n")

To analyze the strength of association rules, we categorize them based on lift: rules with lift > 1 are positively correlated, and those with lift ≤ 1 are negatively correlated. Overall, most of the rules are positively correlated, which is a good sign.

In [None]:
for country, rules in fq_rules.items():
    # Separate rules based on lift
    positive_corr = rules[rules['lift'] > 1]
    negative_corr = rules[rules['lift'] <= 1]
    
    # Print association rule correlation summary
    print(f"Association rules correlation for country: {country}")
    print(f"Total Rules: {len(rules)}")
    print(f"Positive correlation rules: {len(positive_corr)}")
    print(f"Negative correlation rules: {len(negative_corr)}")
    print("\n")


# Apriori Algorithm

The Apriori algorithm is another method for detecting frequent itemsets, leveraging the `apriori` and `association_rules` functions from the `mlxtend` library. In this analysis, we applied the Apriori algorithm to the same subset of countries and used the same minimum support threshold of 0.1. These itemsets were then sorted in descending order of support to identify the most common combinations of items. Additionally, we used the association_rules function to extract meaningful rules from the frequent itemsets, applying a minimum confidence threshold of 0.2 to filter the results. 

In [None]:
from mlxtend.frequent_patterns import apriori, association_rules

apriori_itemsets = {}
apriori_rules = {}

In [None]:
# Apply Apriori to each country's transactions
for country, transactions in country_transactions.items():
    #if country in {'United Kingdom', 'France', 'Germany', 'Australia', 'Austria', 'Bahrain', 'Belgium'}:
    print(f"Processing Apriori for {country}...\n")
        
    # Generate frequent itemsets using Apriori
    frequent_itemsets = apriori(transactions, min_support=0.1, use_colnames=True)
        
    # Sort itemsets by support in descending order
    top_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)
        
    # Generate association rules from the frequent itemsets
    rules = association_rules(top_itemsets, metric='confidence', min_threshold=0.8)
        
    # Store results
    apriori_itemsets[country] = frequent_itemsets
    apriori_rules[country] = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Finally, we stored and displayed the frequent itemsets and association rules for each country, providing insights into purchasing patterns based on the Apriori algorithm.

In [None]:
# Display results for frequent itemsets
for country, itemsets in apriori_itemsets.items():
    print(f"Frequent itemsets for {country} (Apriori)")
    print(len(itemsets))
    print(itemsets.sort_values(by='support', ascending=False).head(5))
    print("\n")

In [None]:
# Display results for association rules
for country, rules in apriori_rules.items():
    print(f"Association rules for {country} (Apriori)")
    print(len(rules))
    print(rules.head(5))
    print("\n")

Once we run through all the calculations, we observe that the results of the Apriori algorithm matches the results obtained by FP Growth for all countries.

# KNOWLEDGE REPRESENTATION

At the final stage of Knowledge Discovery in Databases (KDD), 
the knowledge representation is captured in a graph database, 
which enables better insight and visualization of the discovered patterns. 
In this step, we store the association rules discovered by algorithms such as 
Apriori in a Neo4j graph database.

In the provided code, a connection to the Neo4j database is established, 
and a dynamic structure is created for storing countries, 
association rules (antecedents and consequents), and their relationships, 
such as `HAS_RULE` and `RESULTS_INTO`. 
This allows you to represent associations between items (antecedents and consequents) 
for specific countries with their 
corresponding support, confidence, and lift values.

The graph is visualized using the `networkx` library and the `matplotlib` library, 
which provides a clear and interactive way 
to explore the relationships. Each rule is visualized as a directed edge, 
with the support, confidence, and lift values labeled on the edges, 
giving a visual overview of the strength and reliability of each rule. 
This representation helps in understanding the relationships between items in a country 
and can further be used for different use cases like recommendation systems or market basket analysis.

Here's a breakdown of the core steps involved in the code:

1. **Neo4j Graph Database Connection**: Establishing a connection to a Neo4j database where the rules will be stored.
2. **Storing Rules**: The `create_rule` function dynamically creates country nodes, antecedent nodes, and consequent nodes, and links them with relationships such as `HAS_RULE` and `RESULTS_INTO`, with properties for support, confidence, and lift.
3. **Visualizing the Graph**: The `visualize_graph` function generates a visual representation of the graph using `networkx`, highlighting the rules between antecedents and consequents.
4. **Clearing the Database**: Before creating new rules, the `clear_database` function deletes any existing rules and nodes, ensuring the graph remains up-to-date with the latest rules.

This step completes the KDD pipeline by transforming raw data into actionable insights represented as knowledge in a graph structure, ready for further analysis or decision-making.

### Why this is better than traditional storage:
- **Efficient Relationship Modeling**: Unlike traditional relational databases, which struggle with complex, many-to-many relationships, Neo4j is optimized for handling and querying connected data. It allows for intuitive representation and querying of relationships, making it easier to uncover hidden patterns and associations between items.
  
- **Scalability and Flexibility**: As the data grows, graph databases like Neo4j efficiently scale and adapt to complex structures, without the need for expensive joins or cumbersome relational tables. This flexibility is ideal for dynamic and evolving data, like association rules, where relationships are central to the analysis.

These features enable faster, more efficient querying and provide deeper insights into the data compared to traditional row-based storage systems.


In [None]:
from neo4j import GraphDatabase
import networkx as nx
import matplotlib.pyplot as plt

# Neo4j connection details
uri = "neo4j+s://27f471e5.databases.neo4j.io"  # Your Aura instance URI
username = "neo4j"  # Your Neo4j username
password = "EI5gKZQ6XT0y4bMRMej9orSxnkEP-Xc77lqZLH6Hkac"  # Your Neo4j password

# Initialize the driver
driver = GraphDatabase.driver(uri, auth=(username, password))

# Function to create a country node (dynamic)
def create_country(tx, country_name):
    tx.run("MERGE (ct:Country {name: $country_name})", country_name=country_name)

# Function to create rules based on the given antecedent, consequent, and properties
def create_rule(tx, country_name, antecedent_items, consequent_items, support, confidence, lift):
    tx.run(
        """
        MERGE (a:Antecedent {items: $antecedent_items})
        MERGE (c:Consequent {items: $consequent_items})
        WITH a, c
        MATCH (ct:Country {name: $country_name})
        MERGE (ct)-[:HAS_RULE]->(a)-[:RESULTS_INTO {support: $support, confidence: $confidence, lift: $lift}]->(c)
        """,
        antecedent_items=antecedent_items,
        consequent_items=consequent_items,
        support=support,
        confidence=confidence,
        lift=lift,
        country_name=country_name
    )

# Function to visualize the graph and save as an image
def visualize_graph(G, filename="graph.png"):
    pos = nx.spring_layout(G)  # Positioning nodes
    plt.figure(figsize=(12, 12))  # Adjust figure size
    nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=3000, font_size=10, font_weight='bold', edge_color='gray')
    
    # Draw edge labels (support, confidence, lift)
    edge_labels = nx.get_edge_attributes(G, 'support')
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels)
    
    # Save the graph as an image
    plt.title("Association Rules Graph")
    plt.savefig(filename)
    plt.close()
    
def clear_database(tx):
    # Delete HAS_RULE relationships
    tx.run("MATCH (a)-[re:HAS_RULE]-(b) DELETE re")
    # Delete RESULTS_INTO relationships
    tx.run("MATCH (a)-[re:RESULTS_INTO]-(b) DELETE re")
    # Delete all nodes
    tx.run("MATCH (n) DELETE n")
    
# Function to fetch rules for a specific country
def fetch_rules(tx, country_name):
    query = """
    MATCH (ct:Country {name: $country_name})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
    RETURN ct, a, r1, r2, c
    """
    result = tx.run(query, country_name=country_name)
    return [
        {"antecedent": record["a"].get("items"), "consequent": record["c"].get("items"), 
         "support": record["r2"].get("support"), "confidence": record["r2"].get("confidence"), "lift": record["r2"].get("lift")}
        for record in result
    ]

# Function to create a graph representation of the rules
def create_graph(rules):
    G = nx.DiGraph()  # Directed graph
    
    # Add nodes and edges
    for rule in rules:
        antecedent = str(rule['antecedent'])
        consequent = str(rule['consequent'])
        G.add_node(antecedent, type='Antecedent')
        G.add_node(consequent, type='Consequent')
        G.add_edge(
            antecedent, consequent,
            support=rule['support'], confidence=rule['confidence'], lift=rule['lift']
        )
    
    return G

# Function to visualize the graph and save it as an image
def visualize_graph(G, filename="graph.png"):
    pos = nx.spring_layout(G)  # Positioning nodes
    plt.figure(figsize=(12, 12))  # Adjust figure size
    nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=3000, font_size=10, font_weight='bold', edge_color='gray')
    
    # Draw edge labels (support, confidence, lift)
    edge_labels = {
        (u, v): f"Supp: {d['support']:.2f}, Conf: {d['confidence']:.2f}, Lift: {d['lift']:.2f}"
        for u, v, d in G.edges(data=True)
    }
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels)
    
    # Save the graph as an image
    plt.title("Association Rules Graph")
    plt.savefig(filename)
    plt.close()


In [None]:
with driver.session() as session:
    # Clear the database in the specified sequence
    session.execute_write(clear_database)
    print("All nodes and relationships deleted in sequence.")

In [None]:
for country, rules in fq_rules.items():
    print(f"Association rules for {country} in graph database - Neo4j")
    
    # Initialize the list to hold the association rules
    association_rules = []
    
    rules = rules.head(5)

    # Iterate over each row in the DataFrame
    for index, row in rules.iterrows():
        # Create a dictionary for each rule and append it to the list
        rule = {
            'antecedents': row['antecedents'],
            'consequents': row['consequents'],
            'support': row['support'],
            'confidence': row['confidence'],
            'lift': row['lift']
        }
        association_rules.append(rule)
    
    # Display the result
    print(association_rules)
    print("\n")
    
    country_name = country
    
    with driver.session() as session:
        # Create the country node
        session.execute_write(create_country, country_name)
    
        # Loop through the association_rules array and create each rule
        for rule in association_rules:
            antecedents = list(rule['antecedents'])
            consequents = list(rule['consequents'])
            support = rule['support']
            confidence = rule['confidence']
            lift = rule['lift']
            
            # Create the rule in the database with the dynamic country name
            session.execute_write(create_rule, country_name, antecedents, consequents, support, confidence, lift)
        
        # Fetch rules for the specified country
        rules = session.execute_read(fetch_rules, country_name)
    
        # Create a graph from the fetched rules
        G = create_graph(rules)
        
        # Visualize the graph and save the image
        visualize_graph(G, filename=f"association_rules_{country_name}.png")
        
        print(f"Graph image saved as 'association_rules_{country_name}.png'")


Graph images are particularly not getting saved very well, and this is something we couldn't handle effectively. However, you can easily view the graph database's contents and experiment with it directly by following these steps:

**Steps to explore the graph data:**

1. **Visit the Neo4j Browser** at [this link](https://27f471e5.databases.neo4j.io/browser/).
   
2. **Login details:**
   - **Username**: `neo4j`
   - **Password**: `EI5gKZQ6XT0y4bMRMej9orSxnkEP-Xc77lqZLH6Hkac`

3. **Wait for 60 seconds** before connecting to the database, or you can log in to [Neo4j Console](https://console.neo4j.io) to ensure that your Aura instance is available and ready to use.

4. **Use the following Cypher queries** for each country to view their respective association rule graphs. These queries will return nodes and relationships representing the association rules:

### Queries for Graph Representation:

- **Australia**:
  ```cypher
  MATCH (ct:Country {name: 'Australia'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

- **Belgium**:
  ```cypher
  MATCH (ct:Country {name: 'Belgium'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

- **France**:
  ```cypher
  MATCH (ct:Country {name: 'France'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

- **Germany**:
  ```cypher
  MATCH (ct:Country {name: 'Germany'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

- **Netherlands**:
  ```cypher
  MATCH (ct:Country {name: 'Netherlands'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

- **Norway**:
  ```cypher
  MATCH (ct:Country {name: 'Norway'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

- **Portugal**:
  ```cypher
  MATCH (ct:Country {name: 'Portugal'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

- **Spain**:
  ```cypher
  MATCH (ct:Country {name: 'Spain'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

- **Switzerland**:
  ```cypher
  MATCH (ct:Country {name: 'Switzerland'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

- **United Kingdom**:
  ```cypher
  MATCH (ct:Country {name: 'United Kingdom'})-[r1:HAS_RULE]->(a:Antecedent)-[r2:RESULTS_INTO]->(c:Consequent)
  RETURN ct, a, r1, r2, c
  ```

---

These queries will help you visualize the relationship between `Country`, `Antecedent`, and `Consequent` nodes, showing how the association rules are structured for each country.

### Potential Use Cases for the Knowledge Graphs:
These knowledge graphs could be utilized for various purposes such as:

- **Trend Analysis**: Compare association rules across different countries.
- **Product Recommendations**: Discover frequently bought items in different countries and make recommendations.
- **Data-Driven Decisions**: Utilize association rules to inform strategic decisions.
- **Visual Insights**: Present graph-based insights for stakeholders to interpret complex data visually.

By following the above steps, you can explore and analyze the data directly in Neo4j's browser and leverage it for deeper analysis or further use cases.

In [None]:
# Close the driver
driver.close()