# Analyzing Store Data Through Market Basket Analysis

## 1) Libraries and packages

## a) Import libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from apyori import apriori

## b) Read data and Display

### Data Understanding:
• <u>What does each row in the dataset represent?</u>

Each row in the dataset represents an transaction

• <u>How are the items in each transaction represented in the dataset?</u>

Each items are nested into a list in each transaction


In [171]:
df = pd.read_csv("data.csv")
df

Unnamed: 0,TransactionID,Items
0,1,"Bread, Milk, Eggs"
1,2,"Bread, Juice, Cheese"
2,3,"Milk, Eggs, Juice"
3,4,"Bread, Milk"
4,5,"Eggs, Juice"
5,6,"Bread, Cheese"
6,7,"Milk, Cheese"
7,8,"Bread, Milk, Eggs, Juice"
8,9,"Juice, Cheese"
9,10,"Bread, Milk, Cheese"


In [172]:
df.shape

(10, 2)

# 2) Association Rules

Extract association rules from the given dataset, taking into consideration the Minimum
Support Threshold (MST) of 20% and Minimum Confidence Threshold (MCT) of 70%.

### Algorithm Application:
• <u>Why is the Apriori algorithm used in association rule mining?</u>

The Apriori algorithm is employed in association rule mining due to its efficiency in discovering frequent itemsets through a systematic, level-wise search. It facilitates the identification of relationships and patterns in large datasets by generating candidate itemsets and deriving association rules.

• <u>What is the significance of the min_support parameter in the Apriori algorithm,
and how does it impact the results?</u>

The min_support parameter in Apriori is significant, representing the minimum support threshold for a frequent itemset to be considered interesting. Higher values yield fewer but more significant rules, emphasizing strong associations across a substantial portion of the dataset. Lower values result in more itemsets and potentially weaker rules, offering a balance between generality and specificity in the discovered patterns. Choosing an appropriate min_support value is crucial for tailoring the mining process to specific analysis goals.


### i) Data Preprocessing

### Data Preprocessing:
• <u>Why is it necessary to convert the 'Items' column into a list of items during data
preprocessing?</u>

Converting the 'Items' column into a list of items during data preprocessing is essential for various data analysis and mining tasks. Many algorithms, such as those used in association rule mining, require input in the form of itemsets or sequences. Converting the 'Items' column to a list ensures that the data is structured appropriately for these algorithms to identify patterns and relationships among items effectively.
Handling Missing Values or Inconsistent Data:

• <u>How would you handle missing values or inconsistent data in a real-world
dataset?</u>

<u>Missing Values:</u> For missing values, several strategies can be employed:

**Imputation:** Replace missing values with a calculated estimate, such as the mean, median, or mode of the column.

**Deletion:** Remove rows with missing values, but this should be done cautiously to avoid losing valuable information.

**Advanced Imputation:** Use more sophisticated techniques like machine learning-based imputation methods for better accuracy.

<u>Inconsistent Data:</u>

**Standardization:** Standardize or normalize data to ensure consistency in scale and units.

**Data Cleaning:** Identify and correct inconsistent entries manually or through automated cleaning procedures.

**Outlier Handling:** Detect and handle outliers that might contribute to inconsistencies in the dataset.

In [180]:
#create records
records = []

#add records
for i in range(0, 10):
    records.append([str(df.values[i, j]) for j in range(1, 2)])
    
#convert to list
records = pd.DataFrame(records)
records = records[0].str.split(',').tolist()

#strip empty space in string
for x in range(0,len(records)):
    for y in range(0,len(records[x])):
        records[x][y] = records[x][y].strip()

#display
records

[['Bread', 'Milk', 'Eggs'],
 ['Bread', 'Juice', 'Cheese'],
 ['Milk', 'Eggs', 'Juice'],
 ['Bread', 'Milk'],
 ['Eggs', 'Juice'],
 ['Bread', 'Cheese'],
 ['Milk', 'Cheese'],
 ['Bread', 'Milk', 'Eggs', 'Juice'],
 ['Juice', 'Cheese'],
 ['Bread', 'Milk', 'Cheese']]

### ii) Apriori Algorithm

### Rule Generation:
• <u>What does the confidence metric represent in the context of association rules?</u>

In association rule mining, the confidence metric quantifies the reliability of a rule. It indicates the conditional probability that the presence of the antecedent (if-part) in a transaction implies the presence of the consequent (then-part). A higher confidence value, expressed as a percentage, signifies a stronger association between the items. For instance, a confidence of 80% implies that, in 80% of transactions where the antecedent is present, the consequent is also present, making the rule more dependable for decision-making in areas like marketing or recommendation systems.

• <u>How does adjusting the min_threshold parameter affect the generated association rules?</u>

Adjusting the min_threshold parameter in rule generation impacts the selection criteria for rules to be considered interesting. A higher min_threshold filters out rules with lower confidence, generating a more reliable set of rules but potentially reducing their quantity. Conversely, a lower min_threshold includes rules with lower confidence, resulting in a larger set of rules that may include weaker associations. The parameter thus serves as a control mechanism, allowing users to fine-tune the trade-off between rule quality and quantity based on the specific requirements of the analysis or application domain.

In [183]:
association_rules = apriori(records, min_support=0.2, min_confidence=0.7, min_lift=1)
association_results = list(association_rules)

In [184]:
print("There are {} Relation derived.".format(len(association_results)))

There are 4 Relation derived.


In [185]:
for i in range(0, len(association_results)):
    print(association_results[i][0])

frozenset({'Eggs', 'Juice'})
frozenset({'Eggs', 'Milk'})
frozenset({'Eggs', 'Bread', 'Milk'})
frozenset({'Eggs', 'Juice', 'Milk'})


## 3) Rules generated

In [189]:
for item in association_results:
    # first index of the inner list
    # Contains base item and add item
    pair = item[0]
    items = [x for x in pair]
    if len(items) == 2:
        print("Rule: " + items[0] + " -> " + items[1])
    elif len(items) == 3:
        print("Rule: " + items[0] + " ," + items[1] +" -> " + items[2])

    # second index of the inner list
    print("Support: " + str(item[1]))

    # third index of the list located at 0th
    # of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

Rule: Eggs -> Juice
Support: 0.3
Confidence: 0.7499999999999999
Lift: 1.4999999999999998
Rule: Eggs -> Milk
Support: 0.3
Confidence: 0.7499999999999999
Lift: 1.2499999999999998
Rule: Eggs ,Bread -> Milk
Support: 0.2
Confidence: 1.0
Lift: 1.6666666666666667
Rule: Eggs ,Juice -> Milk
Support: 0.2
Confidence: 1.0
Lift: 2.5


### Experimentation:
• <u>What happens if you change the value of min_support to a higher or lower
value? How does it impact the number of frequent itemsets?</u>

Adjusting min_support in association rule mining has a substantial impact on the number and nature of frequent itemsets. A higher value reduces the number of frequent itemsets, emphasizing only those occurring frequently across a substantial portion of the dataset. Conversely, a lower value increases the number of frequent itemsets, considering even smaller subsets. This parameter allows users to customize the granularity of discovered patterns, balancing generality and specificity in results.

• <u>How does adjusting the confidence threshold influence the number and quality of
association rules?</u>

Modifying the confidence threshold in rule generation affects the number and quality of association rules. A higher threshold yields fewer but more reliable rules, emphasizing stronger correlations, while a lower threshold expands the rule set, including weaker associations. This adjustment provides flexibility in tailoring rule output based on the desired level of certainty or the specific goals of the analysis.

### Real-world Application:
• <u>Can you think of a real-world scenario where association rule mining can be
applied to improve business operations or customer experience?</u>

In the context of a supermarket, association rule mining can enhance business operations and customer experience. Analyzing transaction data, the algorithm may reveal associations like "customers who buy diapers are likely to purchase baby wipes." This insight enables targeted product placements and promotions, improving inventory management and offering personalized discounts. The store can strategically optimize aisle layouts and marketing strategies, providing a more efficient and satisfying shopping experience for customers while boosting sales and operational efficiency.

• <u>How might the insights gained from association rule mining be utilized by a
retail business?</u>

For a retail business, association rule mining unveils valuable purchasing patterns. By discovering associations like "customers buying grills often purchase barbecue sauce," retailers can optimize product placements, create targeted promotions, and enhance cross-selling strategies. Insights gained help retailers understand customer behavior, tailor marketing campaigns, and optimize inventory management. This leads to a more personalized shopping experience, increased sales, and improved operational efficiency, demonstrating the practical application of association rule mining in enhancing decision-making and customer satisfaction in the retail sector.