# Association Analysis
## Frequent Itemset Mining using Apriori Algorithm

### Importing Libaries

In [None]:
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

### Reading Data
For my dataset I choose a grocery dataset I found on Kaggle here:
https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset/ </br>
The dataset has 38765 rows of the purchase orders of people from the grocery stores - and it needs a bit of pre-processing to extract a dataset of transactions as seen below.

In [None]:
df = pd.read_csv('./Groceries_dataset.csv', sep=',')
df.head(5)

### Extracting all transactions for association analysis
I group by 'member_number' and 'date' in the original dataset and keep only the item list.

In [None]:
# print(list(df.groupby(['Member_number','Date']))[0])
transactions = [a[1]['itemDescription'].tolist() for a in list(df.groupby(['Member_number','Date']))]
transactions[:5]

### Put the transactions list (of lists) into a dataframe to be able to use methods from Tutorial_9_AssociationAnalysis
The transactions dataset has 14963 rows (more than 10000 rows).

In [None]:
transactions_df = pd.DataFrame(transactions)
print("Columns: ", transactions_df.columns)
transactions_df

Now each row of the dataframe represents items that were purchased together on the same day by the same member.
The dataset is a **sparse dataset** as relatively high percentage of data is NA or NAN or equivalent. 
Let's see all the unique items in the dataset.

In [None]:
items = np.unique(transactions_df.values[transactions_df.values != None])
print("Number of unique items: ", len(items))
items

### Data Preprocessing - use the function provided in the tutorial

apriori module requires a dataframe that has either 0 and 1 or True and False as data. 
The data we have is all string (name of items), we need to **One Hot Encode** the data.

##### Custom One Hot Encoding
Note: I replaced 0 with False and 1 with True to get rid of the warning: </br>
"DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type"

In [None]:
encoded_vals = []
for index, row in transactions_df.iterrows():
    labels = {}
    uncommons = list(set(items) - set(row))
    commons = list(set(items).intersection(row))
    for uc in uncommons:
        labels[uc] = False
    for com in commons:
        labels[com] = True
    encoded_vals.append(labels)
encoded_vals

One Hot Encoded dataset

In [None]:
ohe_df = pd.DataFrame(encoded_vals)
print("All columns (items): ", ohe_df.columns)
print("Number of single item transactions: ", len(ohe_df[ohe_df.apply(lambda x: sum(x) == 1, axis=1)]))
# Drop rows with single item transaction. There are 205 of them and this speeds up the analysis a bit.
ohe_df = ohe_df[ohe_df.apply(lambda x: sum(x) > 1, axis=1)]
ohe_df.head(3)

Now we're ready to apply Apriori algorithm since we have a dataframe with one hot encoded rows for each transaction.

### Applying Apriori

apriori module from mlxtend library provides fast and efficient apriori implementation.  <br>
<br>
> **apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0, low_memory=False)**

##### Parameters
- ` df ` : One-Hot-Encoded DataFrame or DataFrame that has 0 and 1 or True and False as values
- ` min_support ` : Floating point value between 0 and 1 that indicates the mininmum support required for an itemset to be selected. <br>

In [None]:
freq_items = apriori(ohe_df, min_support=0.01, use_colnames=True, verbose=1)
# Print the 10 most frequently bought items
freq_items.sort_values('support', ascending=False)[:10]

### Get results on different pairs of minimum support and minimum confidence

In [None]:
support_thresholds = [0.001, 0.005, 0.01]
confidence_thresholds = [0.01, 0.05, 0.1]
# Keep all the resulted rules for all combinations of thresholds in a list of rules
rules = []

for support_threshold in support_thresholds:
    for confidence_threshold in confidence_thresholds:
        freq_items_t = apriori(ohe_df, min_support=support_threshold, use_colnames=True, verbose=1)
        rules.append(association_rules(freq_items_t, metric="confidence", min_threshold=confidence_threshold))

# Print association rules for support = 0.01 and confidence = 0.1 (last computed)
rules[-1]

### Visualizing results

1. **Support vs Confidence**

In [None]:
rule_colors = matplotlib.cm.tab10(range(9))
plt.rcParams['figure.figsize'] = [14, 10]

fig, ax = plt.subplots(nrows=3, ncols=3)
for i, r in enumerate(rules):
    row = i // 3
    col = i % 3
    ax_c = ax[row, col]
    ax_c.scatter(r['support'], r['confidence'], s=12, c=[rule_colors[i] if i<8 else 'blue'], alpha=0.75,
                label=f"(s, c) = ({support_thresholds[row]}, {confidence_thresholds[col]})")
    ax_c.legend()
plt.suptitle('Confidence vs Support for different thresholds of support and confidence')
fig.supylabel('Confidence')
fig.supxlabel('Support')
plt.show()

## Analysis of results and how they change with each pair of minimum support and minimum confidence

1. At low thresholds of Support and Confidence there are many association rules selected by the alghoritm.<br>
2. When Support threshold is increased only most frequent items are considered and the number of rules decreases.<br>
3. When Confidence threshold is increased, only rules rules that have the ratio of support(item1→item2) / support(item1) higher than the threshold are kept so the number of rules decreases. <br>
4. When both thresholds for Support and Confidence are increased only the rules that with higher Confidence and higher support are kept.<br>
<br>
For example, when Support >= 0.01 and Confidence >= 0.05


In [None]:
# support_threshold = 0.01, confidence_threshold = 0.05
selected_rules = rules[7]
selected_rules.sort_values('confidence', ascending=False)

We can see that for support_threshold = 0.01, the 3 most confident rules are:<br>
<br>
(yogurt) -> (whole milk)<br>
(rolls/buns) -> (whole milk)<br>
(other vegetables) -> (whole milk)

## Store Manager insights and actions

### Decide which result sets are meaningful
Manager can look at the item (product) support and decide to use a specific threshold.<br>
For example, she/he can decide to work with top k=10 items in terms of antecedent support.<br>
Then she/he can run apriori algorithm with that threshold of support and select a number of association rules or similarly a threshold for confidence.

In [None]:
# Select items with highest support
selected_rulse = rules[0]
selected_rules.sort_values('support', ascending=False)[:10]

#### Manager then decides the set of meaningful rules

In [None]:
# From above table, select support_threshold=0.01
freq_items_t = apriori(ohe_df, min_support=0.01, use_colnames=True, verbose=1)
manager_rules = association_rules(freq_items_t, metric="confidence", min_threshold=0.01)
manager_rules.sort_values('confidence', ascending=False)[['antecedents', 'consequents', 'confidence']][:10]

### Manager actions to improve sales, order inventory and ensure items are accessible easily
<br>
Based on the selected rules the manager can decide to:<br>

1. Improve sales by discounting either the antecedents items or the consequents items or both.<br>
For example, given the above selected association rules, the manager can decide to discount 'yogurt' while also preparing to increase 'whole milk' inventory in anticipation of higher sales of 'yogurt'. Another decision can be made to discount items if they are bought together, for example if the store has high inventory of already discounted 'soda' which cannot be further discounted, the manager can offer a discount only when 'soda' and 'whole milk' are bought together without discounting 'whole milk' itself.<br>
2. Order inventory pro-actively, for example ordering 'other vegetables' together with {'rolls/buns' and 'whole milk'}. Also keeping inventory levels that can satisfy the most confident association rules.
3. Arrange items to shelves/refrigerators in such a way that the most confident association rules are followed with items kept in close vicinity. For example keeping 'yogurt' and 'whole milk' nearby and similarly 'rolls/buns' and 'other vegetables'. Also the manager can identify the most frequently bought items and give them a shelf location that is easily accessible
