<h1><center>Basket Analysis using Association Rules</center></h1>

![download.jpeg](attachment:download.jpeg)

Market basket analysis is a data mining technique that is used to identify association rules between items in large data sets of transactional data. It is used to identify which items are frequently purchased together in order to understand consumer behavior and to identify potential opportunities for product bundling or cross-selling.

For example, a market basket analysis of a grocery store's transaction data might reveal that customers who purchase bread are very likely to also purchase butter. This information could be used to create a product bundle (e.g., a "baking essentials" bundle including bread and butter) or to cross-sell butter to customers who are purchasing bread (e.g., by placing the butter near the bread in the store).

Benefits of market basket analysis include:

    Improved customer satisfaction: By understanding which items are frequently purchased together, a retailer can make it easier for customers to find everything they need in one place, which can improve the overall shopping experience.

    Increased sales: By bundling or cross-selling items that are frequently purchased together, a retailer can increase sales of those items.

    Enhanced marketing efforts: Market basket analysis can help a retailer understand which items are most likely to be of interest to their customers, which can help them tailor their marketing efforts to better target those customers.

    Improved inventory management: By understanding which items are frequently purchased together, a retailer can better predict which items they will need to have in stock at any given time, which can help them manage their inventory more efficiently.
    


### For this project we will be;

    
    1. Analyse and preprocess the dataset
    2. Visualize the weekly, monthly and yearly sales and draw insights from the plotted graphs
    3. Visualize the top and bottom selling products
    4. Visualize the top customers for this business
    5. Genarate association rules to be used to determine the relationships of the products
    6. Identify the frequently purchased products
    

## 1. Importing the Necessary Libraries

In [1]:
##Importing the necessary libraries
import numpy as np
import pandas as pd
import altair as alt
import holoviews as hv

##for visualization styles
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
plt.style.use('ggplot')
import cufflinks as cf ## mainly used for pandas like visualization, colors, graphs, chart gallery e.t.c
import plotly.express as px
import plotly.offline as py
from plotly.offline import plot
import plotly.graph_objects as go
import plotly.graph_objs as go

## for association rules
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import warnings; warnings.simplefilter('ignore')

ModuleNotFoundError: No module named 'altair'

In [None]:
## Reading the dataset
data = pd.read_csv('Groceries_dataset.csv')
data.head()

## 2. Data Preprocessing

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
## Number of unique items in the dataset
print(f'There are : {len(data["itemDescription"].unique())} unique items')

In [None]:
##Renaming the columns
data.rename(columns = {'Member_number':'id','itemDescription':'item'}, inplace = True) 

In [None]:
##Splitting the date column to form new separate columns for the day, month and year
#Convert the 'Date' column to datetime format
data['Date']= pd.to_datetime(data['Date'])
 
#Extracting year,month and day
data['year'] = data['Date'].apply(lambda x : x.year)
data['month'] = data['Date'].apply(lambda x : x.month)
data['day'] = data['Date'].apply(lambda x : x.day)
data['weekday'] = data['Date'].apply(lambda x : x.weekday())

#Rearranging the columns
dataFrame=data[['id', 'Date','year', 'month', 'day','weekday','item']]
dataFrame.head()

In [None]:
dataFrame["year"].unique()

In [None]:
dataFrame["month"].unique()

## 3. Explaratory Data Analysis

### 1. The business yearly sales

In [None]:
from holoviews import opts
hv.extension('bokeh')

#Filter data for 2014 and 2015 
df1=dataFrame.groupby(['year']).filter(lambda x: (x['year'] == 2014).any())
df2=dataFrame.groupby(['year']).filter(lambda x: (x['year'] == 2015).any())

#Monthly data for the two years 
sales_2014=hv.Bars(df1.groupby(['month'])['item'].count()).opts(ylabel="Number of items", title='Number of items sold in 2014')
sales_2015=hv.Bars(df2.groupby(['month'])['item'].count()).opts(ylabel="Number of items", title='Number of items sold in 2015')

#Combining the two plots
(sales_2014 + sales_2015).opts(opts.Bars(width=380, height=300,tools=['hover'],show_grid=True))


1. The year 2015 has a the higher sales than 2014 thus the business has a progressive increase in sales between the two years
2. October 2015 has the highest sales 
3. In February and September recorded the lowest sales in 2015 and september 2014 also has the lowest sales


### 2. The business monthly quantity purchases

In [None]:
#Creating temporary data which has quantity purchased column
temp=dataFrame.copy()
temp['qty_purchased']=dataFrame['id'].map(dataFrame['id'].value_counts())

#Slicing first 5000 rows as altair library can't plot any data which has record beyond that
temp1=temp[:5000]
temp1.columns

#Plotting
brush = alt.selection(type='interval', encodings=['x'])

#Plotting the bar chart
bars = alt.Chart().mark_bar(color="orange").encode(
    x=alt.X('month(Date):O',title="Month"),
    y=alt.Y('mean(qty_purchased):Q',title="Last Price"),
    opacity=alt.condition(brush, alt.OpacityValue(1), alt.OpacityValue(0.7)),
    tooltip=['month(Date)','mean(qty_purchased)']
).add_selection(
    brush
).properties(height=400,width=600,title="Monthly Quantity Purchases")

#Plotting avrage line
line = alt.Chart().mark_rule(color='green').encode(
    y='mean(qty_purchased):Q',
    size=alt.SizeValue(3),
    tooltip=['mean(qty_purchased)']
).transform_filter(
    brush
)

#Display plot using sliced data
alt.layer(bars, line, data=temp1)

1. The top difference was experienced when the band shift after Jan-Apr months(window-4 months)
2. The highest average is obviously between May-Aug month where highest was from June

### 3. The business weekly quantity purchases

In [None]:
#Converting weekday variable to category
temp1.weekday = temp1.weekday.astype('category') 

#Creating a new dataframe which has the frequency of weekdays
weekday_bin=temp1['weekday'].value_counts().to_frame().reset_index().rename(columns={'index':'weekday','weekday':'count'})

#Plotting bar chart
bars = alt.Chart(weekday_bin).mark_bar(color="darkgreen").encode(
    x='weekday',
    y=alt.Y("count",title='Number of purchases')
)

#Adding data labels
text = bars.mark_text(
    align='center',
    baseline='middle',
    dy=-7 ,
    size=15,
).encode(
    text='count',
    tooltip=[alt.Tooltip('weekday'),
            alt.Tooltip('count')]
)

#Combining both
(bars + text).properties(
    width=800,
    height=400,
    title="Weekly Quantity Purchases"
)

### 4. Perfomance of the products  

In [None]:
#Graph : Item by count
fig = px.bar(dataFrame["item"].value_counts()[:20], orientation="v", color=dataFrame["item"].value_counts()[:20], color_continuous_scale=px.colors.sequential.Plasma, 
             log_x=False, labels={'value':'Count', 
                                'index':'Item',
                                 'color':'None'
                                })

fig.update_layout(
    font_color="black",
    title_font_color="blue",
    legend_title_font_color="green",
    title_text="Perfomance of Goods Sold"
)

fig.show()

1. Whole milk products have the highest sales
2. Pork is the least purchased product
3. The top 10 purchased product include food items

### 5. The top and bottom 10 Fast moving products in both years

In [None]:
#Setting plot style
plt.figure(figsize = (15, 8))
plt.style.use('seaborn-white')

#Top 10 fast moving products
plt.subplot(1,2,1)
ax=sns.countplot(y="item", hue="year", data=dataFrame, palette="pastel",
              order=data.item.value_counts().iloc[:10].index)

ax.set_xticklabels(ax.get_xticklabels(),fontsize=11,rotation=40, ha="right")
ax.set_title('Top 10 Fast moving products',fontsize= 22)
ax.set_xlabel('Total # of items purchased',fontsize = 20) 
ax.set_ylabel('Top 10 items', fontsize = 20)
plt.tight_layout()

#Bottom 10 fast moving products
plt.subplot(1,2,2)
ax=sns.countplot(y="item", hue="year", data=dataFrame, palette="pastel",
              order=data.item.value_counts().iloc[-10:].index)
ax.set_xticklabels(ax.get_xticklabels(),fontsize=11,rotation=40, ha="right")
ax.set_title('Bottom 10 Fast moving products',fontsize= 22)
ax.set_xlabel('Total # of items purchased',fontsize = 20) 
ax.set_ylabel('Bottom 10 items', fontsize = 20)
plt.tight_layout()

1. Milk is the top product purchased in both 2014 and 2015 whereas lowest is preservation product which no one purchased in 2015
2. Almost all the top products has seen a rise in 2015 except soda and bottled water
3. Most of the bottom products nevr saw a rise in 2015 except whiskey,chicken,bag and baby cosmetics

### 6.The top customers with most purchases in both years

In [None]:
top_customers=temp[['id', 'qty_purchased','year']].sort_values(by = 'qty_purchased',ascending = False).head(500)
top_customers.id = top_customers.id.astype('category') 
top_customers.year = top_customers.year.astype('category') 
alt.Chart(top_customers).mark_bar(color="blue").encode(
    x='qty_purchased',
    y=alt.Y('id', sort='-x'),
    color='year',
    tooltip=['id','qty_purchased']
).properties(height=400,width=600,title="Top Customers")

1. 3180 id customer has topped the list and has been a loyal customer in both the year
2. There can be few customers who are seen to be inconsitent where they have purchased a lot in 2014 and not in 2015 when it comes to customer life expectancy these consistency are considered. Since we have only two year data we can't comment on each customer about their customer life expectancy much

## 4. Association Rules with Apriori Algorithm

Association Rules are based on the concept of strong rules, are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness.
Association rules are used to unveil the relationship between one item and another when purchased, mainly denotes as X and Y. X is the main product being purchased while Y is the best product to be bought together with X. These rules are developed by three terminologies;
    
    1. Support - It is used to represent the number of transactions in which product X appears from thee total number of transactions. That is, the popularity of product X.
    2. Confidence - It is the likelyhood that product Y being purchased when item X is purchased.
    3. Lift -  This says how likely item Y is purchased when item X is purchased while controlling for how popular item Y is.
 They are calculated as follows;

![assocr2-300x228.png](attachment:assocr2-300x228.png)

The Apriori algorithm generates association rules, but it does so under the condition that

    1. All subsets of the frequent itemset must all be frequent.
    2. For any infrequent itemset all it's supersets must be infrequent too

The method may take some time to construct if all rules are taken into account, thus if the lift of these chosen itemsets (rules) is less than a threshold, the rules are removed.
the Apriori algorithm generates association rules, but it does so under the condition that

    1. Subsets of the frequent itemset must all be frequent.
    2. Similar to this, the algorithm operates in such a way that iterations take place with frequent itemsets and a minimum support value is determined if an infrequent subset has an infrequent parent set. Until removal is impossible, itemsets and subsets are disregarded if their support falls below the threshold.

The method may take some time to construct if all rules are taken into account, thus if the lift of these chosen itemsets (rules) is less than a threshold, the rules are removed.

![img054.webp](attachment:img054.webp)

<h3><center>Apriori Algorithm Steps</center></h3>

1. Identify the support threshold, in the above example, the support threshold has been set to 1
2. Eliminate the items in the dataset that do not meet the requirements of the support threshold
3. Pair up the items into 2 itemsets in the dataset
4. From the pairs formed, eliminate the items with support of 1 and below
5. Form pairs of three with the remaining items in the dataset
6. Remove the items below the threshold
7. The remaining pair is the valid frequent items purchased together

### 4.1 Preparing items for the Apriori Algorithm

In [None]:
products = data["item"].unique()

In [None]:
one_hot = pd.get_dummies(data['item'])

data.drop(['item'], inplace=True, axis=1)
data = data.join(one_hot)

data.head()

In [None]:
records = data.groupby(["id","Date"])[products[:]].sum()
records = records.reset_index()[products]

records

In [None]:
def get_product_names(x):
    for product in products:
        if x[product] != 0:
            x[product] = product
    return x

records = records.apply(get_product_names, axis=1)
records.head()

print(f"Total transactions: {len(records)}")

In [None]:
# Removing zeros in DataFrame
x = records.values
x = [sub[~(sub == 0)].tolist() for sub in x if sub[sub != 0].tolist()]
transactions = x

In [None]:
from apyori import apriori
association_rules = apriori(transactions, min_support = 0.0003, min_confidence = 0.01, min_lift = 3, min_length = 2, target="rules")
association_results = list(association_rules)

In [None]:
print(association_results[0])

In [None]:
for item in association_results:

    pair = item[0] 
    items = [x for x in pair]
    
    print("Rule : ", items[0], " -> " + items[1])
    print("Support : ", str(item[1]))
    print("Confidence : ",str(item[2][0][2]))
    print("Lift : ", str(item[2][0][3]))
    
    print("=====================================")

## 4.3 Association Rules based on Consumer Baskets

In [None]:
from apyori import apriori

In [None]:
## Grouping the member's basket by the id, date and the number of items in the basket
basket = df.groupby(['Member_number','Date'])
basket.count()

In [None]:
## Convert it into a dataFrame
df.groupby(['Member_number','Date'], as_index = False)['itemDescription'].sum()

In [None]:
## Let's try to see what items where in the baskets from transactions
list_transactions = [i[1]['itemDescription'].tolist() for i in list(basket)]
list_transactions[:20]

We can see the 4 items in customer 1000, bought on the 15th of March were 'sausage', 'whole milk', 'semi-finished bread', 'yogurt' 


## 4.2 Building Apriori Algorithm from the customer's baskets

In [None]:
## Building the rules for apriori with the customer's baskets in list_transactions
rules = apriori(list_transactions, min_support=0.001, min_confidence=0.05, min_lift=1.2, min_length=2, max_length=2)

In [None]:
results = list(rules)

In [None]:
def inspect(results):
    lhs = [tuple(result[2][0][0])[0] for result in results]
    rhs = [tuple(result[2][0][1])[0] for result in results]
    support = [result[1]*100 for result in results]
    confidence = [result[2][0][2]*100 for result in results]
    lift = [result[2][0][3] for result in results]
    return list(zip(lhs,rhs,support,confidence,lift))
final_result = pd.DataFrame(inspect(results), columns=['Antecedent','Consequent','Support(%)','Confidence(%)','lift'])
final_result['Rule'] = final_result['Antecedent'] + '->' + final_result['Consequent']

In [None]:
final_result

In [None]:
# Changing rules Parameters
rules_2 = apriori(list_transactions, min_support = 0.001, min_confidence = 0.1, max_length = 3)
results_2 = list(rules_2)

In [None]:
# Creating new data frame of new rules and Calculating number of rules
pd.options.display.float_format = '{:,.2f}'.format
final_df_2 = pd.DataFrame(columns = ['Left Hand Side 1', 'Left Hand Side 2','Right Hand Side', 'Support(%)', 'Confidence(%)', 'Lift'])
for i in results_2:
    if len(i[0]) > 2:
        for j in range(0, len(i[2])):
            LHS1 = list(i[2][j][0])[0]
            LHS2 = list(i[2][j][0])[1]
            RHS = list(i[2][j][1])[0]
            SUPPORT = i[1]*100
            CONFIDENCE = i[2][j][2]*100
            LIFT = i[2][j][3]
            new_row = {'Left Hand Side 1': LHS1, 'Left Hand Side 2': LHS2 ,'Right Hand Side': RHS, 'Support(%)': SUPPORT, 'Confidence(%)': CONFIDENCE, 'Lift': LIFT}
            final_df_2 = final_df_2.append(new_row, ignore_index = True)
final_df_2['Rules'] = final_df_2['Left Hand Side 1'] + ' + ' + final_df_2['Left Hand Side 2'] + ' -> ' + final_df_2['Right Hand Side']
print('Number of Rules: ', final_df_2['Rules'].count(), 'Rules')
final_df_2.head()

In [None]:
final_df_2.to_csv('newRules.csv')

In [None]:
# Save the rules to a pickle file
with open('apriori_rules.pickle', 'wb') as f:
    pickle.dump(rules, f)