## Association Rule Based Recommender System

### 1 ) Business Problem

Armut, Turkey's leading online service platform, acts as a hub linking service providers with clients. It facilitates easy access to services like cleaning, renovation, and moving through a user-friendly interface on computers or smartphones. The aim is to utilize Association Rule Learning to build a product recommendation system using a dataset that includes users who have received services and their respective service categories.

### Dataset Story

The dataset contains details about the services obtained by customers and the corresponding service categories. It also includes the date and time information for each service rendered.

### Variables

- **UserId**: Distinct customer identifier
- **ServiceId**: Anonymized services associated with each category. For instance, within the cleaning category, a service could be upholstery cleaning. ServiceId may appear across different categories, signifying diverse services under distinct categories or a service with CategoryId = 7 and ServiceId = 4 could be radiator cleaning, whereas a service with CategoryId = 2 and ServiceId = 4 might be furniture assembly
- **CategoryId**: Anonymized categories. For instance; cleaning, moving, renovation
- **CreateDate**: The date on which the service was purchased

### 2 ) Data Understanding

In [1]:
## Import the necessary library and functions


!pip install mlxtend

# mlxtend kütüphanesini yükleme
!pip install --user mlxtend

# mlxtend'den gerekli fonksiyonları yükleme
from mlxtend.frequent_patterns import apriori, association_rules

import warnings
warnings.filterwarnings("ignore")

import pandas as pd

pd.set_option('display.max_columns', None)

# pd.set_option('display.max_rows', None)

pd.set_option('display.width', 500)

pd.set_option('display.expand_frame_repr', False)



In [2]:
## Load the dataset

df_ = pd.read_csv("armut_data.csv")

In [3]:
### Creating a copy of the dataframe to work on it without altering the original.

df=df_.copy()
df.head(10)

Unnamed: 0,UserId,ServiceId,CategoryId,CreateDate
0,25446,4,5,2017-08-06 16:11:00
1,22948,48,5,2017-08-06 16:12:00
2,10618,0,8,2017-08-06 16:13:00
3,7256,9,4,2017-08-06 16:14:00
4,25446,48,5,2017-08-06 16:16:00
5,14354,15,1,2017-08-06 16:27:00
6,14162,21,5,2017-08-06 16:28:00
7,21230,46,4,2017-08-06 16:34:00
8,25446,6,7,2017-08-06 16:39:00
9,10659,4,5,2017-08-06 16:44:00


In [4]:
## The shape of dataset

df.shape

(162523, 4)

In [5]:
# Display information about the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162523 entries, 0 to 162522
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   UserId      162523 non-null  int64 
 1   ServiceId   162523 non-null  int64 
 2   CategoryId  162523 non-null  int64 
 3   CreateDate  162523 non-null  object
dtypes: int64(3), object(1)
memory usage: 5.0+ MB


In [6]:
## An overview of descriptive statistics.

df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
UserId,162523.0,13089.803862,7325.81606,0.0,6953.0,13139.0,19396.0,25744.0
ServiceId,162523.0,21.64114,13.774405,0.0,13.0,18.0,32.0,49.0
CategoryId,162523.0,4.325917,3.129292,0.0,1.0,4.0,6.0,11.0


In [7]:
## Check any missing values.

df.isnull().values.any()

False

In [8]:
## The number of unique values.

df.nunique()

UserId         24826
ServiceId         50
CategoryId        12
CreateDate    117510
dtype: int64

### 3 ) Data Preprocessing

In [9]:
# Create a new column and named Service by combining ServiceId and CategoryId

df['Service'] = [str(row[1]) + "_" + str(row[2]) for row in df.values]

df.head()


Unnamed: 0,UserId,ServiceId,CategoryId,CreateDate,Service
0,25446,4,5,2017-08-06 16:11:00,4_5
1,22948,48,5,2017-08-06 16:12:00,48_5
2,10618,0,8,2017-08-06 16:13:00,0_8
3,7256,9,4,2017-08-06 16:14:00,9_4
4,25446,48,5,2017-08-06 16:16:00,48_5


In [10]:
# The dataset consists of the date and time when the services were acquired, without any basket definition (such as invoice, etc.). 
# To apply Association Rule Learning, a basket definition (such as invoice, etc.) must be created. 
# Here, the basket definition will be the monthly services received by each customer. 
# Each basket needs to be identified with a unique ID.

# First, create a new date variable that includes only the year and month.
# Then, concatenate the UserID and the newly created date variable with an underscore ("_") and assign this to a new variable named ID.


In [11]:
## Change data type for Create date 

df["CreateDate"] = pd.to_datetime(df["CreateDate"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162523 entries, 0 to 162522
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   UserId      162523 non-null  int64         
 1   ServiceId   162523 non-null  int64         
 2   CategoryId  162523 non-null  int64         
 3   CreateDate  162523 non-null  datetime64[ns]
 4   Service     162523 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 6.2+ MB


In [12]:
## Create a new_date column

df["New_Date"] = df["CreateDate"].dt.strftime("%Y-%m")

df.head()

Unnamed: 0,UserId,ServiceId,CategoryId,CreateDate,Service,New_Date
0,25446,4,5,2017-08-06 16:11:00,4_5,2017-08
1,22948,48,5,2017-08-06 16:12:00,48_5,2017-08
2,10618,0,8,2017-08-06 16:13:00,0_8,2017-08
3,7256,9,4,2017-08-06 16:14:00,9_4,2017-08
4,25446,48,5,2017-08-06 16:16:00,48_5,2017-08


In [13]:
# Create CartId column by combining UserId and CreateDate

df["Cartid"] = [str(row[0]) + "_" + str(row[5]) for row in df.values]

df.head()

Unnamed: 0,UserId,ServiceId,CategoryId,CreateDate,Service,New_Date,Cartid
0,25446,4,5,2017-08-06 16:11:00,4_5,2017-08,25446_2017-08
1,22948,48,5,2017-08-06 16:12:00,48_5,2017-08,22948_2017-08
2,10618,0,8,2017-08-06 16:13:00,0_8,2017-08,10618_2017-08
3,7256,9,4,2017-08-06 16:14:00,9_4,2017-08,7256_2017-08
4,25446,48,5,2017-08-06 16:16:00,48_5,2017-08,25446_2017-08


In [14]:
# Create invoices pivot table

invoice_product_df = df.groupby(['Cartid', 'Service'])['Service'].count().unstack().fillna(0).applymap(lambda x: 1 if x > 0 else 0)
invoice_product_df.head()

Service,0_8,10_9,11_11,12_7,13_11,14_7,15_1,16_8,17_5,18_4,19_6,1_4,20_5,21_5,22_0,23_10,24_10,25_0,26_7,27_7,28_4,29_0,2_0,30_2,31_6,32_4,33_4,34_6,35_11,36_1,37_0,38_4,39_10,3_5,40_8,41_3,42_1,43_2,44_0,45_6,46_4,47_7,48_5,49_1,4_5,5_11,6_7,7_3,8_5,9_4
Cartid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1
0_2017-08,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0
0_2017-09,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
0_2018-01,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
0_2018-04,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
10000_2017-08,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162523 entries, 0 to 162522
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   UserId      162523 non-null  int64         
 1   ServiceId   162523 non-null  int64         
 2   CategoryId  162523 non-null  int64         
 3   CreateDate  162523 non-null  datetime64[ns]
 4   Service     162523 non-null  object        
 5   New_Date    162523 non-null  object        
 6   Cartid      162523 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 8.7+ MB


### 4) Data Analysis and Data Model

In [16]:
# Find frequent item sets using Apriori algorithm with a minimum support of 0.01

frequent_itemsets = apriori(invoice_product_df, min_support=0.01, use_colnames=True)

In [17]:
# Generate association rules based on support with a minimum threshold of 0.01

rules = association_rules(frequent_itemsets, metric="support", min_threshold=0.01)

In [18]:
# Display the first few rows of the generated association rules
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(13_11),(2_0),0.056627,0.130286,0.012819,0.226382,1.737574,0.005442,1.124216,0.449965
1,(2_0),(13_11),0.130286,0.056627,0.012819,0.098394,1.737574,0.005442,1.046325,0.488074
2,(15_1),(2_0),0.120963,0.130286,0.033951,0.280673,2.154278,0.018191,1.209066,0.609539
3,(2_0),(15_1),0.130286,0.120963,0.033951,0.260588,2.154278,0.018191,1.188833,0.616073
4,(33_4),(15_1),0.02731,0.120963,0.011233,0.411311,3.400299,0.007929,1.493211,0.725728


In [19]:
## Using the arl_recommender function, recommend services to a user who has most recently received the 2_0 service.


def arl_recommender(rules_df, product_id, rec_count=1):
    sorted_rules = rules_df.sort_values("lift", ascending=False)
    # Sort the rules by lift in descending order to capture the most relevant product first.
    # Alternatively, sorting can be done by confidence based on preference.
    
    recommendation_list = [] # Create an empty list for recommended products.
    
    # antecedents: X
    # Since items are mentioned, it is returned as a frozenset. Combine the index and the service.
    # i: index
    # product: X, which is the service for which a recommendation is requested.
    for i, product in sorted_rules["antecedents"].items():
        for j in list(product): # Iterate over the services (product):
            if j == product_id: # If the requested recommendation product is found:
                recommendation_list.append(list(sorted_rules.iloc[i]["consequents"]))
                # Using the index i, add the consequents (Y) value at this index to the recommendation_list.

    # To avoid duplicates in the recommendation list:
    # For example, in 2-item or 3-item combinations, the same product might appear multiple times;
    # Use the unique property of dictionaries.
    recommendation_list = list({item for item_list in recommendation_list for item in item_list})
    return recommendation_list[:rec_count] # Return the recommended products up to the requested count.

In [20]:
## Recommend services to a user who has most recently received the 2_0 service

arl_recommender(rules,"2_0", 4)

['13_11', '38_4', '25_0']