# Business Problem

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. Association rules are made by searching data for frequent if-then patterns and by using a certain criterion under **Support** and **Confidence** to define what the most important relationships are. Support is the evidence of how frequent an item appears in the data given, as Confidence is defined by how many times the if-then statements are found true. However, there is a third criteria that can be used, it is called **Lift** and it can be used to compare the expected Confidence and the actual Confidence. Lift will show how many times the if-then statement is expected to be found to be true.

Support is an indication of how frequently the itemset appears in the dataset. Confidence is the percentage of all transactions satisfying X that also satisfy Y. If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.

Armut, Turkey's largest online service platform, brings together service providers and those who want to receive service. It provides easy access to services such as cleaning, modification and transportation with a few touches on the computer or smart phone. It is desired to create a product recommendation system with **Association Rule Learning** by using the data set containing the service users and the services and categories they have received.

The data set consists of the services customers receive and the categories of these services. It contains the date and time information of each service received.

## Variables

* UserId: Customer number
* ServiceId: Anonymized services belonging to each category. (Example: Upholstery washing service under the cleaning category)
* A ServiceId can be found under different categories and refers to different services under different categories. (Example: Service with CategoryId 7 and ServiceId 4 is honeycomb cleaning, while service with CategoryId 2 and ServiceId 4 is furniture assembly)
* CategoryId: Anonymized categories. (Example: Cleaning, transportation, renovation category)
* CreateDate: The date the service was purchased

# Data Preprocessing

In [1]:
#!pip install mlxtend

[0m

In [37]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)
from mlxtend.frequent_patterns import apriori, association_rules
import plotly.express as px
from pandas_profiling import ProfileReport
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [2]:
df_ = pd.read_csv("../input/armut-data/armut_data.csv")
df = df_.copy()
df.head()

Unnamed: 0,UserId,ServiceId,CategoryId,CreateDate
0,25446,4,5,2017-08-06 16:11:00
1,22948,48,5,2017-08-06 16:12:00
2,10618,0,8,2017-08-06 16:13:00
3,7256,9,4,2017-08-06 16:14:00
4,25446,48,5,2017-08-06 16:16:00


In [3]:
def check_df(dataframe, head=10):
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### Variables #####################")
    print(dataframe.columns)
    print("##################### Descriptive Stats #####################")
    print(dataframe.describe().T)
    print("##################### Null Values #####################")
    print(dataframe.isnull().sum())
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Info #####################")
    print(dataframe.info())
check_df(df)

In [4]:
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

In [6]:
df['New_Date'] = pd.to_datetime(df['CreateDate']).dt.strftime('%Y-%m')

In [33]:
df_serv = df["ServiceId"].value_counts().sort_values(ascending=False)
df_category = df["CategoryId"].value_counts().sort_values(ascending=False)

In [45]:
fig = make_subplots(rows=2, cols=1, subplot_titles=["Service - Count", "Category - Count"])
fig.add_trace(go.Bar(x = df_serv.head(10).index, y=df_serv.head(10).values,marker=dict(
                           line=dict(color='rgba(58, 71, 80, 1.0)', width=3)
                           )), row=1, col=1)
fig.add_trace(go.Bar(x = df_category.head(10).index, y=df_category.head(10).values,marker=dict(
                           line=dict(color='rgba(58, 71, 80, 1.0)', width=3)
                           )), row=2, col=1)
fig.update_layout(height=1200, width=1200, paper_bgcolor='rgb(233,233,233)', title="Service and Category Counts", showlegend=False)

In [51]:
df_sc = df.groupby("ServiceId").agg({"CategoryId": "count"}).sort_values("CategoryId", ascending=False).reset_index()
df_sc.head(10).style.background_gradient(subset= "CategoryId", cmap='Reds')

Unnamed: 0,ServiceId,CategoryId
0,18,32740
1,15,11348
2,2,11326
3,49,6690
4,38,5604
5,48,5588
6,13,4680
7,22,4035
8,17,3676
9,19,3623


In [55]:
# Number of categoryId for each service
# Service 18 has the highest number of categories
fig = px.scatter(df_sc, x="ServiceId", y="CategoryId",
                 size="CategoryId", color="CategoryId",
                 hover_name="ServiceId", size_max=60)
fig.show()

## Preparing ARL Data Structure (Invoice-Product Matrix)

In [56]:
# ServiceID represents a different service for each CategoryID. We can create new variable as 
cols = ['ServiceId', 'CategoryId']
df['Service'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

In [57]:
cols = ['UserId', 'New_Date']
# User_Id and Date Information
df['Market_Basket'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

In [62]:
# User and number of services received in same month
df['Market_Basket'].value_counts()

11669_2017-08    42
7480_2017-12     41
7478_2017-08     33
4864_2018-07     32
3360_2018-05     30
                 ..
23180_2018-01     1
18671_2018-01     1
1340_2018-01      1
6611_2018-01      1
17497_2018-08     1
Name: Market_Basket, Length: 71220, dtype: int64

In [65]:
mar_bas_df = pd.pivot_table(df, columns="Service", index="Market_Basket").fillna(0).applymap(lambda x: 1 if x > 0 else 0).droplevel(0, axis=1)

The name of the Apriori Algorithm is Apriori, meaning "prior" as it obtains the information from the previous step. Based on this algorithm, it has an iterative (repetitive) attribute and is used to discover frequent item sets in databases containing motion information.

In [66]:
frequent_itemsets = apriori(mar_bas_df,
                            min_support=0.01,
                            use_colnames=True)

rules = association_rules(frequent_itemsets,
                          metric="support",
                          min_threshold=0.01)


DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type



In [67]:
rules[(rules["support"]>0.01) & (rules["confidence"]>0.1) & (rules["lift"]>5)]. \
sort_values("confidence", ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
15,(25_0),(22_0),0.042895,0.047515,0.01112,0.259247,5.456141,0.009082,1.285834
14,(22_0),(25_0),0.047515,0.042895,0.01112,0.234043,5.456141,0.009082,1.249553


In [69]:
def arl_recommender(rules_df, service_id, rec_count=1):
    sorted_rules = rules_df.sort_values("lift", ascending=False)
    recommendation_list = []
    for i, product in enumerate(sorted_rules["antecedents"]):
        for j in list(product):
            if j == service_id:
                recommendation_list.append(list(sorted_rules.iloc[i]["consequents"]))
    # recommendation_list = list({item for item_list in recommendation_list for item in item_list})
    return recommendation_list[0:rec_count]

# recommend items according to antecedents "lift value"
arl_recommender(rules, "2_0", 1)
# recommend items according to antecedents "lift value"
arl_recommender(rules, "25_0", 3)

[['22_0'], ['2_0']]