# Descrição

Aplicação de Regras de Associação para o banco de dados [InstaCart](https://www.kaggle.com/competitions/instacart-market-basket-analysis/data?select=order_products__train.csv.zip).

# Importando Bibliotecas

In [350]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori

# Leitura das Bases de Dados

In [353]:
# Base com as transações
df=pd.read_csv(r"C:\Users\olima\Documents\Python\Market-basket\order_products__train.csv")

#Base com os nomes dos produtos
prod=pd.read_csv(r"C:\Users\olima\Documents\Python\Market-basket\products.csv")

# Exploração das Bases de Dados

In [351]:
df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
2,1,10246,3,0
3,1,49683,4,0
5,1,13176,6,0
6,1,47209,7,0
7,1,22035,8,1


In [354]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1384617 entries, 0 to 1384616
Data columns (total 4 columns):
 #   Column             Non-Null Count    Dtype
---  ------             --------------    -----
 0   order_id           1384617 non-null  int64
 1   product_id         1384617 non-null  int64
 2   add_to_cart_order  1384617 non-null  int64
 3   reordered          1384617 non-null  int64
dtypes: int64(4)
memory usage: 42.3 MB


## Verificando missings

In [355]:
df.order_id.isna().sum()

0

In [356]:
df.product_id.isna().sum()

0

## Verificando a quantidade de Ids únicos

In [357]:
df.order_id.nunique()

131209

In [358]:
df.product_id.nunique()

39123

In [359]:
df.order_id.value_counts()

1395075    80
2813632    80
949182     77
2869702    76
341238     76
           ..
1144944     1
1144765     1
1144608     1
1144038     1
3214874     1
Name: order_id, Length: 131209, dtype: int64

In [360]:
df.product_id.value_counts()

24852    18726
13176    15480
21137    10894
21903     9784
47626     8135
         ...  
42744        1
5871         1
47237        1
9305         1
38900        1
Name: product_id, Length: 39123, dtype: int64

# Transformação das Bases de Dados

## Filtros

In [399]:
# Filtrando transações com apenas um produto e produtos que apareceram poucas vezes (valor arbitrário)
# Filtro para produtos aplicado porque o algoritmo não rodou com todos, não sei como melhorar
filtro_orderId=list(df.order_id.value_counts()[df.order_id.value_counts()>1].index)
filtro_produtos=list(df.product_id.value_counts()[df.product_id.value_counts()>200].index)

df=df[df['order_id'].isin(filtro_orderId)]
df=df[df['product_id'].isin(filtro_produtos)]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 749601 entries, 2 to 1384616
Data columns (total 4 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   order_id           749601 non-null  int64
 1   product_id         749601 non-null  int64
 2   add_to_cart_order  749601 non-null  int64
 3   reordered          749601 non-null  int64
dtypes: int64(4)
memory usage: 28.6 MB


In [183]:
df.order_id.nunique()

124361

In [400]:
df.product_id.nunique()

1138

## Join com a base de nome dos produtos

In [21]:
prod.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [401]:
data=df.merge(prod,left_on="product_id",right_on="product_id")
data.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,1,10246,3,0,Organic Celery Hearts,83,4
1,2869,10246,4,1,Organic Celery Hearts,83,4
2,3378,10246,19,0,Organic Celery Hearts,83,4
3,14119,10246,6,0,Organic Celery Hearts,83,4
4,17152,10246,22,1,Organic Celery Hearts,83,4


In [402]:
# Mantendo apenas order_id e product_name
data=data.iloc[:,[0,4]]

In [390]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1083575 entries, 0 to 1083574
Data columns (total 2 columns):
 #   Column        Non-Null Count    Dtype 
---  ------        --------------    ----- 
 0   order_id      1083575 non-null  int64 
 1   product_name  1083575 non-null  object
dtypes: int64(1), object(1)
memory usage: 24.8+ MB


In [403]:
# Ao rodas as regras, verifiquei que algumas expressões estavam atrapalhando, corrigi na mão
# Talvez isso mereça maior atenção para o desempenho do modelo
substituicao={'Organic ':'','Box of ':'','Bag of':'','Bananas':'Banana','Hass':''}
data['product_name']=data['product_name'].replace(substituicao,regex=True).str.strip()
data.head()

Unnamed: 0,order_id,product_name
0,1,Celery Hearts
1,2869,Celery Hearts
2,3378,Celery Hearts
3,14119,Celery Hearts
4,17152,Celery Hearts


In [344]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 503894 entries, 0 to 503893
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   order_id      503894 non-null  int64 
 1   product_name  503894 non-null  object
dtypes: int64(1), object(1)
memory usage: 11.5+ MB


## One hot encoding

In [404]:
basket= data[['order_id','product_name']].value_counts().unstack().reset_index().fillna(0).set_index('order_id')
basket.head()



product_name,& Raw Strawberry Serenity Kombucha,0% Greek Strained Yogurt,1% Low Fat Milk,1% Lowfat Milk,100 Calorie Per Bag Popcorn,100% Apple Juice,100% Grated Parmesan Cheese,100% Lactose Free Fat Free Milk,100% Natural Spring Water,100% Pure Apple Juice,...,"Yogurt, Lowfat, Strawberry","Yogurt, Strained Low-Fat, Coconut",Yokids Lemonade/Blueberry Variety Pack Yogurt Squeezers Tubes,Yukon Gold Potatoes 5lb Bag,ZBar Chocolate Brownie Energy Snack,Zero Calorie Cola,Zucchini,Zucchini Spirals,Zucchini Squash,smartwater® Electrolyte Enhanced Water
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [405]:
def encode_units(x):
    if x <= 0:
        return False
    if x >= 1:
        return True

basket_sets = basket.applymap(encode_units)

In [326]:
basket_sets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96792 entries, 1 to 3421070
Columns: 326 entries,  Bananas to Zucchini
dtypes: bool(326)
memory usage: 30.8 MB


# Criação das Regras

Now that the data is structured properly, we can generate frequent item sets that have a support of at least 1% (this number was chosen so that I could get enough useful examples). These values are mostly just arbitrarily chosen, so you can play with these values and see what difference it makes in the rules you get back out.

Idea for choosing support: Let's suppose that we want rules for only those items that are purchased at least 5 times a day, or 7 x 5 = 35 times in one week, since our dataset is for a **one-week time period** (I don't know what is our time period). The support for those items can be calculated as 35/7500 = 0.0045

In [407]:
frequent_itemsets = apriori(basket_sets, min_support=0.01, use_colnames=True,low_memory=True)
frequent_itemsets.sort_values('support',ascending=False)

Unnamed: 0,support,itemsets
12,0.323934,(Banana)
114,0.152652,(Strawberries)
6,0.147080,(Avocado)
11,0.096460,(Baby Spinach)
85,0.077563,(Raspberries)
...,...,...
46,0.010103,(Garbanzo Beans)
74,0.010049,(Lemonade)
137,0.010031,"(Cilantro, Avocado)"
25,0.010023,(Butternut Squash)


Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions.

Suporte máximo de 0,32

Restringindo a produtos que apareceram 500 vezes, o suporte máximo foi de 0,36. Talvez restringir mais possa ajudar nas regras


In [409]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Banana),(2% Reduced Fat Milk),0.323934,0.027540,0.010129,0.031270,1.135444,0.001208,1.003850
1,(2% Reduced Fat Milk),(Banana),0.027540,0.323934,0.010129,0.367809,1.135444,0.001208,1.069401
2,(Banana),(Apple Honeycrisp Organic),0.323934,0.019974,0.010557,0.032589,1.631569,0.004086,1.013040
3,(Apple Honeycrisp Organic),(Banana),0.019974,0.323934,0.010557,0.528520,1.631569,0.004086,1.433925
4,(Banana),(Asparagus),0.323934,0.034162,0.014660,0.045256,1.324745,0.003594,1.011620
...,...,...,...,...,...,...,...,...,...
191,"(Banana, Strawberries)",(Raspberries),0.075409,0.077563,0.015185,0.201369,2.596193,0.009336,1.155023
192,"(Raspberries, Strawberries)",(Banana),0.027896,0.323934,0.015185,0.544352,1.680443,0.006149,1.483747
193,(Banana),"(Raspberries, Strawberries)",0.323934,0.027896,0.015185,0.046877,1.680443,0.006149,1.019915
194,(Raspberries),"(Banana, Strawberries)",0.077563,0.075409,0.015185,0.195777,2.596193,0.009336,1.149670


A galera parece bem "fitness" com esses filtros

Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. In the second rule, the confidence is 0,36, which means that 36% of times where fat milk was bought, banana is also bought.

Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B). a lift of 1.29 tells us that the likelihood of buying a A and B together is 1.29 times more than the likelihood of just buying the B. (30% chance)

I don't know yet what conviction is.

# Conclusion

Association rule mining algorithms such as Apriori are very useful for finding simple associations between our data items. They are easy to implement and have high explain-ability. However for more advanced insights, such those used by Google or Amazon etc., more complex algorithms, such as recommender systems, are used. However, you can probably see that this method is a very simple way to get basic associations if that's all your use-case needs.

we could try to use departments
