A base utilizada e seu descritivo completo está disponível em https://www.kaggle.com/datasets/akashdeepkuila/bakery

The dataset belongs to "The Bread Basket" a bakery located in Edinburgh. The dataset provide the transaction details of customers who ordered different items from this bakery online during the time period from 26-01-11 to 27-12-03. The dataset has 20507 entries, over 9000 transactions, and 4 columns.

Análise exploratória

In [53]:
import pandas as pd
import numpy as np

In [None]:
!pip install mlxtend --upgrade

In [55]:
dados = pd.read_csv(r'C:\Users\Monique\Documents\alura_ML_NS\dados-padaria.csv')

In [56]:
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20507 entries, 0 to 20506
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   TransactionNo  20507 non-null  int64 
 1   Items          20507 non-null  object
 2   DateTime       20507 non-null  object
 3   Daypart        20507 non-null  object
 4   DayType        20507 non-null  object
dtypes: int64(1), object(4)
memory usage: 801.2+ KB


In [57]:
dados.head()

Unnamed: 0,TransactionNo,Items,DateTime,Daypart,DayType
0,1,Bread,2016-10-30 09:58:11,Morning,Weekend
1,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
2,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
3,3,Hot chocolate,2016-10-30 10:07:57,Morning,Weekend
4,3,Jam,2016-10-30 10:07:57,Morning,Weekend


Tratamento dos dados

In [58]:
# Lista de transações

transacao = []

for item in dados['TransactionNo'].unique():
    lista = list(set(dados[dados['TransactionNo']==item]['Items']))
    transacao.append(lista)
    
# unique para pegar o valor único da transação
# Agrupar os itens de uma mesma transação

In [59]:
transacao[0:5]

[['Bread'],
 ['Scandinavian'],
 ['Jam', 'Hot chocolate', 'Cookies'],
 ['Muffin'],
 ['Pastry', 'Coffee', 'Bread']]

In [60]:
# Transformar os dados categóricos em boolean utilizando one-hot encoding
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
transacao_te = te.fit(transacao).transform(transacao)

In [61]:
transacao_te

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [62]:
transacao_transformado = pd.DataFrame(transacao_te, columns=te.columns_)
transacao_transformado

Unnamed: 0,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9460,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9461,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
9462,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9463,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Regra de associação: identificação de relacionamentos desconhecidos em grandes conjuntos de dados

Suporte: verificar a frequência de venda de um conjunto de items e calcular seu % em relação ao total de transações.
s(X) = itemsetX/Ntransactions

In [63]:
from mlxtend.frequent_patterns import apriori

In [64]:
# Criar coluna com o cálculo de support
# s(x -> Y) = (itemsetX U itemsetY)/Ntransactions

items_frequentes_apriori = apriori(transacao_transformado, use_colnames=True, min_support=0.02)
items_frequentes_apriori.sort_values(['support'], ascending=False)

# Alterar o valor de min_support pode aumentar ou reduzir os resultados da amostra

Unnamed: 0,support,itemsets
4,0.478394,(Coffee)
1,0.327205,(Bread)
16,0.142631,(Tea)
3,0.103856,(Cake)
20,0.090016,"(Coffee, Bread)"
11,0.086107,(Pastry)
12,0.071844,(Sandwich)
9,0.061807,(Medialuna)
7,0.05832,(Hot chocolate)
23,0.054728,"(Cake, Coffee)"


In [65]:
from mlxtend.frequent_patterns import association_rules

In [66]:
# Confiança: frequência que os itens do conjunto Y aparecem em transações que contenham o conjunto X, vai medir a confiabilidade da inferência da regra
# c(X -> Y) = (itemsetX U itemsetY)/itemsetX

regras_apriori = association_rules(items_frequentes_apriori, metric='confidence', min_threshold=0.5)
regras_apriori.head(15)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Cake),(Coffee),0.103856,0.478394,0.054728,0.526958,1.101515,0.005044,1.102664,0.10284
1,(Cookies),(Coffee),0.054411,0.478394,0.028209,0.518447,1.083723,0.002179,1.083174,0.0817
2,(Hot chocolate),(Coffee),0.05832,0.478394,0.029583,0.507246,1.060311,0.001683,1.058553,0.060403
3,(Juice),(Coffee),0.038563,0.478394,0.020602,0.534247,1.11675,0.002154,1.119919,0.108738
4,(Medialuna),(Coffee),0.061807,0.478394,0.035182,0.569231,1.189878,0.005614,1.210871,0.170091
5,(Pastry),(Coffee),0.086107,0.478394,0.047544,0.552147,1.154168,0.006351,1.164682,0.146161
6,(Sandwich),(Coffee),0.071844,0.478394,0.038246,0.532353,1.112792,0.003877,1.115384,0.109205
7,(Toast),(Coffee),0.033597,0.478394,0.023666,0.704403,1.472431,0.007593,1.764582,0.332006


Resultado: foi possível onter métricas que apoiam na análise de vendas dos produtos, inferindo as melhores possibilidades de venda de um conjunto de items.

antecedent: conjunto unitário (X).
consequent: conjunto unitário (Y).
antecedent support: % do quanto X é vendido em relação aos items disponíveis na padaria.
consequent support: % do quanto Y é vendido em relação aos items disponíveis na padaria.
support: métrica que analisa o % de agrupamento dos produtos em uma venda (o quanto eles costumam vender juntos).
confidence: frequência entre os conjuntos.

Demais métricas estão sob análise, entretanto, é possível concluir que o produto mais associado a outras vendas é o Coffee.