# Solution Planning

## Business Problem

**Qual é o problema de negócios?**
1. Selecionar os clientes mais valiosos para integrar o programa de fidelidade "Loyals".

2. Responder a questões de negócio ao time de marketing.

### Output

**O que vou entregar? / Onde o time de negócio quer ver?**

* 1) Lista em xls / enviar por email. Deve conter clientes que irão aderir ao Loyals (programa de fidelidade). 
   - Formato:
   
| client_id | is_loyal |
|-----------|----------|
|1          |yes       |
|2          |no        |

* 2) Relatório em pdf respondendo as questões de negócio / enviar por email e apresentar ao time de marketing:
    - Quem são as pessoas elegíveis para participar do programa Loyals?
    - Quantos clientes farão parte do grupo?
    - Quais são as principais características desses clientes?
    - Qual a porcentagem de contribuição de faturamento, vinda do Loyals?
    - Qual a expectativa de faturamento desse grupo para os próximos meses?
    - Quais as condições para uma pessoa ser elegível ao Loyals? 
    - Quais as condições para uma pessoa ser removida do Loyals?
    - Qual a garantia que o programa Loyals é melhor que o restante da base?
    - Quais ações do time de marketing pode realizar para aumentar o faturamento?

### Input

**Fontes de dados:**
    Dataset "Ecommerce.csv", contendo as vendas de e-commerce do período de um ano.

**Ferramentas:**
    Python 3.8.12, Jupyter Notebook, Git, Github.

### Process

**Tipo de problema:**
Separação de clientes por grupos.
    
**Principais métodos:**
    Clusterização.

**Perguntas de negócio:**
* 1) Quem são as pessoas elegíveis para participar do programa Loyals?
    - O que é ser elegível? / O que são clientes de maior "valor"? (de acordo com área de negócio) 
        - Faturamento:
             - Alto ticket médio
             - Alto LTV (soma da receita do cliente conosco)
             - Baixa recência (tempo desde a última compra)
             - Alto basket size (qtd produtos comprados por compra)
             - Baixa probabilidade de churn (usaria a saída de um modelo)
             - Alta Previsão LTV (usaria saída de um modelo)
             - Alta propensão de compra (usaria a saída de um modelo)
        - Custo:
             - Baixa taxa de devolução
        - Experiência de compra:
             - Média alta das avaliações

 PS: as features acima serão criadas no feature engeneering. 
 
 
* 2) Quantos clientes farão parte do grupo?
    - Número de clientes
    - % em relação ao total de clients
    
    
* 3) Quais são as principais características desses clientes?
    - Escrever caracterísiticas do cliente:
        - Idade
        - País
        - Salário
        - Localização
     - Escrever os principais comportamentos de compra dos clients ( métricas de negócio )
        - Vide features da clusterização (questão 1)
         
 Para look alike: prospectar clientes parecidos na internet


* 4) Qual a porcentagem de contribuição de faturamento, vinda do Loyals?
    - Calcular o faturamento total da empresa durante o ano.
    - Calcular o faturamento (%) apenas do cluster Loyals.   
   
   
* 5) Qual a expectativa de faturamento desse grupo para os próximos meses?
    - Cálculo do LTV do Loyals (com média móvel, time series, arima..)
    - Séries Temporais ( ARMA, ARIMA, HoltWinter, etc )
    - Análise de Cohort (com tempo, localização, produto..)       

 Deve haver meta de faturamento, consultar negócio.
    
    
* 6) Quais as condições para uma pessoa ser elegível ao Loyals?
    - Definir o período de avaliação (a cada 1 mês, 3 meses..)
    - O "desempenho" do cliente deve estar próximo da média do cluster Loyals.
    
    
* 7) Quais as condições para uma pessoa ser removida do Loyals?
    - O "desempenho" do cliente não está mais próximo da média do cluster Loyals. 
   
   
* 8) Qual a garantia que o programa Loyals é melhor que o restante da base?
    - Teste A/B
    - Teste de hipótese


* 9) Quais ações do time de marketing pode realizar para aumentar o faturamento?
    - Desconto
    - Preferência de compra
    - Frete mais barato
    - Visita a empresa
    - Oferecer personal stylist
    - Recomendar cross sell 
    - Oferecer conteúdo expclusivo

## Solution Benchmarking

### Desk Research

Leitura de artigos sobre customer segmentation na internet, para compreender o que o mercado está fazendo.

Identificar a partir das soluções do mercado, com o time de negócio, o que podemos fazer como MVP.

1. Modelo RFM de segmentação.

# Environment Preparation

## Imports

In [134]:
import pandas            as pd
import matplotlib.pyplot as plt
import seaborn           as sns
import datetime          as dt
from tabulate                 import tabulate
from IPython.core.display     import HTML

#from IPython.display          import Image

## Helper Functions

In [53]:
def jupyter_settings():
    """ Optimize general settings, standardize plot sizes, etc. """
    %matplotlib inline
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [12, 6]
    plt.rcParams['font.size'] = 20
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.set_option( 'display.expand_frame_repr', False )
    pd.set_option('display.max_columns', 30)
    pd.set_option('display.max_rows', 30)
    sns.set()
jupyter_settings()

# Data Collection

In [54]:
%ls -l ../data/raw/

total 83400
-rw-rw-r--@ 1 home  staff  42697197 Apr 29  2021 Ecommerce.csv


In [55]:
#read data
df_raw = pd.read_csv('../data/raw/Ecommerce.csv', encoding='unicode_escape')

In [56]:
df_raw.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Unnamed: 8
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,29-Nov-16,2.55,17850.0,United Kingdom,
1,536365,71053,WHITE METAL LANTERN,6,29-Nov-16,3.39,17850.0,United Kingdom,
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,29-Nov-16,2.75,17850.0,United Kingdom,
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,29-Nov-16,3.39,17850.0,United Kingdom,
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,29-Nov-16,3.39,17850.0,United Kingdom,


In [61]:
df_raw = df_raw.drop('Unnamed: 8', axis=1).copy()
df_raw.sample(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
510894,579470,22998,TRAVEL CARD WALLET KEEP CALM,3,27-Nov-17,0.42,16549.0,United Kingdom
208159,555096,23174,REGENCY SUGAR BOWL GREEN,4,29-May-17,4.15,12682.0,France
403837,571654,21108,FAIRY CAKE FLANNEL ASSORTED COLOUR,1,16-Oct-17,0.79,18118.0,United Kingdom


# Data Description

In [62]:
df1 = df_raw.copy()

## Rename Columns

In [63]:
df1.sample(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
459556,575875,23084,RABBIT NIGHT LIGHT,22,9-Nov-17,4.13,,United Kingdom
218217,556023,22672,FRENCH BATHROOM SIGN BLUE METAL,4,6-Jun-17,1.65,14527.0,United Kingdom
278951,561220,22431,WATERING CAN BLUE ELEPHANT,6,24-Jul-17,1.95,17734.0,United Kingdom


In [64]:
df1.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [65]:
df1.columns = ['invoice_no', 'stock_code', 'description', 'quantity', 'invoice_date',
       'unit_price', 'customer_id', 'country']

## Feature Description 

In [66]:
# Explain feature meanings
tab_meanings = [['Columns', 'Meaning'],
        ['invoice_no', 'unique identifier of each transaction'],
        ['stock_code', 'item code'],
        ['description', 'item name'],
        ['quantity', 'quantity of each item purchased per transaction'],
        ['invoice_date', 'the day the transaction took place'],
        ['unit_price', 'product price per unit'],
        ['customer_id', 'unique customer identifier'],
        ['country', 'customer\'s country of residence']
      ]
print(tabulate(tab_meanings, headers='firstrow', stralign='left', tablefmt='simple'))

Columns       Meaning
------------  -----------------------------------------------
invoice_no    unique identifier of each transaction
stock_code    item code
description   item name
quantity      quantity of each item purchased per transaction
invoice_date  the day the transaction took place
unit_price    product price per unit
customer_id   unique customer identifier
country       customer's country of residence


In [67]:
df1.sample(3)

Unnamed: 0,invoice_no,stock_code,description,quantity,invoice_date,unit_price,customer_id,country
248234,558808,84987,SET OF 36 TEATIME PAPER DOILIES,1,2-Jul-17,3.75,,United Kingdom
306558,563771,23232,WRAP VINTAGE LEAF DESIGN,25,17-Aug-17,0.42,17841.0,United Kingdom
248214,558778,23209,LUNCH BAG DOILEY PATTERN,10,2-Jul-17,1.65,17160.0,United Kingdom


## Data Dimensions

In [68]:
print(f'Number of rows: {df1.shape[0]}')
print(f'Number of columns: {df1.shape[1]}')

Number of rows: 541909
Number of columns: 8


In [69]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   invoice_no    541909 non-null  object 
 1   stock_code    541909 non-null  object 
 2   description   540455 non-null  object 
 3   quantity      541909 non-null  int64  
 4   invoice_date  541909 non-null  object 
 5   unit_price    541909 non-null  float64
 6   customer_id   406829 non-null  float64
 7   country       541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


## Check NA

In [70]:
df1.isna().sum()

invoice_no           0
stock_code           0
description       1454
quantity             0
invoice_date         0
unit_price           0
customer_id     135080
country              0
dtype: int64

## Replace NA

In [71]:
#remove rows with NA
df1 = df1.dropna( subset=['description','customer_id'] )

In [72]:
print (f'Removed data: { 1-(df1.shape[0] / df_raw.shape[0]):.2f}%')

Removed data: 0.25%


In [78]:
df1.tail()

Unnamed: 0,invoice_no,stock_code,description,quantity,invoice_date,unit_price,customer_id,country
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,7-Dec-17,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,7-Dec-17,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,7-Dec-17,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,7-Dec-17,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,7-Dec-17,4.95,12680.0,France


In [79]:
#reset index
df1 = df1.reset_index(drop=True)

In [80]:
#new num of rows
df1.shape

(406829, 8)

In [81]:
#check
df1.isna().sum()

invoice_no      0
stock_code      0
description     0
quantity        0
invoice_date    0
unit_price      0
customer_id     0
country         0
dtype: int64

## Change Types

In [None]:
#correct data types ensure correct calculations using the columns on next sessions

In [82]:
df1.dtypes

invoice_no       object
stock_code       object
description      object
quantity          int64
invoice_date     object
unit_price      float64
customer_id     float64
country          object
dtype: object

In [83]:
df1.sample(3)

Unnamed: 0,invoice_no,stock_code,description,quantity,invoice_date,unit_price,customer_id,country
255523,567484,23395,BELLE JARDINIERE CUSHION COVER,4,18-Sep-17,3.75,17340.0,United Kingdom
172410,558044,23206,LUNCH BAG APPLE DESIGN,10,22-Jun-17,1.65,13870.0,United Kingdom
112857,550483,22993,SET OF 4 PANTRY JELLY MOULDS,2,16-Apr-17,1.25,16033.0,United Kingdom


In [84]:
#invoice_date
df1['invoice_date'] = pd.to_datetime(df1['invoice_date'], format='%d-%b-%y')

In [85]:
#customer_id
df1['customer_id'] = df1['customer_id'].astype(int)

In [86]:
# invoice_no
#df1['invoice_no'] = df1['invoice_no'].astype(int)
#there are letters on invoice, so let it like object(string). Ex: 'C536379', 'C554197'

In [87]:
# stock_code
#df1['stock_code'] = df1['stock_code'].astype(int)
#there are letters on stock_code, so let it like object(string). Ex: '85123A', '84406B'

In [88]:
df1.sample(3)

Unnamed: 0,invoice_no,stock_code,description,quantity,invoice_date,unit_price,customer_id,country
388470,579741,22402,MAGNETS PACK OF 4 VINTAGE COLLAGE,1,2017-11-28,0.39,16910,United Kingdom
231638,564958,23292,SPACEBOY CHILDRENS CUP,16,2017-08-29,1.25,13089,United Kingdom
12964,537871,22703,PINK CAT BOWL,1,2016-12-06,2.1,12748,United Kingdom


In [89]:
df1.dtypes

invoice_no              object
stock_code              object
description             object
quantity                 int64
invoice_date    datetime64[ns]
unit_price             float64
customer_id              int64
country                 object
dtype: object

## Descriptive Statistics

In [90]:
#pass on first cycle

In [91]:
ls ../data

[34mexternal[m[m/  [34minterim[m[m/   [34mprocessed[m[m/ [34mraw[m[m/


In [92]:
#save dataset
df1.to_csv("../data/interim/df1_data_description_done.csv")

# Feature Engeneering

## Feature Creation

In [157]:
df2 = pd.read_csv("../data/interim/df1_data_description_done.csv", index_col=0, parse_dates=['invoice_date'])
df2.sample(3)

Unnamed: 0,invoice_no,stock_code,description,quantity,invoice_date,unit_price,customer_id,country
401767,581021,22960,JAM MAKING SET WITH JARS,6,2017-12-05,4.25,14769,United Kingdom
325947,574029,22090,PAPER BUNTING RETROSPOT,12,2017-10-31,2.95,12955,United Kingdom
176582,558620,82483,WOOD 2 DRAWER CABINET WHITE FINISH,2,2017-06-28,6.95,15584,United Kingdom


In [158]:
df2.dtypes

invoice_no              object
stock_code              object
description             object
quantity                 int64
invoice_date    datetime64[ns]
unit_price             float64
customer_id              int64
country                 object
dtype: object

In [None]:
# our current granularity is: stock_code + invoice_date
# we need to have customer as granularity, to have is's features on a new table.
# let's start adjusting the dataset to build an RFM Segmentation Model (Recency, Frequency, Monetary Value)

In [160]:
#get first customer as sample
df1.loc[ df1['invoice_no'] == '536365']

Unnamed: 0,invoice_no,stock_code,description,quantity,invoice_date,unit_price,customer_id,country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2016-11-29,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2016-11-29,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2016-11-29,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2016-11-29,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2016-11-29,3.39,17850,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,2016-11-29,7.65,17850,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,2016-11-29,4.25,17850,United Kingdom


In [161]:
#create the referente table with uniques customer_id (and reset index)
df_ref = df2.drop(['invoice_no', 'stock_code', 'description', 'quantity', 'invoice_date',
       'unit_price', 'country'], axis=1).drop_duplicates(ignore_index=True).copy()

In [162]:
df_ref.head()

Unnamed: 0,customer_id
0,17850
1,13047
2,12583
3,13748
4,15100


In [None]:
#let's now create the 3 variables needed to use the RFM Segmentation Model 

### gross_revenue

In [164]:
# Gross Revenue: (quantity * price of each purchase)
df2['gross_revenue'] = df2['quantity'] * df2['unit_price']
df2.sample(3)

Unnamed: 0,invoice_no,stock_code,description,quantity,invoice_date,unit_price,customer_id,country,gross_revenue
394273,580500,23418,LAVENDER TOILETTE BOTTLE,1,2017-12-02,2.08,17131,United Kingdom,2.08
263043,568183,21694,SMALL REGAL SILVER CANDLEPOT,6,2017-09-23,2.95,13991,United Kingdom,17.7
259553,C567903,M,Manual,-120,2017-09-20,0.03,16422,United Kingdom,-3.6


In [165]:
# df_monetary = gorss_revenue by customer
df_monetary = df2[['customer_id','gross_revenue']].groupby('customer_id').sum().reset_index()
df_monetary

Unnamed: 0,customer_id,gross_revenue
0,12346,0.00
1,12347,4310.00
2,12348,1797.24
3,12349,1757.55
4,12350,334.40
...,...,...
4367,18280,180.60
4368,18281,80.82
4369,18282,176.60
4370,18283,2094.88


In [166]:
#merge df_ref + df_monetary
df_ref = pd.merge( df_ref, df_monetary, on='customer_id', how='left' )

In [167]:
#check possible join problems
df_ref.isna().sum()

customer_id      0
gross_revenue    0
dtype: int64

In [168]:
#reference table of customer_id with first var
df_ref.head()

Unnamed: 0,customer_id,gross_revenue
0,17850,5288.63
1,13047,3079.1
2,12583,7187.34
3,13748,948.25
4,15100,635.1


### recency

In [None]:
# Recency: number of days since last purchase

In [169]:
# get each customer's last day of purchase 
df_recency = df2[['customer_id','invoice_date']].groupby('customer_id').max().reset_index()
df_recency.head()

Unnamed: 0,customer_id,invoice_date
0,12346,2017-01-16
1,12347,2017-12-05
2,12348,2017-09-23
3,12349,2017-11-19
4,12350,2017-01-31


In [173]:
#'2017-12-07' is the last day with purchases on this dataset. 
# Let's assume '2017-12-07' is today, avoiding big distortions. Using updated dataset, last purchase would be recent (dataset is 4y/o +).
# In real cases, this variable would be dt.date.today(), and could be updated daily.
today = df2['invoice_date'].max()
today

Timestamp('2017-12-07 00:00:00')

In [183]:
#Calculate number of days since last purchase
#"today" (being '2017-12-07') - invoice_date (last day of purchase of each customer)
df_recency['recency_days'] = (df2['invoice_date'].max() - df_recency['invoice_date']).dt.days
df_recency

Unnamed: 0,customer_id,invoice_date,recency_days
0,12346,2017-01-16,325
1,12347,2017-12-05,2
2,12348,2017-09-23,75
3,12349,2017-11-19,18
4,12350,2017-01-31,310
...,...,...,...
4367,18280,2017-03-05,277
4368,18281,2017-06-10,180
4369,18282,2017-11-30,7
4370,18283,2017-12-04,3


In [187]:
#check recency days (2017-12-07 -> 2017-01-16) = 325 days
df2['invoice_date'].max()

Timestamp('2017-12-07 00:00:00')

In [188]:
df_recency

Unnamed: 0,customer_id,invoice_date,recency_days
0,12346,2017-01-16,325
1,12347,2017-12-05,2
2,12348,2017-09-23,75
3,12349,2017-11-19,18
4,12350,2017-01-31,310
...,...,...,...
4367,18280,2017-03-05,277
4368,18281,2017-06-10,180
4369,18282,2017-11-30,7
4370,18283,2017-12-04,3


In [191]:
#drop invoice_date (max purchase date) from df_recency
df_recency = df_recency[['customer_id', 'recency_days']].copy()
df_recency

Unnamed: 0,customer_id,recency_days
0,12346,325
1,12347,2
2,12348,75
3,12349,18
4,12350,310
...,...,...
4367,18280,277
4368,18281,180
4369,18282,7
4370,18283,3


In [192]:
# merge df_recency w/ df_ref
df_ref = pd.merge( df_ref, df_recency, on='customer_id', how='left' )
df_ref

Unnamed: 0,customer_id,gross_revenue,recency_days
0,17850,5288.63,302
1,13047,3079.10,31
2,12583,7187.34,2
3,13748,948.25,95
4,15100,635.10,330
...,...,...,...
4367,13436,196.89,1
4368,15520,343.50,1
4369,13298,360.00,1
4370,14569,227.39,1


In [193]:
#check NA
df_ref.isna().sum()

customer_id      0
gross_revenue    0
recency_days     0
dtype: int64

### frequency

In [None]:
# Frequency (number of purchases)

In [None]:
#Frequency - Average time between purchases (in days). Measure engagement.
# I need data on every line, to get the avg time betweeb purchases

# Variable Filtering

# EDA

# Data Preparation

# Feature Selection

# Model Training

# Cluster Analysis

# Deploy