# High Value Customers Identification

## The All In One Place company

The All in One Place company is a Multibrand Outlet company. It sells second-line products of several brands at a lower price, through an e-commerce.

In just over 1 year of operation, the marketing team realized that some customers in its base buy more expensive products, with high frequency and end up contributing a significant portion of the company's revenue.

Based on this perception, the marketing team will launch a loyalty program for the best customers in the base, called Insiders. But the team does not have an advanced knowledge of data analysis to elect program participants.

For this reason, the marketing team asked the data team to select eligible customers for the program, using advanced data manipulation techniques.

---

## Project Objectives

You are part of All In One Place's team of data scientists who need to determine who are the eligible customers to be part of Insiders. In possession of this list, the Marketing team will carry out a sequence of personalized and exclusive actions to the group, in order to increase sales and purchase frequency.

As a result of this project, you are expected to submit a list of people eligible to participate in the Insiders program, along with a report answering the following questions:

1. Who are the people eligible to participate in the Insiders program?
2. How many customers will be part of the group?
3. What are the main characteristics of these customers?
4. What is the percentage of revenue contribution, coming from Insiders?
5. What is this group's revenue expectation for the coming months?
6. What are the conditions for a person to be eligible for Insiders?
7. What are the conditions for a person to be removed from Insiders?
8. What is the guarantee that the Insiders program is better than the rest of the base?
9. What actions can the marketing team take to increase revenue?

---

## Data 

The dataset is available on the [Kaggle platform](https://www.kaggle.com/vik2012kvs/high-value-customers-identification).

Each line represents a sale transaction, which took place between the period of November 2016 and December 2017.

The dataset includes the following information:
* InvoiceNo: Invoice number (A 6-digit integral number uniquely assigned to each transaction)
* StockCode: Product (item) code
* Description: Product (item) name
* Quantity: The quantities of each product (item) per transaction
* InvoiceDate: The day when each transaction was generated
* UnitPrice: Unit price (Product price per unit)
* CustomerID: Customer number (Unique ID assigned to each customer)
* Country: Country name (The name of the country where each customer resides)

# Summary
* [1. Data Information](#1.)
    * [1.1 Missing Values](#1.1)
    * [1.2 Numeric Features](#1.2)
    * [1.3 Categorical Features](#1.3)
* [2. Data Visualization](#2.)
* [3. Model](#3.)
    * [3.1 Model and Feature Selection](#3.1)
    * [3.2 Hyperparameter Tuning](#3.2)
    * [3.3 Evaluate Time](#3.3)
    * [3.4 Save Model](#3.4)
* [4. Conclusion](#4.)
    * [4.1 Payment](#4.1)
    * [4.2 Compliance](#4.2)
    * [4.3 Model](#4.3)

# Import the python libraries

In [222]:
# data analysis
import pandas as pd
import numpy as np

# 1. Data Information <a class='anchor' id='1.'></a>

In [223]:
df = pd.read_csv('csv/Ecommerce.csv', encoding='ISO-8859-1')
df.head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Unnamed: 8
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,29-Nov-16,2.55,17850.0,United Kingdom,
1,536365,71053,WHITE METAL LANTERN,6,29-Nov-16,3.39,17850.0,United Kingdom,
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,29-Nov-16,2.75,17850.0,United Kingdom,
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,29-Nov-16,3.39,17850.0,United Kingdom,
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,29-Nov-16,3.39,17850.0,United Kingdom,
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,29-Nov-16,7.65,17850.0,United Kingdom,
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,29-Nov-16,4.25,17850.0,United Kingdom,
7,536366,22633,HAND WARMER UNION JACK,6,29-Nov-16,1.85,17850.0,United Kingdom,
8,536366,22632,HAND WARMER RED POLKA DOT,6,29-Nov-16,1.85,17850.0,United Kingdom,
9,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,29-Nov-16,1.69,13047.0,United Kingdom,


In [224]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
 8   Unnamed: 8   0 non-null       float64
dtypes: float64(3), int64(1), object(5)
memory usage: 37.2+ MB


In [225]:
df.describe(include='all')

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Unnamed: 8
count,541909.0,541909,540455,541909.0,541909,541909.0,406829.0,541909,0.0
unique,25900.0,4070,4223,,305,,,38,
top,573585.0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,,3-Dec-17,,,United Kingdom,
freq,1114.0,2313,2369,,5331,,,495478,
mean,,,,9.55225,,4.611114,15287.69057,,
std,,,,218.081158,,96.759853,1713.600303,,
min,,,,-80995.0,,-11062.06,12346.0,,
25%,,,,1.0,,1.25,13953.0,,
50%,,,,3.0,,2.08,15152.0,,
75%,,,,10.0,,4.13,16791.0,,


### Adjusts

* create a new column, 'TotalValue' for the total value of the row
* create a new column, 'InvoiceYear' for the year when the transaction was generated
* create a new column, 'InvoiceMonth' for the month when the transaction was generated
* create a new column, 'InvoiceSemester' for the semester when the transaction was generated

### Hypotesis

1. The Insiders could be defined by CustomerIDs who have the most accumulated value of purchases
    * do an analysis grouping the 'TotalValue' column by the 'CustomerID'
2. The Insiders could be defined by CustomerIDs who have the most accumulated value of purchases in a range of date
    * do an analysis grouping the 'TotalValue' column by the 'CustomerID' and 'InvoiceYear'
    * do an analysis grouping the 'TotalValue' column by the 'CustomerID', 'InvoiceYear' and 'InvoiceSemester'
3. The Insiders could be defined by CustomerIDs who have the highest value purchased in a unique 'InvoiceNo' or 'InvoiceDate'
    * do an analysis grouping the 'TotalValue' column by 'CustomerID' and 'InvoiceNo' or 'InvoiceDate'
4. The Insiders could be defined by CustomerIDs who have bought the most expensive products in higher quantity
    * identify which are the most expensive products
    * do an analysis grouping the 'TotalValue' column by 'CustomerID' and ('StockCode' or 'Description)
5. The Insiders could be defined by CustomerIDs who have bought the most expensive products in higher quantity in a range of date
    * do an analysis grouping the 'TotalValue' column by 'CustomerID', ('StockCode' or 'Description) and 'InvoiceYear'
    * do an analysis grouping the 'TotalValue' column by 'CustomerID', ('StockCode' or 'Description), 'InvoiceYear' and 'Invoice Semester'

## 1.1 Missing values <a class='anchor' id='1.1'></a>

In [226]:
print(f'Only features contained missing value in Training Dataset')
temp = df.isnull().sum()
print(temp.loc[temp!=0], '\n')

Only features contained missing value in Training Dataset
Description      1454
CustomerID     135080
Unnamed: 8     541909
dtype: int64 



* Since we want to identify which customers are tthe most valued ones to the company, the rows with missing CustomerID are not relevant and can be dropped
* We can drop the Unnamed: 8 column, since it doesn't have any value
* We can ignore the missing values from the column Description, because the column StockCode is a reference to the same product and doesn't have any missing values

In [227]:
df.drop('Unnamed: 8', axis=1, inplace=True)
df.drop(df[df['CustomerID'].isnull()].index, inplace=True)
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,29-Nov-16,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,29-Nov-16,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,29-Nov-16,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,29-Nov-16,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,29-Nov-16,3.39,17850.0,United Kingdom


In [228]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    406829 non-null  object 
 1   StockCode    406829 non-null  object 
 2   Description  406829 non-null  object 
 3   Quantity     406829 non-null  int64  
 4   InvoiceDate  406829 non-null  object 
 5   UnitPrice    406829 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      406829 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 27.9+ MB


As we can see, all the missing values from the column Description were removed when we deleted the rows that didn't have a CustomerID value

## 1.2 Negative Quantities <a class='anchor' id='1.2'></a>

In [229]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,406829.0,406829.0,406829.0
mean,12.061303,3.460471,15287.69057
std,248.69337,69.315162,1713.600303
min,-80995.0,0.0,12346.0
25%,2.0,1.25,13953.0
50%,5.0,1.95,15152.0
75%,12.0,3.75,16791.0
max,80995.0,38970.0,18287.0


In [230]:
df[df['Quantity'] < 0].head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,29-Nov-16,27.5,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,29-Nov-16,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,29-Nov-16,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,29-Nov-16,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,29-Nov-16,0.29,17548.0,United Kingdom


Before creating the new features, let's fix the values that are negative in the Quantity column, for those transactions that doesn't represent discount

In [231]:
df[df['Quantity'] < 0].where(df['Description'] == 'Discount')['StockCode'].dropna().unique()

array(['D'], dtype=object)

In [232]:
sorted(df[df['Quantity'] < 0].where(df['Description'].str.startswith('D'))['Description'].dropna().unique())

['DAIRY MAID LARGE MILK JUG',
 'DAIRY MAID STRIPE MUG',
 'DAIRY MAID TOASTRACK',
 'DAIRY MAID TRADITIONAL TEAPOT ',
 'DAISY HAIR BAND',
 'DAISY HAIR COMB',
 'DANISH ROSE BEDSIDE CABINET',
 'DANISH ROSE DECORATIVE PLATE',
 'DANISH ROSE DELUXE COASTER',
 'DANISH ROSE PHOTO FRAME',
 'DANISH ROSE ROUND SEWING BOX',
 'DANISH ROSE TRINKET TRAYS',
 'DARK BIRD HOUSE TREE DECORATION',
 'DECORATION  BUTTERFLY  MAGIC GARDEN',
 'DECORATION  PINK CHICK MAGIC GARDEN',
 'DECORATION SITTING BUNNY',
 'DECORATION WHITE CHICK MAGIC GARDEN',
 'DECORATION WOBBLY CHICKEN',
 'DECORATIVE CATS BATHROOM BOTTLE',
 'DECORATIVE FLORE BATHROOM BOTTLE',
 'DECORATIVE HANGING SHELVING UNIT',
 'DECORATIVE PLANT POT WITH FRIEZE',
 'DECORATIVE WICKER HEART LARGE',
 'DECORATIVE WICKER HEART MEDIUM',
 'DELUXE SEWING KIT ',
 'DENIM PATCH PURSE PINK BUTTERFLY',
 'DIAMANTE HAIR GRIP PACK/2 BLACK DIA',
 'DIAMANTE HAIR GRIP PACK/2 CRYSTAL',
 'DIAMANTE HAIR GRIP PACK/2 LT ROSE',
 'DIAMANTE HAIR GRIP PACK/2 MONTANA',
 'DIAMANTE H

As we can see there are no Description of 'Discount' misspelled and the only type of discount transactions are defined with the StockCode of 'D'

In [233]:
df['Quantity'] = [x if (x < 0) & (y == 'D') else x*(-1) if (x < 0) else x for x,y in zip(df['Quantity'], df['StockCode'])]

In [234]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,406829.0,406829.0,406829.0
mean,13.406409,3.460471,15287.69057
std,248.624487,69.315162,1713.600303
min,-720.0,0.0,12346.0
25%,2.0,1.25,13953.0
50%,5.0,1.95,15152.0
75%,12.0,3.75,16791.0
max,80995.0,38970.0,18287.0


In [238]:
df[df['Quantity'] < 0]['StockCode'].unique()

array(['D'], dtype=object)

As we can notice, the only StockCodes available to negative quantities are from Discounts

## 1.3 New Features <a class='anchor' id='1.3'></a>

In [239]:
df['TotalValue'] = df['Quantity']*df['UnitPrice']
df = pd.concat([df, df['InvoiceDate'].str.extract(r'(?P<InvoiceMonth>[A-Za-z]{3})-(?P<InvoiceYear>\d{2})')], axis=1)
df['InvoiceMonth'] = pd.to_datetime(df.InvoiceMonth, format='%b').dt.month
df['InvoiceSemester'] = df['InvoiceMonth'].apply(lambda x: 1 if x <= 6 else 2)
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalValue,InvoiceMonth,InvoiceYear,InvoiceSemester
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,29-Nov-16,2.55,17850.0,United Kingdom,15.3,11,16,2
1,536365,71053,WHITE METAL LANTERN,6,29-Nov-16,3.39,17850.0,United Kingdom,20.34,11,16,2
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,29-Nov-16,2.75,17850.0,United Kingdom,22.0,11,16,2
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,29-Nov-16,3.39,17850.0,United Kingdom,20.34,11,16,2
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,29-Nov-16,3.39,17850.0,United Kingdom,20.34,11,16,2


## 1.4 Data Analysis <a class='anchor' id='1.4'></a>

In [240]:
df.describe(include='all')

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalValue,InvoiceMonth,InvoiceYear,InvoiceSemester
count,406829.0,406829,406829,406829.0,406829,406829.0,406829.0,406829,406829.0,406829.0,406829.0,406829.0
unique,22190.0,3684,3896,,305,,,37,,,2.0,
top,576339.0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,,4-Nov-17,,,United Kingdom,,,17.0,
freq,542.0,2077,2070,,3434,,,361878,,,379979.0,
mean,,,,13.406409,,3.460471,15287.69057,,23.379252,7.541259,,1.629046
std,,,,248.624487,,69.315162,1713.600303,,427.439262,3.409613,,0.483061
min,,,,-720.0,,0.0,12346.0,,-1867.86,1.0,,1.0
25%,,,,2.0,,1.25,13953.0,,4.68,5.0,,1.0
50%,,,,5.0,,1.95,15152.0,,11.8,8.0,,2.0
75%,,,,12.0,,3.75,16791.0,,19.8,11.0,,2.0


### Hypothesis 1
The Insiders could be defined by CustomerIDs who have the most accumulated value of purchases

In [241]:
df.groupby('CustomerID').agg({'TotalValue':np.sum}).sort_values(by='TotalValue', ascending=False)

Unnamed: 0_level_0,TotalValue
CustomerID,Unnamed: 1_level_1
16446.0,336942.10
14646.0,280510.22
18102.0,262876.11
17450.0,201449.81
12346.0,154367.20
...,...
12943.0,3.75
16428.0,2.95
14679.0,2.55
16995.0,1.25
