# **SOURCE**
https://www.kaggle.com/code/mgmarques/customer-segmentation-and-market-basket-analysis/notebook
- Customer segmentation: Customer segmentation is the problem of uncovering information about a firm's customer base, based on their interactions with the business. In most cases this interaction is in terms of their purchase behavior and patterns. We explore some of the ways in which this can be used.
- Market basket analysis: Market basket analysis is a method to gain insights into granular behavior of customers. This is helpful in devising strategies which uncovers deeper understanding of purchase decisions taken by the customers. This is interesting as a lot of times even the customer will be unaware of such biases or trends in their purchasing behavior.

Let's see the description of each column:
- InvoiceNo: A unique identifier for the invoice. An invoice number shared across rows means that those transactions were performed in a single invoice (multiple purchases).
- StockCode: Identifier for items contained in an invoice.
- Description: Textual description of each of the stock item.
- Quantity: The quantity of the item purchased.
- InvoiceDate: Date of purchase.
- UnitPrice: Value of each item.
- CustomerID: Identifier for customer making the purchase.
- Country: Country of customer.

# **DATA UNDERSTANDING**

In [2]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
pd.options.mode.chained_assignment = None

path = './db/online-retail.xlsx'
df = pd.read_excel(path)

In [3]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
line = '========================'
def dataProfile(data):
  dimension = data.shape
  dtype = data.dtypes
  countOfNull = data.isnull().sum()
  nullRatio = round(countOfNull/len(data)*100,4)
  countOfDistinct = data.nunique()
  distinctValue = data.apply(lambda x: x.unique())
  output = pd.DataFrame(list(zip(dtype, countOfNull, nullRatio, countOfDistinct, distinctValue)),
                        index=data.columns, 
                        columns=['dtype', 'count_of_null', 'null_ratio', 'count_of_distinct', 'distinct_value'])
  # output = pd.concat([dtype, countOfNull, nullRatio, countOfDistinct, distinctValue], axis=1)
  # output.rename(columns=['dtype', 'count_of_null', 'null_ratio', 'count_of_distinct', 'distinct_value'])
  print(f'Dimensions\t: {dimension}')
  print(f'Data Size\t: {round(data.memory_usage(deep=True).sum()/1000000, 2)} MB')
  print(line)
  print(f'Duplicated Data\t: {len(data[data.duplicated()])}')
  display(data[data.duplicated()])
  print(line)
  print('REVIEW')
  display(output)
  print(line)
  print('Stastical Numerics')
  display(data.describe())
  print(line)
  print('Stastical Categorics')
  display(data.describe(include=['category', 'object']))
  print(line)
  print('PREVIEW head(3)')
  display(data.head(3))
  

In [5]:
dataProfile(df)

Dimensions	: (541909, 8)
Data Size	: 141.48 MB
Duplicated Data	: 5268


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
517,536409,21866,UNION JACK FLAG LUGGAGE TAG,1,2010-12-01 11:45:00,1.25,17908.0,United Kingdom
527,536409,22866,HAND WARMER SCOTTY DOG DESIGN,1,2010-12-01 11:45:00,2.10,17908.0,United Kingdom
537,536409,22900,SET 2 TEA TOWELS I LOVE LONDON,1,2010-12-01 11:45:00,2.95,17908.0,United Kingdom
539,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,2010-12-01 11:45:00,4.95,17908.0,United Kingdom
555,536412,22327,ROUND SNACK BOXES SET OF 4 SKULLS,1,2010-12-01 11:49:00,2.95,17920.0,United Kingdom
...,...,...,...,...,...,...,...,...
541675,581538,22068,BLACK PIRATE TREASURE CHEST,1,2011-12-09 11:34:00,0.39,14446.0,United Kingdom
541689,581538,23318,BOX OF 6 MINI VINTAGE CRACKERS,1,2011-12-09 11:34:00,2.49,14446.0,United Kingdom
541692,581538,22992,REVOLVER WOODEN RULER,1,2011-12-09 11:34:00,1.95,14446.0,United Kingdom
541699,581538,22694,WICKER STAR,1,2011-12-09 11:34:00,2.10,14446.0,United Kingdom


REVIEW


Unnamed: 0,dtype,count_of_null,null_ratio,count_of_distinct,distinct_value
InvoiceNo,object,0,0.0,25900,"[536365, 536366, 536367, 536368, 536369, 53637..."
StockCode,object,0,0.0,4070,"[85123A, 71053, 84406B, 84029G, 84029E, 22752,..."
Description,object,1454,0.2683,4223,"[WHITE HANGING HEART T-LIGHT HOLDER, WHITE MET..."
Quantity,int64,0,0.0,722,"[6, 8, 2, 32, 3, 4, 24, 12, 48, 18, 20, 36, 80..."
InvoiceDate,datetime64[ns],0,0.0,23260,"[2010-12-01T08:26:00.000000000, 2010-12-01T08:..."
UnitPrice,float64,0,0.0,1630,"[2.55, 3.39, 2.75, 7.65, 4.25, 1.85, 1.69, 2.1..."
CustomerID,float64,135080,24.9267,4372,"[17850.0, 13047.0, 12583.0, 13748.0, 15100.0, ..."
Country,object,0,0.0,38,"[United Kingdom, France, Australia, Netherland..."


Stastical Numerics


Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


Stastical Categorics


Unnamed: 0,InvoiceNo,StockCode,Description,Country
count,541909,541909,540455,541909
unique,25900,4070,4223,38
top,573585,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom
freq,1114,2313,2369,495478


PREVIEW head(3)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom


We can observe from the preceding output that Quantity and UnitPrice are having negative values, which may mean that we may have some return transactions in our data also. As our goal is customer segmentation and market basket analysis, it is important that these records are removed, but first we will take a look at whether there are records where both are negative or if one of them is negative and the other is zero.

# **DATA CLEANSING**

## **Fix Datatypes**

In [27]:
data = df.copy()
data.CustomerID = data.CustomerID.astype('object')
notCategoricalColumns = ['Quantity', 'InvoiceDate', 'UnitPrice']
for value in data.columns:
  if value not in notCategoricalColumns:
    data[value] = data[value].astype('object')
data.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  object        
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 147.3 MB


## **Drop Duplicated**

In [28]:
def dropDuplicates(df):
  print(f'Dimensions before remove duplicates: {df.shape}')
  df = df.drop_duplicates()
  print(f'Dimensions after remove duplicates: {df.shape}')
  return df

In [29]:
data = dropDuplicates(data)
data

Dimensions before remove duplicates: (541909, 8)
Dimensions after remove duplicates: (536641, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


## **Drop N/a CustomerID**

In [30]:
def dropNull(df, cols=None):
  print(f'Dimensions before remove duplicates: {df.shape}')
  if(cols==None):
    df = df.dropna()
  else:
    df = df.dropna(subset=cols, axis=0)
  print(f'Dimensions after remove duplicates: {df.shape}')
  return df

In [31]:
data = dropNull(data, cols=['CustomerID'])
data

Dimensions before remove duplicates: (536641, 8)
Dimensions after remove duplicates: (401604, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


## **Explore**

In [32]:
dataProfile(data)

Dimensions	: (401604, 8)
Data Size	: 117.6 MB
Duplicated Data	: 0


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


REVIEW


Unnamed: 0,dtype,count_of_null,null_ratio,count_of_distinct,distinct_value
InvoiceNo,object,0,0.0,22190,"[536365, 536366, 536367, 536368, 536369, 53637..."
StockCode,object,0,0.0,3684,"[85123A, 71053, 84406B, 84029G, 84029E, 22752,..."
Description,object,0,0.0,3896,"[WHITE HANGING HEART T-LIGHT HOLDER, WHITE MET..."
Quantity,int64,0,0.0,436,"[6, 8, 2, 32, 3, 4, 24, 12, 48, 18, 20, 36, 80..."
InvoiceDate,datetime64[ns],0,0.0,20460,"[2010-12-01T08:26:00.000000000, 2010-12-01T08:..."
UnitPrice,float64,0,0.0,620,"[2.55, 3.39, 2.75, 7.65, 4.25, 1.85, 1.69, 2.1..."
CustomerID,object,0,0.0,4372,"[17850.0, 13047.0, 12583.0, 13748.0, 15100.0, ..."
Country,object,0,0.0,37,"[United Kingdom, France, Australia, Netherland..."


Stastical Numerics


Unnamed: 0,Quantity,UnitPrice
count,401604.0,401604.0
mean,12.183273,3.474064
std,250.283037,69.764035
min,-80995.0,0.0
25%,2.0,1.25
50%,5.0,1.95
75%,12.0,3.75
max,80995.0,38970.0


Stastical Categorics


Unnamed: 0,InvoiceNo,StockCode,Description,CustomerID,Country
count,401604,401604,401604,401604.0,401604
unique,22190,3684,3896,4372.0,37
top,576339,85123A,WHITE HANGING HEART T-LIGHT HOLDER,17841.0,United Kingdom
freq,542,2065,2058,7812.0,356728


PREVIEW head(3)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom


Duplicated values and null values has been removed. Negative value in Quantity?

In [33]:
print(f'negative quantity => refund?')
print(f'InvoiceNo startwith: {data[(data.Quantity<0)].InvoiceNo.apply(lambda x: str(x)[0]).unique()}\n{line}')
display(data[(data.Quantity<0)])
print(line)
print(f'zero unitprice => free/bug/error?')
print(f'length: {len(data[(data.UnitPrice==0)])}\n{line}')
display(data[(data.UnitPrice==0)])

negative quantity => refund?
InvoiceNo startwith: ['C']


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.50,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
...,...,...,...,...,...,...,...,...
540449,C581490,23144,ZINC T-LIGHT HOLDER STARS SMALL,-11,2011-12-09 09:57:00,0.83,14397.0,United Kingdom
541541,C581499,M,Manual,-1,2011-12-09 10:28:00,224.69,15498.0,United Kingdom
541715,C581568,21258,VICTORIAN SEWING BOX LARGE,-5,2011-12-09 11:57:00,10.95,15311.0,United Kingdom
541716,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,-1,2011-12-09 11:58:00,1.25,17315.0,United Kingdom


zero unitprice => free/bug/error?
length: 40


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
9302,537197,22841,ROUND CAKE TIN VINTAGE GREEN,1,2010-12-05 14:02:00,0.0,12647.0,Germany
33576,539263,22580,ADVENT CALENDAR GINGHAM SACK,4,2010-12-16 14:36:00,0.0,16560.0,United Kingdom
40089,539722,22423,REGENCY CAKESTAND 3 TIER,10,2010-12-21 13:45:00,0.0,14911.0,EIRE
47068,540372,22090,PAPER BUNTING RETROSPOT,24,2011-01-06 16:41:00,0.0,13081.0,United Kingdom
47070,540372,22553,PLASTERS IN TIN SKULLS,24,2011-01-06 16:41:00,0.0,13081.0,United Kingdom
56674,541109,22168,ORGANISER WOOD ANTIQUE WHITE,1,2011-01-13 15:10:00,0.0,15107.0,United Kingdom
86789,543599,84535B,FAIRY CAKES NOTEBOOK A6 SIZE,16,2011-02-10 13:08:00,0.0,17560.0,United Kingdom
130188,547417,22062,CERAMIC BOWL WITH LOVE HEART DESIGN,36,2011-03-23 10:25:00,0.0,13239.0,United Kingdom
139453,548318,22055,MINI CAKE STAND HANGING STRAWBERY,5,2011-03-30 12:45:00,0.0,13113.0,United Kingdom
145208,548871,22162,HEART GARLAND RUSTIC PADDED,2,2011-04-04 14:42:00,0.0,14410.0,United Kingdom


In [34]:
zeroUP = data[data.UnitPrice==0][['StockCode', "Description"]]
priceZero = pd.merge(data, zeroUP, left_on=['StockCode', 'Description'], right_on=['StockCode', 'Description'], how='inner')
# priceZero
priceZero.groupby(['StockCode', 'Description', 'UnitPrice'], as_index=False).agg(Count_=('UnitPrice', 'count')).reset_index(drop=True)

Unnamed: 0,StockCode,Description,UnitPrice,Count_
0,21208,PASTEL COLOUR HONEYCOMB FAN,0.000,1
1,21208,PASTEL COLOUR HONEYCOMB FAN,0.390,50
2,21208,PASTEL COLOUR HONEYCOMB FAN,1.450,2
3,21208,PASTEL COLOUR HONEYCOMB FAN,1.650,9
4,21786,POLKADOT RAIN HAT,0.000,1
...,...,...,...,...
316,M,Manual,4287.630,6
317,M,Manual,6930.000,6
318,M,Manual,38970.000,6
319,PADS,PADS TO MATCH ALL CUSHIONS,0.000,1


The zero-valued UnitPrice only has 40 registers data. Therefore, it can be removed to avoid data inconsistencies. 

### **Drop Zero UnitPrice**

In [35]:
data = data[data.UnitPrice > 0]
data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


explore refund

In [47]:
cancel = data.groupby(['InvoiceNo', 'CustomerID'], as_index=False).Quantity.sum().sort_values('CustomerID').reset_index(drop=True)
cancel['IsCanceled'] = np.where(cancel.InvoiceNo.str.startswith('C', na=False), 1, 0)

print(f'Total transactions\t\t: {len(cancel)}')
print(f'Total canceled transactions\t: {cancel.IsCanceled.sum()} => {round((cancel.IsCanceled.sum()/len(cancel)*100),2)}%')
print(line)
cancel

Total transactions		: 22186
Total canceled transactions	: 3654 => 16.47%


Unnamed: 0,InvoiceNo,CustomerID,Quantity,IsCanceled
0,541431,12346.0,74215,0
1,C541433,12346.0,-74215,1
2,549222,12347.0,483,0
3,537626,12347.0,319,0
4,562032,12347.0,277,0
...,...,...,...,...
22181,578262,18283.0,241,0
22182,579673,18283.0,132,0
22183,570715,18287.0,990,0
22184,554065,18287.0,488,0
