# Client Segmentation Clustering

We have a dataset about the stocks management of a company. We have data about orders made by different clients and also other type of transactions like shipipment orders, bank transactions stock balancing transactions, etc.

We need to focus only on the proper customer transactions and group the customers in certain clusters based on their behaviours

### Importing the dataset

We need to change the default encondig because of some charactrs in the dataset

In [205]:
import pandas as pd

df = pd.read_csv('data.csv', encoding='unicode_escape')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


We want to see how many rows and columns we have

- we have 541909 rows and 8 columns

In [206]:
df.shape

(541909, 8)

We want to see what types of data we have

1. We notice that we have some columns that are normally numeric ones like InvoiceNo, StockCode that are now as objects because they also contain letters, we will need to somehow convert these to numeric type

2. We also notice that InvoiceDate is object and we will convert it to datetype

In [207]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


## Data cleaning

### Handling null values:

We want to see the features with nulls

Since we want to observe the behaviour of customers we need to get rid of the data without 'Customer ID'

In [208]:
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

By deleting all the 'CustomerID' rows with null values it will remove also all the rows from 'Description' that have null values because the rows match

In [209]:
df.dropna(subset=['CustomerID'], inplace=True, axis=0)
df.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

### Checking duplicates

We have duplicates but we want to see if they are legit orders or not

In [210]:
duplicated_rows = df[df.duplicated(keep=False)]
duplicated_rows = duplicated_rows.sort_values(by=['InvoiceNo', 'StockCode'])
duplicated_rows.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
494,536409,21866,UNION JACK FLAG LUGGAGE TAG,1,12/1/2010 11:45,1.25,17908.0,United Kingdom
517,536409,21866,UNION JACK FLAG LUGGAGE TAG,1,12/1/2010 11:45,1.25,17908.0,United Kingdom
485,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,12/1/2010 11:45,4.95,17908.0,United Kingdom
539,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,12/1/2010 11:45,4.95,17908.0,United Kingdom
489,536409,22866,HAND WARMER SCOTTY DOG DESIGN,1,12/1/2010 11:45,2.1,17908.0,United Kingdom


We see that there are duplicates but it seems that the system allows buying the same product multiple times on an order

We see that on the same InvoiceNO '536381' we have the same product '71270' added with different quantities - this makes us think that it is possible to have the same product mutiple times on the same order with the same or different 'Quantity'

In [211]:
# Group by 'orderID' and 'product_code', then filter based on multiple occurrences with different quantities
duplicate_orders = df[df.duplicated(subset=['InvoiceNo', 'StockCode'], keep=False)]

print("Orders with the same product_code brought multiple times with different quantities:")
duplicate_orders.head()

Orders with the same product_code brought multiple times with different quantities:


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
113,536381,71270,PHOTO CLIP LINE,1,12/1/2010 9:41,1.25,15311.0,United Kingdom
125,536381,71270,PHOTO CLIP LINE,3,12/1/2010 9:41,1.25,15311.0,United Kingdom
483,536409,90199C,5 STRAND GLASS NECKLACE CRYSTAL,3,12/1/2010 11:45,6.35,17908.0,United Kingdom
485,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,12/1/2010 11:45,4.95,17908.0,United Kingdom
489,536409,22866,HAND WARMER SCOTTY DOG DESIGN,1,12/1/2010 11:45,2.1,17908.0,United Kingdom


Remove data that is not a product transaction 

First we need to check what StockCodes we have

In [212]:
non_numeric_stocks = df[~df['StockCode'].str.contains('^\\d', na=False)]
non_numeric_stocks.StockCode.value_counts().sum()

1920

We drop these rows

In [213]:
print(df.shape)
df.drop(index=non_numeric_stocks.index, inplace=True)
print(df.shape)

(406829, 8)
(404909, 8)


We want to see what UnitPrices we have

We notice that we have some rows with UnitPrices equal to 0 - these do not help us at all

In [214]:
df['UnitPrice'].describe()

count    404909.000000
mean          2.901129
std           4.430846
min           0.000000
25%           1.250000
50%           1.950000
75%           3.750000
max         649.500000
Name: UnitPrice, dtype: float64

We notice that we have 33 rows where UnitPrice is 0 - we can remove them

In [215]:
zero_unit_price = df[df['UnitPrice']==0]
print(zero_unit_price.value_counts().sum())

33


In [216]:
print(df.shape)
df.drop(index=zero_unit_price.index, inplace=True)
print(df.shape)

(404909, 8)
(404876, 8)


We want to see what type of orders we have, it seems we have normal orders and canceled orders. We notice that there are around 8500 canceled orders

In [217]:
invoice_codes = df['InvoiceNo']
invoice_codes.sort_values(ascending=False)

non_numeric_invoices = df[~df['InvoiceNo'].str.contains('^\\d', na=False)]
non_numeric_invoices.value_counts()


InvoiceNo  StockCode  Description                         Quantity  InvoiceDate       UnitPrice  CustomerID  Country       
C543611    82483      WOOD 2 DRAWER CABINET WHITE FINISH  -1        2/10/2011 14:38   4.95       17850.0     United Kingdom    4
C538341    22725      ALARM CLOCK BAKELIKE CHOCOLATE      -1        12/10/2010 14:03  3.75       15514.0     United Kingdom    3
           22976      CIRCUS PARADE CHILDRENS EGG CUP     -12       12/10/2010 14:03  1.25       15514.0     United Kingdom    3
           22730      ALARM CLOCK BAKELIKE IVORY          -1        12/10/2010 14:03  3.75       15514.0     United Kingdom    3
C570556    20971      PINK BLUE FELT CRAFT TRINKET BOX    -1296     10/11/2011 11:10  1.06       16029.0     United Kingdom    2
                                                                                                                              ..
C551285    82483      WOOD 2 DRAWER CABINET WHITE FINISH  -1        4/27/2011 14:07   6.95       15005

We need to handle the canceled orders in some way. It seems that they start with 'C' letter followed by an order id. What we could is to create another binary column 'CanceledOrder' and we can mark the canceled orders with 1 and the others with 0

Afterwards we need to clean the 'InvoiceNo' column by removing the 'C' letter from those invoices that have it

In [218]:
df['CanceledOrder'] = df['InvoiceNo'].str.startswith('C').astype(int)
df['InvoiceNo'] = df['InvoiceNo'].str.replace('C', '')
df['CanceledOrder'].value_counts()

CanceledOrder
0    396337
1      8539
Name: count, dtype: int64

### Removing unneeded columns

We have Country column we need to see how the samples is distributed over countries

We see that the majority of the orders are from United Kingdom and they represent ~90% from the total

We can remove the 'Country' column

In [224]:
df['Country'].value_counts(normalize=True)
df.drop('Country', axis=1, inplace=True)
df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'CanceledOrder'],
      dtype='object')

Also we can remove 'Description' column because it is not relevant since we have their stock codes

In [225]:
df.drop('Description', axis=1, inplace=True)
df.columns

Index(['InvoiceNo', 'StockCode', 'Quantity', 'InvoiceDate', 'UnitPrice',
       'CustomerID', 'CanceledOrder'],
      dtype='object')

### Converting columns to proper type

#### Date column
We have the 'InvoiceDate' column that is of object type, we need to convert it to datetime type and extract what we need from it

In [226]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 404876 entries, 0 to 541908
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   InvoiceNo      404876 non-null  object        
 1   StockCode      404876 non-null  object        
 2   Quantity       404876 non-null  int64         
 3   InvoiceDate    404876 non-null  datetime64[ns]
 4   UnitPrice      404876 non-null  float64       
 5   CustomerID     404876 non-null  float64       
 6   CanceledOrder  404876 non-null  int32         
dtypes: datetime64[ns](1), float64(2), int32(1), int64(1), object(2)
memory usage: 23.2+ MB


#### InvoiceNo | StockCode columns

In [228]:
df = df.astype({'InvoiceNo':'int64'})
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 404876 entries, 0 to 541908
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   InvoiceNo      404876 non-null  int64         
 1   StockCode      404876 non-null  object        
 2   Quantity       404876 non-null  int64         
 3   InvoiceDate    404876 non-null  datetime64[ns]
 4   UnitPrice      404876 non-null  float64       
 5   CustomerID     404876 non-null  float64       
 6   CanceledOrder  404876 non-null  int32         
dtypes: datetime64[ns](1), float64(2), int32(1), int64(2), object(1)
memory usage: 23.2+ MB


We need to treat StockCodes with letters in the end and see if we can transform this to int64 or if we really need this column at all