# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Customer Segmentation</p>

<img src="https://github.com/KarnikaKapoor/Files/blob/main/Colorful%20Handwritten%20About%20Me%20Blank%20Education%20Presentation.gif?raw=true">

In this final project, an unsupervised clustering will be performed on the customer's records extracted into [`online_retail.xlsx`](https://github.com/thuynh386/olist_ecommerce_dataset/blob/master/online_retail_II.xlsx?raw=true). 
   <a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Requirements</p>   
    
### 1. Import libraries

Import libraries that support the analysis and visualization of data, e.g. pandas, matplotlib, seaborn, sklearn, etc.

|Invoice|StockCode|	Description|	Quantity	|InvoiceDate|	Price|	Customer ID|	Country|
|-|-|	-|	-	|-|	-|	-|	-|
|0	|489434 |85048|	15CM CHRISTMAS GLASS BALL 20 LIGHTS	12	|2009-12-01 07:45:00	|6.95	|13085.0	|United Kingdom|
|1	|489434 |79323P|	PINK CHERRY LIGHTS	12|	2009-12-01 07:45:00|	6.75|	13085.0|	United Kingdom|

### 2. Load datasets

The dataset can be accessed [here](https://github.com/thuynh386/olist_ecommerce_dataset/blob/master/online_retail_II.xlsx?raw=true) and can be loaded from excel using pandas.
Further analysis can be performed on the dataset to discover the relationship between the features and the problem of features.
E.g:
- How many customers are there in the dataset?
- Describe the dataset with the necessary information?
- Is there any abnormal type for each feature?
- How many unique values are there in each feature?
- How many missing values are there in each feature?
- Are there any outliers in each feature?
    
### 3. Data cleaning and preprocessing 
- Remove the outliers if any.
- Remove the missing values if any.
- Convert the date to datetime format.
- Convert the quantity to numeric.
- Convert the customer id to numeric.
- Remove the duplicates and test data where StockCode is 'TEST' or 'M'.

### 4. Data visualization and analysis
- Visualize the data with the help of matplotlib and seaborn for the above analysis.

### 5. Feature creation
Create new features from the existing features to capture the RFM (Recency, Frequency, Monetary) of the customer.
- Create `StockValue` feature, which is the product of `Quantity` and `Price`.
- Create `Recency` feature, which is the difference between `InvoiceDate` and `InvoiceDate` of the last invoice.
- Create `Frequency` feature, which is the number of invoices of the customer.
- Create `Monetary` feature, which is the sum of `StockValue` of the invoices.
Make sure that the features are in the same scale in the dataset with no missing values or outliers.
    
### 6. Clustering with suitable algorithm
Perform clustering on the dataset using the algorithm that is suitable for the problem. E.g Kmeans, GMM, DBSCAN, etc.
In terms of K-means, make sure to find the optimal number of clusters using the elbow method.
Visualize the clusters using the above analysis. Examine the clusters formed via scatter plot.

### 7. Evaluate the clustering results and conclusion of the analysis (Important)

In [1]:
### 1

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [8]:
pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10
Note: you may need to restart the kernel to use updated packages.


In [None]:
url = "D:\Python 2\HomeWork\FormGithub\PY38SA6L2\Lecture 08\online_retail.xlsx"
data1 = pd.read_excel(url,sheet_name='Year 2009-2010')
data2 = pd.read_excel(url,sheet_name='Year 2010-2011')
data = pd.concat(data1,data2)
data.head(2)

In [13]:
data.describe()

Unnamed: 0,Quantity,Price,Customer ID
count,525461.0,525461.0,417534.0
mean,10.337667,4.688834,15360.645478
std,107.42411,146.126914,1680.811316
min,-9600.0,-53594.36,12346.0
25%,1.0,1.25,13983.0
50%,3.0,2.1,15311.0
75%,10.0,4.21,16799.0
max,19152.0,25111.09,18287.0


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Invoice      525461 non-null  object        
 1   StockCode    525461 non-null  object        
 2   Description  522533 non-null  object        
 3   Quantity     525461 non-null  int64         
 4   InvoiceDate  525461 non-null  datetime64[ns]
 5   Price        525461 non-null  float64       
 6   Customer ID  417534 non-null  float64       
 7   Country      525461 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 32.1+ MB


In [18]:
data['Country'].unique()

array(['United Kingdom', 'France', 'USA', 'Belgium', 'Australia', 'EIRE',
       'Germany', 'Portugal', 'Japan', 'Denmark', 'Nigeria',
       'Netherlands', 'Poland', 'Spain', 'Channel Islands', 'Italy',
       'Cyprus', 'Greece', 'Norway', 'Austria', 'Sweden',
       'United Arab Emirates', 'Finland', 'Switzerland', 'Unspecified',
       'Malta', 'Bahrain', 'RSA', 'Bermuda', 'Hong Kong', 'Singapore',
       'Thailand', 'Israel', 'Lithuania', 'West Indies', 'Lebanon',
       'Korea', 'Brazil', 'Canada', 'Iceland'], dtype=object)

In [24]:
# How many customers are there in the dataset?
data['Customer ID'].unique()

array([13085., 13078., 15362., ..., 12942., 13369., 15211.])