<a href="https://colab.research.google.com/github/prabirdeb/Customer-Segmentation/blob/main/Online_Retail_Customer_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Extraction/identification of major topics & themes discussed in news articles. </u></b>

## <b> Problem Description </b>

### In this project, your task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## <b> Data Description </b>

### <b>Attribute Information: </b>

* ### InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
* ### StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
* ### Description: Product (item) name. Nominal.
* ### Quantity: The quantities of each product (item) per transaction. Numeric.
* ### InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
* ### UnitPrice: Unit price. Numeric, Product price per unit in sterling.
* ### CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* ### Country: Country name. Nominal, the name of the country where each customer resides.

# **Steps of Exploratory Data Analysis (EDA)**

**1. Data or EXPERIENCE**

Here, we are provided with a dataset of online retail customers.

Thus, the dataset is actually a collection of experiences about online retail customers. 

Now, we need to **decode the set of experiences** to help in building a model for customer segmentation.

At first, we import the libraries or functions for **making our journey easy** and then **get connected** to the set of experiences.

In [1]:
#importing libraries
import numpy as np
import pandas as pd
from termcolor import colored

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
xls = pd.ExcelFile('/content/drive/MyDrive/Almabetter Assignments/Capstone projects/Customer Segmentation-Prabir Debnath/Online Retail.xlsx')

In [4]:
xls.sheet_names

['Online Retail']

In [5]:
# Reading the data as pandas dataframe
customer_df = pd.read_excel(xls, 'Online Retail')

**2. Features or "DHARMA":**

Here, the columns are the set of features, showing the way to reach final decoded experience or conclusions.  

As there is a huge no. of experiences, we cannot see the whole lot of experiences and therefore we find the features on the **data head**.

In [6]:
customer_df.head(2)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


**3. Statistics or MEASUREMENTS**

As there is a huge no. of experiences, we take the help of statistics to **measure** each and every features in different dimensions and thus step by step, will find the most important features or the exact way to decode the experiences.

“**what gets measured gets done**“.

In [7]:
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [8]:
customer_df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [9]:
for column_name in customer_df.columns:
  print(colored(f"Unique values for {column_name}:",'blue', attrs=['bold']))
  print(customer_df[column_name].unique())

[1m[34mUnique values for InvoiceNo:[0m
[536365 536366 536367 ... 581585 581586 581587]
[1m[34mUnique values for StockCode:[0m
['85123A' 71053 '84406B' ... '90214U' '47591b' 23843]
[1m[34mUnique values for Description:[0m
['WHITE HANGING HEART T-LIGHT HOLDER' 'WHITE METAL LANTERN'
 'CREAM CUPID HEARTS COAT HANGER' ... 'lost'
 'CREAM HANGING HEART T-LIGHT HOLDER' 'PAPER CRAFT , LITTLE BIRDIE']
[1m[34mUnique values for Quantity:[0m
[     6      8      2     32      3      4     24     12     48     18
     20     36     80     64     10    120     96     23      5      1
     -1     50     40    100    192    432    144    288    -12    -24
     16      9    128     25     30     28      7     56     72    200
    600    480     -6     14     -2     11     33     13     -4     -5
     -7     -3     70    252     60    216    384    -10     27     15
     22     19     17     21     34     47    108     52  -9360    -38
     75    270     42    240     90    320   1824    204  

In [10]:
for column_name in customer_df.columns:
  print(colored(f"No. of unique values for {column_name}:",'green', attrs=['bold']))
  print(customer_df[column_name].nunique())

[1m[32mNo. of unique values for InvoiceNo:[0m
25900
[1m[32mNo. of unique values for StockCode:[0m
4070
[1m[32mNo. of unique values for Description:[0m
4223
[1m[32mNo. of unique values for Quantity:[0m
722
[1m[32mNo. of unique values for InvoiceDate:[0m
23260
[1m[32mNo. of unique values for UnitPrice:[0m
1630
[1m[32mNo. of unique values for CustomerID:[0m
4372
[1m[32mNo. of unique values for Country:[0m
38


In [11]:
for column_name in customer_df.columns:
  print(colored(f"No. of null values for {column_name}:",'magenta', attrs=['bold']))
  print(customer_df[column_name].isnull().sum())

[1m[35mNo. of null values for InvoiceNo:[0m
0
[1m[35mNo. of null values for StockCode:[0m
0
[1m[35mNo. of null values for Description:[0m
1454
[1m[35mNo. of null values for Quantity:[0m
0
[1m[35mNo. of null values for InvoiceDate:[0m
0
[1m[35mNo. of null values for UnitPrice:[0m
0
[1m[35mNo. of null values for CustomerID:[0m
135080
[1m[35mNo. of null values for Country:[0m
0


In [12]:
customer_df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [14]:
# Finding out the final important columns considering type of unique values, less no. of unique values and less no. of null values
important_columns=['InvoiceNo', 'Description', 'Quantity', 'InvoiceDate','UnitPrice', 'Country']

**4. Data Cleaning**

Now we can create a clean experience set with important features. 

In this step, we treat the missing values or "?" values through imputation if required.

If there is any string containing a Python literal in any of the important feature, we must evaluate them

We may also create new feature if found important during the analysis.

We check the statistics further on clean data.

In [15]:
customer_df=customer_df[important_columns]

In [16]:
for column_name in customer_df.columns:
  print(colored(f"Value counts for {column_name}:", attrs=['bold']))
  print(customer_df[column_name].value_counts())

[1mValue counts for InvoiceNo:[0m
573585     1114
581219      749
581492      731
580729      721
558475      705
           ... 
C558095       1
563742        1
563740        1
C552863       1
568372        1
Name: InvoiceNo, Length: 25900, dtype: int64
[1mValue counts for Description:[0m
WHITE HANGING HEART T-LIGHT HOLDER    2369
REGENCY CAKESTAND 3 TIER              2200
JUMBO BAG RED RETROSPOT               2159
PARTY BUNTING                         1727
LUNCH BAG RED RETROSPOT               1638
                                      ... 
MINT DINER CLOCK                         1
?sold as sets?                           1
Incorrect stock entry.                   1
FLOWER SHOP DESIGN MUG                   1
CAKESTAND, 3 TIER, LOVEHEART             1
Name: Description, Length: 4223, dtype: int64
[1mValue counts for Quantity:[0m
 1       148227
 2        81829
 12       61063
 6        40868
 4        38484
          ...  
 1287         1
-5368         1
 267          1
-244   

**5. Data Visualization**

When we know all the important features of our experiences, we can go a step ahead by finding the relationship among features. Here, we take the help of visualization because

**"A picture is worth a thousand words"**

**6. Anomaly Detection**

While, we are finding out the **general formula** from the experiences, we should identify the outlier or **exceptional observations** for all the important features and keep them aside during the analysis.

**7. Conclusion**