# Everything Plus E-commerce Platform: Project Decomposition

by Jonathan Chan

### <u>Table of Contents</u><a id='back'></a>
* [1. Project Overview](#overview)
* [2. Preliminary Hypotheses](#preliminaryhypotheses)    
* [3. Deliverables](#deliverables)
    * [3.1. Data Preprocessing](#dp)
    * [3.2. Exploratory Data Analysis](#eda)
    * [3.3 Features Engineering - Hypotheses Exploration](#fe)
    * [3.4. User Clustering and Segmentation Strategies (ML)](#ml)
* [4. Resources](#resources)    

# 1. Project Overview <a id='overview' > <a/>

- **About our Client:**
<br><u>*Everything Plus: Plus a Little Bit More!*</u> is an e-commerce platform specializing in household goods of all kinds.


- **Problem Statement:**
<br>The client has specified a need for identifying distinct customer profiles out of their web channel &ndash; <u>Customer Segmentation Problem</u>.  


- **Primary Objective:**
<br>To run personalized offers for distinct customer profiles.


- **Success Indicator(s):**
<br>Increase in conversion rates &ndash; Running personalized offers will hopefully lead to greater user experience, of which will generate better brand salience and retention, while also aiding in word of mouth traffic. <u>Target Conversion Rates: 5 &ndash; 7%</u>


- **Time Period:**
<br>Approximately a year.


- **Gap Analysis:**
<br>Qualitative research has been done, but none with data manipulation. The client does not possess intrinsic personal data on its customers; they only have transactional data at their disposal.


- **Key End User:** 
<br>Our analysis will be used by the client's <u>product manager</u>.


- **Key Decisions:** 
<br> Unique personalized offerings will be made for each distinct customer profile we can identify.
<br>

[Back to Contents](#back)

# 2. Preliminary Hypotheses <a id='preliminaryhypotheses' > <a/>

1. **Seasonality factors** 
- Seasonalities affects sales, whereby sales are highest towards the end of the year. 

2. **Day of the week transactions** 
- Customers are more likely to make purchases over the weekend than weekdays. 

3. **Time elapsed between transactions** 
- Customers with shorter time deltas between transactions are more likely to exhibit higher repeat purchase behavior, indicating higher engagement/loyalty. (and vice versa)

4. **Purchase frequency (first order month) and average purchase frequency (proceeding months) are likely dependent on each other.** 
- Customers who transact frequently during their first-order month are likely to be frequent purchasers in the future.

5. **Purchase Frequency has a higher impact on LTV compared to Average Purchase Value and Average Basket Size** 
- Customers who transact frequently are likely to generate higher Lifetime Value and subsequently, better loyalty and retention rates. This is irrespective of invoice value of that particular purchase or their quantities purchased.

6. **Customer segments differ in terms of their average basket size (average quantity) and frequency of purchase** 
- For example, one segment may consist of bulk buyers who purchase in larger quantities but low frequency, while another segment may consist of more occasional shoppers with smaller purchase quantities but purchase more often. 


[Back to Contents](#back)

# 3. Deliverables <a id='deliverables' > <a/>

**<u>Transaction Data Requested from client:</u>** 

1. InvoiceNo - Unique order identifiers
2. StockCode - Unique item identifiers
3. Description - Item name
4. Quantity - Item quantities transacted
5. InvoiceDate - Order date
6. UnitPrice - Price per item
7. CustomerID - Unique customer identifiers


## 3.1. Data Preprocessing <a id='dp'> <a/>

Estimated delivery timeframe for Data Preprocessing: 1 day(s)

**<u>Key Check-points:</u>**
- Readability fixes (column names to snake_case)
- Check appropriate feature data types (especially datetime columns)
- Check for complete duplicates
- Check for missing values
- Investigate the statistical summary of all feature columns - Look for potential outliers and illogical values

[Back to Contents](#back)

## 3.2. Exploratory Data Analysis <a id='eda'> <a/>

Estimated delivery timeframe for EDA: 1 day(s)

**Multivariate Feature Analysis** - Visualizing distribution of aggregated sales data across time
- Check data distribution across monthly periods.
- Plot a histogram and boxplot of PURCHASE VALUE for each unique invoice. Are there invoices that have abnormally low/high values?
- Plot a boxplot for BASKET SIZE for each unique invoice. Are there invoices that have abnormally low/high volumes?
- Plot a barchart for TOTAL ORDER VOLUME by month and REVENUE by month. Which months perform better than others?
- Plot a barchart for AVERAGE CUSTOMER REVENUE by month. Which months have higher-value customers?
- Plot a barchart for AVERAGE BASKET SIZE by month. Are there months where customers buy above average quantities?
 - We can make some intermediate conclusions on the hypothesis on Seasonal Factors here.

[Back to Contents](#back)

## 3.3. Features Engineering -  Hypotheses Exploration <a id='fe'> <a/>

Estimated delivery timeframe for Features Engineering: 3 day(s)

1. **Study purchase frequency by days of the week:**
- Extract day of the week and plot frequency of invoices across the 7 days.

2. **Investigate time elapsed between transactions and total purchase frequency:** 
- Group dataframe by customer id, invoice date and invoice number
- Create a column of previous invoice dates for each customer_id group 
- Calculate time delta column by getting the difference invoice date column and previous invoice date column
- Convert time delta column into the time delta type and calculate the mean
- Fill any missing values with 0 as they are likely to be customers who do not have a prior invoice date, aka customers who have only purchased once
- Merge values to our main DataFrame and visualize a histogram
- Study the relationship between time-delta between transactions and total purchase frequency by individual customers. Use a joint-plot

3. **Purchase frequency (first order month) and average purchase frequency of proceeding months are likely dependent on each other - Customers are likely to exhibit similar purchasing frequency in their lifetime based on their first order month:**
- Create a column of first month purchase frequency
- Group customers into cohorts based on their purchase frequency in the first month; Assign categorical identifiers.
- Calculate the average purchase frequency for the remaining months for each customer within the cohort; Assign categorical idenfitiers.
- Create a contingency table that shows the frequencies of observed occurrences for each distinct combination of 'purchase frequency in the first month' and the 'average purchase frequency in the remaining months'.
- Perform Chi-test of Independence based on our contingency table. No need for a normality test.

4. **Purchase frequency, average invoice value, average basket size & LTV - Purchase Frequency has a higher impact on LTV compared to Average Invoice Value and Average Basket Size**
- Create cohorts of unique customer ids.
- Calculate n_buyers, n_quantities, total revenue, total purchase instances and lifetime age values by each customer id.
- Calculate LTV and average purchase frequency for each cohort .
- Calculate average purchase value for each cohort.
- Calculate average basket size for each cohort. 
- Create a correlation matrix and scatterplot to visualize the relationship.
- Assign categorical identifiers to each metric for each customer for segmentation.

[Back to Contents](#back)

## 3.4. RFM Segmentation <a id='ml' > <a/>

Estimated delivery timeframe for Machine Learning: 5 day(s)

**RFM modelling**
- Create the RFM table
- Recency: Calculate the time-delta between individual customer's last purchase date and a reference date (possibly the last observed transaction date on record)
- Frequency: Calculate the total number of transactions made by each unique customer for the time period 
- Monetary: Calculate the total monetary value of transactions made by each unique customer the same time period
- Assign a scoring system for RFM features. Manually or using quantiles.
- Visualize our groups in terms of their behaviours and answer questions on who different segments of customers are. For example, who are the biggest spenders, who are likely to churn, who are lost customers we should not be expending budget on?

**K-Means Modelling**
- Run hierarchical clustering to get an estimation for K-clusters
- Run a dual axis plot of silhouette and inertia scores at various K-clusters to determine to cross-check our results in hierarchical clustering
- Train our model 
- Assign model labels to our main report dataframe and aggregate each feature
- Visualize each feature with each segment and their corresponding distribution
- Provide recommendations

[Back to Contents](#back)

# 4. Resources <a id='resources' > <a/>

1. On various general customer segmentations strategies
- https://www.qualtrics.com/au/experience-management/brand/customer-segmentation/?rid=ip&prevsite=en&newsite=au&geo=MY&geomatch=au
- https://amplitude.com/blog/customer-segmentation 
- https://amplitude.com/blog/user-behavior

2. Using data science for behavioural segmentation 
- https://towardsdatascience.com/setting-your-business-up-for-success-with-behaviour-segmentation-74cf675ef18b
- https://towardsdatascience.com/from-data-to-market-strategy-using-behavior-segmentation-d065da224262

2. On the intricasies of Chi-Squared analysis and contingency tables.
- https://towardsdatascience.com/contingency-tables-chi-squared-and-cramers-v-ada4f93ec3fd

3. Studying the multitude of feature selection methods available for differing kinds of scenarios.
- https://www.kaggle.com/code/prashant111/comprehensive-guide-on-feature-selection

4. On the intricasies of RFM Modelling.
- https://aainabajaj39.medium.com/rfm-analysis-for-successful-customer-segmentation-using-python-6291decceb4b
- https://www.geeksforgeeks.org/rfm-analysis-analysis-using-python/
- https://towardsdatascience.com/an-rfm-customer-segmentation-with-python-cf7be647733d

5. On K-Means Clustering and dimensionality reduction
- https://365datascience.com/tutorials/python-tutorials/pca-k-means/

[Back to Contents](#back)