The project is about customer segmentation in an e-commerce website. Customer segmentation is the process of breaking a customer base into groups of people that are similar in areas that matter in marketing, such as age, gender, interests, and purchasing habits. Companies that use customer segmentation believe that each customer is unique and that their marketing efforts would be better served if they targeted, smaller groups with messaging that would be relevant to those customers and lead to a purchase. Companies also want to obtain a better understanding of their consumers' tastes and demands, with the goal of determining what each category values the most so that marketing materials can be more precisely tailored to that segment
The dataset is provided by The UCI Machine Learning Repository and contains actual e-commerce transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based online retailer. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
Dataset has the following columns:
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal, the name of the country where each customer resides.
Here is a snapshot of the data
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
---|---|---|---|---|---|---|---|
536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
The data is incorrect format, so it needs some modification to process easily. Here is a snapshot after modifications:
CustomerID | Revenue | Purchases | Quantity | RevenuePercentage |
---|---|---|---|---|
14646.0 | 279489.02 | 2085 | 196719 | 0.028672 |
18102.0 | 256438.49 | 433 | 64122 | 0.026307 |
17450.0 | 187482.17 | 351 | 69029 | 0.019233 |
14911.0 | 132572.62 | 5903 | 77180 | 0.013600 |
12415.0 | 123725.45 | 778 | 77242 | 0.012693 |
To explore more about data, the following questions need to be answered:
- What was the total revenue?
- Which months had the higher sales?
- What products are in the top 10 in sales and revenue?
- Which products were returned more frequently?
- What are the top 10 countries that purchased the most?
- How much is the share of the revenue for each cluster?
- Set up the data with required columns
- Removing the outliers
- Correlation between the selected features
- Split data into train set & test set
- Scaling using minmax scaler
- Determine optimal cluster number with Elbow method
- Fitting KMeans Model
- Plotting the Result
- Determine optimal cluster number with Elbow method
- Fitting Agglomerative Model
- Plotting the result
- The Completeness Score
- Random Score
- The Normalized Mutual Info Score
- Heatmap