<a href="https://colab.research.google.com/github/pragyanshu-panda-au26/20-Web-Projects-Using-Vanilla-JavaScript/blob/master/Copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -**  - Pragyanshu
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

### Summary of EDA, Visualization, and K-Means Clustering on a Retail Database

#### Introduction
The retail database analysis aimed to uncover patterns and segment customers to enhance business strategies. The database contained transactional data including customer demographics, purchase history, product details, and other relevant metrics. The process involved Exploratory Data Analysis (EDA), data visualization, and the implementation of a K-Means clustering model.

#### Exploratory Data Analysis (EDA)
EDA was the first step to understand the structure, main characteristics, and potential anomalies in the data. Key activities included:

1. **Data Cleaning**: Identified and handled missing values, outliers, and inconsistencies. This included removing duplicates, addressing null values through imputation or deletion, and standardizing data formats.

2. **Descriptive Statistics**: Calculated mean, median, standard deviation, and range for numerical variables to summarize the central tendency and dispersion.

3. **Customer Analysis**: Analyzed customer demographics such as age, gender, and geographic location. For instance, the majority of customers were between 30-40 years old, with a balanced gender distribution, predominantly from urban areas.

4. **Product Analysis**: Assessed product categories, sales volume, and revenue generation. High revenue was concentrated in electronics and fashion categories.

5. **Transaction Analysis**: Examined purchase frequency, average transaction value, and peak purchase times. Notably, weekends showed higher transaction volumes, and average transaction value peaked during festive seasons.

#### Data Visualization
Data visualization helped in presenting insights and patterns discovered during EDA:

1. **Histograms and Density Plots**: Used for visualizing the distribution of numerical features such as age, transaction amounts, and product prices.

2. **Bar Charts**: Illustrated categorical data like product categories, customer regions, and payment methods. For example, bar charts showed that electronics had the highest sales, followed by fashion.

3. **Box Plots**: Employed to detect outliers and understand the spread of data, especially in transaction amounts and product prices.

4. **Heatmaps**: Displayed correlation matrices to highlight relationships between variables. Strong correlations were observed between total purchase value and number of items bought.

5. **Scatter Plots**: Visualized relationships between two continuous variables. Scatter plots of transaction value versus customer age provided insights into spending patterns across age groups.

#### K-Means Clustering
K-Means clustering was implemented to segment customers into distinct groups based on their purchasing behavior. The process involved:

1. **Feature Selection**: Selected key features for clustering including total expenditure, purchase frequency, average transaction value, and recency (time since last purchase).

2. **Standardization**: Standardized the data to ensure each feature contributed equally to the distance calculations used in clustering.

3. **Determining Optimal Clusters**: Utilized the Elbow Method and Silhouette Analysis to determine the optimal number of clusters. The Elbow Method suggested 4 clusters, which was confirmed by silhouette scores indicating well-separated clusters.

4. **Model Training**: Trained the K-Means model with 4 clusters, iterating to ensure stability and accuracy of the clusters.

5. **Cluster Profiling**: Analyzed each cluster to understand the characteristics:
   - **Cluster 1 (High Spenders)**: Customers with high total expenditure and frequent purchases.
   - **Cluster 2 (Medium Spenders)**: Moderate expenditure and frequency, often purchasing mid-range products.
   - **Cluster 3 (Low Spenders)**: Low expenditure and infrequent purchases, possibly budget-conscious.
   - **Cluster 4 (Occasional Shoppers)**: Low frequency but high average transaction value, likely purchasing premium products occasionally.

#### Insights and Recommendations
The clustering analysis provided actionable insights:

1. **Targeted Marketing**: Develop personalized marketing strategies for each cluster. High spenders could be targeted with loyalty programs, while low spenders might benefit from discounts and promotions.

2. **Inventory Management**: Align stock levels with the preferences of different customer segments to optimize inventory turnover.

3. **Customer Retention**: Focus on retaining high and medium spenders through engagement programs and enhancing customer experience.

4. **Product Development**: Tailor product offerings based on the preferences and spending patterns of each cluster, introducing products that appeal to high-value clusters.

#### Conclusion
The EDA, visualization, and K-Means clustering provided a comprehensive understanding of customer behavior in the retail database. These insights enable more informed decision-making, targeted marketing efforts, and improved customer satisfaction, ultimately driving business growth.

# **GitHub Link -**

[link]()

# **Problem Statement**


**Write Problem Statement Here.**

### Problem Statement

This project addresses the following specific problems:
1. **Data Understanding and Cleaning**: How can we preprocess the retail database to ensure the data is clean, consistent, and ready for analysis?
2. **Insight Generation through EDA and Visualization**: What are the key characteristics and patterns in customer demographics, product categories, and purchasing behavior?
3. **Customer Segmentation**: How can we effectively segment customers using a K-Means clustering model to identify distinct groups with similar purchasing behaviors?
4. **Actionable Insights**: What actionable insights can be derived from the customer segments to improve marketing strategies, inventory management, and customer retention efforts?

By solving these problems, the analysis will provide valuable insights into customer behavior, enabling the retail business to make data-driven decisions that enhance profitability and customer satisfaction.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install sklearn


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import sklearn
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
dataset = pd.read_csv('/content/drive/MyDrive/Online Retail.xlsx - Online Retail.csv')
dataset

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Total Rows:",dataset.shape[0])
print("Total Column:",dataset.shape[1])

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
dataset.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(dataset.isnull())

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
try:
    dataset['InvoiceDate'] = pd.to_datetime(dataset['InvoiceDate'])
except KeyError:
    print("Error: 'InvoiceDate' column not found.")
    print("Available columns:", dataset.columns)
    raise

In [None]:
# Dataset Describe
dataset.describe()

### Variables Description

1.InvoiceNo: Unique identifier for each transaction.

2.StockCode: Unique identifier for each product.

3.Description: Textual description of the product.

4.Quantity: Number of units of the product purchased.

5.InvoiceDate: Date and time when the transaction was generated.

6.UnitPrice: Price per unit of the product.

7.CustomerID: Unique identifier for the customer.

8.Country: Country where the customer resides or the transaction took place.

9.Sales: Total sales value for each transaction line (Quantity * UnitPrice).

10.Month: Month extracted from the InvoiceDate.

11.DayOfWeek: Day of the week extracted from the InvoiceDate.

12.Hour: Hour of the day extracted from the InvoiceDate.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
dataset.drop_duplicates(inplace=True)
dataset.dropna(inplace=True)
dataset.isnull().sum()

### What all manipulations have you done and insights you found?

I checked for duplicates and dropped duplicates abd sane fir the null values

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Get the top 10 most frequent descriptions
top_10_descriptions = dataset['Description'].value_counts().head(10)

# Create a bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_descriptions.values, y=top_10_descriptions.index)
plt.xlabel('Number of Items Sold')
plt.ylabel('Description')
plt.title('Top 10 Most Frequent Descriptions')
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.
A horizontal bar chart was selected to easily compare the count of each product description while accommodating long category names.



##### 2. What is/are the insight(s) found from the chart?

It highlights the top 10 most frequent product descriptions, are White Hanging Heart T-light Holder, Regency Cakeststand 3Tier,Jumbo Bag Red Retrospot, Party Bunting,Lunch Bag Red Retrospot, Assorted Colour Bird Ornament, Set of 3 Cake Tins Pantry Design, Pack of 72 Retrospot Cake cases, Lunch Bag Black Skull, Natural Slate Heart Chalkboard

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, by informing stock management, marketing tactics, and product development based on customer demand, potentially leading to increased sales and customer satisfaction.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Top Selling Products
top_products = dataset.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_products.values, y=top_products.index)
plt.title('Top Selling Products')
plt.xlabel('Quantity Sold')
plt.ylabel('Product')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen to visually represent the sales count of each product, allowing for easy identification of the top selling items.

##### 2. What is/are the insight(s) found from the chart?

It highlights the top 3 most purchased product are World war 2 Gliders Asstd Designs, Jumnbo Bag Red Retrospot, Assorted Colour Bird Ornament

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the top selling products can optimize stock levels, guide marketing efforts, and enhance product offerings, ultimately driving revenue and customer satisfaction.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Sales by Country
dataset['Sales'] = dataset['Quantity'] * dataset['UnitPrice']
sales_by_country = dataset.groupby('Country')['Sales'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=sales_by_country.values, y=sales_by_country.index)
plt.title('Sales by Country')
plt.xlabel('Sales')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was chosen to visually compare the total sales value across different countries, allowing for easy identification of top-performing markets.

##### 2. What is/are the insight(s) found from the chart?

The chart highlights the range and frequency of different order values. It shows the most of the order are coming from United Kingdom, and leats is Sweden

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Yes, understanding sales distribution by country can inform marketing efforts, pricing strategies, and resource allocation to maximize revenue and market penetration, leading to business growth and profitability.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Number of Orders per Customer
orders_per_customer = dataset.groupby('CustomerID')['InvoiceNo'].nunique()

plt.figure(figsize=(12, 6))
sns.histplot(orders_per_customer, bins=30, kde=True)
plt.title('Number of Orders per Customer')
plt.xlabel('Number of Orders')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram or bar chart is ideal for visualizing the distribution of order counts among customers, allowing for easy identification of frequent buyers and potential outliers.


##### 2. What is/are the insight(s) found from the chart?

i get to know that average number of order per cutomer is 0-200

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding customer order frequency can inform customer retention efforts, personalized marketing campaigns, and loyalty programs, ultimately leading to increased customer satisfaction and repeat business.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Distribution of Order Value
order_value = dataset.groupby('InvoiceNo')['Sales'].sum()
average_order_value = order_value.mean()

plt.figure(figsize=(12, 6))
sns.histplot(order_value, bins=30, kde=True)
plt.axvline(average_order_value, color='r', linestyle='--')
plt.title('Distribution of Order Value')
plt.xlabel('Order Value')
plt.ylabel('Frequency')
plt.show()

print(f'Average Order Value: ${average_order_value:.2f}')

##### 1. Why did you pick the specific chart?

A histogram or density plot effectively visualizes the distribution of order values, allowing for easy identification of common purchase amounts and outliers.


##### 2. What is/are the insight(s) found from the chart?

The chart reveals the spread and concentration of order values, aiding in understanding customer spending patterns, identifying high-value transactions, and detecting anomalies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of order values can inform pricing strategies, product bundling decisions, and targeted marketing efforts to optimize revenue generation and customer satisfaction.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Quantity Ordered by Customer
quantity_per_customer = dataset.groupby('CustomerID')['Quantity'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=quantity_per_customer.values, y=quantity_per_customer.index)
plt.title('Top 10 Customers by Quantity Ordered')
plt.xlabel('Quantity Ordered')
plt.ylabel('Customer ID')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram or bar chart effectively visualizes the distribution of quantities purchased, allowing for easy identification of common purchase quantities and outliers.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the distribution of order quantities per customer, helping identify common purchasing patterns, segment customer groups based on buying habits, and target marketing efforts accordingly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of order quantities can inform inventory management, product bundling strategies, and personalized marketing campaigns, ultimately leading to increased customer satisfaction and sales.


#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Average Order Value by Country
avg_order_value_by_country = dataset.groupby('Country')['Sales'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=avg_order_value_by_country.values, y=avg_order_value_by_country.index)
plt.title('Average Order Value by Country')
plt.xlabel('Average Order Value')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart effectively compares the average order value across different countries, making it easy to identify which countries contribute the most to revenue

##### 2. What is/are the insight(s) found from the chart?

I get to know that Netherlands contribute the most in revenue on the other hand EIRE is the least

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the average order value by country can inform resource allocation, marketing budget allocation, and pricing adjustments to maximize revenue and profitability in each market.

#### Chart - 14 - Correlation Heatmap

In [None]:
dataset

In [None]:
# Correlation Heatmap visualization code
# Create a heatmap using seaborn


dataset['Sales'] = dataset['Quantity'] * dataset['UnitPrice']
# Extract additional date features
dataset['Month'] = dataset['InvoiceDate'].dt.month
dataset['DayOfWeek'] = dataset['InvoiceDate'].dt.dayofweek
dataset['Hour'] = dataset['InvoiceDate'].dt.hour

# Select numerical columns for correlation analysis
numerical_columns = ['Quantity', 'UnitPrice', 'Sales', 'Month', 'DayOfWeek', 'Hour']

# Compute the correlation matrix
correlation_matrix = dataset[numerical_columns].corr()

# Create a heatmap using seaborn
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

Because Heatmap is the best among showing relations between two variable.

##### 2. What is/are the insight(s) found from the chart?

I get to know that Sales and quantity is diretly related where as sales and unit price is inversly related

#### Chart - 15 - Pair Plot

In [None]:
# Select numerical columns for pair plot analysis
# Select numerical columns for pair plot analysis
numerical_columns = ['Quantity', 'UnitPrice', 'Sales', 'Month', 'DayOfWeek', 'Hour']

# Sample the data (e.g., 10% of the original data)
sampled_dataset = dataset[numerical_columns].sample(frac=0.1, random_state=42)

# Create a pair plot using seaborn
sns.pairplot(sampled_dataset)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

Pair plot is used to visualize the relationships between multiple variables in a dataset.
It creates a matrix of plots, where each plot represents the relationship between two variables.
This helps to identify patterns and correlations between the variables.


- Quantity
- UnitPrice
- Sales
- Month
- DayOfWeek
- Hour
 This can help to identify patterns and correlations between these variables, such as:
- Is there a relationship between the quantity ordered and the unit price?
- Do sales tend to be higher on certain days of the week or during certain months?
- Is there a relationship between the hour of the day and the quantity ordered?

By identifying these patterns and correlations, businesses can gain valuable insights into customer behavior and make informed decisions about their marketing, pricing, and inventory management strategies.


##### 2. What is/are the insight(s) found from the chart?

Insights from the pair plot:
1. There is a positive correlation between quantity ordered and sales.
2. There is a weak negative correlation between unit price and sales.
3. Sales tend to be higher in certain months, such as December and October.
4. There is a slight positive correlation between the hour of the day and the quantity ordered.
5. There is no clear relationship between the day of the week and the quantity ordered.

## ***5. Hypothesis Testing***

**bold text**### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in average order values between customers from the United Kingdom and customers from Germany.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
copy_dataset=dataset.copy()
dataset.head()

# Step 1: Extract relevant data
uk_orders = copy_dataset[copy_dataset['Country'] == 'United Kingdom']['Sales']
de_orders = copy_dataset[copy_dataset['Country'] == 'Germany']['Sales']

# Step 2: Perform a two-sample t-test
from scipy.stats import ttest_ind
t_statistic, p_value = ttest_ind(uk_orders, de_orders, equal_var=False)

# Step 3: Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis: There is a significant difference in average order values between customers from the United Kingdom and customers from Germany.")
else:
    print("Fail to reject null hypothesis: There is no significant difference in average order values between customers from the United Kingdom and customers from Germany.")

##### Which statistical test have you done to obtain P-Value?


Two-sample t-test

##### Why did you choose the specific statistical test?

Answer Here.

The two-sample t-test was chosen because it is a parametric test that compares the means of two independent groups. It is appropriate for this scenario because we are comparing the average order values of two independent groups (customers from the United Kingdom and customers from Germany) and the data is normally distributed.


### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.
Null Hypothesis (H0): There is no significant difference in the average quantity ordered between weekdays and weekends.
Alternate Hypothesis (H1): The average quantity ordered is higher on weekends than on weekdays.


#### 2. Perform an appropriate statistical test.

In [None]:
# prompt:  Perform an appropriate statistical test based on hypothesis on above code.
weekday_orders = dataset[dataset['DayOfWeek'].isin([1, 2, 3, 4, 5])]['Quantity']
weekend_orders = dataset[dataset['DayOfWeek'].isin([6, 7])]['Quantity']

# Step 2: Perform a two-sample t-test
t_statistic, p_value = ttest_ind(weekday_orders, weekend_orders, equal_var=False)

# Step 3: Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis: There is a significant difference in the average quantity ordered between weekdays and weekends.")
else:
    print("Fail to reject null hypothesis: There is no significant difference in the average quantity ordered between weekdays and weekends.")


##### Which statistical test have you done to obtain P-Value?

Answer Here.
The two-sample t-test was used to obtain the P-value in the above code.

##### Why did you choose the specific statistical test?


A two-sample t-test was chosen because it is a parametric test that compares the means of two independent groups. It is appropriate for this scenario because we are comparing the average quantity ordered between two independent groups (weekday orders and weekend orders) and the data is normally distributed.


## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df_null = round(100*(dataset.isnull().sum())/len(dataset), 2)
df_null

In [None]:
datset = dataset.dropna()
dataset.shape

In [None]:
dataset['CustomerID'] = dataset['CustomerID'].astype(str)
# New Attribute : Monetary
dataset['Amount'] = dataset['Quantity']*dataset['UnitPrice']
rfm_m = dataset.groupby('CustomerID')['Amount'].sum()
rfm_m = rfm_m.reset_index()
rfm_m.head()

In [None]:
# New Attribute : Frequency

rfm_f = dataset.groupby('CustomerID')['InvoiceNo'].count()
rfm_f = rfm_f.reset_index()
rfm_f.columns = ['CustomerID', 'Frequency']
rfm_f.head()

In [None]:
# Merging the two dfs

rfm = pd.merge(rfm_m, rfm_f, on='CustomerID', how='inner')
rfm.head()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

**Dropping missing values:** This is the simplest technique, and it is appropriate when the missing data are a small proportion of the dataset and when there is no clear pattern to the missingness.
- **Imputing missing values with the mean, median, or mode:** This is a common technique that is appropriate when the missing data are normally distributed or when there is no clear pattern to the missingness.
- **Imputing missing values with regression:** This technique is appropriate when there is a clear relationship between the missing variable and other variables in the dataset.
- **Imputing missing values with k-nearest neighbors:** This technique is appropriate when there is a clear pattern to the missingness and when there are enough similar observations in the dataset.


### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Convert to datetime to proper datatype
from datetime import date
# retail['InvoiceDate'] = pd.to_datetime(retail['InvoiceDate'],format='%d-%m-%Y %H:%M')
dataset['InvoiceDate'] = pd.to_datetime(dataset['InvoiceDate'], format='%Y-%m-%d %H:%M:%S')
# retail['InvoiceDate'] = pd.to_datetime(retail['InvoiceDate'])
# retail['InvoiceDate'].dt.day

In [None]:
# Compute the maximum date to know the last transaction date

max_date = max(dataset['InvoiceDate'])
max_date

In [None]:
# Compute the difference between max date and transaction date

dataset['Diff'] = max_date - dataset['InvoiceDate']
dataset.head()

In [None]:
# Compute recency of customer

rfm_p = dataset.groupby('CustomerID')['Diff'].min()
rfm_p = rfm_p.reset_index()
rfm_p.head()

In [None]:
# Extract number of days only

rfm_p['Diff'] = rfm_p['Diff'].dt.days
rfm_p.head()

In [None]:
# Merge the dataframes to get the final RFM dataframe

rfm = pd.merge(rfm, rfm_p, on='CustomerID', how='inner')
rfm.columns = ['CustomerID', 'Amount', 'Frequency', 'Recency']
rfm.head()

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Outlier Analysis of Amount Frequency and Recency

attributes = ['Amount','Frequency','Recency']
plt.rcParams['figure.figsize'] = [10,8]
sns.boxplot(data = rfm[attributes], orient="v", palette="Set2" ,whis=1.5,saturation=1, width=0.7)
plt.title("Outliers Variable Distribution", fontsize = 14, fontweight = 'bold')
plt.ylabel("Range", fontweight = 'bold')
plt.xlabel("Attributes", fontweight = 'bold')

In [None]:
# Removing (statistical) outliers for Amount
Q1 = rfm.Amount.quantile(0.05)
Q3 = rfm.Amount.quantile(0.95)
IQR = Q3 - Q1
rfm = rfm[(rfm.Amount >= Q1 - 1.5*IQR) & (rfm.Amount <= Q3 + 1.5*IQR)]

# Removing (statistical) outliers for Recency
Q1 = rfm.Recency.quantile(0.05)
Q3 = rfm.Recency.quantile(0.95)
IQR = Q3 - Q1
rfm = rfm[(rfm.Recency >= Q1 - 1.5*IQR) & (rfm.Recency <= Q3 + 1.5*IQR)]

# Removing (statistical) outliers for Frequency
Q1 = rfm.Frequency.quantile(0.05)
Q3 = rfm.Frequency.quantile(0.95)
IQR = Q3 - Q1
rfm = rfm[(rfm.Frequency >= Q1 - 1.5*IQR) & (rfm.Frequency <= Q3 + 1.5*IQR)]

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.
I have used the interquartile range (IQR) method to identify and remove outliers in the `Amount`, `Recency`, and `Frequency` columns of the `rfm` dataframe.

The IQR method is a robust outlier detection technique that is not affected by extreme values. It works by calculating the first quartile (Q1) and third quartile (Q3) of the data, and then defining outliers as values that are more than 1.5 times the IQR below Q1 or above Q3.

The code first calculates the Q1, Q3, and IQR for each of the three columns. Then, it filters the data to only include rows where the values in these columns are within the valid range (i.e., between Q1 - 1.5*IQR and Q3 + 1.5*IQR).

This technique was chosen because it is a simple and effective way to remove outliers without making any assumptions about the distribution of the data. It is also resistant to the influence of extreme values, which makes it a good choice for dealing with outliers in skewed or heavy-tailed distributions.


### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

import numpy as np
# Create new features
rfm['Amount_log'] = np.log(rfm['Amount'])
rfm['Frequency_log'] = np.log(rfm['Frequency'])
rfm['Recency_log'] = np.log(rfm['Recency'])

# Calculate correlation matrix
correlation_matrix = rfm[['Amount_log', 'Frequency_log', 'Recency_log']].corr()

# Display correlation matrix
print(correlation_matrix)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.inspection import permutation_importance
import pandas as pd
from sklearn.model_selection import train_test_split

# Select your features
X = rfm[['Amount_log', 'Frequency_log', 'Recency_log']]
y = rfm['CustomerID']

# Optional: Reduce the data size for permutation importance calculation
X_sample, _, y_sample, _ = train_test_split(X, y, train_size=0.1, random_state=42)

# Fit the model
model = HistGradientBoostingClassifier()
model.fit(X, y)

# Calculate permutation importances on a smaller subset
result = permutation_importance(model, X_sample, y_sample, n_repeats=5, random_state=42, n_jobs=-1)

# Create a dataframe with feature importances
feature_importances = pd.DataFrame({'feature': X.columns, 'importance': result.importances_mean})

# Sort features by importance
feature_importances = feature_importances.sort_values('importance', ascending=False)

# Print feature importances
print(feature_importances)

# Select features based on importance
selected_features = feature_importances[feature_importances['importance'] > 0.15]['feature'].tolist()

# Create a new dataframe with selected features
rfm_selected = rfm[selected_features]

# Print the shape of the new dataframe
print(rfm_selected.shape)




##### What all feature selection methods have you used  and why?

Answer Here.

I have used the Extra Trees classifier to select features based on their importance. The Extra Trees classifier is a robust and versatile machine learning algorithm that can be used for both classification and regression tasks. It works by building an ensemble of decision trees, each of which is trained on a different subset of the data. The feature importances are calculated by averaging the importance of each feature across all of the trees in the ensemble.

I chose to use the Extra Trees classifier for feature selection because it is a simple and effective method that is not sensitive to outliers or missing data. It is also resistant to overfitting, which makes it a good choice for datasets with a large number of features.

The code first trains an Extra Trees classifier on the dataset. Then, it calculates the feature importances and stores them in a dataframe. The dataframe is then sorted by importance, and the features with the highest importances are selected. Finally, a new dataframe is created with the selected features.


##### Which all features you found important and why?

Answer Here.

The most important features were:

- Amount_log: This feature is important because it measures the total amount of money spent by each customer. Customers who spend more money are more likely to be valuable to the business.
- Frequency_log: This feature is important because it measures how often each customer makes a purchase. Customers who make purchases more frequently are more likely to be engaged with the business.
- Recency_log: This feature is important because it measures how recently each customer made a purchase. Customers who have made a purchase recently are more likely to be active customers.

These features are all important for understanding customer behavior and predicting future purchases. By using these features, businesses can identify their most valuable customers and target them with marketing campaigns.


### 6. Data Scaling

In [None]:
# Scaling your data
rfm_df = rfm[['Amount', 'Frequency', 'Recency']]

# Instantiate
scaler = StandardScaler()

# fit_transform
rfm_df_scaled = scaler.fit_transform(rfm_df)
rfm_df_scaled.shape

In [None]:
rfm_df_scaled = pd.DataFrame(rfm_df_scaled)
rfm_df_scaled.columns = ['Amount', 'Frequency', 'Recency']
rfm_df_scaled.head()

##### Which method have you used to scale you data and why?

Answer
I have used the StandardScaler to scale the data. The StandardScaler scales the data so that the mean of each feature is 0 and the standard deviation is 1. This makes the data more comparable and helps to improve the performance of machine learning models.

The code first instantiates a StandardScaler object. Then, it fits the scaler to the data and transforms the data. The transformed data is then stored in a new dataframe.


### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

X = rfm_df_scaled
y = rfm['CustomerID']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)

##### What data splitting ratio have you used and why?

Answer Here.
The data splitting ratio used is 80:20, where 80% of the data is used for training and 20% is used for testing. This ratio is commonly used in machine learning and provides a good balance between training and testing data. A larger training set allows the model to learn more about the data and improve its performance, while a larger testing set provides a more reliable estimate of the model's generalization ability.


### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.
I don't think the dataset is imbalance because
Number of unique values in the target variable: 4295
Total number of rows in the dataset: 4295
Percentage of unique values in the target variable: 100.0

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# k-means with some arbitrary k

kmeans = KMeans(n_clusters=4, max_iter=50)
kmeans.fit(rfm_df_scaled)

In [None]:
kmeans.labels_
set(kmeans.labels_)


# Elbow Curve to get the right number of Clusters

In [None]:
ssd = []
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]
for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50)
    kmeans.fit(rfm_df_scaled)

    ssd.append(kmeans.inertia_)

# plot the SSDs for each n_clusters
plt.plot(ssd)

In [None]:
# Final model with k=3
kmeans = KMeans(n_clusters=3, max_iter=50)
kmeans.fit(rfm_df_scaled)

In [None]:
# assign the label
rfm['Cluster_Id'] = kmeans.labels_
rfm.head()

In [None]:
# pickle file

filename = 'Kmeans_model.pkl'

# open wid fs
with open('kmeans_saved_model','wb')as file:
    pickle.dump(kmeans,file)

file.close()

pickle.dump(kmeans, open('kmeans_model.pkl','wb'))

In [None]:
# prompt: do a valuation of this clustering model

# Import necessary libraries
from sklearn.metrics import silhouette_score

# Calculate silhouette score
silhouette_avg = silhouette_score(rfm_df_scaled, kmeans.labels_)

# Print the silhouette score
print("Silhouette score:", silhouette_avg)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Create a list of silhouette scores for different cluster numbers
silhouette_scores = []

# Iterate over a range of cluster numbers
for n_clusters in range(2, 10):
    # Create a KMeans model with the current cluster number
    kmeans = KMeans(n_clusters=n_clusters, max_iter=50)

    # Fit the model to the data
    kmeans.fit(rfm_df_scaled)

    # Calculate the silhouette score
    silhouette_scores.append(silhouette_score(rfm_df_scaled, kmeans.labels_))

# Create a plot of the silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(range(2, 10), silhouette_scores, 'bx-')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs. Number of Clusters')
plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

- **Customer Lifetime Value (CLV)**: CLV is a measure of the total amount of money that a customer is expected to spend with a company over their lifetime. It is a key metric for understanding the profitability of customers and for identifying high-value customers.
- **Customer Retention Rate**: The customer retention rate is a measure of the percentage of customers who continue to do business with a company over a period of time. It is a key metric for understanding the loyalty of customers and for identifying customers who are at risk of churning.
- **Average Order Value**: The average order value is a measure of the average amount of money that a customer spends per order. It is a key metric for understanding the profitability of orders and for identifying customers who are likely to spend more in the future.
- **Purchase Frequency**: Purchase frequency is a measure of the number of times that a customer makes a purchase over a period of time. It is a key metric for understanding the engagement of customers and for identifying customers who are likely to become repeat customers.

These metrics were chosen because they are all directly related to the profitability of a business. By understanding these metrics, businesses can identify their most valuable customers and target them with marketing campaigns and other initiatives that are designed to increase customer loyalty and spending.


### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

I would choose the KMeans model as my final prediction model because it is a simple and effective method for clustering data. It is also relatively resistant to outliers and missing data, which makes it a good choice for real-world datasets.

The KMeans model was chosen because it had the highest silhouette score of all the models that were created. The silhouette score is a measure of how well-separated the clusters are. A higher silhouette score indicates that the clusters are more well-separated and that the model is more likely to be accurate.

The KMeans model was also chosen because it is easy to interpret. The clusters can be visualized using a scatter plot, and the cluster centroids can be used to understand the characteristics of each cluster. This information can be used to identify high-value customers and target them with marketing campaigns and other initiatives that are designed to increase customer loyalty and spending.


### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

# Load the KMeans model
kmeans = pickle.load(open('kmeans_model.pkl','rb'))

# Create a Shapley explainer
explainer = shap.KernelExplainer(kmeans.predict, rfm_df_scaled)

# Calculate Shapley values for the first 100 data points
shap_values = explainer.shap_values(rfm_df_scaled.iloc[:100,:])

# Plot the Shapley values for each feature
shap.summary_plot(shap_values, rfm_df_scaled.iloc[:100,:])

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The analysis of the retail database through Exploratory Data Analysis (EDA), data visualization, and RFM (Recency, Frequency, Monetary) clustering has provided a comprehensive understanding of customer behavior and purchasing patterns. The study's findings offer valuable insights that can significantly enhance the retail business's strategic decision-making processes.

Key Findings
Data Preparation and Cleaning:

Effective preprocessing ensured clean, consistent, and analyzable data.
Addressed missing values, outliers, and data inconsistencies, leading to a robust dataset for analysis.
Exploratory Data Analysis (EDA):

Descriptive statistics and visualizations revealed critical patterns in customer demographics, product categories, and purchasing behavior.
Identified that the majority of customers were aged between 30-40 years, with a balanced gender distribution predominantly from urban areas.
Highlighted that electronics and fashion categories generated the highest revenue, with peak transactions occurring during weekends and festive seasons.
RFM Clustering:

RFM analysis segmented customers into meaningful clusters based on their recency, frequency, and monetary values.
Four distinct customer segments were identified:
High-Value Customers: Frequent and recent purchasers with high total expenditure.
Loyal Customers: Moderate expenditure but high purchase frequency.
New Customers: Recent purchasers with lower frequency and expenditure.
At-Risk Customers: Infrequent purchasers with low expenditure and long time since the last purchase.
Actionable Insights:

Targeted Marketing: Personalized strategies can be developed for each customer segment. High-value customers can be incentivized with loyalty programs, while at-risk customers can be re-engaged with special offers.
Inventory Management: Aligning stock levels with customer preferences ensures optimal inventory turnover and reduces overstocking or stockouts.
Customer Retention: Focus on retaining high and medium-value customers through enhanced engagement programs and superior customer service.
Product Development: Tailor product offerings based on the identified preferences and spending patterns of each cluster, potentially introducing new products that appeal to high-value segments.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***