This project Details the analysis conducted to Identify customer behaviour in an ecommerce data set. The project showcases the adoption of Recency, frequency and spend in understanding customer behaviour in the data set
Customer segementation was also identified using KMeans Clustering algorithm.
Project description: This is a data analysis project that involves reading, cleaning, exploring, and performing advanced analyses on a retail dataset. The project uses Kmeans Clustering algorithm for identifying .
This project covers data cleaning, exploratory data analysis (EDA), data visualization, and customer segmentation using machine learning techniques.
I used a dataset named OnlineRetail.csv, which contains transactional data from an online retail store. The primary goal was to explore the dataset, clean it, visualize important patterns, and perform customer segmentation \
After removing the nulll values from the dataset, I created two addditonal columsn 'Month' and 'Day of Week' based of tdh invoice date column. The goal was to further understand if there were any pointers to the actual day of week, or some other insights from the date Below are some visualizatiions showing trend and distributions.
The Barplots show that there were no transactions on saturday, and even though the highest transactions were in the 11th month, the total amount spent was highest in the first month in January
In addition this visuals show for teh top 5 countries and the tiop stock items purchased
Recency, Frequency, Monetary model (RFM), is a behavior based analysis technique used to segment customers by examining their transaction history. Recency is calculated as the number of days since the last purchase Recency is calculated, frequency is the number of transactions per customer, and Spend is the total amount spent per customer.
last_transaction_date = df.groupby('CustomerID')['InvoiceDate'].max()
reference_date = max(df['InvoiceDate'])
days_difference = (reference_date - last_transaction_date).dt.days
days_difference = days_difference.reset_index().rename(columns={'InvoiceDate': 'recency'})
These 2 metrics are then merged into a single dataframe.
And these are boxplots to show the Outliers in the RFS data distributions
The outliers were mostly in the Frequency and spend coluumns and were removed before applying Kmeans algorthm
The outliers were removed in X, and then feature scaling
from sklearn.preprocessing import StandardScaler
X=rfs.iloc[:,1:]
scaler = StandardScaler()
X = scaler.fit_transform(X)
I adopted the Yellowbirck cluster for visualizing the Kmeans distorion score Elbow. As shown below, the k elbows at 4, indicating 4 clusters.
Then I fitted the Kmeans and then updated the RFS dataframe with teh clusters identified.
kmeans= KMeans(n_clusters=4,n_init='auto',random_state=42)
kmeans.fit(X)
These are the four clusters identified from the results
In addition, these are the top items by Amount spent.
For more details see Repository