- The goal of a retail purchase prediction is to accurately forecast the demand for products and services to better manage inventory, anticipate customer needs, and maximize profits.
- By leveraging data-driven models and predictive analytics, retailers can accurately forecast future sales and make more informed business decisions.
- Exploratory data analysis to understand customer buying pattern
- Build a regression model to predict purchase amount of customer against various products
- Python, Pandas (data processing), Plotlyexpress,Sklearn
The dataset has 550,069 rows and 12 columns
Column ID | Column Name | Data type | Description | Masked |
---|---|---|---|---|
0 | User_ID | int64 | Unique Id of customer | False |
1 | Product_ID | object | Unique Id of product | False |
2 | Gender | object | Sex of customer | False |
3 | Age | object | Age of customer | False |
4 | Occupation | int64 | Occupation code of customer | True |
5 | City_Category | object | City of customer | True |
6 | Stay_In_Current_City_Years | object | Number of years of stay in city | False |
7 | Marital_Status | int64 | Marital status of customer | False |
8 | Product_Category_1 | int64 | Category of product | True |
9 | Product_Category_2 | float64 | Category of product | True |
10 | Product_Category_3 | float64 | Category of product | True |
11 | Purchase | int64 | Purchase amount | False |
- There are 5891 users in the dataset and 3631 uique products
- 31% of Product_Category_2 and 69% of Product_Category_3 has missing values.
- Average amount spent by female customers is 8k and male is 9k
- Age group 26-35 has the highest total purchase across age groups
- Product category 0 has highest revenue of purchase and Product category 4 has highest number of purchases
- Unmarried customers in the age group of 26-35 have highest total purchase amount as compared to other customers
- Maximum purchases where customer have stayed in the city only for 1 year
- Product category 2 and 3 has missing values, we will use SimpleImputer to fill missing values with median values.
- Handle categorical columns Gender, Age, City_Category,Stay_In_Current_City_Years
- Drop UserID,Product_ID columns
-
Features(X): Gender, Age, Occupation,City_category, Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
-
Label: Purchase
-
Train test split 75% training and 25% test set
-
Evaluate using RMSE and RMSLE metric baseline model using LinearRegression,DecisionTreeRegressor and RandomForestRegressor. RandomForestRegressor() had the lowest RMSE and RMSLE score
-
Apply GridSearchCV to find the best parameter for RandomForestRegressor
-
Product category_1 seem to have highest effect on purchase
-
Surprisingly gender has the least effect on purchase