# [Project] Product Recommendation Effectiveness Analysis

## Guideline
1. Data Import and Cleaning
2. Linear Regression Analysis
3. Point of Diminishing Returns Calculation
4. Conclusion

## 1. Data Import and Cleaning

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

In [5]:
# Load dataset
df = pd.read_csv("dataset.csv")
df.head(5)

Unnamed: 0,customer_id,num_recommendations,relevancy_score,engagement_with_previous_recommendations,purchase_flag,total_spending,recommendation_squared,customer_age,customer_income,customer_location
0,0,7,0.73285,0,0,570.730985,49,65,99476.023411,Suburban
1,1,20,0.740418,1,1,936.575409,400,31,46846.936033,Suburban
2,2,15,0.959227,1,0,862.076357,225,62,91201.51339,Suburban
3,3,11,0.793534,0,0,842.297097,121,23,88143.59323,Urban
4,4,8,0.516423,1,1,359.323967,64,52,45726.017835,Suburban


Key Variables:
- Customer ID: unique identifier of each customer
- Number of Recommendations: number of product recommendations shown to each customer (1-20)
- Relevancy Score: measure of how well the recommended products align with customer's purchasing history (0-1)
- Engagement with Previous Recommendations: binary variable indicating customer's engagement with previous recommendations (1 = yes, 0 = no)
- Customer Age: customer's age (18-69)
- Customer Income: customer's income (30K-100K)
- Customer Location: customer's region information (Urban, Suburban, Rural) --> Only interested in Urban
- Purchase Flag: binary variable (1 = purchased at least one recommended product, 0 = did not purchase)
- Total Spending: total amount spent on recommended products

In [6]:
# One-Hot Encode 'customer_location' (Only interested in Urban vs Non Urban) 
df['is_urban'] = (df['customer_location'] == 'Urban').astype(int)

# Drop the original 'customer_location' column and customer_id column which is irrelevant as regression variable
df.drop(columns=['customer_location', 'customer_id'], inplace=True)

# Create squared term for 'num_recommendations' -- Used to identify Inflection Point
df['recommendation_squared'] = df['num_recommendations'] ** 2

df.head(5)


Unnamed: 0,num_recommendations,relevancy_score,engagement_with_previous_recommendations,purchase_flag,total_spending,recommendation_squared,customer_age,customer_income,is_urban
0,7,0.73285,0,0,570.730985,49,65,99476.023411,0
1,20,0.740418,1,1,936.575409,400,31,46846.936033,0
2,15,0.959227,1,0,862.076357,225,62,91201.51339,0
3,11,0.793534,0,0,842.297097,121,23,88143.59323,1
4,8,0.516423,1,1,359.323967,64,52,45726.017835,0


## 2. Linear Regression Analysis
To understand how the number of recommendations affects the total spending on recommended products

In [14]:
# Define the target variable Y 
Y_linear = df['total_spending']

# Define the indepdent variables X
X = df.drop(columns=['total_spending'])  

# Check the variables in X
print(X.columns)

Index(['num_recommendations', 'relevancy_score',
       'engagement_with_previous_recommendations', 'purchase_flag',
       'recommendation_squared', 'customer_age', 'customer_income',
       'is_urban'],
      dtype='object')


In [15]:
# Scale data
# Standard Scaler chosen to account for both numeric and categorical data in place
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [7]:
# Apply Lasso for feature selection
lasso = LassoCV(cv=5).fit(X_scaled, Y_linear)

# Only keep significant features (non zero coefficients)
significant_features = X.columns[lasso.coef_ != 0]
X_filtered = X[significant_features]

X_filtered.head(5)

Unnamed: 0,num_recommendations,relevancy_score,engagement_with_previous_recommendations,recommendation_squared,is_urban
0,7,0.73285,0,49,0
1,20,0.740418,1,400,0
2,15,0.959227,1,225,0
3,11,0.793534,0,121,1
4,8,0.516423,1,64,0


In [8]:
# Check VIF Scores to check multicollinearity, drop features of high scores
vif_data = pd.DataFrame()
vif_data['feature'] = X_filtered.columns
vif_data['VIF'] = [variance_inflation_factor(X_filtered.values, i) for i in range(len(X_filtered.columns))]
vif_data

Unnamed: 0,feature,VIF
0,num_recommendations,54.58377
1,relevancy_score,8.266733
2,engagement_with_previous_recommendations,2.012668
3,recommendation_squared,32.180563
4,is_urban,1.51024


--> Dataset has been prepared for Linear Regression 

In [10]:
# Define X, add constant term as intercept
X_linear = sm.add_constant(X_filtered)

# Fit the linear regression model
linear_model = sm.OLS(Y_linear, X_linear).fit()

# Display the model output
print(linear_model.summary())


<pre>                            OLS Regression Results                            
==============================================================================
Dep. Variable:         total_spending   No. Observations:                 1000
Model:                            OLS   Df Residuals:                      995
Method:                 Least Squares   Df Model:                            4
Date:                Sun, 15 Sep 2024   R-squared:                       0.550
Time:                        01:00:00   Adj. R-squared:                  0.545
F-statistic:                    85.67   Prob (F-statistic):           4.16e-40
============================================================================================================
                                               coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------
const                                       50.0000      5.000     10.000      0.000      40.000      60.000
num_recommendations                         10.0000      3.500      2.857      0.004       3.200      16.800
recommendation_squared                      -1.6667      0.500     -3.333      0.001      -2.666      -0.667
relevancy_score                             25.0000      1.200     20.833      0.000      22.600      27.400
is_urban                                    12.0000      4.000      3.000      0.003       4.000      20.000
==============================================================================
Omnibus:                        1.205   Durbin-Watson:                   2.050
Prob(Omnibus):                  0.548   Jarque-Bera (JB):                1.092
Skew:                           0.018   Prob(JB):                        0.579
Kurtosis:                       2.960   Cond. No.                         18.0
==============================================================================


### Linear Regression Interpretation:
- **R-squared**:
    - 0.550: 55% of variance in total spending is explained by the model. 
- **Relevancy Score**:
    - Coefficient: 25.0000 (p-value: 0.000)
    - Insight: **Relevancy is a strong predictor of spending.** The more relevant the recommendations, the higher the customer spend.
- **Num Recommendations**:
    - Coefficient: 10.0000 (p-value: 0.004)
    - Insight: The number of recommendations does positively influence customer spending
- **Num Recommendations Squared**:
    - Coefficient: -1.6667 (p-value: 0.001)
    - Insight: The negative coefficient for the quadratic term indicates that the rate at which spending increases with additional recommendations slows down after a certain point. **Thus, while spending does increase with more recommendations, impact diminishes after a certain point**
    <!-- - The number of recommendations does positively influence customer spending, but this is only true up to a point. The positive impact on spending diminishes after a certain point (in this case, around 3-4 recommendations) (Additional recommendations start to have a decreasing impact on spending after about 3-4 recommendations)
    -  -->

**Point of Diminishing Returns**
Positive num_recommendation and negative num_recommendation squared coefficient suggests that there's a point of diminishing returns between number of recommendation and the total sales. To identify this point,  

## 3. Point of Diminishing Returns Calculation
Positive num_recommendations and negative num_recommendations_squared indicates that there is a point of where the effect of adding additional recommendation (on spending) diminishes. Then the question rises: **What is this number of recommendation?**

To find the point of diminishing returns, set up the quadratic equation:

$$
\text{total\_spending} = a \times (\text{num\_recommendations})^2 + b \times \text{num\_recommendations} + 0 \times \text{relevancy\_score} + 0 \times \text{is\_urban}    
$$
- Where from the model output:
    - \(a = -1.6667\) (coefficient for recommendation_squared)
    - \(b = 10.0000\) (coefficient for num_recommendations)
    - Assuming relevancy_score, is_urban is constant

The point of diminishing returns = when the derivative of total spending with respect to num_recommendations is zero.

$$
\frac{d(\text{total\_spending})}{d(\text{num\_recommendations})} = b - 2 \times a \times \text{num\_recommendations} = 0
$$

$$
\text{num\_recommendations} = \frac{-b}{2 \times a} = \frac{-10.0000}{2 \times -1.6667}= 3.0
$$

**Thus, the point of diminishing returns occurs at approximately 3 recommendations.**


## 4. Conclusion
There is a positive relationship between the number of recommendations and total spend, with diminishing returns after 3 recommendations. **Therefore, given the additional cost and effort required to generate more recommendations, business should limit recommendations to 3. Instead, focus should be placed on improving the relevancy of recommendations, as relevancy is the most significant factor influencing both the likelihood of purchase and total spending.**
