# Predicting E-Commerce Revenue

## The Best Group
### Xi Yang, Yixin Sun, Ziyu Fan, Brian Chivers

# Given session-level data, can we predict a user's e-commerce revenue?

### Using data from Google's online merchandise store, we seek to accurately predict a consumer's spending

# Challenges

1) Imbalanced Data
    
98.7% of sessions did not make a transaction

98.6% of users did not make a transaction

2) Lack of continuous variables
   
Most data in web traffic is catagorical (Device info, location, time, redirect info, etc)

3) User-level data

A business doesn't market it's products to web sessions.  How do we aggregate this data to a user-level?

# EDA - Feature Importance and Engineering

# Page Views

<img src="Final_Presentation_Images/pageviews.png">



# Hits

<img src="Final_Presentation_Images/hits.png">



# Next Session

<img src="Final_Presentation_Images/nextsession.png">



# Location: Americas

<img src="Final_Presentation_Images/america.png">


# Source: Googleplex

<img src="Final_Presentation_Images/googleplex.png">


## Modeling Approach

1) Regression Only Models

2) Stacking Models

3) Boosting Models

# Regression Only


### Regression Model


<img src="Final_Presentation_Images/reg_only.png">

# Stacking Models

#### Custom Estimator 1: Stacking Regressor  (the best) 
Step1: fit a classifier with X_train,   
Step2: classifier's prediction, as a __new feature__, was appended to X_train -> X_train_new    
Step3: fit a regressor with __X_train_new__      
Prediction: regressor prediction, and convert all the negative values to zero's  

#### Custom Estimator 2: TrustClassfierRegressor 
Step1: fit a classifier with X_train   
Step2: fit a regressor with X_train  
Prediction: classfier prediction * regressor predicion

#### Custom Estimator 3: TrustClassfierRegressor_v2
Step1: fit a classifier with X_train  
Step2: fit a regressor with X_train where transaction_revenue>0   
Prediction: classfier prediction * regressor predicion

In [None]:
class StackedRegressor(BaseEstimator, ClassifierMixin):  
    def __init__(self, classifier, regressor):
        self.classifier = classifier
        self.regressor = regressor
        
    def fit(self, X, y):
        class_labels = pd.Series(np.where(y>0,1,0))
        
        self.classifier.fit(X,class_labels)

        pred_class_labels = self.classifier.predict(X)
        pred_class_labels_df = pd.DataFrame(
            pred_class_labels, columns = ['pred_class_label'])
        
        X = X.reset_index(drop=True)
        pred_class_labels_df = pred_class_labels_df.reset_index(drop=True)
        X = X.join(pred_class_labels_df)
        self.regressor.fit(X,y)

        print(self.classifier.__class__.__name__, ",", 
              self.regressor.__class__.__name__)
        
    def predict(self, X):
        
        class_predict = self.classifier.predict(X)
        class_predict_df = pd.DataFrame(
             class_predict, columns = ['pred_class_label'])
        X = X.reset_index(drop=True)
        class_predict_df = class_predict_df.reset_index(drop=True)
        X = X.join(class_predict_df)
        regressor_predict = self.regressor.predict(X)
        regressor_predict = np.where(regressor_predict<0,0,regressor_predict)
        
        return regressor_predict
    
    def score(self, X, y):
        return np.sqrt(np.mean((y - self.predict(X))**2))
    
    def clf_score(self, X, y):
        y_true = pd.Series(np.where(y>0,1,0))
        y_pred = self.classifier.predict(X)
        return precision_recall_fscore_support(y_true, y_pred, 
                                               average='macro')

# Model Selection and Parameter Search

A default classification model
<img src="Final_Presentation_Images/clf_only.png">

#### Searching Space:
- __stacking estimator__: 1 v.s 2 v.s 3
- __classification algorithm__:
    - logistic regression
        - penalty score
    - SVC
        - kernels
    - random forest classification
        - n_estimators
- __regression algorithm__:
    - linear regression
    -  random forest regressor
- __resampling training set__:
    - no resampling    
    - downsampled majority class : minority = 1:1  
    - upsampled minority class : majority = 1:1  
    

In [None]:
# classifier
best_classifier = BaggingClassifier(
    base_estimator=SVC(tol=0.01, kernel = 'poly', verbose=False),
    bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=0.01, n_estimators=100)
# regressor
best_regressor = RandomForestRegressor(
    n_estimators = 100, 
    min_samples_leaf = 15
)

# Results

#### Some Good Stacking Models

<img src="Final_Presentation_Images/stacking.png">

__Best Parameters:__   
- No resampling  
- Stacking Regressor with   
    - BaggingClassifier (of SVC)  
    - RandomForestRegressor


#### Feature Importance: classification label works!
<img src="Final_Presentation_Images/feature_importance_stacking.png">

### All Models
<img src="Final_Presentation_Images/all_models.png">

### Boosting

Iteratively trains weak learner

Focuses on errors from each iterations


XGBoost: Level wise growth
<img src="Final_Presentation_Images/level-wise.png">

LightGBM: Leaf wise growth

<img src="Final_Presentation_Images/leaf-wise.png">

## Summary

Best Model = Most Complicated Model

At chance: RMSE=2.105

Best Model: RMSE=1.6391

39.4% reduction in squared error