# Using Machine Learning to predict revenue

It is finally time to model our dataset to predict the revenue generated per customer and the goal of this lesson is to learn how a target variable can be predicted using Machine Learning. We will also evaluate our model in this lesson.

As a recap, we've followed the following processes till now:

1. Data Optimization
2. Data Cleaning
3. Data Analysis
4. Data Pre-processing for Machine Learning

Now, we will conclude this course by performing the following three processes in this lesson:

5. Data Splitting
6. Machine Learning
7. Model Evaluation

Before we start this lesson, please make sure to install the `lightgbm` library which contains helper functions to build and train a tree-based Machine Learning algorithm called Light Gradient Boosting Machine. 

LightGBM is a free and open source distributed gradient boosting framework for Machine Learning originally developed by Microsoft and you can install it using the following command:

In [1]:
# !pip install lightgbm

Now, let us start by importing the necessary libraries,

In [2]:
import pandas as pd
import lightgbm as lgb
import datetime

Next, importing the CSV file called `preprocessed_gstore_data.csv` which contains pre-processed information about each user's website visit along with the revenue they generated for Google. 

In [3]:
# Reading in the CSV file as a DataFrame
store_df = pd.read_csv('data/preprocessed_gstore_data.csv', low_memory=False)

In [4]:
# Looking at the first five rows
store_df.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,visitId,visitNumber,visitStartTime,device.browser,device.operatingSystem,device.isMobile,device.deviceCategory,...,totals.transactionRevenue,trafficSource.campaign,trafficSource.source,trafficSource.medium,trafficSource.keyword,trafficSource.isTrueDirect,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adwordsClickInfo.isVideoAd
0,4,2016-09-02,1131660440785968503,1472830385,1,1472830385,11,16,0,0,...,0.0,0,149,5,11,0,0,0,1,1
1,4,2016-09-02,377306020877927890,1472880147,1,1472880147,16,7,0,0,...,0.0,0,149,5,11,0,0,0,1,1
2,4,2016-09-02,3895546263509774583,1472865386,1,1472865386,11,16,0,0,...,0.0,0,149,5,11,0,0,0,1,1
3,4,2016-09-02,4763447161404445595,1472881213,1,1472881213,46,6,0,0,...,0.0,0,149,5,1098,0,0,0,1,1
4,4,2016-09-02,27294437909732085,1472822600,2,1472822600,11,1,1,1,...,0.0,0,149,5,11,1,0,0,1,1


In [5]:
# Printing the shape
store_df.shape

(903653, 31)

First of all, let us split the dataset based on a 70:30 ratio. 70% of the dataset will be used for training our LightGBM model and 30% of the dataset will be used for evaluating it.

For this, we can get the dates before 2017/04/01 as the training dataset and the dates after 2017/04/01 as the evaluation dataset.

In [6]:
train_df = store_df[pd.to_datetime(store_df['date']).dt.date < datetime.date(2017,4,1)]
eval_df = store_df[pd.to_datetime(store_df['date']).dt.date >= datetime.date(2017,4,1)]

In [7]:
# Printing the shape of the training DataFrame
train_df.shape

(633210, 31)

In [8]:
# Printing the shape of the evaluation DataFrame
eval_df.shape

(270443, 31)

Next, let us get the target variable (y) and the features (X) from the splitted DataFrames. Please mind that we will be removing some columns since they cannot be used for training the model.

In [9]:
# Getting the target (y) from the splitted DataFrames
train_y = train_df["totals.transactionRevenue"].astype(float).values
eval_y = eval_df["totals.transactionRevenue"].astype(float).values

# Getting the features (X) from the splitted DataFrames
train_X = train_df.drop(['date', 'fullVisitorId', 'visitId', 'visitStartTime', 'totals.transactionRevenue'], axis=1)
eval_X = eval_df.drop(['date', 'fullVisitorId', 'visitId', 'visitStartTime', 'totals.transactionRevenue'], axis=1)

Creating a custom function to train the LightGBM model.

In [10]:
def train_lightgbm(train_X, train_y, eval_X, eval_y):
    
    # Initializing the training dataset
    lgtrain = lgb.Dataset(train_X, label=train_y)
    
    # Initializing the evaluation dataset
    lgeval = lgb.Dataset(eval_X, label=eval_y)
    
    # Hyper-parameters for the LightGBM model
    params = {
        "objective" : "regression",
        "metric" : "rmse", 
        "num_leaves" : 30,
        "min_child_samples" : 100,
        "learning_rate" : 0.1,
        "bagging_fraction" : 0.7,
        "feature_fraction" : 0.5,
        "bagging_seed" : 2018,
        "verbosity" : -1
    }
    
    # Training the LightGBM model
    model = lgb.train(params, lgtrain, 1000, valid_sets=[lgeval], early_stopping_rounds=100, verbose_eval=100)
    
    # Returning the model
    return model

# Training the model 
model = train_lightgbm(train_X, train_y, eval_X, eval_y)

Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 1.77938
[200]	valid_0's rmse: 1.77289
[300]	valid_0's rmse: 1.77419
Early stopping, best iteration is:
[228]	valid_0's rmse: 1.77265


We've successfully trained our LightGBM model.

Now, let us quickly evaluate the model to see how it is doing by making an actual prediction using it. For this, let us select a row of data from our evaluation dataset and the actual revenue for that row of data.

In [11]:
# Index to test 0/8612
index_val = 8612

# Selecting the index value from the evaluation DataFrame
actual_X_value = eval_X.reset_index(drop=True).iloc[index_val]

# Selecting the revenue from the target variable array
actual_y_value = eval_y[index_val]

In [12]:
# Printing the feature values
actual_X_value

channelGrouping                                   2
visitNumber                                       4
device.browser                                   11
device.operatingSystem                            7
device.isMobile                                   0
device.deviceCategory                             0
geoNetwork.continent                              2
geoNetwork.subContinent                          12
geoNetwork.country                              212
geoNetwork.region                               220
geoNetwork.metro                                 59
geoNetwork.city                                 393
geoNetwork.networkDomain                          0
totals.hits                                      29
totals.pageviews                                 24
totals.bounces                                    0
totals.newVisits                                  0
trafficSource.campaign                            0
trafficSource.source                              0
trafficSourc

In [13]:
# Printing the actual revenue
actual_y_value

17.84014772563146

Now, let us predict if our model can get a prediction close to the actual generated revenue.

In [15]:
# Predicting the value
model.predict(actual_X_value.astype(float))

array([10.27687219])

As a reminder, you can convert these revenue values to their original value by using the `expm1()` method.

We can conclude the following from this small evaluation:

1. The model is actually trained and is able to predict if a customer will generate revenue or not.

2. The model is not able to accurately predict the revenue amount.

This is quite good considering we built this model with 30+ columns in just about an hour of time. Now, it is your time to shine!

As an exercise, I would encourage you to take the time to go through all of the steps we've done till now in the course and find different ways to improve the model. 

Some things that you can do to increase model accuracy are as follows:

- Do not drop any of the columns and start with the unoptimized dataset. Then, individually go through all of the columns and only drop columns that are not helpful to the model.

- Engineer new features from the dataset based on the available data fields.

- Change the LightGBM model's hyper-parameters.

- Use another Machine Learning model or create an ensemble of Machine Learning algorithms for getting better results.

- Use K-Fold Cross Validation instead of simple data splitting for model evaluation.

- ... and much more. Research!