## Google Analytics Customer Revenue Prediction

### Predict how much GStore customers will spend


### 1.Business/Real World Problem:
The 80/20 rule has proven true for many businesses杘nly a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies


### 2.Objectives 
#### Objective:
we are challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.



### 3.Data Information
"Train_v2.csv" and "Test_v2.csv" contains the data necessary to make predictions for each "fullVisitorId" listed in "sample_submission_v2.csv".

### This is little about our data
Both train_v2.csv and test_v2.csv contain the columns (Features) . Each row in the dataset is one visit to the store. Because we are predicting the log of the total revenue per user, be aware that not all rows in test_v2.csv will correspond to a row in the submission, but all unique fullVisitorIds will correspond to a row in the submission.

There are multiple columns which contain JSON blobs of varying depth. In one of those JSON columns, totals, the sub-column transactionRevenue contains the revenue information we are trying to predict. This sub-column exists only for the training data.


####  Where the data comes from

The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

1.Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc.

2.Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc.

3.Transactional data: information about the transactions that occur on the Google Merchandise Store website.

In [None]:
#IMPORTING LIBRARIES needed for this problem 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import warnings 
warnings.simplefilter('ignore')

In [None]:
train_data = pd.read_csv('../input/train.csv')

In [None]:
train_data.shape

In [None]:
train_data.info()

In [None]:
train_data.head()

Few columns contains JSON objects , we will convert them into CSV files.So, that we can perform all our operations on data without much struggle

### Features (Data Fields)
Each row in the dataset is one visit to the store. We are predicting the natural log of the sum of all transactions per user.

#### Data Fields

**fullVisitorId**- A unique identifier for each user of the Google Merchandise Store.

**channelGrouping** - The channel via which the user came to the Store.

**date** - The date on which the user visited the Store.

**device** - The specifications for the device used to access the Store.

**geoNetwork** - This section contains information about the geography of the user.

**sessionId** - A unique identifier for this visit to the store.

**socialEngagementType** - Engagement type, either "Socially Engaged" or "Not Socially Engaged".

**totals** - This section contains aggregate values across the session.

**trafficSource** - This section contains information about the Traffic Source from which the session originated.

**visitId** - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique 
to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.

**visitNumber** - The session number for this user. If this is the first session, then this is set to 1.

**visitStartTime** - The timestamp (expressed as POSIX time).

In [None]:
from pandas.io.json import json_normalize
import json

In [None]:

columns = ['device', 'geoNetwork', 'totals', 'trafficSource'] # Columns that have json format

#dir_path = "all/train.csv" # you can change to your local 


#Code to transform the json format columns in table
def json_read(df):
    #joining the [ path + df received]
    data_frame = '../input/train.csv'
    
    #Importing the dataset
    df = pd.read_csv(data_frame, 
                     converters={column: json.loads for column in columns}, # loading the json columns properly
                     dtype={'fullVisitorId': 'str'}, # transforming this column to string
                     nrows = None
                     )
    
    for column in columns: #loop to finally transform the columns in data frame
        #It will normalize and set the json to a table
        column_as_df = json_normalize(df[column]) 
        # here will be set the name using the category and subcategory of json columns
        column_as_df.columns = ["{column}.{subcolumn}".format(column=column,subcolumn=subcolumn) for subcolumn in column_as_df.columns] 
        # after extracting the values, let drop the original columns
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
        
    # Printing the shape of dataframes that was imported     
    #print(f"Loaded {os.path.basename(data_frame)}. Shape: {df.shape}")
    return df # returning the df after importing and transforming

In [None]:
train_df = json_read(train_data)

In [None]:
#let's see the shape after conversion of JSON values into columns
train_df.shape

In [None]:
train_df.head()

### About values in the above data in 'train_df' dataframe

**"(not set)"** means Google Analytics can't received any information. 
https://support.google.com/analytics/answer/2820717?hl=en

and

**"not available in demo dataset"** is shown only in Sample Dataset. This means that some real data are removed. https://support.google.com/analytics/answer/7586738?hl=en

### ----------------------------------------------------------------------------------------------------------

**Note :** Now applyig the same process(converting from json to columns ,which we have done for train_dataset) to test_datset also.
We will keep the test_dataset unseen but the transformations which we have done for train_datset should also be done for test_datset

In [None]:
test_df = pd.read_csv("../input/test.csv")

In [None]:
test_df.shape

In [None]:
test_df.info()

In [None]:

columns = ['device', 'geoNetwork', 'totals', 'trafficSource'] # Columns that have json format

#dir_path = "all/train.csv" # you can change to your local 



#Code to transform the json format columns in table
def json_read_test(df):
    #joining the [ path + df received]
    data_frame = '../input/test.csv'
    
    #Importing the dataset
    df = pd.read_csv(data_frame, 
                     converters={column: json.loads for column in columns}, # loading the json columns properly
                     dtype={'fullVisitorId': 'str'}, # transforming this column to string
                     nrows = None
                     )
    
    for column in columns: #loop to finally transform the columns in data frame
        #It will normalize and set the json to a table
        column_as_df = json_normalize(df[column]) 
        # here will be set the name using the category and subcategory of json columns
        column_as_df.columns = ["{column}.{subcolumn}".format(column=column,subcolumn=subcolumn) for subcolumn in column_as_df.columns] 
        # after extracting the values, let drop the original columns
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
        
    # Printing the shape of dataframes that was imported     
    #print(f"Loaded {os.path.basename(data_frame)}. Shape: {df.shape}")
    return df # returning the df after importing and transforming

In [None]:
test_df = json_read_test(test_df)

In [None]:
test_df.shape

#### Missing Values percentage in Train Dataset

Let's plot the missing values percentage for columns having missing values.

The following graph shows only those columns having missing values, all other columns are fine.

In [None]:
missing_values_percentage = {}
for key, value in dict(train_df.isna().sum(axis=0)).items():
    if value == 0:
        continue
    missing_values_percentage[key] = 100 * float(value) / len(train_df)
    
sorted_x = sorted(missing_values_percentage.items(), reverse=True)
print ("There are " + str(len(missing_values_percentage)) + " columns with missing values")


In [None]:
#Using plotly to plot the missing-values percentage in train dataset

from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go



data = [go.Bar(                            #Bar chart
            x=list(missing_values_percentage.values()),
            y=list(missing_values_percentage.keys()),
            orientation = 'h'
)]

layout = go.Layout(title="Missing Values Percentage",           #Layout for bar-chart
                   xaxis=dict(title="Missing Percentage"), 
                   height=400, margin=dict(l=250, r=200))
figure = go.Figure(data = data , layout = layout)


py.iplot(figure,filename='missing values percentage')

So, from this plot we can see that some features have very large number of missing values

## 4 Data Exploration
### 4.1 Univariate Analysis
As there are many columns in the dataset. Many sub-columns related to one attribute. So, we will be analyzing each column with its sub column


First, Let's analyze the target variable. --- 'totals.transactionRevenue'

Analysis - Distribution (How it is distributed)

#### 4.1.1 totals.transactionRevenue --- Target Variable

In [None]:
train_df['totals.transactionRevenue'].isnull().sum()

So, by this we can say that there are huge number of null values in the given column.

In [None]:
train_df['totals.transactionRevenue'] = train_df['totals.transactionRevenue'].astype('float')

In [None]:
type(train_df['totals.transactionRevenue'][0])

In [None]:
# Printing some statistics of our data

#Min value of the transactionRevenue
print("Transaction Revenue Min Value: ", 
      train_df[train_df['totals.transactionRevenue'] > 0]["totals.transactionRevenue"].min()) 

#Mean value of the transactionRevenue
print("Transaction Revenue Mean Value: ", 
      train_df[train_df['totals.transactionRevenue'] > 0]["totals.transactionRevenue"].mean()) # mean value

#Median value of the transactionRevenue
print("Transaction Revenue Median Value: ", 
      train_df[train_df['totals.transactionRevenue'] > 0]["totals.transactionRevenue"].median()) # median value

#Max value of the transactionRevenue
print("Transaction Revenue Max Value: ", 
      train_df[train_df['totals.transactionRevenue'] > 0]["totals.transactionRevenue"].max()) # the max value


# seting the figure size of our plots
plt.figure(figsize=(14,5))


# ordering the total of users and seting the values of transactions to understanding 
plt.scatter(range(train_df.shape[0]), np.sort(train_df['totals.transactionRevenue'].values))
plt.xlabel('Index', fontsize=15) # xlabel and size of words
plt.ylabel('Revenue value', fontsize=15) # ylabel and size of words
plt.title("Revenue Value Distribution", fontsize=20) # Setting Title and fontsize

plt.show()


Here, in the above plot , as my "max_TransactionRevenue" values are so huge. In the below plot, I am taking log values of all transaction_revenue values. So, we can get a good plot.

and also

As we are predicting the natural log of sum of all transactions of the user, let us sum up the transaction revenue at user level and take a log and then do a scatter plot

In [None]:
grouped_df = train_df.groupby("fullVisitorId")["totals.transactionRevenue"].sum().reset_index()

plt.figure(figsize=(8,6))
plt.scatter(range(grouped_df.shape[0]), np.sort(np.log1p(grouped_df["totals.transactionRevenue"].values)))
plt.xlabel('index', fontsize=12)
plt.ylabel('TransactionRevenue', fontsize=12)
plt.title("Revenue Value Distribution", fontsize=20) # Setting Title and fontsize
plt.show()

In [None]:
#Filling the NaN values with 0 in the totals.transactionRevenue column because as per the given rules when there is no transaction , the revenue generated will be zero.
train_df['totals.transactionRevenue'] = train_df['totals.transactionRevenue'].fillna(0)

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(train_df['totals.transactionRevenue'])
plt.title("Distribution of Total TransactionRevenue");
plt.xlabel("total.TransactionsRevenue");

This seems like lognormal distribution (Power-law distribution) which is 80-20 rule 
  
                                                          which confirms competition overview.  

 The 80/20 rule has proven true for many businesses杘nly a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies


#### 4.1.2 Device Information

In [None]:
device_column = [col for col in train_df.columns if 'device' in col]

In [None]:
#let's see the device_atribute part of the dataset to know which features which might the helpful
train_df[device_column].head()


Among all these , we will consider **"device.browser , device.isMobile , device.deviceCategory , device.operatingSystem"**-- these features for analysis as values for remaining features are not available in the given dataset

In [None]:
#selected specific features from device attributes 
device_cols = ['device.browser' , 'device.isMobile' , 'device.deviceCategory' , 'device.operatingSystem']

In [None]:
from plotly.offline import iplot
plots = []
colors = ["green","violet","blue","red"]
for color,column in enumerate(device_cols):
    each_column = train_df[column].value_counts()
    plots.append(go.Bar(marker=dict(opacity=0.5,color=colors[color]),orientation="h", y = each_column.index[:15][::-1], x = each_column.values[:15][::-1]))
#each_column.index[:15][::-1] -------- #taking 15 items from column and also in reversing order [::-1]
fig = tools.make_subplots(rows=2, cols=2, subplot_titles=["Visits: Browser", "Visits: Mobile", "Visits: Category" ,"Visits: OS"], print_grid=False)
fig.append_trace(plots[0], 1, 1)
fig.append_trace(plots[1], 1, 2)
fig.append_trace(plots[2], 2, 1)
fig.append_trace(plots[3], 2, 2)

fig['layout'].update(height=800, showlegend=False,xaxis=dict(title="no of users"), title="Visits by Device Attributes")
iplot(fig)


#### Calculating the transaction revenues for each category in each features.
Inorder to know which one is category in which feature is impacting more.

In [None]:
train_df["totals.transactionRevenue"] = train_df["totals.transactionRevenue"].astype('float')


fig = tools.make_subplots(rows=2, cols=2, subplot_titles=["Mean Transaction revenue: Browser", "Mean Transaction revenue: Mobile", "Mean Transaction revenue: Category", "Mean Transaction revenue: OS"], print_grid=False)

device_columns = ['device.browser' , 'device.isMobile' , 'device.deviceCategory' , 'device.operatingSystem']

colors = ["green","violet","blue","red"]
plts = []
for color, column in enumerate(device_columns):
    temporary_var = train_df.groupby(column).agg({"totals.transactionRevenue": "mean"}).reset_index().rename(columns={"totals.transactionRevenue" : "Mean Revenue"})
    temporary_var = temporary_var.dropna().sort_values("Mean Revenue", ascending = False)
    each_bar = go.Bar(x = temporary_var["Mean Revenue"][::-1], orientation="h", marker=dict(opacity=0.5, color=colors[color]), y = temporary_var[column][::-1])
    plts.append(each_bar)

fig.append_trace(plts[0], 1, 1)
fig.append_trace(plts[1], 1, 2)
fig.append_trace(plts[2], 2, 1)
fig.append_trace(plts[3], 2, 2)
fig['layout'].update(height=800, showlegend=False, title="Mean Transaction revenue by Device Attributes")
iplot(fig)

### Observations:
#### about device_attribute

1.**Browser**:

   ----> There is interesting results in this  feature analysis.
  
   ----> 1.Chrome browser is the one which is used by most of the visitors.
   
   ----> 2.FireFox browser is the one from which most of transactionRevenue is generated. This might be possible because all of them who visit the website doesn't mean they will purchase.We can assume as, the one who used firefox browser had made purchases.
   
2.**Mobile**:

   ----> 1.Most of the people don't use the mobile to visit the Gstore (which is ecommercial store)
  
   ----> 2.As, most of the visitors don't use mobile. Most of the revenue will not be genearted through mobile devices.
   
3.**Category**:

   ----> Which type of device is used by most of the visitors and from which type of device 'revenue' is generated.
  
   ----> Desktop is the one which is used by most of the visitors. Most of transaction is generated by it when compared to other devices.  
   
4.**OS (Operating system)**:

   ----> There is interesting results in this  feature analysis.
  
   ----> 1.Windows Operating system is the one which is used by most of the visitors.
   
   ----> 2.Chrome OS is the one from which most of transactionRevenue is generated. This might be possible because all of them who visit the website doesn't mean they will purchase. We can assume as (our assumptions), the one who uses Chrome OS had made purchases and also as it is GStore , they are using ChromeOS which is Google product. They might have more trust and likely towards Google, they(People who use ChromeOS) are the one from which most of transaction_Revenue has been generated.

#### 4.1.3 GeoNetwork attributes

In [None]:
geo_net_column = [col for col in train_df.columns if 'geoNetwork' in col]

In [None]:
geo_net_column

In [None]:
train_df[geo_net_column].head()

In [None]:
geo_net_cols = ['geoNetwork.continent' ,'geoNetwork.subContinent', 'geoNetwork.country']

Among all these , we will consider **"geoNetwork.continent , geoNetwork.country , geoNetwork.subContinent**-- these features for analysis as values for remaining features are not available in the given dataset

In [None]:
from plotly.offline import iplot
plots = []
colors = ["green","blue","red"]
for color,column in enumerate(geo_net_cols):
    each_column = train_df[column].value_counts()
    plots.append(go.Bar(marker=dict(opacity=0.5,color=colors[color]),orientation="h", y = each_column.index[:15][::-1], x = each_column.values[:15][::-1]))
#each_column.index[:15][::-1] -------- #taking 15 items from column and also in reversing order [::-1]
fig = tools.make_subplots(rows=2, cols=2, subplot_titles=["Visits: Continent", "Visits: subContinent", "Visits: country" ], print_grid=False)
fig.append_trace(plots[0], 1, 1)
fig.append_trace(plots[1], 1, 2)
fig.append_trace(plots[2], 2, 1)

fig['layout'].update(height=800, showlegend=False,xaxis=dict(title="no of users"), title="Visits by GeoNetwork Attributes")
iplot(fig)


#### Calculating the transaction revenues for each category in each features.
Inorder to know which one is category in which feature is impacting more.

In [None]:
train_df["totals.transactionRevenue"] = train_df["totals.transactionRevenue"].astype('float')


fig = tools.make_subplots(rows=2, cols=2, subplot_titles=["Mean Transaction revenue: Continent", "Mean Transaction revenue: SubContinent", "Mean Transaction revenue: Country"], print_grid=False)


colors = ["green","blue","red"]
plts = []
for color, column in enumerate(geo_net_cols):
    temporary_var = train_df.groupby(column).agg({"totals.transactionRevenue": "mean"}).reset_index().rename(columns={"totals.transactionRevenue" : "Mean Revenue"})
    temporary_var = temporary_var.dropna().sort_values("Mean Revenue", ascending = False)
    each_bar = go.Bar(x = temporary_var["Mean Revenue"][::-1], orientation="h", marker=dict(opacity=0.5, color=colors[color]), y = temporary_var[column][::-1])
    plts.append(each_bar)

fig.append_trace(plts[0], 1, 1)
fig.append_trace(plts[1], 1, 2)
fig.append_trace(plts[2], 2, 1)

fig['layout'].update(height=800, showlegend=False, title="Mean Transaction revenue by GeoNetwork Attributes")


iplot(fig)

In [None]:
#Jsut a fancy visualization for one of the above feature (Nothing new beyond that)

# plotly globe credits - https://www.kaggle.com/arthurtok/generation-unemployed-interactive-plotly-visuals
temporary_var = train_df["geoNetwork.country"].value_counts()

colorscale = [[0, 'rgb(102,194,165)'], [0.005, 'rgb(102,194,165)'], 
              [0.01, 'rgb(171,221,164)'], [0.02, 'rgb(230,245,152)'], 
              [0.04, 'rgb(255,255,191)'], [0.05, 'rgb(254,224,139)'], 
              [0.10, 'rgb(253,174,97)'], [0.25, 'rgb(213,62,79)'], [1.0, 'rgb(158,1,66)']]

data = [ dict(
        type = 'choropleth',
        autocolorscale = False,
        colorscale = colorscale,
        showscale = True,
        locations = temporary_var.index,
        z = temporary_var.values,
        locationmode = 'country names',
        text = temporary_var.values,
        marker = dict(
            line = dict(color = '#fff', width = 2)) )           ]

layout = dict(
    height=500,
    title = 'Visits by Country',
    geo = dict(
        showframe = True,
        showocean = True,
        oceancolor = '#222',
        projection = dict(
        type = 'orthographic',
            rotation = dict(
                    lon = 60,
                    lat = 10),
        ),
        lonaxis =  dict(
                showgrid = False,
                gridcolor = 'rgb(102, 102, 102)'
            ),
        lataxis = dict(
                showgrid = False,
                gridcolor = 'rgb(102, 102, 102)'
                )
            ),
        )
fig = dict(data=data, layout=layout)
iplot(fig)




########################----------------------MEAN TRANSACTION REVENUE FOR COUNTRIES----------------------------################



# plotly globe credits - https://www.kaggle.com/arthurtok/generation-unemployed-interactive-plotly-visuals

temporary_var = train_df.groupby("geoNetwork.country").agg({"totals.transactionRevenue" : "mean"}).reset_index()

colorscale = [[0, 'rgb(102,194,165)'], [0.005, 'rgb(102,194,165)'], 
              [0.01, 'rgb(171,221,164)'], [0.02, 'rgb(230,245,152)'], 
              [0.04, 'rgb(255,255,191)'], [0.05, 'rgb(254,224,139)'], 
              [0.10, 'rgb(253,174,97)'], [0.25, 'rgb(213,62,79)'], [1.0, 'rgb(158,1,66)']]

data = [ dict(
        type = 'choropleth',
        autocolorscale = False,
        colorscale = colorscale,
        showscale = True,
        locations = temporary_var['geoNetwork.country'],
        z = temporary_var['totals.transactionRevenue'],
        locationmode = 'country names',
        text = temporary_var['totals.transactionRevenue'],
        marker = dict(
            line = dict(color = '#fff', width = 2)) ) ]

layout = dict(
    height=500,
    title = 'Mean Transaction Revenue by Countries',
    geo = dict(
        showframe = True,
        showocean = True,
        oceancolor = '#222',
        projection = dict(
        type = 'orthographic',
            rotation = dict(
                    lon = 60,
                    lat = 10),
        ),
        lonaxis =  dict(
                showgrid = False,
                gridcolor = 'rgb(102, 102, 102)'
            ),
        lataxis = dict(
                showgrid = False,
                gridcolor = 'rgb(102, 102, 102)'
                )
            ),
        )
fig = dict(data=data, layout=layout)
iplot(fig)

### Observations:
#### about GeoNetwork_attribute

1.**Continent**:

   ----> There is very interesting results in this feature analysis.
  
   ----> 1. American continent is the one which is used by most of the visitors.
   
   ----> 2. Number of Visitors to America is nearly 39times greater than Africa.But, Africa is the one from which most of transactionRevenue is generated followed by Asian continent. 
   
2.**Sub-Continent**:

   ----> This will be Similar to Continent Statistics as we can treat Sub-Continet is like a child class to Continent.
   
   ----> 1.NorthAmerica followed by SouthEast_Asia are the two sub-continents from where most of visitors come from.
   
   ----> 2.EasternAfrica and EastAsia are sub-Continents from where highest transaction revenue is generated from.
   
3.**Country**:

   ----> As we discussed, This will be similar to above understanding because we can think Country is like a child class to Sub-Continent.
  
   ----> 1.USA has highest number of visitors. It has nearly 364k visitors. India is next highest number of visitor with 51k. 
   
   ----> 2.Anguilla is the country from where highest revenue is generated.(Which is in Africa region) followed by Curaco.

#### 4.1.4 trafficSource attributes

In [None]:
trafficsource_columns = [col for col in train_df.columns if 'trafficSource' in col]

In [None]:
trafficsource_columns

In [None]:
train_df[trafficsource_columns].head()

In [None]:
trafficSource_cols = ['trafficSource.campaign','trafficSource.medium','trafficSource.source']

Among all these , we will consider **"trafficSource.campaign','trafficSource.medium','trafficSource.source**-- these features for analysis as values for remaining features are not available or doesn't look useful in the given dataset

In [None]:
plots = []
colors = ["green","blue","red"]
for color,column in enumerate(trafficSource_cols):
    each_column = train_df[column].value_counts()
    plots.append(go.Bar(marker=dict(opacity=0.5,color=colors[color]),orientation="h", y = each_column.index[:15][::-1], x = each_column.values[:15][::-1]))
#each_column.index[:15][::-1] -------- #taking 15 items from column and also in reversing order [::-1]
fig = tools.make_subplots(rows=2, cols=2, subplot_titles=["trafficSource: Campaign", "trafficSource: Medium", "trafficSource: Source" ], print_grid=False)
fig.append_trace(plots[0], 1, 1)
fig.append_trace(plots[1], 1, 2)
fig.append_trace(plots[2], 2, 1)

fig['layout'].update(height=800, showlegend=False,xaxis=dict(title="no of users using trafficSource"), title="Traffic Source")
iplot(fig)


#### Calculating the transaction revenues for each category in each features.
Inorder to know which one is category in which feature is impacting more.

In [None]:
train_df["totals.transactionRevenue"] = train_df["totals.transactionRevenue"].astype('float')


fig = tools.make_subplots(rows=2, cols=2, subplot_titles=["Mean Transaction revenue: Campaign", "Mean Transaction revenue: Medium", "Mean Transaction revenue: Source"], print_grid=False)


colors = ["green","blue","red"]
plts = []
for color, column in enumerate(trafficSource_cols):
    temporary_var = train_df.groupby(column).agg({"totals.transactionRevenue": "mean"}).reset_index().rename(columns={"totals.transactionRevenue" : "Mean Revenue"})
    temporary_var = temporary_var.dropna().sort_values("Mean Revenue", ascending = False)
    each_bar = go.Bar(x = temporary_var["Mean Revenue"][::-1], orientation="h", marker=dict(opacity=0.5, color=colors[color]), y = temporary_var[column][::-1])
    plts.append(each_bar)

fig.append_trace(plts[0], 1, 1)
fig.append_trace(plts[1], 1, 2)
fig.append_trace(plts[2], 2, 1)

fig['layout'].update(height=800, showlegend=False, title="Mean Transaction revenue by Traffic_Source Attributes")


iplot(fig)

### Observations:
#### about trafficSource Attributes

1.**Campaign**:

   ----> Most of the visitors trafficSource came through campaign 'not set' (that means through unknown source , this might be possible as there will be many unknown sources (all unknown sources are together taken as 'not set')). 
  
   ----> Highest mean TransactionRevenue is generated from campaign called 'V-Accessories'.
   
   
2.**Medium**:

   ----> Organic is the medium through which many visitors have visited G-store.
   
   ----> But through 'cpm' medium , highest transactionRevenue is generated.
   
   
3.**Source**:

   ----> Most of the users have visited G-Store using Google (as a source). (Of high volume visitors)
  
   ----> But 'secamp.com' is the one source through which most of the revenue is generated. Surprisingly, in the revenue generation process , google hasn't been succeeded (it stands at 10th or far more places).

#### 4.1.5 Channel Grouping.

In [None]:
channel_group_counts = train_df['channelGrouping'].value_counts()
values_channel_group = channel_group_counts.values 
index_channel_group = channel_group_counts.index
domain_channel_group = {'x': [0.2, 0.50], 'y': [0.0, 0.33]}
fig = {
  "data": [
    {
      "values": values_channel_group,
      "labels": index_channel_group,
      "domain": {"x": [0, .48]},
    "marker" : dict(colors=["#ef86a2" ,'#89edd4',  '#f7ee71']),
      "name": "Channel Grouping",
      "hoverinfo":"label+percent+name",
      "hole": .7,
      "type": "pie"
    }
   ],   
  "layout": {"title":"Channel Grouping",
      "annotations": [
            {
                "font": {
                    "size": 20
                },
                "showarrow": False,
                "text": "Channel Grouping",
                "x": 0.11,
                "y": 0.5
            }
        ]
    }
}
iplot(fig)
#took from pavansangapati kernel

### Observations:
#### about channelGrouping

1.Channel Grouping is something through which channel user came to the store (through which he/she had visited the G-store)

2.Through Organic Search, 42.2% of the users have visited G-store then followed by Social channel (nearly 25% of the users visited the G-store)

#### 4.1.6 Date feature

In [None]:
train_df['date'].head()

In [None]:
#given date is not in the appropriate format. So we will try to convert that first using lambda function 
import datetime
train_df['date'] = train_df['date'].apply(lambda x: datetime.date(int(str(x)[:4]), int(str(x)[4:6]), int(str(x)[6:])))

In [None]:
temporary_var = train_df['date'].value_counts().to_frame().reset_index().sort_values('index')
temporary_var = temporary_var.rename(columns = {"index" : "dateX", "date" : "visits"})

pltt = go.Scatter(mode="lines", x = temporary_var["dateX"].astype(str), y = temporary_var["visits"])
layout = go.Layout(title="User Visits by date(month wise)", height=400)
fig = go.Figure(data = [pltt], layout = layout)
iplot(fig)

#calculating transactionRevenue

temporary_var = train_df.groupby("date").agg({"totals.transactionRevenue" : "mean"}).reset_index()
temporary_var = temporary_var.rename(columns = {"date" : "dateX", "totals.transactionRevenue" : "mean_revenue"})
pltt = go.Scatter(mode="lines", x = temporary_var["dateX"].astype(str), y = temporary_var["mean_revenue"])
layout = go.Layout(title="Monthly TransactionRevenue by date", height=400)
fig = go.Figure(data = [pltt], layout = layout)
iplot(fig)

### Observations:
#### about  'date' feature
--Interesting Observation-- . There is Contrast Observation from 'Visitors to Store' and 'Transaction revenue' mainly in date feature.

1.From the plot, we can see that in Nov 2016 , there are high number of visitors to G-store.

2.Surprising to see. In Nov2016, Despite of having more number of visitors, Monthly_Transaction_Revenue was almost least in that month.

3.May2017 has generated highest transaction revenue despite of having least number of visits to store.

In [None]:
#Applying same transformation to test dataset also
test_df['date'] = test_df['date'].apply(lambda x: datetime.date(int(str(x)[:4]), int(str(x)[4:6]), int(str(x)[6:])))

#### 4.1.7 Number of Visitors and Common Visitors

In [None]:
print("Number of unique visitors in train set : ",train_df.fullVisitorId.nunique(), " out of rows : ",train_df.shape[0])
print("Number of unique visitors in test set : ",test_df.fullVisitorId.nunique(), " out of rows : ",test_df.shape[0])
print("Number of common visitors in train and test set : ",len(set(train_df.fullVisitorId.unique()).intersection(set(test_df.fullVisitorId.unique())) ))

In [None]:
test_df.shape

In [None]:
train_df.shape

In [None]:
train_df.columns

#### 4.1.7 Visitor Profile Attributes

In [None]:
total_cols = [col for col in train_df.columns if 'totals' in col]

In [None]:
total_cols

In [None]:
ttl_cols = ['totals.hits','totals.pageviews']

In [None]:
plots = []
colors = ["green","red"]
for color,column in enumerate(ttl_cols):
    each_column = train_df[column].value_counts()
    plots.append(go.Bar(marker=dict(opacity=0.5,color=colors[color]),orientation="h", y = each_column.index[:15][::-1], x = each_column.values[:15][::-1]))
#each_column.index[:15][::-1] -------- #taking 15 items from column and also in reversing order [::-1]
fig = tools.make_subplots(rows=1, cols=2, subplot_titles=["Visits: hits", "Visits: pageviews"], print_grid=False)
fig.append_trace(plots[0], 1, 1)
fig.append_trace(plots[1], 1, 2)

fig['layout'].update(height=800, showlegend=False,xaxis=dict(title="Visitor Profile hits & views"), title="Visitor profile")
iplot(fig)


#### Calculating the transaction revenues for each category in each features.
Inorder to know which one is category in which feature is impacting more.

In [None]:
train_df["totals.transactionRevenue"] = train_df["totals.transactionRevenue"].astype('float')


fig = tools.make_subplots(rows=1, cols=2, subplot_titles=["Mean Transaction revenue:hits", "Mean Transaction revenue:PageViews"], print_grid=False)


colors = ["green","red"]
plts = []
for color, column in enumerate(ttl_cols):
    temporary_var = train_df.groupby(column).agg({"totals.transactionRevenue": "mean"}).reset_index().rename(columns={"totals.transactionRevenue" : "Mean Revenue"})
    temporary_var = temporary_var.dropna().sort_values("Mean Revenue", ascending = False)
    each_bar = go.Bar(x = temporary_var["Mean Revenue"][::-1], orientation="h", marker=dict(opacity=0.5, color=colors[color]), y = temporary_var[column][::-1])
    plts.append(each_bar)

fig.append_trace(plts[0], 1, 1)
fig.append_trace(plts[1], 1, 2)


fig['layout'].update(height=800, showlegend=False, title="Mean Transaction revenue by Visitor profile attributes (totals)")


iplot(fig)

### Observations:
#### about  'totals' feature (visitorProfileAttributes)

1.Most of the visitors has only least number of hits and page views (they have hit or viewed mostly once).

2.Count plot shows decreasing nature i.e. we have a very high total count for less number of hits and page views per visitor transaction and the overall count decreases when the number of hits per visitor transaction increases.

3.we are unable to get any clear trend(or)pattern related to TransactionsRevenue as per the hits and pageViews

### 5.Creating a Baseline model

**5.1 Preprocessing** 

Initially , we will remove the columns which is not useful for creating the model
 
------>Drop Columns with constant values

------>Drop Ids and other non relevant columns


In [None]:
#columns with constant values
constant_columns = [column for column in train_df.columns if train_df[column].nunique(dropna=False)==1 ]


In [None]:
constant_columns

In [None]:
## non relevant columns
non_relevant = ["visitNumber", "date", "fullVisitorId", "sessionId", "visitId", "visitStartTime"]

In [None]:
train_df_model_columns = train_df.drop(columns=constant_columns)

In [None]:
train_df_model_columns.head()

In [None]:
type(train_df_model_columns['date'][0])

In [None]:
#sorting by date inorder to perform time-based slicing
sorted_by_date_train_df_model_columns = train_df_model_columns.sort_values(by='date',ascending=True)


In [None]:
sorted_by_date_train_df_model_columns.head()

In [None]:
train_df_model_columns = sorted_by_date_train_df_model_columns.drop(columns=non_relevant)

In [None]:
train_df_model_columns.head()

In [None]:
train_df_model_columns.shape

In [None]:
train_df_model_columns.columns

#### We will be doing same process for test data also
removing "constant_columns" and also "non-relevant columns" as well as we will be checking and comparing both train_df and test_df to check whether is there any columns(features) missing in test_df when compared to train_df.

In [None]:
test_df_model_columns = test_df.drop(columns=constant_columns)

In [None]:
test_df_model_columns.head()

In [None]:
test_df_model_columns.shape

In [None]:
type(test_df_model_columns['fullVisitorId'][0])

In [None]:
test_df_model_columns_with_id = test_df_model_columns

In [None]:
test_df_model_columns_with_id.shape

In [None]:
type(test_df_model_columns_with_id['fullVisitorId'][0])

In [None]:
test_df_model_columns_with_id['fullVisitorId'] = test_df_model_columns_with_id['fullVisitorId'].astype('float') 

In [None]:
test_df_model_columns = test_df_model_columns.drop(columns=non_relevant)

In [None]:
test_df_model_columns.shape

In [None]:
#We will look at the variable names which are there in train dataset and not in test dataset.
print("Variables not in test_df_model_columns but in train_df_model_columns : ", set(train_df_model_columns.columns).difference(set(test_df_model_columns.columns)))

Anyway 'totals.transactionRevenue' will not be there in test_df_model_columns because it is the feature we need to predict.
But 'trafficSource.campaignCode' is not in train_df_model_columns. So, we will be removing this feature from 'trafficSource.campaignCode. So, we will be removing 'trafficSource.campaignCode' from the train_dataset

In [None]:
train_df_model_columns = train_df_model_columns.drop(columns='trafficSource.campaignCode')

In [None]:
train_df_model_columns.shape

In [None]:
#We will look at the variable names which are there in test dataset and not in train dataset.
print("Variables not in train_df_model_columns but in test_df_model_columns : ", set(test_df_model_columns.columns).difference(set(train_df_model_columns.columns)))

In [None]:
test_df_model_columns.shape

#### 5.2 Handling Categorical variables

In [None]:
#Label encoding the Categorical variables (train_data)
from sklearn.preprocessing import LabelEncoder

categorical_columns = [column for column in train_df_model_columns.columns if not column.startswith('total')]
categorical_columns = [column for column in categorical_columns if column not in constant_columns + non_relevant]

for column in categorical_columns:

    le = LabelEncoder()
    train_values = list(train_df_model_columns[column].values.astype(str))
    test_values = list(test_df_model_columns[column].values.astype(str))
    
    le.fit(train_values + test_values)
    
    train_df_model_columns[column] = le.transform(train_values)
    test_df_model_columns[column] = le.transform(test_values)  

In [None]:
train_df_model_columns.head()

In [None]:
test_df_model_columns.head()

#### 5.3 Handling Numerical variables

In [None]:
#filling the NA values in totals column
train_df_model_columns['totals.bounces'] = train_df_model_columns['totals.bounces'].fillna(0.0)
train_df_model_columns['totals.newVisits'] = train_df_model_columns['totals.newVisits'].fillna(0.0)
test_df_model_columns['totals.bounces'] = test_df_model_columns['totals.bounces'].fillna(0.0)
test_df_model_columns['totals.newVisits'] = test_df_model_columns['totals.newVisits'].fillna(0.0)


In [None]:
def normalize_numerical_columns(dataframe, isTrainDataset = True):
    dataframe["totals.hits"] = dataframe["totals.hits"].astype(float)
    #dataframe["totals.hits"] = (dataframe["totals.hits"] - min(dataframe["totals.hits"])) / (max(dataframe["totals.hits"]) - min(dataframe["totals.hits"]))

    dataframe["totals.pageviews"] = dataframe["totals.pageviews"].astype(float)
    #dataframe["totals.pageviews"] = (dataframe["totals.pageviews"] - min(dataframe["totals.pageviews"])) / (max(dataframe["totals.pageviews"]) - min(dataframe["totals.pageviews"]))
    
    dataframe["totals.bounces"] = dataframe["totals.bounces"].astype(float)
    dataframe["totals.newVisits"] = dataframe["totals.newVisits"].astype(float)
    
    
    if isTrainDataset:
        dataframe["totals.transactionRevenue"] = dataframe["totals.transactionRevenue"].fillna(0.0)
    return dataframe 

In [None]:
train_df_model_columns = normalize_numerical_columns(train_df_model_columns)


In [None]:
train_df_model_columns.head()

In [None]:
test_df_model_columns = normalize_numerical_columns(test_df_model_columns,isTrainDataset=False)

In [None]:
test_df_model_columns.head()

#### 5.4 Generate Training and Validation Sets

Now let us create development and validation splits based on time to build the model. We can take the last two months as validation sample.

1. we will be splitting in the ratio of 70:30.
    That too we will be performing time-based slicing because in order to get accurate results for future data , we need to know how our model is performing on recent data,so we can estimate our accuracy for future data. So, the most recent in the train dataset , we will be using for cross validation and older ones can be used as train_data

In [None]:
train_df_model_columns.shape

In [None]:
num = 903653*70


In [None]:
num/100

In [None]:
train_df_model_columns_train = train_df_model_columns[:632557]

In [None]:
train_df_model_columns_train.shape

remaining in the train_df will be considered as validation data

In [None]:
train_df_model_columns_cv = train_df_model_columns[632558:]

In [None]:
train_df_model_columns_cv.shape

Dividing into train and cross-validation data

In [None]:
train_X = train_df_model_columns_train.drop(columns='totals.transactionRevenue')

In [None]:
train_Y = np.log1p(train_df_model_columns_train['totals.transactionRevenue'].values)

In [None]:
train_X.shape

In [None]:
train_Y.shape

In [None]:
cv_X = train_df_model_columns_cv.drop(columns='totals.transactionRevenue')

In [None]:
cv_Y = np.log1p(train_df_model_columns_cv['totals.transactionRevenue'].values)

In [None]:
cv_X.shape

In [None]:
cv_Y.shape

#### 5.5 Training the model
training the model using lightgbm

In [None]:
import lightgbm as lightGBM 

#about parameters of lightgbm --- https://lightgbm.readthedocs.io/en/latest/Parameters.html

lightGBM_params = {"objective" : "regression", "metric" : "rmse",
              "num_leaves" : 100, "learning_rate" : 0.02, 
              "bagging_fraction" : 0.75, "feature_fraction" : 0.8, "bagging_frequency" : 9,"bagging_seed" : 2019,"use_best_model":True,"colsample_bytree":0.9}


lightGBM_train = lightGBM.Dataset(train_X, label=train_Y)
lightGBM_crossVal = lightGBM.Dataset(cv_X, label=cv_Y)
lightGBM_model = lightGBM.train(lightGBM_params, lightGBM_train, 700, valid_sets=[lightGBM_crossVal], early_stopping_rounds=150, verbose_eval=50)

In [None]:
test_df_model_columns.shape

In [None]:
prediction = lightGBM_model.predict(test_df_model_columns, num_iteration=lightGBM_model.best_iteration)
final_df = pd.DataFrame({"fullVisitorId":test_df['fullVisitorId']})
prediction[prediction<0] = 0

final_df["PredictedLogRevenue"] = np.expm1(prediction)

final_df = final_df.groupby("fullVisitorId").agg({"PredictedLogRevenue" : "sum"}).reset_index()
final_df.columns = ["fullVisitorId", "PredictedLogRevenue"]

final_df["PredictedLogRevenue"] = np.log1p(final_df["PredictedLogRevenue"])

final_df.head()

In [None]:
final_df.to_csv("baseline_lightGBM_6.csv",index=False)

In [None]:
fig, ax = plt.subplots(figsize=(12,18))
lightGBM.plot_importance(lightGBM_model, max_num_features=30, height=0.8, ax=ax)
plt.title("LightGBM-Feature Importance", fontsize=10)
plt.show()