## Google Analytics Customer revenue prediction

In this kernel, I have done exploratory data analysis for [Google Analytics customer revenue prediction challenge](https://www.kaggle.com/c/ga-customer-revenue-prediction). Here , we need to analyze a [Google Merchandise Store](https://www.googlemerchandisestore.com) (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. 

Datasets provided for training and test are available both in csv as well as google bigquery datasets. Since, I have experience in using SQL for data analytics in professional life, I have tried my hands with __google big query__ for exploratory data analysis here.  I have used some simple SELECT sqls and GROUP BY and aggregage functions like SUM,MIN,MAX & AVG here.

 This kernel is inspired from [simple exploration kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue) from [SRK](https://www.kaggle.com/sudalairajkumar). His kernels always have been a great source inspiration and learning for me!

In [None]:
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
color = sns.color_palette()
plt.style.use('bmh')
plt.set_cmap('spring')
%matplotlib inline
import bq_helper

In [None]:

#Here's how we can use the BQHelper library to pull datasets/tables from BigQuery
ga_bq_train = bq_helper.BigQueryHelper(active_project= "kaggle-public-datasets", 
                                       dataset_name = "ga_train_set")
ga_bq_test = bq_helper.BigQueryHelper(active_project= "kaggle-public-datasets", 
                                       dataset_name = "ga_test_set")


In [None]:
ga_bq_train.list_tables()[:10]

It seems dataset on kaggle is factored on the basis of date. Let's check what all columns are available :

In [None]:
#columns in train dataset
ga_bq_train.table_schema((ga_bq_train.list_tables()[0]))['name'].tolist()

There are 186 columns in train dataset which includes columns hidden in json as well.

In [None]:
#Let's check the size of train dataset total number of records
total_train_query = """SELECT  COUNT(*) AS COUNT
  FROM `kaggle-public-datasets.ga_train_set.ga_sessions_*` """
total_train = ga_bq_train.query_to_pandas(total_train_query)
print('Total number of records in training dataset :',total_train['COUNT'][0])

#Let's check the size of test dataset total number of records
total_test_query = """SELECT  COUNT(*) AS COUNT
  FROM `kaggle-public-datasets.ga_test_set.ga_sessions_*` """
total_test = ga_bq_test.query_to_pandas(total_test_query)
print('Total number of records in test dataset :',total_test['COUNT'][0])


In [None]:
#training data snapshot
ga_bq_train.head(ga_bq_train.list_tables()[0]) #this will show data in first table of bigquery dataset


## Exploratory Data Analysis

### transactionRevenue - the target variable

In [None]:
#exploration of Target Variable using BigQuery

totalrevenue_per_user_query = """SELECT  fullVisitorId, coalesce(SUM( totals.transactionRevenue ),0) AS totalrevenue_per_user
  FROM `kaggle-public-datasets.ga_train_set.ga_sessions_*` 
  GROUP BY fullVisitorId
"""
totalrevenue_per_user = ga_bq_train.query_to_pandas_safe(totalrevenue_per_user_query)
#plot distribution of transactionRevenue
plt.figure(figsize=(8,6))
#scatter plot on natural log of totalrevenue per user
#original code by : SRK kernel(https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue)

plt.scatter(range(totalrevenue_per_user.shape[0]), np.sort(np.log1p(totalrevenue_per_user["totalrevenue_per_user"].values)))
plt.xlabel('index', fontsize=12)
plt.ylabel('totalRevenue', fontsize=12)
plt.title('Distribution of totalrevenue per user')
plt.show()

As mentioned in competetion overview and by [SRK](https://www.kaggle.com/sudalairajkumar) in his [kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue), it is true that
__"The 80/20 rule has proven true for many businesses杘nly a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies."__ . The 80/20 rule is nothing but says that 20% customers produce more than 80% of sales(transactionrevenue in our case). From above plot, it seems even much lower.

### date

As we have seen above train and test dataset is grouped into tables by dates. Let's check the time period of data available in train and test dataset using column date and bigquery.

In [None]:
traindate_query = """SELECT MIN(date) as startdate,MAX(date) as enddate 
    FROM `kaggle-public-datasets.ga_train_set.ga_sessions_*`"""
traindate_result =  ga_bq_train.query_to_pandas_safe(traindate_query)
print('Data in training set is for date',traindate_result['startdate'].iloc[0],'to',traindate_result['enddate'].iloc[0])

One year of data is present in training dataset(1st August,2016 to 1st August,2017).

In [None]:
testdate_query = """SELECT MIN(date) as startdate,MAX(date) as enddate 
    FROM `kaggle-public-datasets.ga_test_set.ga_sessions_*`"""
testdate_result =  ga_bq_test.query_to_pandas_safe(testdate_query)
print('Data in test set is for date',testdate_result['startdate'].iloc[0],'to',testdate_result['enddate'].iloc[0])

Around 10 months of data is present in test dataset from August 2017 to April 2018.

Here,we need to parse date column to bring it in proper format(yyyy-mm-dd) for visualization first. For it, there is a built-in function [PARSE_DATE](https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions#parse_date) in google bigquery. Let's use it and convert date into proper format.

In [None]:
revenue_per_date_query = """SELECT  PARSE_DATE('%Y%m%d',date) AS DATE,COUNT(*) AS VISIT_COUNT ,coalesce(SUM( totals.transactionRevenue ),0) AS totalrevenue,
coalesce(AVG( totals.transactionRevenue ),0) AS avgrevenue
  FROM `kaggle-public-datasets.ga_train_set.ga_sessions_*` 
  GROUP BY date
"""
revenue_per_date = ga_bq_train.query_to_pandas_safe(revenue_per_date_query)

for i,col in enumerate(['VISIT_COUNT','totalrevenue','avgrevenue']):
    #fig,axes = plt.subplots(3,1)
    revenue_per_date.plot(x='DATE',y=col,figsize=(8,6))
    if col=='VISIT_COUNT' :
        plt.title('Visits count per day')
    else :
        plt.title('Distribution of ' + col + ' per date')
    plt.xlabel('DATE', fontsize=12)
    plt.ylabel(col, fontsize=12)
    

In the first plot for visit count, we can see that visits starts increasing from after Oct 2016 and go to its highest peak near by Chritsmas. But this peak doesn't convert into revenue as per total revenue and avg revenue plots. Instead revenue attains its peak in first quarter of 2017 in february and april.

In [None]:
def categorical_countplot(feature):
    #this function extract usage count of feature passed using BigQuery and visualize the usage of top 10 feature values based on their counts
    separate_feat = feature.split('.')[1]
    query = """SELECT """ + feature + """, COUNT(*) AS COUNT,coalesce(SUM( totals.transactionRevenue ),0) AS TotalRevenue,
    coalesce(AVG( totals.transactionRevenue ),0) AS AvgRevenue
      FROM `kaggle-public-datasets.ga_train_set.ga_sessions_*` 
      GROUP BY """ + feature + """
      ORDER BY COUNT(*) DESC"""
    feature_count = ga_bq_train.query_to_pandas_safe(query)
    print('Total number of ' ,separate_feat, ' :',len(feature_count[separate_feat]))
    #let's visualize the usage of top 10 feature categories using barplot
    plt.figure(figsize=(16,6))
    for i,col in enumerate(['COUNT','TotalRevenue','AvgRevenue']) :
        ax = plt.subplot(1,3,i+1)
        sns.barplot(x=separate_feat,y=col,data=feature_count.head(10))
        if col=='COUNT' :
            plt.title('Visits count per '  + separate_feat)
        else :
            plt.title(col + ' per ' + separate_feat)
        plt.xticks(rotation=90)

### Browser 

In [None]:
# exploration of browser variable
categorical_countplot('device.browser')


Most widely used browser is __Google Chrome__, then comes Safari and firefox. Other browsers have very low usage. Total revenue for Google chrome is highest but __average revenue(non-zero revenue) for Firefox is highest__! Seems user visits  from firefox are quality visits(generating revenue).

### Operating System

In [None]:
# exploration of operating system
categorical_countplot('device.operatingSystem')

__Microsoft Windows__ is leading OS __used__ here but __Mac__ users are __generating more revenue__ than others. Average revenue(__non-zero revenue__) per visit is highest for __Chrome OS__.

### deviceCategory

In [None]:
categorical_countplot('device.deviceCategory')

Bar plots are uniform and shows same pattern for usage,totalrevenue and avgrevenue. Most of the visits are from __Desktop computers__ here. So previous graph for operating system also makes sense that usage of Windows and Macintosh OS are significantly higher as comparison to others.

#### Continent

In [None]:
categorical_countplot('geoNetwork.continent')

Visit counts and totalrevenue is highest from __Americas__ but avgrevenue per visit is significantly higher from __Africa__.Also, number of visits from asia and europe are high but they are contibuting much towards generting revenue.

### Sub-continent

In [None]:
categorical_countplot('geoNetwork.subContinent')

Visits are coming mostly from North America. Also, North America is major revenue contributor(seems more than 90%) . But more quality visits(non-zero revenue) are coming from South America and Eastern Asia.

### Country

In [None]:
categorical_countplot('geoNetwork.country')

From graphs above, we can say that United States is cash-cow for Google. Revenue contributed by United States alone constitutes more than 90% of total revenue genearated by all countries.
Surprisingly, more non-zero revenue visits are coming from Japan as very less number of visits are coming from Japan.

### Traffic Source

In [None]:
categorical_countplot('trafficSource.source')

__Google and youtube__(also owned by google) themselves are major source of visits to its merchandising store. However, __mail.googleplex.com__(seems google own mail server used by its employess, I tried visited it !) is major contributor of revenue.So, __folks working in Google itself are buying more from their merchandise store__. Also, facebook and baidu are not making any significant contributions to revenue(may be google is placing their ads on them!).

### Traffic medium

In [None]:
categorical_countplot('trafficSource.medium')

Here in training dataset, 7 Traffic mediums are used. Organic is generating more visits then any other mediums. __Organic__ traffic is Traffic from search engine results that is earned, not paid! I got it from this [link](https://www.smartbugmedia.com/blog/what-is-the-difference-between-direct-and-organic-search-traffic-sources). But __Referral__(traffic that occurs when a user finds you through a site other than a major search engine) traffic is generating more  revenue than other traffic mediums. Although, __CPM__(click-per-impressions) is generating most of the revenue for google merchandise store.

### hits

In [None]:
hits_query = """SELECT totals.hits as hits, COUNT(*) AS COUNT,coalesce(SUM( totals.transactionRevenue ),0) AS TotalRevenue,
coalesce(AVG( totals.transactionRevenue ),0) AS AvgRevenue
  FROM `kaggle-public-datasets.ga_train_set.ga_sessions_*` 
  GROUP BY totals.hits ORDER BY totals.hits"""

hits_count = ga_bq_train.query_to_pandas_safe(hits_query)

In [None]:
#visits per hit
plt.figure(figsize=(16,8))
sns.barplot(x='hits',y='COUNT',data=hits_count.head(50),color='green')
plt.title('Visits count per hits')

In [None]:
#effect of hits on total revenue and mean revenue
plt.figure(figsize=(16,8))
for i,col in enumerate(['TotalRevenue','AvgRevenue']) :
        ax = plt.subplot(1,2,i+1)
        sns.scatterplot(x='hits',y=col,data=hits_count)
        plt.title(col + ' per hits')
        scale_y = 1e6
        ticks_y = ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x/scale_y))
        ax.yaxis.set_major_formatter(ticks_y)
        plt.xticks(rotation=90)

We can see that as number of hits increases, the number of visits decreases.Almost, same pattern is observed in graph for total revenue and average revenue.

Totalrevenue steeply decreases with increase in number of hits.

### pageviews

In [None]:
pageviews_query = """SELECT CAST(totals.pageviews as INT64) as pageviews, COUNT(*) AS COUNT,coalesce(SUM( totals.transactionRevenue ),0) AS TotalRevenue,
coalesce(AVG( totals.transactionRevenue ),0) AS AvgRevenue
  FROM `kaggle-public-datasets.ga_train_set.ga_sessions_*` 
  GROUP BY totals.pageviews ORDER BY totals.pageviews"""

pageviews_count = ga_bq_train.query_to_pandas_safe(pageviews_query)

In [None]:
#pageviews counts
plt.figure(figsize=(16,8))
sns.barplot(x='pageviews',y='COUNT',data=pageviews_count.head(30),color='blue')
plt.title('Visit count per pageviews')

In [None]:
#effect of pageviews on total revenue and mean revenue
plt.figure(figsize=(16,8))
for i,col in enumerate(['TotalRevenue','AvgRevenue']) :
        ax = plt.subplot(1,2,i+1)
        sns.scatterplot(x='pageviews',y=col,data=pageviews_count)
        plt.title(col + ' per pageviews')
        scale_y = 1e6
        ticks_y = ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x/scale_y))
        ax.yaxis.set_major_formatter(ticks_y)
        plt.xticks(rotation=90)

Here also, similar to hits, same trend is observed i.e. visit count decreases as number of pageviews increaese. Same is the case with graphs for effect of pageviews on totalrevenue and average revenue. 

Next thing, I will try to do some feature engineering using bigquery SQL! Also, any suggestions/feedbacks are most welcome!