<h1><center><font size="6">Google Analytics Customer Revenue Extensive EDA</font></center></h1>

<img src="https://lh3.googleusercontent.com/JyFKFXvek5tgUMhZh4FhBrlSKKoq74s53I91nfXdMLJNHg8WzOPSS8DSog4V0FUJOA"></img>

# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Prepare the data analysis</a>  
    -<a href='#21'>Load packages</a>  
     -<a href='#21'>Load the data</a>  
- <a href='#3'>Data exploration</a>   
    -<a href='#31'>Missing data</a>  
    -<a href='#32'>Channel Grouping</a>  
    -<a href='#33'>Social Engagement Type</a>  
    -<a href='#34'>Device attributes</a>  
    -<a href='#35'>Geographical/Network attributes</a>  
    -<a href='#36'>Total attributes</a>  
    -<a href='#37'>Traffic source attributes</a>  
    -<a href='#38'>Date and time</a>  
- <a href='#4'>Conclusions</a>    
- <a href='#5'>References</a>    

# <a id="1">Introduction</a>  

## The competition

In this competition, Google Cloud and Kaggle partenered with [RStudio](http://www.rstudio.com) to challenge the Kagglers to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer.     

The data is provided both in *.csv format and in BigQuery format.

In this Kernel we will use the data in *.csv format.

The data fields are as following:

* **fullVisitorId** - A unique identifier for each user of the Google Merchandise Store.
* **channelGrouping** - The channel via which the user came to the Store.
* **date** - The date on which the user visited the Store.
* **device** - The specifications for the device used to access the Store.
* **geoNetwork** - This section contains information about the geography of the user.
* **sessionId** - A unique identifier for this visit to the store.
* **socialEngagementType** - Engagement type, either "Socially Engaged" or "Not Socially Engaged".
* **totals** - This section contains aggregate values across the session.
* **trafficSource** - This section contains information about the Traffic Source from which the session originated.
* **visitId** - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
* **visitNumber** - The session number for this user. If this is the first session, then this is set to 1.
* **visitStartTime** - The timestamp (expressed as POSIX time).

Some of the data fields are blobs with multiple attributes, as following: **device**, **geoNetwork**, **totals**, **trafficSource**.

We will need to predict the natural log of the sum of all transactions per user. For every user in the test set, the target is:

$$target_{user}=\sum_{i=0}^n {transaction_{user}}_i$$


## This Kernel

This Kernel objective is to explore the dataset for [Google Analytics Customer Revenue Prediction competition](https://www.kaggle.com/c/google-analytics-customer-revenue-prediction). 

We only use the data in *.csv format.

For the **predictive model**, a separate **Kernel** was developed: https://www.kaggle.com/gpreda/ga-customer-revenue-simple-lightgbm, with public **LB 1.6650**.


<a href="#0"><font size="1">Go to top</font></a>


# <a id="2">Prepare the data analysis</a>  



# <a id="21">Load the packages</a>  




In [None]:
import numpy as np 
import pandas as pd 
import json
from pandas.io.json import json_normalize
import datetime as dt
import seaborn as sns 
import matplotlib.pyplot as plt 
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import os
IS_LOCAL=False
if(IS_LOCAL):
    PATH="../google-analytics-customer-revenue-prediction/input/"    
else:
    PATH="../input/"

# <a id="22">Load the data</a>  


We load first the data. Let's see what data files we have.

In [None]:
print(os.listdir(PATH))

Let's check first the columns and types of `train.csv`.

In [None]:
onerow = pd.read_csv(PATH+'train.csv',nrows=1)
pd.concat([onerow.T, onerow.dtypes.T], axis=1, keys=['Example', 'Type'])

Columns **device**, **geoNetwork**, **totals**, **trafficSource** are of type **Objects** and are storing data in **json** format. We will create a function to read the data and creates separate columns for every element in the jsons.    

As well, we will have to pay attention to the **fullVisitorID** field, which is an integer but we will have to read it as **str**, to not loose some of the prefixing **0** digits.  

Let's load **train** data. We will use the procedure described in <a href='#5'>[1]</a> to flatten the json objects.

In [None]:
#the columns that will be parsed to extract the fields from the jsons
cols_to_parse = ['device', 'geoNetwork', 'totals', 'trafficSource']

def read_parse_dataframe(file_name):
    #full path for the data file
    path = PATH + file_name
    #read the data file, convert the columns in the list of columns to parse using json loader,
    #convert the `fullVisitorId` field as a string
    data_df = pd.read_csv(path, 
        converters={column: json.loads for column in cols_to_parse}, 
        dtype={'fullVisitorId': 'str'})
    #parse the json-type columns
    for col in cols_to_parse:
        #each column became a dataset, with the columns the fields of the Json type object
        json_col_df = json_normalize(data_df[col])
        json_col_df.columns = [f"{col}_{sub_col}" for sub_col in json_col_df.columns]
        #we drop the object column processed and we add the columns created from the json fields
        data_df = data_df.drop(col, axis=1).merge(json_col_df, right_index=True, left_index=True)
    return data_df

In [None]:
%%time
train_df = read_parse_dataframe('train.csv')

Let's check now the dataset shape.

In [None]:
print("Train set:",train_df.shape[0]," rows, ", train_df.shape[1],"columns")

In [None]:
train_df.head()

It seems that **sessionId** is the result of concatenating **fullVisitorId** with **visitId**.  The field **visitStartTime** seems to be identical with **visitId** and also it is most probably the timestamp. Let's check if the value of first **visitId** is a timestamp.



In [None]:
print(dt.datetime.fromtimestamp(train_df['visitId'][0]).isoformat())

We will explore these into more details in the following section. Let's for now just extract date and time from the date field.

In [None]:
def process_date_time(data_df):
    data_df['date'] = data_df['date'].astype(str)
    data_df["date"] = data_df["date"].apply(lambda x : x[:4] + "-" + x[4:6] + "-" + x[6:])
    data_df["date"] = pd.to_datetime(data_df["date"])   
    data_df["year"] = data_df['date'].dt.year
    data_df["month"] = data_df['date'].dt.month
    data_df["day"] = data_df['date'].dt.day
    data_df["weekday"] = data_df['date'].dt.weekday
    return data_df

In [None]:
train_df = process_date_time(train_df)

Let's check again the dataset shape.

In [None]:
print("Train set:",train_df.shape[0]," rows, ", train_df.shape[1],"columns")

Let's load also the test data. Then, let's process similarly teh test data.

In [None]:
%%time
test_df = read_parse_dataframe('test.csv')
test_df = process_date_time(test_df)

In [None]:
print("Test set:",test_df.shape[0]," rows, ", test_df.shape[1],"columns")

# <a id="3">Data exploration</a>  


## <a id="31">Missing data</a>


Let's check if there are columns with missing data. We will only show the columns with missing data.

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    df = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return df.loc[~(df['Total']==0)]
missing_data(train_df)

Some of the columns in the train dataset (8) have >97% of the values missing, majority columns with missing values being from **trafficSource**.   

These fields (especially the ones with high percent of missing values) are candidates to be droped when we will create a predictive model.    

Let's check the status for the test dataset.



In [None]:
missing_data(test_df)

A part of the features with very high missing data percent in the **train** data have a lower percent of the missing data in the **test** set (~93%) and also there are some that are not appearing in the list with fields with missing values in the **test**.     


We can also see that there are fields that does appears only in the **train** and  set, for example  **trafficSource_campaignCode** . We will have to consider these aspects when we will decide what features to drop and what features to keep for the predictive model.


<a href="#0"><font size="1">Go to top</font></a>

## <a id="32">Channel Grouping</a>

Let's check the channelGrouping data distribution.

In [None]:
fig, (ax) = plt.subplots(nrows=1,figsize=(8,4))
sns.countplot(train_df['channelGrouping'])
plt.title("Channel Grouping")
plt.show()

Let's look into more details in the **channelGrouping** classes.

In [None]:
def get_feature_distribution(data, feature):
    # Get the count for each label
    label_counts = data[feature].value_counts()
    # Get total number of samples
    total_samples = len(data)
    # Count the number of items in each class
    for i in range(len(label_counts)):
        label = label_counts.index[i]
        count = label_counts.values[i]
        percent = int((count / total_samples) * 10000)/100
        print("{:<30s}:   {} or {}%".format(label, count, percent))

get_feature_distribution(train_df,'channelGrouping')

## <a id="33">Social Engagement Type</a>

Let's check the Social engagement type data distribution.

In [None]:
get_feature_distribution(train_df,'socialEngagementType')

Only **Not Socially Engaged** type is present. For prediction this field is a very good candidate to be droped.  

<a href="#0"><font size="1">Go to top</font></a>

## <a id="34">Device attributes</a>  

Let's check the device fields.

In [None]:
device_cols = train_df.columns[train_df.columns.str.contains('device')].T.tolist()
print("There are ",len(device_cols),"columns with device attributes:\n",device_cols)

Before starting to plot the number of visits per various devices attributes, let's check if there are devices attributes that have a unique value.

In [None]:
const_device_cols = []
for i, col in enumerate(device_cols):
    if(len(train_df[col].value_counts())==1):
        const_device_cols.append(col)
print("There are ",len(const_device_cols),"columns with unique value for device attributes:\n",const_device_cols)

The columns with constant value will be droped for the model.

We will only show the number of visits for the devices attributes that have more than one value. 

In [None]:
def show_features(data,features,width=6,height=6):
    for i,feature in enumerate(features):
        f, ax = plt.subplots(1,1, figsize=(width,height))
        sns.countplot(data[feature],order = data[feature].value_counts().iloc[:50].index)
        plt.xticks(rotation=90)
        plt.show()

In [None]:
var_cols = [item for item in device_cols if item not in const_device_cols]
show_features(train_df,var_cols, width=12,height=6)

The majority of the visits are using devices with Windows OS, Chrome browser, from a Desktop. From the mobile devices, majority are phones. 

The most used OS are: Windows, Macintosh, Android, iOS and Linux.  
The most used browsers are Chrome, Safari, Firefox, Internet Explorer and Edge.    
Let's check few of these features correlation.


In [None]:
def plot_heatmap_count(data_df, feature1, feature2, feature3='channelGrouping', color="Greens", title="", height=16, width=16):
    tmp = data_df.groupby([feature1, feature2])[feature3].count()
    df1 = tmp.reset_index()
    matrix = df1.pivot(feature1, feature2, feature3)
    fig, (ax1) = plt.subplots(ncols=1, figsize=(width,height))
    sns.heatmap(matrix, 
        xticklabels=matrix.columns,
        yticklabels=matrix.index,ax=ax1,linewidths=.1,linecolor='black',annot=True,cmap=color)
    plt.title(title, fontsize=14)
    plt.show()
    
def plot_heatmap_sum(data_df, feature1, feature2, feature3='channelGrouping', color="Greens", title="", height=16, width=16):
    tmp = data_df.groupby([feature1, feature2])[feature3].sum()
    df1 = tmp.reset_index()
    matrix = df1.pivot(feature1, feature2, feature3)
    fig, (ax1) = plt.subplots(ncols=1, figsize=(width,height))
    sns.heatmap(matrix, 
        xticklabels=matrix.columns,
        yticklabels=matrix.index,ax=ax1,linewidths=.1,linecolor='black',annot=True,cmap=color)
    plt.title(title, fontsize=14)
    plt.show()

In [None]:
plot_heatmap_count(train_df, 'device_browser', 'device_operatingSystem',color='Reds',title="Device Browsers vs. Device OS")

Chrome with Windows are the most frequent combination, followed by Chrome with Macintosh, Chrome with Android and Safari with Macintosh and Safari with iOS.

In [None]:
plot_heatmap_count(train_df, 'device_browser','device_deviceCategory', color='Blues',title="Device Browser vs. Device Category", height=12, width=8)

Chrome on Desktop is the most frequent browser-category device combination, followed by Chrome with mobile and Safari with desktop and with mobile.

In [None]:
plot_heatmap_count(train_df, 'device_deviceCategory', 'device_isMobile', color='viridis',title="Device is mobile vs. Device Category", width=6, height=4)

We can observe that there are both Desktop that appears as mobile (110) device and tablet (14) and mobile (150) set as not mobile. 

<a href="#0"><font size="1">Go to top</font></a>

# <a id="35">Geographic/Network attributes</a>

Let's check the geographical/network attributes. As we did with the devices attributes, we will first gather all columns with **geoNetwork** in the name.

In [None]:
geo_cols = train_df.columns[train_df.columns.str.contains('geoNetwork')].T.tolist()
print("There are ",len(geo_cols),"columns with geoNetwork attributes:\n",geo_cols)

Before starting to plot the number of visits per various geoNetwork attributes, let's check if there are geoNetwork attributes that have a unique value.

In [None]:
const_geo_cols = []
for i, col in enumerate(geo_cols):
    if(len(train_df[col].value_counts())==1):
        const_geo_cols.append(col)
print("There are ",len(const_geo_cols),"columns with unique value for geoNetwork attributes:\n",const_geo_cols)

These columns are candidates to be dropped from the model.  For the rest of the columns, we show the number of the visits per each attribute. 
Note: We limit the number of shown values/categories to 50, showing the most numerous first.

In [None]:
var_cols = [item for item in geo_cols if item not in const_geo_cols]
show_features(train_df,var_cols,16,6)

Not all the cities, network domains, metropolitan areas, continents are set.   
Most numerous cities are not available in the dataset, as well as most metropolitan areas or network domains.  
The continent with the largest number of visits is America. The sub-continent with the largest number of visits is Northern America.

Let's also show the geographical features on a plotly map. We will show the country feature distribution.

In [None]:
tmp = train_df['geoNetwork_country'].value_counts()
country_visits = pd.DataFrame(data={'geoNetwork_country': tmp.values}, index=tmp.index).reset_index()
country_visits.columns = ['Country', 'Visits']

In [None]:
def plot_country_map(data, location, z, legend, title, colormap='Rainbow'):
    data = dict(type = 'choropleth', 
                colorscale = colormap,
                autocolorscale = False,
                reversescale = False,
               locations = data[location],
               locationmode = 'country names',
               z = data[z], 
               text = data[z],
               colorbar = {'title':legend})
    layout = dict(title = title, 
                 geo = dict(showframe = False, 
                         projection = {'type': 'natural earth'}))
    choromap = go.Figure(data = [data], layout=layout)
    iplot(choromap)

In [None]:
plot_country_map(country_visits, 'Country', 'Visits', 'Visits', 'Visits per country')



<a href="#0"><font size="1">Go to top</font></a>

## <a id="36">Totals attributes</a>  

Let's check the total attributes. 

In [None]:
tot_cols = train_df.columns[train_df.columns.str.contains('totals')].T.tolist()
print("There are ",len(tot_cols),"columns with Totals attributes:\n",tot_cols)

Let's check if there are columns with an unique value.

In [None]:
const_tot_cols = []
for i, col in enumerate(tot_cols):
    if(len(train_df[col].value_counts())==1):
        const_tot_cols.append(col)
print("There are ",len(const_tot_cols),"columns with unique value for Totals attributes:\n",const_tot_cols)

In [None]:
var_cols = [item for item in tot_cols if item not in const_tot_cols]
show_features(train_df,var_cols,12,4)

For **totals_transactionRevenue**, let's also show the values distribution.

In [None]:
train_df['totals_transactionRevenue'] = pd.to_numeric(train_df['totals_transactionRevenue'])
df = train_df[train_df['totals_transactionRevenue'] > 0]['totals_transactionRevenue']
f, ax = plt.subplots(1,1, figsize=(16,4))
plt.title("Distribution of totals: transaction revenue")
sns.kdeplot(df, color="green")
plt.tick_params(axis='both', which='major', labelsize=12)
plt.ylabel('Density plot', fontsize=12)
plt.xlabel('Transaction revenue', fontsize=12)
locs, labels = plt.xticks()
plt.show()

Let's check as well the log of the total transaction revenue.

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(np.log1p(df),color="darkgreen",bins=50)
plt.xlabel("Log(total transaction revenue)");
plt.title("Logarithmic distribution of total transaction revenue (non-zeros)");

Let's show the distribution of visits per country, taking into consideration only the visits with non-zero transactions.

In [None]:
# select the visits with non-zero transaction revenue
non_zero = train_df[train_df['totals_transactionRevenue']>0]
tmp = non_zero['geoNetwork_country'].value_counts()
country_visits = pd.DataFrame(data={'geoNetwork_country': tmp.values}, index=tmp.index).reset_index()
country_visits.columns = ['Country', 'Visits']

In [None]:
plot_country_map(country_visits, 'Country', 'Visits', 'Visits', 'Visits with non zero transactions')

In [None]:
# select the visits with non-zero transaction revenue and calculate the sums
non_zero = train_df[train_df['totals_transactionRevenue']>0]
tmp = non_zero.groupby(['geoNetwork_country'])['totals_transactionRevenue'].sum()
country_total = pd.DataFrame(data={'total': tmp.values}, index=tmp.index).reset_index()
country_total.columns = ['Country', 'Total']
country_total['Total']  = np.log1p(country_total['Total'])

In [None]:
plot_country_map(country_total, 'Country', 'Total', 'Total(log)', 'Total revenues per country (log scale)')

We can observe that majority of the visits with non-zero transactions and most of the revenues are from US.   

Let's check the top 10 of the transaction revenue.

In [None]:
non_zero[['fullVisitorId','visitNumber', 'totals_transactionRevenue', 'channelGrouping']].sort_values(['totals_transactionRevenue', 'fullVisitorId'], ascending=[0,0]).head(10)

We can see that in top 10 by the total transaction revenue there is one fullVisitorId (1957458976293878100) that appears 5 times.

Let's check also the top 10 of visit number.

In [None]:
non_zero[['fullVisitorId','visitNumber', 'totals_transactionRevenue', 'channelGrouping']].sort_values(['visitNumber', 'totals_transactionRevenue'], ascending=[0,0]).head(10)

One fullVisitorId (1957458976293878100) ocupies all the positions in top 10 (by visitNumber).

Let's calculate the sum of the transaction revenue per **channelGrouping**.

In [None]:
# select the visits with non-zero transaction revenue and calculate the sums
non_zero = train_df[train_df['totals_transactionRevenue']>0]
tmp = non_zero.groupby(['channelGrouping', 'geoNetwork_subContinent'])['totals_transactionRevenue'].sum()
channel_total = pd.DataFrame(data={'total': tmp.values}, index=tmp.index).reset_index()
channel_total.columns = ['Channel', 'Subcontinent', 'Total']

In [None]:
plot_heatmap_sum(non_zero, 'geoNetwork_subContinent','channelGrouping',  'totals_transactionRevenue','rainbow',"Total transactions per channel and subcontinent", width=16, height=6)

Most of the transaction revenues are from Northern America, with Refferal, Direct and Organic Search channel.

<a href="#0"><font size="1">Go to top</font></a>


## <a id="37">Traffic Source attributes</a>  

In [None]:
ts_cols = train_df.columns[train_df.columns.str.contains('trafficSource')].T.tolist()
print("There are ",len(ts_cols),"columns with Totals attributes:\n",ts_cols)

Let's check the **trafficSource** attributes that have unique value.

In [None]:
const_ts_cols = []
for i, col in enumerate(ts_cols):
    if(len(train_df[col].value_counts())==1):
        const_ts_cols.append(col)
print("There are ",len(const_ts_cols),"columns with unique value for Traffic Source attributes:\n",const_ts_cols)

We will plot only the attributes of Traffic Source with multiple categories.

In [None]:
var_cols = [item for item in ts_cols if item not in const_ts_cols]
show_features(train_df,var_cols,12,4)

Google Merchandise Collection is the most important adCountent traffic source.  

Majority of campaign attributes are not set.  Majority of keywords are not provided. Organic and refferal stands for the majority of mediums.   
The most important traffic source is google, followed by youtube.com. 



<a href="#0"><font size="1">Go to top</font></a>

## <a id="38">Date and time</a>  


Let's check the distribution of values for date and time features. 

In [None]:
var_cols = ['year','month','day','weekday']
show_features(train_df,var_cols,12,4)

Let's plot the number of visits vs. date  and the amount of transaction revenues vs. date for the train set.

First, let's show the number of visits finalized with a transaction per day.

In [None]:
def plot_scatter_data(data, xtitle, ytitle, title, color='blue'):
    trace = go.Scatter(
        x = data.index,
        y = data.values,
        name=ytitle,
        marker=dict(
            color=color,
        ),
        mode='lines+markers'
    )
    data = [trace]
    layout = dict(title = title,
              xaxis = dict(title = xtitle), yaxis = dict(title = ytitle),
             )
    fig = dict(data=data, layout=layout)
    iplot(fig, filename='lines')

In [None]:
count_all = train_df.groupby('date')['totals_transactionRevenue'].agg(['size'])
count_all.columns = ["Total"]
count_all = count_all.sort_index()
plot_scatter_data(count_all['Total'],'Date', 'Total','Total count of visits (including zero transactions)','green')

Then, let's see the number of visits  with non-zero transactions per day.

In [None]:
count_nonzero = train_df.groupby('date')['totals_transactionRevenue'].agg(['count'])
count_nonzero.columns = ["Total"]
count_nonzero = count_nonzero.sort_index()
plot_scatter_data(count_nonzero['Total'],'Date', 'Total','Total non-zero transaction visits','darkblue')

Let's plot the total amount of transactions per day.

In [None]:
total_nonzero = train_df.groupby('date')['totals_transactionRevenue'].agg(['sum'])
total_nonzero.columns = ["Total"]
total_nonzero = total_nonzero.sort_index()
plot_scatter_data(total_nonzero['Total'],'Date', 'Total','Total non-zero transaction amounts','red')

Let's plot the total amount of non-zero transactions per day,  grouped by **channelGrouping**.

In [None]:
channels = list(train_df['channelGrouping'].unique())
data = []
for channel in channels:
    subset = train_df[train_df['channelGrouping']==channel]
    subset = subset.groupby('date')['totals_transactionRevenue'].agg(['sum'])
    subset.columns = ["Total"]
    subset = subset.sort_index()
    trace = go.Scatter(
        x = subset['Total'].index,
        y = subset['Total'].values,
        name=channel,
        mode='lines'
    )
    data.append(trace)
layout= go.Layout(
    title= 'Total amount of non-zero transactions per day, grouped by channel',
    xaxis = dict(title = 'Date'), yaxis = dict(title = 'Total'),
    showlegend=True,
)
fig = dict(data=data, layout=layout)
iplot(fig, filename='lines')

Let's plot the total amount of non-zero transactions per day,  grouped by **device_operatingSystem**.

In [None]:
opsys = list(train_df['device_operatingSystem'].unique())
data = []
for os in opsys:
    subset = train_df[train_df['device_operatingSystem']==os]
    subset = subset.groupby('date')['totals_transactionRevenue'].agg(['sum'])
    subset.columns = ["Total"]
    subset = subset.sort_index()
    trace = go.Scatter(
        x = subset['Total'].index,
        y = subset['Total'].values,
        name=os,
        mode='lines'
    )
    data.append(trace)
layout= go.Layout(
    title= 'Total amount of non-zero transactions per day, grouped by OS',
    xaxis = dict(title = 'Date'), yaxis = dict(title = 'Total'),
    showlegend=True,
)
fig = dict(data=data, layout=layout)
iplot(fig, filename='lines')


Let's plot now the number of visits per day in the test set.

In [None]:
total_test = test_df.groupby('date')['fullVisitorId'].agg(['count'])
total_test.columns = ["Total"]
total_test = total_test.sort_index()
plot_scatter_data(total_test['Total'],'Date', 'Total','Total count of visits per day (test set)','magenta')

Let's plot the average amount of transactions, grouped by **total_pageviews**.

# <a id="4">Conclusions</a>  


Preliminary analysis shows that there is a considerable number of columns with missing content, or with unique value.  These columns will have to be considered to be droped. @mlisovyi observed that some of the features with an unique value has also missing values so I would have to consider these values as binary and actually reconsider if need to be droped.  

For the predictive model, there is this Kernel available: https://www.kaggle.com/gpreda/ga-customer-revenue-simple-lightgbm, using the results of this data analysis.



# <a id="5">References</a>  

[1] [Julián Peller](https://www.kaggle.com/julian3833), 1 - Quick start: read csv and flatten json fields, https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields   
[2] [Shivam Bansal](https://www.kaggle.com/shivamb), Exploratory Analysis - GA Customer Revenue, https://www.kaggle.com/shivamb/exploratory-analysis-ga-customer-revenue/   
[3] [SRK](https://www.kaggle.com/sudalairajkumar), Simple Exploration+Baseline - GA Customer Revenue,  https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue   

<a href="#0"><font size="1">Go to top</font></a>