In [None]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Overview
This Notebook aims to automate the analysis and SQL coding needed when creating an input data set for a propensity model. More specifically, it's designed for Binning and Dimensionality Reduction of Categorical Variables from the Google Analytics Big Query Export. 

The notebook will allow the user to pass standard and custom dimensions from Google Analytics Big Query Export into a function that will return a summary table of top values in each dimension as well as SQL code to be used when creating a propensity model's input dataset.

## Dataset
This notebook is designed to work with categorical variables, both standard and cusotom dimensions, from the Google Analytics Big Query Export

## Objective
The goal of the notebook is to speed up creation of the input dataset by automating summary tables and sql code for each categorical variable in the Google Analytics Big Query Export. 

This allows the user to quickly determine which value within each categorical variable should and shouln't be included in the input data set then easily port (copy/paste) the corresponding SQL code into thier SQL enviroment for creation of the input dataset.

## Cost
This tutorial uses billable components of Google Cloud Platform (GCP):

- Cloud AI Platform
- Cloud Storage
Learn about Cloud AI Platform pricing and Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.




### Set up your local development environment

**If you are using Colab or AI Platform Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK
* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

3. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3.

4. Activate that environment and run `pip install jupyter` in a shell to install
   Jupyter.

5. Run `jupyter notebook` in a shell to launch Jupyter.

6. Open this notebook in the Jupyter Notebook Dashboard.

### Set up your GCP project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)

3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)

4. Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

In [None]:
PROJECT_ID = "" #@param {type:"string"}
! gcloud config set project $PROJECT_ID

### Authenticate your GCP account

**If you are using AI Platform Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the GCP Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. From the **Service account** drop-down list, select **New service account**.

3. In the **Service account name** field, enter a name.

4. From the **Role** drop-down list, select
   **Machine Learning Engine > AI Platform Admin** and
   **Cloud Storage > Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

if 'google.colab' in sys.modules:
  from google.colab import auth as google_auth
  google_auth.authenticate_user()

# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
else:
  %env GOOGLE_APPLICATION_CREDENTIALS ''

## Imports & Inputs

In [None]:
# Install sidetable because it is not pre-installed colab library
!pip install sidetable==0.6.0

sidetable documentation: https://pbpython.com/sidetable.html

In [None]:
# Python Library Imports
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import pandas as pd
import re
import sidetable

**Set Dataset Paramaters Below**

In [None]:
#@title Dataset Parameters
# Google Analytics Big Query Export Paramaters
project_id_billing = "" #@param {type:"string"}
dataset_id = "bigquery-public-data.google_analytics_sample" #@param {type:"string"}
table_id = "ga_sessions_*" #@param {type:"string"}
start_date = "20170801" #@param {type:"string"}
end_date = "20170801" #@param {type:"string"}
country = "United States" #@param {type:"string"}



**Set Default Summary Table Paramaters Below**
This will set default values for the function. You can always specify different values within the function's arguments

In [None]:
#@title Summary Table Paramaters
#@markdown **Threshold**: Unique values that contribute to the top X%. Eliminates long-tail results
threshold = 95 #@param {type:"slider", min: 0, max: 100, step:5}
#@markdown **Max Rows**: Maximum number of rows to display. Helps when there are many results that fall under the threshold
max_rows = 10 #@param {type:"integer"}
#@markdown **Metric**: Calculations will be done based on Vists or Unique Users
metric = 'visits' #@param ["visits", "user_cnt"]

## Stanard Dimension Function

Run the below cell to create the Feature Buckets function for standard dimensions

In [None]:
def featureBuckets(dimension_list, threshold=threshold, max_rows=max_rows):

  for field in dimension_list:
    try:
      # Run SQL based on paramaters
      sql = f"""
            SELECT
              '{field}' as dimension,
              {field} as value,
              count(distinct clientId) as user_cnt,
              count(distinct concat(fullVisitorId, visitId)) as visits
            FROM `{dataset_id}.{table_id}` AS visits
              ,UNNEST(visits.hits) as hits
            WHERE 
              _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
              AND geoNetwork.Country = '{country}'
            GROUP BY 1, 2
            """
      
      # Create Frequnecy Distribution DataFrame based on threshold and metric
      df = pd.read_gbq(sql, project_id=project_id_billing, dialect='standard') # Create DataFrame from SQL statement
      df = df.stb.freq(['value'], value=metric, thresh=threshold) # Use SideTable to create frequncy distribtion
      df['dimension'] = field # Add columns specifying dimension
      display(df[['dimension','value', 'visits', 'percent', 'cumulative_visits', 'cumulative_percent']][:max_rows])
      print("")

      # Create SQL statement for Binary Variables of top values based from each dimension based on threshold and max_rows
      i = 1
      for index, row in df[:-1].iterrows():
        print("Max(CASE WHEN ",row['dimension'],"='",row['value'],"' THEN 1 ELSE 0 END) AS ",\
              re.sub('[\\\\/:*?" <>|._()&,-]','',row['dimension']),'_',\
              re.sub('[\\\\/:*?" <>|._()&,-]','',row['value']),",",sep="")
        if i == max_rows:
          break
        i += 1
      print("")
    
    except:
      print('Could not run', field)

In [None]:
# Example of Output
featureBuckets(['device.browser'])

### Run Function

- Pass lists of dimensions into the function to get summary tables and SQL code that you can copy/paste directly into your input data SQL code

- Note: in each function you can also pass arguments for `max_rows` and `threshold`, so you don't have the always use the default paramaters set at the beginning of this notebook 

- Included already are standard dimensions for device, traffic source, page, geo, and ecommerce.

#### Device

In [None]:
device = [
          'device.browser',
          'device.browserSize',
          'device.browserVersion',
          'device.deviceCategory',
          'device.mobileDeviceInfo',
          'device.mobileDeviceMarketingName',
          'device.mobileDeviceModel',
          'device.mobileInputSelector',
          'device.operatingSystem',
          'device.operatingSystemVersion',
          'device.mobileDeviceBranding',
          'device.flashVersion',
          'device.javaEnabled',
          'device.language',
          'device.screenColors',
          'device.screenResolution'
          ]

In [None]:
featureBuckets(device, threshold=95)

#### Traffic Source

In [None]:
traffic_source = [
                  'trafficSource.adContent',
                  'trafficSource.campaign',
                  'trafficSource.campaignCode',
                  'trafficSource.isTrueDirect',
                  'trafficSource.keyword',
                  'trafficSource.medium',
                  'trafficSource.referralPath',
                  'trafficSource.source',
                  'channelGrouping'
                  ]

In [None]:
featureBuckets(traffic_source, threshold=.90)

#### Geo


In [None]:
geo = [
      'geoNetwork.subContinent',
      'geoNetwork.country',
      'geoNetwork.region',
      'geoNetwork.metro'
      'geoNetwork.city'
       ]

In [None]:
featureBuckets(geo, threshold=.90, max_rows=100)

#### Page

In [None]:
page = [
        'hits.page.pagePath',
        'hits.page.pagePathLevel1',
        'hits.page.pagePathLevel2',
        'hits.page.pagePathLevel3',
        'hits.page.pagePathLevel4',
        'hits.page.hostname',
        'hits.page.pageTitle',
        'hits.page.searchKeyword'
        ]

In [None]:
featureBuckets(page, threshold=.80)

#### Ecommerce

In [None]:
ecomm = [
         'eCommerceAction.action_type',	
         'eCommerceAction.step',	
         'eCommerceAction.option'
         ]

In [None]:
featureBuckets(ecomm, threshold=.90)

## Custom Dimension Function

If you do not know your Custom Dimension Index values, run the below cell to return a list of Custom Dimensions Indexes used in your dataset

In [None]:
# Return all custome dimension indexes in list of values to pass to function
sql = f"""
      SELECT
        customDimensions.index
      FROM 
        `{dataset_id}.{table_id}` AS visits
        ,UNNEST(visits.hits) as hits
        ,UNNEST(hits.customDimensions) as customDimensions
      WHERE 
        _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
        AND geoNetwork.Country = '{country}'
      GROUP BY 1
      ORDER BY 1 ASC
      """

customDimension_list = list(pd.read_gbq(sql, project_id=project_id_billing, dialect='standard')['index'])
print(customDimension_list)

Run the below cell to create the Feature Buckets function for custom dimensions

In [None]:
def CDfeatureBuckets(customDimension_list, threshold=threshold, max_rows=max_rows):

  for field in list(customDimension_list):
    try:
      # Run SQL based on paramaters
      sql = f"""
            SELECT
              '{field}' as dimension,
              customDimensions.value as value,
              count(distinct clientId) as user_cnt,
              count(distinct concat(fullVisitorId, visitId)) as visits
            FROM `{dataset_id}.{table_id}` AS visits
              ,UNNEST(visits.hits) as hits
              ,UNNEST(hits.customDimensions) as customDimensions
            WHERE 
              _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
              AND geoNetwork.Country = '{country}'
              AND customDimensions.index = {field}
            GROUP BY 1, 2
            """
      
      # Create Frequnecy Distribution DataFrame based on threshold
      df = pd.read_gbq(sql, project_id=project_id_billing, dialect='standard') # Create DataFrame from SQL statement
      df = df.stb.freq(['value'], value=metric, thresh=threshold) # Use SideTable to create frequncy distribtion
      df['dimension'] = field # Add columns specifying dimension
      display(df[['dimension','value', 'visits', 'percent', 'cumulative_visits', 'cumulative_percent']][:max_rows])
      print("")

      # Create SQL statement for Binary Variables of top values based from each dimension based on threshold and max_rows
      i = 1
      for index, row in df[:-1].iterrows():
        print("Max(CASE WHEN customDimensions.index = ",row['dimension']," AND customDimensions.value = '",row['value'],"' THEN 1 ELSE 0 END) AS cd",\
              row['dimension'],'_',\
              re.sub('[\\\\/:*?" <>|._()&,-]','',row['value']),",",sep="")
        if i == max_rows:
          break
        i += 1
      print("")
    
    except:
      print('Could not run', field)

### Run Function

- Pass lists of custom dimensions into the function to get summary tables and SQL code that you can copy/paste directly into your input data SQL code. You can use the `customDimension_list` created above or pass your own list of values

- Note: in each function you can also pass arguments for `max_rows` and `threshold`, so you don't have the always use the default paramaters set at the beginning of this notebook 

In [None]:
CDfeatureBuckets(customDimension_list)