In [None]:
# Copyright 2021 Google LLC.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 1. EDA (Exploratory Data Analysis)

This notebook has following objectives:

1. Check feasability of building propensity model.
1. Inspect GA fields in order to identify relevant information for features and targets.
1. Perform initial exploratory data analysis to identify insights that might help with building well performing propensity model.
1. Get a more "intimate" feel of the data before embarking on a large scale processing and automated modelling.

## Requirements

1. [Google Analytics dataset stored in BigQuery.](https://support.google.com/analytics/answer/3437618?hl=en)

## Install and import required modules

In [None]:
# Install gps_building_blocks package if not installed
# !pip install gps_building_blocks

In [None]:
import pandas as pd

from gps_building_blocks.cloud.utils import bigquery as bigquery_utils
from utils import exploratory_data_analysis

### Settings

In [None]:
# Prints all the outputs from cell (instead of using display each time)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Configuration

Configure following variables:

In [None]:
# GCP Project ID
PROJECT_ID = 'project_id'
# BigQuery dataset name
DATASET_NAME = 'bigquery-public-data.google_analytics_sample'
# BigQuery table (name) containing google analytics session data.
TABLE_NAME = 'ga_sessions'

In [None]:
params = {
  'project': PROJECT_ID,
  'dataset': DATASET_NAME,
  'table': TABLE_NAME,
  'verbose': True
}

First, we initialize Analysis with config parameters.

In [None]:
bq_client = bigquery_utils.BigQueryUtils(project_id=PROJECT_ID)
eda = exploratory_data_analysis.Analysis(bq_client=bq_client, params=params)

## 1. Problem statement

Objective:
1. Define problem statement and questions you are trying to address.

### Questions:
1. What is the business problem you are trying to solve?
1. What are the success criteria of the project?
1. What target do you want to predict?
1. What are the important fields to consider as the potential features?

## 2. Schema and the field descriptions.

This is example for GA360 data. You can see field descriptions [here](https://support.google.com/analytics/answer/3437719?hl=en#).

In [None]:
# We can read the table with schema description
# into pandas dataframe for easy reference or joining.
df_schema = pd.read_html('https://support.google.com/analytics/answer/3437719?hl=en#')[0]
df_schema

## 3. Data Structure
<a id= 'data_structure'></a>

We can get information about data directly from metadata of the bigquery dataset using [INFORMATION_SCHEMA](https://cloud.google.com/bigquery/docs/information-schema-tables). Sometimes its not populated.

Objective:

1. Understand structure of the data.

Questions:

1. Is the dataset description available and what does it say?
1. How long the dataset is stretches for (how many daily tables) ?
1. How big are the daily tables?
1. Are there any missing days?

In [None]:
table_options, description = eda.get_ds_description()

# References

**General**

[Google Analytics Glossary](https://support.google.com/analytics/topic/6083659?hl=en&ref_topic=3544906)

[Interactive visualization of the Google Analytics 360 BigQuery Export schema.](https://storage.googleapis.com/e-nor/visualizations/bigquery/ga360-schema.html#section-collapsible-tree)