#Integração Query BQ no Colab

## Referências

* **Tutorial conexão BQ **https://colab.research.google.com/notebooks/bigquery.ipynb#scrollTo=gisPvdr4Eaui-
* **Colunas exportação BQ** https://support.google.com/analytics/answer/3437719?hl=pt-br

## Setup

In [0]:
!pip install -U -q PyDrive

import pandas as pd

## Autenticação Google Drive

In [2]:
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Autentica e cria um cliente PyDrive
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

print('Authenticated')

Authenticated


## Extração Query BQ

A biblioteca **google.cloud.bigquery** permite rodar a query e mostrar o resultado ou salvar em uma dataframe.

In [3]:
%%bigquery --project propension df
SELECT 
  COUNT(*) as total_rows
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

Unnamed: 0,total_rows
0,903653


Declara o Cloud Project Id do BQ: https://console.cloud.google.com/bigquery?project=propension&page=savedqueries

In [0]:
project_id = 'propension'

Determina o tamanho do dataset do BQ

In [5]:
from google.cloud import bigquery

client = bigquery.Client(project=project_id)

sample_count = 2000
row_count = client.query('''
  SELECT
    COUNT(*) as total
  FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`''').to_dataframe().total[0]

print('O dataset tem %d linhas' % row_count)

O dataset tem 903653 linhas


Uso do BQ com o pandas-gbq para fazer a query para o modelo de propensão. 
[Documentação do Pandas GBQ](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_gbq.html)

In [9]:
%%time
df = pd.io.gbq.read_gbq('''
  #standardSQL
  SELECT
      date,
      trafficSource.source AS source,
      trafficSource.source AS medium,
      fullVisitorId,
      channelGrouping,
      SUM(totals.pageviews) AS pageviews,
      SUM(totals.transactions) AS transactions,
      SUM(totals.timeOnSite) AS SUM_totals_timeOnSite,
      totals.totalTransactionRevenue AS receita,
      device.deviceCategory AS device,
      geoNetwork.country AS pais,
      MOD(ABS(FARM_FINGERPRINT(fullVisitorId)), 20) as farm_finger
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*` #, UNNEST(hits) AS hits
  GROUP BY
      date,
      source,
      medium,
      fullVisitorId,
      channelGrouping,
      receita,
      device,
      pais
    
  ''', project_id=project_id, verbose=False, dialect='standard')

df.head()
df.tail(1)

Unnamed: 0,date,source,medium,fullVisitorId,channelGrouping,pageviews,transactions,SUM_totals_timeOnSite,receita,device,pais,farm_finger
0,20170410,google,google,5141006232346734771,(Other),5,,94.0,,mobile,United States,17
1,20170415,(not set),(not set),1552260720366936947,(Other),2,,13.0,,desktop,United States,15
2,20170722,google,google,5302771102738976470,(Other),1,,,,desktop,United States,15
3,20170412,(not set),(not set),7452432726909106856,(Other),1,,,,desktop,United States,6
4,20170413,(not set),(not set),9869185717139963120,(Other),15,,399.0,,mobile,United States,11


Salva o dataframe da Query do BQ em um .csv. O id abaixo é o do folder de destino onde o arquivo será salvo.

In [0]:
%%time
df.to_csv("input_hash.csv", index=False)
file = drive.CreateFile({'parents':[{u'id': '1z8scY025WNyT_3p-1B4sjz_aidLPi1ht'}]})
file.SetContentFile("input_hash.csv")
file.Upload()