## What data are we going to play with today?

We are going to play with the data of your *site* from Data platform, more particularly with *Pulse data*.
We use Athena as medium to connect to it. 

![](pictures/sqlaas.png)


[**Reference 1: SQLaaS**](https://confluence.schibsted.io/display/GD/SQLaaS+-+Onboarding+documentation)


In [9]:
# Required packages

from pyathena import connect
import pandas as pd
import os
from getpass import getpass

In [2]:
# Some parameters
user = "maria.pelaez@schibsted.com/"
provider ='avitoma'

In [3]:
access_key = getpass(prompt="Enter your access key to databox: ")
secret_key = getpass(prompt="Enter your secret to databox: ")

Enter your access key to databox:  ····················
Enter your secret to databox:  ········································


In [5]:
# Establishing the connection
# TODO: read directly of one path
conn = connect(aws_access_key_id=access_key,
               aws_secret_access_key=secret_key,
               s3_staging_dir="s3://schibsted-spt-common-dev/user-areas/"+ user,
               region_name="eu-west-1")

## A. First dataset that we are going to play with: raw data Pulse simplification


1.- **Dataset name to request access:** Insights-FactLayer-Leads

2.- **Athena (SQLaaS):** {provider}_databox.insights_sessions_fact_layer_7d

3.- **S3 path** schibsted-spt-common-prod/yellow/pulse-simple/version=1-alpha/*/client=${provider}


In [6]:
# Doing a simple query
query_events = """
SELECT
  *
FROM
  {}_databox.yellow_pulse_simple_7d
LIMIT 10
"""
df_events = pd.read_sql(query_events.format(provider), conn)

In [7]:
df_events.head()

Unnamed: 0,category,name,objectlatitude,objectlongitude,objectid,url,items,objecttype,useragent,providerid,...,intent,objectintent,devicetype,version,year,month,day,hour,gen,client
0,Servicios > negocios y empleo > Ofertas de empleo,Chofer Colectivo,,,sdrn:yapocl:classified:57750622,https://tags.tiqcdn.com/utag/schibsted/yapo-ng...,"{""@id"":""sdrn:yapocl:job:57750622"",""@type"":""Job...",ClassifiedAd,Mozilla/5.0 (Linux; Android 7.1.1; SM-J510MN B...,sdrn:schibsted:client:yapocl,...,,,mobile,1-alpha,2018,10,25,1,0,yapocl
1,,Clasificados yapo.cl - Avisos Clasificados Gra...,,,sdrn:yapocl:page:https://m.yapo.cl/index.htm,https://m.yapo.cl/index.htm,,Page,Mozilla/5.0 (Linux; Android 7.0; SM-P585M Buil...,sdrn:schibsted:client:yapocl,...,,,tablet,1-alpha,2018,10,25,1,0,yapocl
2,,,,,sdrn:yapocl:page:https://tags.tiqcdn.com/utag/...,https://tags.tiqcdn.com/utag/schibsted/yapo-ng...,,Page,Mozilla/5.0 (Linux; Android 7.0; Moto G (5) Bu...,sdrn:schibsted:client:yapocl,...,,,mobile,1-alpha,2018,10,25,1,0,yapocl
3,,Clasificados yapo.cl - Avisos Clasificados Gra...,,,sdrn:yapocl:page:https://m.yapo.cl/index.htm,https://m.yapo.cl/index.htm,,Page,Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like...,sdrn:schibsted:client:yapocl,...,,,mobile,1-alpha,2018,10,25,1,0,yapocl
4,,yapo.cl,,,sdrn:yapocl:listing:https://m.yapo.cl/araucani...,https://m.yapo.cl/araucania?&o=3,,Listing,Mozilla/5.0 (Linux; Android 6.0; BLADE A602 Bu...,sdrn:schibsted:client:yapocl,...,,,mobile,1-alpha,2018,10,25,1,0,yapocl


## B. Second dataset that we are going to play with: Simple Pulse Data 

1.- **Dataset name to request access:** Insights-FactLayer-Leads

2.- **Athena (SQLaaS):** {provider}_databox.insights_leads_fact_layer_90d.

3.- **S3 path** s3://schibsted-spt-common-prod/yellow/insights/leads/

[More information](https://docs.schibsted.io/data-and-insight/insights-pipelines/10.Data%20Model/fact-layer/#sessions-user-behaviour)

In [None]:
# Doing a simple query
query_leads = """
SELECT
  *
FROM
  {}_databox.insights_leads_fact_layer_90d
LIMIT 10
"""
df_leads = pd.read_sql(query_leads.format(provider), conn)

In [9]:
df_leads.head()

Unnamed: 0,leadid,globalleadtype,environmentid,eventtype,eventobject,published,clientid,country,adid,adlocalcategory,...,devicetype,producttype,trackertype,source,version,year,month,day,gen,client
0,04d78866-13b5-4481-b22e-b6723655b774,Ad phone number called,sdrn:schibsted:environment:4b7859b8-2a82-4efc-...,Call,PhoneContact,2018-09-22T13:55:07+00:00,avitoma,ma,,,...,mobile,AndroidApp,Android,pulse,1,2018,9,22,0,avitoma
1,0598e219-0b42-47bb-9643-d6b9c9608526,Ad phone number called,sdrn:schibsted:environment:b398cd81-6738-4b73-...,Call,PhoneContact,2018-09-22T17:58:21+00:00,avitoma,ma,,,...,mobile,AndroidApp,Android,pulse,1,2018,9,22,0,avitoma
2,06378b70-b1bb-4090-8393-2a4d52e1cc35,Ad phone number called,sdrn:schibsted:environment:9fad9935-fb37-43e4-...,Call,PhoneContact,2018-09-22T13:08:46+00:00,avitoma,ma,,,...,mobile,AndroidApp,Android,pulse,1,2018,9,22,0,avitoma
3,073b56a9-4d28-4191-baa7-7500ce886c50,Ad phone number called,sdrn:schibsted:environment:0e64f409-21ab-4b42-...,Call,PhoneContact,2018-09-22T13:24:35+00:00,avitoma,ma,,,...,mobile,AndroidApp,Android,pulse,1,2018,9,22,0,avitoma
4,07d3d229-26fd-4ea3-a503-5896634a72f5,Ad SMS app opened,sdrn:schibsted:environment:ddf8b939-7a7f-4924-...,SMS,PhoneContact,2018-09-22T01:12:30+00:00,avitoma,ma,,,...,mobile,AndroidApp,Android,pulse,1,2018,9,22,0,avitoma


In [33]:
df_leads.columns

Index(['leadid', 'globalleadtype', 'environmentid', 'eventtype', 'eventobject',
       'published', 'clientid', 'country', 'adid', 'adlocalcategory',
       'adlocalmaincategory', 'adtype', 'adpublishertype', 'adlocation',
       'adlocalvertical', 'devicetype', 'producttype', 'trackertype', 'source',
       'version', 'year', 'month', 'day', 'gen', 'client'],
      dtype='object')