## Extracting patterns on daily user behaviour

We are going to play with the following dataset

1.- **Dataset name to request access:** Insights-FactLayer-Leads

2.- **Athena (SQLaaS):** {provider}_databox.insights_sessions_fact_layer_1d

3.- **S3 path** schibsted-spt-common-prod/yellow/pulse-simple/version=1-alpha/*/client=${provider}


[Athena Query](https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.html) 


In [None]:
# Needed packages
from pyathena import connect
import pandas as pd
import os

In [None]:
from getpass import getpass
access_key = getpass(prompt="Enter your access key to databox: ")
secret_key = getpass(prompt="Enter your secret to databox: ")

# Some parameters (another different way to extract the credential)
user = "maria.pelaez@schibsted.com/"
provider ='avitoma'

# Doing the connection
conn = connect(aws_access_key_id=access_key,
               aws_secret_access_key=secret_key,
               s3_staging_dir="s3://schibsted-spt-common-dev/user-areas/"+ user,
               region_name="eu-west-1")

## Step 0: What information is available in this dataset?

In [None]:
describe_events = """
SELECT * FROM {}_databox.yellow_pulse_simple_1d LIMIT 1
"""
describe_df = pd.read_sql(describe_events.format(provider))
describe_df.dtypes()

In [None]:
# Doing a simple query of the events in one hour
query_events = """
SELECT
  category,
  name,
  objectid,
  objecttype,
  type,
  environmentid,
  devicetype,
  providerproducttype,
  isloggedin,
  "hour"
FROM
  {}_databox.yellow_pulse_simple_1d
WHERE 
 "hour" = 20
LIMIT 2000
"""
df_events = pd.read_sql(query_events.format(provider), conn)

## What are we going to learn about Python?

We are mainly going to work with the following packages:
    
1. [pandas](https://pandas.pydata.org/pandas-docs/stable/) 

This is a package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” 
data both easy and intuitive. 
It aims to be the fundamental high-level building block for doing practical, 
real world data analysis in Python. Additionally, it has the broader goal of becoming the most 
powerful and flexible open source data analysis / manipulation tool available in any language. 
It is already well on its way towards this goal.

[cheat sheet pandas](http://datacamp-community.s3.amazonaws.com/9f0f2ae1-8bd8-4302-a67b-e17f3059d9e8)


2. [numpy](http://www.numpy.org/)

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- a powerful N-dimensional array object

- sophisticated (broadcasting) functions

- tools for integrating C/C++ and Fortran code useful linear algebra, Fourier transform, and random number capabilities

[cheat sheet numpy](http://datacamp-community.s3.amazonaws.com/e9f83f72-a81b-42c7-af44-4e35b48b20b7)


3. [matplotlib](https://matplotlib.org/)

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

[cheat sheet matplotlib](http://datacamp-community.s3.amazonaws.com/28b8210c-60cc-4f13-b0b4-5b4f2ad4790b)


### Exploration of our DataFrame

*head(),tail(),shape(),columns, describe(),dtypes()*

*selection one row, one column, a subdataframe*

In [None]:
df_events.head(2)

In [None]:
df_events.tail(2)

In [None]:
df_events.shape

In [None]:
df_events.dtypes

In [None]:
df_events['category'][0:10]

In [None]:
df_events[df_events['type']=='View'].head(2)

In [None]:
df_events['combination'] = df_events['devicetype'] + ' - ' +df_events['providerproducttype']

In [None]:
df_events['combination']

In [None]:
df_events['has_category'] = False
df_events.loc[-df_events['category'].isna(),'has_category'] = True

In [None]:
df_events.head(4)

### Some countings
*count(),unique(),nunique(),value_counts(),groupby*

In [None]:
df_events['combination'].unique()

In [None]:
df_events['combination'].nunique(), len(df_events['objecttype'].unique())

In [None]:
df_events['combination'].value_counts()

In [None]:
df_events['type'].value_counts()

In [None]:
df_events['devicetype'].value_counts()

In [None]:
df_events.groupby?

In [None]:
df_events.groupby('devicetype').count()

In [None]:
df_events.groupby('devicetype')['devicetype'].count()

In [None]:
# We are going to do various aggregations at the same time using one dictionary 
dic = {'devicetype':'count','environmentid':'nunique'}

In [None]:
dic

In [None]:
dic['devicetype']

In [None]:
dic.keys(),dic.values()

In [None]:
df_events.groupby('combination').agg(dic)

In [None]:
df_events.groupby('combination').agg(dic).plot()

In [None]:
# If you want to visualize in jupyter notebook you need to add this line
%matplotlib inline

df_events.groupby('combination').agg(dic).plot(kind='bar')

> Later we are going to learn how to tune a chart!

## **EXERCISES**

Which browsers are the most active - desktop, mobile or tablet?

Here *most active* means those that generate the highest number of events.


In [None]:
# Doing a simple query of the events in one hour
query = """
SELECT
  devicetype,
  count(devicetype) as nof_events,
  count(distinct environmentid) as nof_users
FROM
  avitoma_databox.yellow_pulse_simple_1d
GROUP BY 
  devicetype
"""
df = pd.read_sql(query.format(provider), conn)

In [None]:
df

In [None]:
df_clean = df[-df['devicetype'].isna()]
df_clean

In [None]:
df['volumeEvents'] = 100*df['nof_events']/df['nof_events'].sum()
df['volumeUsers'] = 100*df['nof_users']/df['nof_users'].sum()
df['eventsPeruser'] = df['nof_events']/df['nof_users']

In [None]:
df.sort_values('volumeUsers',ascending =False)

# Now 

![](pictures/your_turn.png)



# Run the following query

Be patient! It might take a while!

In [None]:
# Doing a simple query of the events in one hour
query = """
SELECT
  environmentid,
  providerproducttype,
  devicetype,
  isloggedin,
  type,
  objecttype,
  count(*) as nof_events
FROM
  {provider}_databox.yellow_pulse_simple_1d
GROUP BY
  environmentid,
  providerproducttype,
  devicetype,
  isloggedin,
  type,
  objecttype
"""
df = pd.read_sql(query.format(provider), conn)

In [None]:
df.head()

## **Exercise 1**:

Write a function (use *def*) in python to extract from the fields **devicetype** and **providerproducttype**  the platform associated to the event. The result of platform feature must be: *iOS*, *Android*, *Web* or *Undefined*

Apply this function to create a new column called **platform** and answer the following questions:

1.1 How many users come from each platform? 

1.2 What is the percentage of Listings and ClassifiedAds for every platform?


> Please write the results in the cardboard of the site that you are studying

## **Exercise 2**:

There is a field called **isloggedin** which is states if browser was logged or not logged user

2.1 What's the percentage of logged users ?

> Please write the results in the cardboard of the site that you are studying

## **Exercise 3**:

If we define as:
    
**browser**: Active user with at least one Listing View or one Ad View in the session.
    
**buyer**: Active user who has *contacted* at least one Lister.

To *contact* means to do one of the following actions:
    
Call-->PhoneContact

Show-->PhoneContact

Send-->Message

SMS-->PhoneContact


**lister**: Active user who has tried to publish one ad. This is refering to Create-ClassifiedAd event.


3.1. Please compute percentage of browsers, buyers and sellers per platform?

