## Segmentation of users

We are going to play with the following dataset

1.- **Dataset name to request access:** SQLaaSPulseAnonymizedSimple-1

2.- **Athena (SQLaaS):** {provider}_databox.insights_sessions_fact_layer_7d

3.- **S3 path** schibsted-spt-common-prod/yellow/pulse-simple/version=1-alpha/*/client=${provider}


We are going to do a query per users and activity and then we are going to do a segmentation based on activity (# of days) and type of activity as follows:

![](pictures/segmentation.png)

In [None]:
# Needed packages
from pyathena import connect
import pandas as pd
import os

In [None]:
from getpass import getpass
access_key = getpass(prompt="Enter your access key to databox: ")
secret_key = getpass(prompt="Enter your secret to databox: ")

# Some parameters (different way to extract the credential)
user = "pawel.tyszka@schibsted.com/"
provider ='yapocl'

# Establishing the connection
conn = connect(aws_access_key_id=access_key,
               aws_secret_access_key=secret_key,
               s3_staging_dir="s3://schibsted-spt-common-dev/user-areas/"+ user,
               region_name="eu-west-1")

## Step 0: What information is available in this dataset?

In [None]:
describe_events = """
SELECT * FROM {}_databox.yellow_pulse_simple_7d LIMIT 1
"""
pd.read_sql(describe_events.format(provider), conn).dtypes

In [None]:
%%time

query1 = """
SELECT
 environmentid,
 sum(nof_listings) as nof_listings,
 sum(nof_classifieds) as nof_classifieds,
 sum(nof_pages) as nof_pages,
 sum(nof_created) as nof_created,
 sum(nof_call) as nof_call,
 sum(nof_Show) as nof_Show,
 sum(nof_Send) as nof_Send,
 sum(nof_SMS) as nof_SMS
 FROM
 (
  SELECT
      environmentid,
      CASE WHEN (type='View' and objecttype = 'Listing') THEN SUM(1) else 0 end as nof_listings,
      CASE WHEN (type='View' and objecttype = 'ClassifiedAd') THEN SUM(1) else 0 end as nof_classifieds,
      CASE WHEN (type='View' and objecttype = 'Content')  THEN SUM(1) else 0 end as nof_content,
      CASE WHEN (type='View' and objecttype = 'Page') THEN SUM(1) else 0 end as nof_pages,
      CASE WHEN (type='Create' and objecttype = 'ClassifiedAd' ) THEN SUM(1) else 0 end as nof_created,
      CASE WHEN (type='Call' and objecttype = 'PhoneContact') THEN SUM(1) else 0 end as nof_call,
      CASE WHEN (type='Show' and objecttype = 'PhoneContact') THEN SUM(1) else 0 end as nof_Show,
      CASE WHEN (type='Send' and objecttype = 'Message') THEN SUM(1) else 0 end as nof_Send,
      CASE WHEN (type='SMS' and objecttype = 'PhoneContact') THEN SUM(1) else 0 end as nof_SMS
    FROM
      {}_databox.yellow_pulse_simple_7d
    GROUP BY
      environmentid, 
      type,
      objecttype
 )
GROUP BY
  environmentid
;
"""

query2 = """
SELECT
      environmentid,
      count(distinct day) as active_days,
      count(environmentid) as total_events,
      count(nullif(isloggedin = true, false)) as total_logged_events
    FROM
      {}_databox.yellow_pulse_simple_7d
    GROUP BY
      environmentid
;
"""
#df1 = pd.read_sql(query1.format(provider), conn)
#df2 = pd.read_sql(query2.format(provider), conn)
#df = df2.merge(df1,on='environmentid',how='left')

In [None]:
# in case the query is long,we are going to use this fake data
df = pd.read_csv("aggregatedDataset_7d.csv")

## **Exercise 1**:

Create a function to apply the previous segmentation. After this, please answer the following questions:

1.1 What is the distribution of the segmentation?

1.2 What is the volume of total events that every segment consumes?

1.3 What is the distribution of active days per segment?

1.4 What are the main charactersitics of every segment?

> Please write the results in the cardboard of the site that you are studying


## **Exercise 2** (difficulty medium):

[Recency, frequency and monetary](https://en.wikipedia.org/wiki/RFM_(customer_value)) is another way to segment our users. Using `{}_databox.yellow_pulse_simple_7d` create a query to extract the following for eeach user:
- number of active days,
- day of the first visit
- day of the last visit
- total number of events

Do a weekly RFM of our users using the features above.

Any interesting conclusions?

**Hint**: use [pandas.cut](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html)

## **Exercise 3** (difficulty high):

Add in your previous analysis the category dimension and try to extract some new insights related to the behaviour of our users.