# Calculate Mobile App Clicks, Views and Opens for Push Notifications

## Description 
Notebook to count clicks, views and open apps for push notifications by aggregating the click actions by MLSE ID. Page actions are further broken down into registered vs not registed users (yinzid or no yinz id), and total vs. unique actions.

Notes:
* a non_registered_user is defined when yinzcam_id=NULL
* mlse_id = concat("device_id","yinzid")

## Preprocessing

* Load the YinzCam Realtime API data & Metadata from the ADL
* Join `push` (filtered from actions) and `sessions` into a single table
* Aggregate `Push` actions (`click`, `view`, `open`)

## Output

SQL Table to MS SQL
* `team_push`: action_date, type_minor, total_push_clicks, total_unique_push_clicks, total_push_views, total_unique_push_views, total_opened_app, total_unique_opened_app, total_reg_push_clicks, total_non_reg_push_clicks, app_id, send_time, tag, alert, has_link, subscriptions_targeted. Where `team in {nhl,tfc,nba}`

## QA 
* Erika Munoz, Data Scientist, Erika.Munoz@MLSE.com (Primary)
* Nicole Ridout, Data Engineer, Nicole.Ridout@MLSE.com
* Farah Bastien, Manager of Data Science, Farah.Bastien@MLSE.com

##### Load the necessary functions from `PySpark` and `Python`

In [0]:
#SQL-like functions from PySpark
from pyspark.sql.functions import col,date_format,from_utc_timestamp, from_unixtime, unix_timestamp, sum, count,\
                                  countDistinct, collect_list, size,month,min,max,when,upper, \
                                  lag,split,size,length, lit, mean, collect_set, concat, upper,create_map, weekofyear,\
                                  year,round,first, explode, isnan, regexp_replace, to_date, floor, when,isnan
from pyspark.sql.types import TimestampType, IntegerType, DoubleType, StringType, DateType, StructType
from pyspark.sql.window import Window

#datetime from Python
from datetime import datetime,timedelta,date
import numpy as np
import os, re, glob
from itertools import chain

##### Load ADL credentials

In [0]:
url = "https://login.microsoftonline.com/{0}/oauth2/token".format(dbutils.secrets.get(scope = "adl_cred", key = "directory_id"))
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", dbutils.secrets.get(scope = "adl_cred", key = "client_id"))
spark.conf.set("dfs.adls.oauth2.credential", dbutils.secrets.get(scope = "adl_cred", key = "credential"))
spark.conf.set("dfs.adls.oauth2.refresh.url", url)

## Load the YinzCam Realtime API data & Metadata from the ADL

##### Define the team and urls from ADL

In [0]:
dbutils.widgets.text("team", "","")
dbutils.widgets.get("team")
team = getArgument("team")
print("Working with {0} team".format(team))

In [0]:
adlurl = "adl://mlse1.azuredatalakestore.net/yinz_cam/"+team +"_tor/realtime_api/"
meta_adl = "adl://mlse1.azuredatalakestore.net/yinz_cam/cards_content/meta-push.csv"

##### Load Realtime data (`actions` & `sessions`) and Meta data (`meta-push.csv`) from ADL 

In [0]:
actions  = (spark.read.csv(adlurl+'actions',header=True)
            .drop_duplicates()
            .withColumnRenamed('id','action_id')
            .withColumn('request_date_time',from_utc_timestamp(col('request_date_time').cast(TimestampType()), "America/Toronto"))
            .withColumn('invisible_date_time',from_utc_timestamp(col('invisible_date_time').cast(TimestampType()), "America/Toronto"))
            .withColumn('action_date',date_format('request_date_time', 'yyyy-MM-dd'))
           )
sessions = (spark.read.csv(adlurl+'sessions',header=True)
            .withColumnRenamed('id','ses_id')
            .withColumn('start_date_time', from_utc_timestamp(col('start_date_time').cast(TimestampType()), "America/Toronto"))
            .withColumn('end_date_time', from_utc_timestamp(col('end_date_time').cast(TimestampType()), "America/Toronto"))
            .withColumn('session_date',date_format('start_date_time', 'yyyy-MM-dd'))
            .orderBy('end_date_time',ascending=False)
            .drop_duplicates(subset=['ses_id'])
            .where(col('ses_id').isNotNull())
            .withColumn("hardware_device_id",col("hardware_device_id").cast(IntegerType()))
           )
meta = spark.read.csv(meta_adl, header = True).withColumn('send_time',from_utc_timestamp(col('send_time').cast(TimestampType()), "America/Toronto"))

##### Filter `actions` for Push notifications only

In [0]:
push = (actions[actions["type_major"].like('%PUSH%')]).cache()

## Join `push` (from actions) and `sessions` into a single table

(to get information about the `device_id`)

In [0]:
push_devid =(push
        .select("action_date", "session_id", "type_major", "type_minor", "yinzid", "resource_major")
        .join(sessions.withColumn("session_id", sessions.device_generated_id)
                      .select("session_id","device_id"),
              "session_id",
              "left"
             )
       ).cache()

## Aggregate `Push` actions (`click`, `view`, `open`)

Perform the aggregation in seperate dataframes to identify registered and non-registered
* registered_user : yinzid = NotNull & device_id= NotNull
* non_registered_user : yinzid=NULL & device_id = NotNull.

In [0]:
total_push = (push_devid
                .groupBy(["action_date", "type_minor"])
                .agg(sum(when(col("type_major").like("PUSH_CLICK"), 1).otherwise(0)).alias("total_push_clicks"),
                     countDistinct(when(col("type_major").like("PUSH_CLICK"), col("device_id"))).alias("total_unique_push_clicks"),
                     sum(when(col("type_major").like("PUSH_VIEW"), 1).otherwise(0)).alias("total_push_views"),
                     countDistinct(when(col("type_major").like("PUSH_VIEW"), col("device_id"))).alias("total_unique_push_views"),
                     sum(when(col("type_major").like("PUSH_OPENED_APP"), 1).otherwise(0)).alias("total_opened_app"),
                     countDistinct(when(col("type_major").like("PUSH_OPENED_APP"), col("device_id"))).alias("total_unique_opened_app"),
                     sum(when(col("yinzid").isNotNull() & col("device_id").isNotNull() & col("type_major").like("PUSH_CLICK"), 1).otherwise(0)).alias("total_reg_push_clicks"),
                     sum(when(col("yinzid").isNull() & col("device_id").isNotNull() & col("type_major").like("PUSH_CLICK"), 1).otherwise(0)).alias("total_non_reg_push_clicks")
                    )
               ).cache()


In [0]:
# display(meta)

In [0]:
# display(total_push.where(col('type_minor') == '294230'))

##### Join push and metadata

In [0]:
meta_push_ini = total_push.join(meta, total_push['type_minor'] == meta['id']).drop(col('id'))

In [0]:
# display(meta_push_ini)

In [0]:
# Fill empty cells
meta_push = meta_push_ini.withColumn('subscriptions_targeted',meta_push_ini.subscriptions_targeted.cast('int')).na.fill({'subscriptions_targeted':0})

### Write to SQL

In [0]:
sqlserver = dbutils.secrets.get(scope = "jdbc", key = "sqlserver")
port = '1433'
database = 'mlse_freq'
user = dbutils.secrets.get(scope = "jdbc", key = "username")
pswd = dbutils.secrets.get(scope = "jdbc", key = "password")
url = 'jdbc:sqlserver://' + sqlserver + ':' + port + ';database=' + database

In [0]:
(meta_push
   .coalesce(8)
   .write
   .option('user', user)
   .option('password', pswd)
   .jdbc(url, team + "_push", mode = 'overwrite' )
)