# Daily Active User Table

## Description 

This notebook will calculate the daily active users for the team. 

## Preprocessing

* Load the YinzCam Realtime API Data from the ADL
* Join the Realtime data (`actions`, `sessions`, `hardware`, `geoip`) into a single table
* Aggregate users (`yinzid` or `device_id`) by `session_date` to get the Daily Active Users
* Additional "Active User" caluclation ideas not utilized
* Errors when grouping the active_users

## Output

SQL Table to MS SQL
* `team_tor_active_users`: session_date, city_name, manufacturer, country_name, postal_code, subdivision1_name, active_users, active_devices. Where `team in {nhl,tfc,nba}`

## QA 
* Jose Nandez, Data Scientist, Jose.Nandez@MLSE.com (Primary)
* Nicole Ridout, Data Engineer, Nicole.Ridout@MLSE.com
* Farah Bastien, Manager of Data Science, Farah.Bastien@MLSE.com

##### Load the necessary functions from `PySpark`

In [0]:
from pyspark.sql.functions import col,date_format,from_utc_timestamp, unix_timestamp, sum, count, countDistinct, lit, abs
from pyspark.sql.types import TimestampType, IntegerType

##### Load ADL credentials
The following cell has the `Spark` configuration for accessing the Azure Data Lake, and it uses *masked* credentials stored in DataBricks

In [0]:
url = "https://login.microsoftonline.com/{0}/oauth2/token".format(dbutils.secrets.get(scope = "adl_cred", key = "directory_id"))
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", dbutils.secrets.get(scope = "adl_cred", key = "client_id"))
spark.conf.set("dfs.adls.oauth2.credential", dbutils.secrets.get(scope = "adl_cred", key = "credential"))
spark.conf.set("dfs.adls.oauth2.refresh.url", url)

## Load the YinzCam Realtime API, User Profile & Metadata from the ADL

##### Define the team and url from ADL

In [0]:
dbutils.widgets.text("team", "","")
dbutils.widgets.get("team")
team = getArgument("team")
print("Working with {0} team".format(team))

In [0]:
adlurl = "adl://mlse1.azuredatalakestore.net/yinz_cam/"+team +"_tor/realtime_api/"

##### Load Realtime data from ADL

In [0]:
actions  = (spark.read.csv(adlurl+'actions',header=True)
            .drop_duplicates()
            .withColumnRenamed('id','action_id')
            .withColumn('request_date_time',col('request_date_time').cast(TimestampType()))
            .withColumn('request_date_time_est', from_utc_timestamp(col('request_date_time'), "America/Toronto"))
            .withColumn('action_date',date_format('request_date_time_est', 'yyyy-MM-dd'))
           )
sessions = (spark.read.csv(adlurl+'sessions',header=True)
            .withColumnRenamed('id','ses_id')
            .withColumn('start_date_time', from_utc_timestamp(col('start_date_time').cast(TimestampType()), "America/Toronto"))
            .withColumn('end_date_time', from_utc_timestamp(col('end_date_time').cast(TimestampType()), "America/Toronto"))
            .withColumn('session_date',date_format('start_date_time', 'yyyy-MM-dd'))
            .orderBy('end_date_time',ascending=False)
            .drop_duplicates(subset=['ses_id'])
            .where(col('ses_id').isNotNull())
            .withColumn("hardware_device_id",col("hardware_device_id").cast(IntegerType()))
           )
hardware = (spark.read.csv(adlurl+'hardware',header=True)
            .withColumnRenamed('id','hardware_id')
            .drop_duplicates(subset=['hardware_id'])
            .where(col('hardware_id').isNotNull())
            .withColumn("hardware_id",col("hardware_id").cast(IntegerType()))
           )
geoip    = (spark.read.csv(adlurl+'geoip',header=True)
            .withColumnRenamed('id','geoip_id')
            .withColumn('geoip_id',col('geoip_id').cast(IntegerType()))
            .where(col('geoip_id').isNotNull())
           )
geoip    =  (geoip
             .orderBy(geoip.columns, ascending=[False for i in range(len(geoip.columns))])
             .drop_duplicates(subset=['geoip_id'])
            )

In [0]:
# row = actions.where(col('session_id') == '7510548B-BC99-4020-A80C-2667EC758AB6')
# print(geoip)

action_id,in_venue,invisible_date_time,request_date_time,resource_major,resource_minor,session_id,sort_order,type_major,type_minor,yinzid,request_date_time_est,action_date
423,0,,2017-09-10T03:47:33.000+0000,HOME,,7510548B-BC99-4020-A80C-2667EC758AB6,5,AD_INLINE_IMP,c2833b28-6ef2-4d5d-89e2-32e4c831a964|INLINE,f3c36ea8-587c-4f7d-abe8-8899d77915e2,2017-09-09T23:47:33.000+0000,2017-09-09
432,0,2017-09-10 03:49:38.0,2017-09-10T03:47:17.000+0000,HOME,,7510548B-BC99-4020-A80C-2667EC758AB6,0,V,,f3c36ea8-587c-4f7d-abe8-8899d77915e2,2017-09-09T23:47:17.000+0000,2017-09-09
426,0,,2017-09-10T03:47:35.000+0000,HOME,,7510548B-BC99-4020-A80C-2667EC758AB6,6,AD_INLINE_IMP,c2833b28-6ef2-4d5d-89e2-32e4c831a964|INLINE,f3c36ea8-587c-4f7d-abe8-8899d77915e2,2017-09-09T23:47:35.000+0000,2017-09-09
410,0,,2017-09-10T03:47:18.000+0000,HOME,,7510548B-BC99-4020-A80C-2667EC758AB6,2,AD_SPO_BAR_IMP,d6c423a2-b0d1-4c51-be84-fbc06b959d5b,f3c36ea8-587c-4f7d-abe8-8899d77915e2,2017-09-09T23:47:18.000+0000,2017-09-09
401,0,,2017-09-10T03:47:18.000+0000,HOME,,7510548B-BC99-4020-A80C-2667EC758AB6,1,AD_INLINE_IMP,c2833b28-6ef2-4d5d-89e2-32e4c831a964|INLINE,f3c36ea8-587c-4f7d-abe8-8899d77915e2,2017-09-09T23:47:18.000+0000,2017-09-09


## Join the Realtime data (`actions`, `sessions`, `hardware`, `geoip`) into a single table

In [0]:
actions_sessions = (actions
                    .join(sessions,
                          col('session_id') == col('device_generated_id'))
                    .join(geoip,col('session_device_generated_id') == col('device_generated_id'))
                    .join(hardware,col('hardware_id')==col('hardware_device_id'))
                    .where(col('yinzid').isNotNull())
                   )


## Aggregate users (`yinzid` or `device_id`) by `session_date` to get the Daily Active Users

I am excluding the `subdivision2_name` for the moment, since only <3% of the records will have it. Therefore, it will store values in the SQL DB that we will not be able to explode for the lack of data.

In [0]:
active_users = (actions_sessions
                .where(col('session_date').isNotNull() & col('yinzid').isNotNull() & col('device_id').isNotNull() & col('session_id').isNotNull())
                .groupBy('session_date','city_name','manufacturer','country_name','postal_code','subdivision1_name')
                .agg(countDistinct('yinzid').alias('active_users'),
                     countDistinct('device_id').alias('active_devices')
                    )
                .orderBy('session_date')
                .withColumn('session_date',col('session_date').cast(TimestampType()))
               )
# print(actions_sessions.count())
# print(actions_sessions.drop_duplicates().count())


## Additional "Active User" calculation ideas not utilized


Possible Active User - excludes postal code and subdivision

```
possible_active_user = (actions_sessions
                        .where(col('session_date').isNotNull() & col('yinzid').isNotNull() & col('device_id').isNotNull())
                        .groupBy('session_date','manufacturer','country_name')
                        .agg(countDistinct('yinzid').alias('active_users'),
                             countDistinct('device_id').alias('active_devices')
                            )
                        .orderBy('session_date')
                       )
```


Get the total true count

```
display(actions_sessions
        .where(col('session_date').isNotNull() & col('yinzid').isNotNull() & col('device_id').isNotNull())
        .groupBy('session_date')
        .agg(countDistinct('yinzid').alias('active_users'),
             countDistinct('device_id').alias('active_devices')
            )
        .orderBy('session_date')
       )
```


Get the total true count

```
active_user_true = (actions_sessions
                    .where(col('session_date').isNotNull() & col('yinzid').isNotNull() & col('device_id').isNotNull())
                    .groupBy('session_date')
                    .agg(countDistinct('yinzid').alias('active_users_x'),
                         countDistinct('device_id').alias('active_devices_x')
                        )
                    .orderBy('session_date')
                   )
``` 


## Errors when grouping the active_users

Since the geolocation is not quite accurate in YinzCam (this is because they don't use GPS, they use IP address for getting it), there will be error when aggregated the `active_users` tables by date. This error can be estimated by using percentage of error. 

The result is as follows, Mean values 12% +/- 8% with a max error of 30%. You can enable the following cell for calculating the error. 

```
display(active_users
        .groupBy('session_date')
        .agg(sum('active_users').alias('active_users_y'),
             sum('active_devices').alias('active_devices_y')
            )
        .join(active_user_true,'session_date')
        .withColumn('per_of_error',abs(col('active_users_x')-col('active_users_y'))/col('active_users_x')*100)
        .select('per_of_error').describe()
       )
```


## Writing to SQL

The following cell will write the table to `SQL`.

In [0]:
sqlserver = dbutils.secrets.get(scope = "jdbc", key = "sqlserver")
port = '1433'
database = 'mlse_sqldb'
user = dbutils.secrets.get(scope = "jdbc", key = "username")
pswd = dbutils.secrets.get(scope = "jdbc", key = "password")
url = 'jdbc:sqlserver://' + sqlserver + ':' + port + ';database=' + database

In [0]:
(active_users
 .coalesce(8)
 .write
 .option('user', user)
 .option('password', pswd)
 .jdbc(url, team + "_tor_active_users", mode = 'overwrite' )
)