# ABA: Quiz 3

### Spring 2024

In [1]:
# Import required libraries :
import io
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

This question tests your understanding using time varying covariates in the Cox model. 
The Apple app store requires that all apps in the store must display a privacy label. The goal of privacy label is to inform users about app’s privacy posture and let them make decision about whether to download the app. There has been a lot of interest in whether privacy labels affect demand   for apps. From December 14 onwards, Appstore mandated that any new or any existing app must display privacy label. I am attaching a dataset on date when the Apps display the label. Apps are followed for a certain number of days and if the Label appears then it is listed as 1, otherwise 0. One would expect that labels for all apps appear on or around Dec 14. However, that does not happen. Existing apps do not display privacy labels despite the Appstore mandate. This leads to additional investigation into what factors cause apps to display labels.

There are many useful attributes that may influence the timing for the labels. For example, the rank of apps on Dec 14 and type of the app (Free, paid, grossing). Appstore publishes the rank of apps based on how many downloads they receive. A higher rank (1, 2, 3.. and so on) app gets more downloads. We have reasons to believe that top ranks apps are more likely to disclose label earlier.  

I am attaching a dataset to answer this question. The data provides information on the number of days it takes for an app to display the Label. The column “days_followed” lists the number of days the app was followed. The “Label“ column shows whether the label appeared. 

One can estimate a Cox model with the rank and app type as covariates to estimate the time it takes for app to adopt the label. However, you suspect that what other apps in a category do play a role in an app’s incentive to adopt the label. If more apps in the focal app’s category are displaying labels, the focal app may adopt the label faster.  In short, you believe that the number of other apps that have adopted labels at a given time may affect the time to release the label for an app. 

**To test whether this has an impact on the hazard of adoption, you create a long form version of the dataset where for every focal app, for each day you count the number of apps which have released the label. Show your steps and compute this variable. [10]**


In [2]:
# read data
filename = 'privacy_label.csv'
df_apps = pd.read_csv(filename)
df_apps

Unnamed: 0,app_type,days_followed,rank_on_14_Dec,app_category,app_name,Label
0,free,2,58,1,Bible Verses: Daily Devotional,0
1,free,15,52,1,Daily Bible Inspirations,0
2,free,2,85,1,Daily Bible Verse & Motivation,0
3,gross,2,70,1,Daily Devotional For Women App,0
4,gross,6,28,1,Dictionary.com: English Words,0
...,...,...,...,...,...,...
945,gross,4,48,87,RAID: Shadow Legends,1
946,gross,1,32,87,State of Survival Walking Dead,1
947,paid,24,68,87,Superimpose X,1
948,gross,8,68,87,World Series of Poker - WSOP,1


In [3]:
#keep columns for regression
cols_to_keep = ['days_followed','Label','app_category','rank_on_14_Dec','app_type']
df_cox = df_apps[cols_to_keep].reset_index(drop=True)
df_cox

Unnamed: 0,days_followed,Label,app_category,rank_on_14_Dec,app_type
0,2,0,1,58,free
1,15,0,1,52,free
2,2,0,1,85,free
3,2,0,1,70,gross
4,6,0,1,28,gross
...,...,...,...,...,...
945,4,1,87,48,gross
946,1,1,87,32,gross
947,24,1,87,68,paid
948,8,1,87,68,gross


In [4]:
#get dummies
df_cox = pd.get_dummies(df_cox, drop_first=True) #we can't estimate for all app types, so we drop free
df_cox.head()

Unnamed: 0,days_followed,Label,app_category,rank_on_14_Dec,app_type_gross,app_type_paid
0,2,0,1,58,0,0
1,15,0,1,52,0,0
2,2,0,1,85,0,0
3,2,0,1,70,1,0
4,6,0,1,28,1,0


In [5]:
#transform to episodic format (long format)
from lifelines.utils import to_episodic_format

# the time_gaps parameter specifies how large or small you want the periods to be.
df_apps_long = to_episodic_format(df_cox
                                  ,duration_col='days_followed'
                                  , event_col='Label'
                                  , time_gaps=1.)
df_apps_long.head(20)

Unnamed: 0,stop,start,Label,app_category,app_type_gross,app_type_paid,id,rank_on_14_Dec
0,1.0,0.0,0,1,0,0,0,58
1,2.0,1.0,0,1,0,0,0,58
2,1.0,0.0,0,1,0,0,1,52
3,2.0,1.0,0,1,0,0,1,52
4,3.0,2.0,0,1,0,0,1,52
5,4.0,3.0,0,1,0,0,1,52
6,5.0,4.0,0,1,0,0,1,52
7,6.0,5.0,0,1,0,0,1,52
8,7.0,6.0,0,1,0,0,1,52
9,8.0,7.0,0,1,0,0,1,52


In [6]:
#describe data
df_apps_long.describe()

Unnamed: 0,stop,start,Label,app_category,app_type_gross,app_type_paid,id,rank_on_14_Dec
count,9156.0,9156.0,9156.0,9156.0,9156.0,9156.0,9156.0,9156.0
mean,9.568152,8.568152,0.051551,26.696702,0.391219,0.182722,455.273045,41.442005
std,7.416872,7.416872,0.221131,27.442728,0.48805,0.386459,270.548937,27.364702
min,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
25%,3.0,2.0,0.0,6.0,0.0,0.0,217.75,18.0
50%,7.0,6.0,0.0,20.0,0.0,0.0,459.0,39.0
75%,15.0,14.0,0.0,55.0,1.0,0.0,686.0,64.0
max,30.0,29.0,1.0,87.0,1.0,1.0,949.0,99.0


In [7]:
#get total apps with label and no label pers period, per app type
cols_to_group = ['app_category','stop','start','Label']
df_count = df_apps_long.groupby(cols_to_group).size().to_frame('apps_in_period').reset_index()
df_count

Unnamed: 0,app_category,stop,start,Label,apps_in_period
0,1,1.0,0.0,0,24
1,1,1.0,0.0,1,1
2,1,2.0,1.0,0,22
3,1,2.0,1.0,1,1
4,1,3.0,2.0,0,17
...,...,...,...,...,...
1335,87,26.0,25.0,0,1
1336,87,27.0,26.0,0,1
1337,87,28.0,27.0,0,1
1338,87,29.0,28.0,0,1


In [8]:
#keep only number of apps with labeled released
df_count = df_count[df_count.Label == 1]
df_count = df_count.drop(columns='Label').reset_index(drop=True)
df_count = df_count.rename(columns={'apps_in_period':'labeled_apps_in_period'})

In [9]:
df_count.head(10)

Unnamed: 0,app_category,stop,start,labeled_apps_in_period
0,1,1.0,0.0,1
1,1,2.0,1.0,1
2,1,9.0,8.0,1
3,1,15.0,14.0,1
4,1,16.0,15.0,1
5,1,17.0,16.0,1
6,2,1.0,0.0,4
7,2,2.0,1.0,1
8,2,4.0,3.0,2
9,2,5.0,4.0,1


In [10]:
#left join with episodic dataframe
df_apps_long = df_apps_long.merge(df_count,how='left') 
df_apps_long

Unnamed: 0,stop,start,Label,app_category,app_type_gross,app_type_paid,id,rank_on_14_Dec,labeled_apps_in_period
0,1.0,0.0,0,1,0,0,0,58,1.0
1,2.0,1.0,0,1,0,0,0,58,1.0
2,1.0,0.0,0,1,0,0,1,52,1.0
3,2.0,1.0,0,1,0,0,1,52,1.0
4,3.0,2.0,0,1,0,0,1,52,
...,...,...,...,...,...,...,...,...,...
9151,7.0,6.0,0,87,1,0,948,68,1.0
9152,8.0,7.0,1,87,1,0,948,68,1.0
9153,1.0,0.0,0,87,0,1,949,87,4.0
9154,2.0,1.0,0,87,0,1,949,87,


In [11]:
# and fill periods with no labeled apps with zero
df_apps_long['labeled_apps_in_period'] = df_apps_long['labeled_apps_in_period'].fillna(0)
df_apps_long

Unnamed: 0,stop,start,Label,app_category,app_type_gross,app_type_paid,id,rank_on_14_Dec,labeled_apps_in_period
0,1.0,0.0,0,1,0,0,0,58,1.0
1,2.0,1.0,0,1,0,0,0,58,1.0
2,1.0,0.0,0,1,0,0,1,52,1.0
3,2.0,1.0,0,1,0,0,1,52,1.0
4,3.0,2.0,0,1,0,0,1,52,0.0
...,...,...,...,...,...,...,...,...,...
9151,7.0,6.0,0,87,1,0,948,68,1.0
9152,8.0,7.0,1,87,1,0,948,68,1.0
9153,1.0,0.0,0,87,0,1,949,87,4.0
9154,2.0,1.0,0,87,0,1,949,87,0.0


In [12]:
#get cumulative counts for labeled apps for each period
cols_to_group = ['app_category','id']
df_apps_long['cumulative_labeled_apps'] = df_apps_long.groupby(cols_to_group)['labeled_apps_in_period'].cumsum()
df_apps_long

Unnamed: 0,stop,start,Label,app_category,app_type_gross,app_type_paid,id,rank_on_14_Dec,labeled_apps_in_period,cumulative_labeled_apps
0,1.0,0.0,0,1,0,0,0,58,1.0,1.0
1,2.0,1.0,0,1,0,0,0,58,1.0,2.0
2,1.0,0.0,0,1,0,0,1,52,1.0,1.0
3,2.0,1.0,0,1,0,0,1,52,1.0,2.0
4,3.0,2.0,0,1,0,0,1,52,0.0,2.0
...,...,...,...,...,...,...,...,...,...,...
9151,7.0,6.0,0,87,1,0,948,68,1.0,9.0
9152,8.0,7.0,1,87,1,0,948,68,1.0,10.0
9153,1.0,0.0,0,87,0,1,949,87,4.0,4.0
9154,2.0,1.0,0,87,0,1,949,87,0.0,4.0


In [13]:
#drop period count variable
df_apps_long.drop(columns=['labeled_apps_in_period','app_category'],inplace=True)

In [14]:
df_apps_long

Unnamed: 0,stop,start,Label,app_type_gross,app_type_paid,id,rank_on_14_Dec,cumulative_labeled_apps
0,1.0,0.0,0,0,0,0,58,1.0
1,2.0,1.0,0,0,0,0,58,2.0
2,1.0,0.0,0,0,0,1,52,1.0
3,2.0,1.0,0,0,0,1,52,2.0
4,3.0,2.0,0,0,0,1,52,2.0
...,...,...,...,...,...,...,...,...
9151,7.0,6.0,0,1,0,948,68,9.0
9152,8.0,7.0,1,1,0,948,68,10.0
9153,1.0,0.0,0,0,1,949,87,4.0
9154,2.0,1.0,0,0,1,949,87,4.0


**Use that as a covariate in your Cox regression along with the Rank and App type and report the results, along with the model that you estimated.  What is the impact of the number of other apps on the hazard? [15]**

We have created a time varying covariate "cumulative_labeled_apps", that varies over time (let's call it Xc(t)). The hazard can then be defined as:

$$ h(t | X(t)) = h_0(t)exp (β'x + γ'X(t))$$

Here, the beta would be for time independent covariates. The coefficient γ (for time varying covariates) indicates how the hazard changes with increasing values of this covariate (which varies over time).

more precisely:

$$h(t | \mathbf{X}(t)) = h_0(t) \exp(\beta_1 \times \text{app_type_gross} + \beta_2 \times \text{app_type_paid} + \beta_3 \times \text{rank_on_14_Dec} + \beta_4 \times \text{cumulative_labeled_apps}(t))$$

- $h(t | X(t))$ is the hazard at time $t$, given the covariate values at time $t$.
- $ h_0(t) $ is the baseline hazard function, representing the hazard when all covariates are zero.
- $ \text{app_type_gross} $ , $ \text{app_type_paid} $ and $ \text{rank_on_14_Dec} $ are time invariant covariates (they remain fixed for each app)
- $ \text{cumulative_labeled_apps}(t) $ is the cumulative number of apps of the same category that have released the label up to time $t$, the only time-varying covariate in the model.
- $ \beta_1, \beta_2, \beta_3, \beta_4 $ are the coefficients for each covariate, which measure the effect of these covariates on the hazard rate, with $\beta_4$ specifically associated with the time-varying covariate.



This equation provides the hazard at any time t, given the individual's covariates and the model's estimated coefficients.

In [17]:
# Estimate model with time varying covariate
from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()

ctv.fit(df_apps_long,
        id_col='id', #identifies app
        event_col='Label', #indicates event
        start_col='start', #start time window
        stop_col='stop', #stop time of window
       )

  problem_columns = (censors_only | deaths_only).difference(total).tolist()


<lifelines.CoxTimeVaryingFitter: fitted with 9156 periods, 950 subjects, 472 events>

In [18]:
ctv.print_summary(3, model="time-varying covariates")

0,1
model,lifelines.CoxTimeVaryingFitter
event col,'Label'
number of subjects,950
number of periods,9156
number of events,472
partial log-likelihood,-2787.079
time fit was run,2024-02-21 16:09:33 UTC
model,time-varying covariates

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,z,p,-log2(p)
app_type_gross,0.271,1.312,0.109,0.058,0.485,1.06,1.623,2.497,0.013,6.321
app_type_paid,-0.12,0.887,0.141,-0.396,0.156,0.673,1.169,-0.852,0.394,1.343
rank_on_14_Dec,-0.009,0.991,0.002,-0.012,-0.005,0.988,0.995,-4.974,<0.0005,20.54
cumulative_labeled_apps,0.037,1.038,0.009,0.019,0.054,1.02,1.056,4.114,<0.0005,14.649

0,1
Partial AIC,5582.159
log-likelihood ratio test,45.141 on 4 df
-log2(p) of ll-ratio test,28.003


An increase in 1 more labeled app in the same category increases the hazard rate by a factor of 1.038 (we can state that for every one-unit increase in cumulative_labeled_apps, the hazard of experiencing the event increases by 3.8%).