#  Track Your Tinnitus (TYT) Dataset

## Purpose of this Notebook
- [ ] Clean the dataset and save cleaned version
- [x] Get an statistical overview
    - [x] How many users?
    - [x] How many assessments (= filled out questionnaires)?
    - [x] Date range of the dataset?
    - [x] User-assessment distribution
- [x] Potential target for classification?
- [ ] Potential features for classification?
    
    

In [1]:
# imports
import pandas as pd
from datetime import date, datetime
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
# read data
tyt_raw = pd.read_csv("../../data/d01_raw/tyt/22-01-17_standardanswers.csv")
KEEP_COLUMNS = ['user_id','created_at','question1','question2','question3','question4','question5','question6','question7']
tyt_raw = tyt_raw[KEEP_COLUMNS]
tyt_raw.loc[:,'created_at'] = pd.to_datetime(tyt_raw.created_at, format="%Y-%m-%d %H:%M:%S")
tyt_raw.head()

Unnamed: 0,user_id,created_at,question1,question2,question3,question4,question5,question6,question7
0,1.0,2013-08-13 12:33:11,1.0,0.293594,0.540925,1.0,0.5,0.346975,0.713523
1,1.0,2013-08-13 12:33:53,1.0,0.284698,0.653025,0.75,0.25,0.30605,0.88968
2,1.0,2013-08-13 12:34:22,1.0,0.816726,1.0,0.25,0.75,0.768683,0.209964
3,1.0,2013-08-13 12:36:23,0.0,0.0,0.0,0.5,0.625,0.208185,0.270463
4,1.0,2013-08-13 12:38:09,1.0,0.407473,0.766904,0.625,0.625,0.307829,0.736655


###  How many users?

In [6]:
print("No. of unique users:",tyt_raw.user_id.nunique())

No. of unique users: 3269


###  How long does the dataset span?

In [35]:
print("The dataset spans", (tyt_raw.created_at.max() - tyt_raw.created_at.min()).days, "days, starting on", tyt_raw.created_at.min().date(), "and ending on", tyt_raw.created_at.max().date())

The dataset spans 3078 days, starting on 2013-08-13 and ending on 2022-01-17


### How intensely do users engage with the app?

In [5]:
tyt_raw.loc[:,'assessment_quantile'] = tyt_raw.loc[:,'user_id'].map(pd.qcut(tyt_raw.user_id.value_counts(), 5, duplicates='drop').to_dict())

print("The distribution for number of users within a certain range of assessments submitted varies like in the table below:")
print(pd.DataFrame(tyt_raw.groupby('assessment_quantile')['user_id'].nunique()).reset_index().rename({'user_id':'n_users'}, axis=1))

print("As you can see, unlike the UNITI dataset, the TYT dataset shows a much more sharp drop-off curve for how long users last before they give up..." + 
      " This is probably due to the fact that most UNITI App users are recruited by doctors for the UNITI RCT")

The distribution for number of users within a certain range of assessments submitted varies like in the table below:
  assessment_quantile  n_users
0        (0.999, 2.0]     1447
1          (2.0, 5.0]      543
2         (5.0, 24.0]      630
3      (24.0, 6075.0]      649
As you can see, unlike the UNITI dataset, the TYT dataset shows a much more sharp drop-off curve for how long users last before they give up... This is probably due to the fact that most UNITI App users are recruited by doctors for the UNITI RCT


#### What about at the user level?

In [31]:
tyt_raw['date'] = tyt_raw['created_at'].map(lambda x: x.date())
tyt_interaction_intensity_userlevel = pd.DataFrame(tyt_raw.groupby('user_id').agg({'date':['min','max','nunique'], 'user_id':'count'}).reset_index().values, columns = ['user_id','date_min','date_max','n_unique_days', 'n_assessments'])
tyt_interaction_intensity_userlevel['date_min'] = pd.to_datetime(tyt_interaction_intensity_userlevel.date_min, format='%Y-%m-%d')
tyt_interaction_intensity_userlevel['date_max'] = pd.to_datetime(tyt_interaction_intensity_userlevel.date_max, format='%Y-%m-%d')
tyt_interaction_intensity_userlevel['n_unique_days'] = tyt_interaction_intensity_userlevel['n_unique_days'].astype(int)
tyt_interaction_intensity_userlevel['n_assessments'] = tyt_interaction_intensity_userlevel['n_assessments'].astype(int)
tyt_interaction_desc = tyt_interaction_intensity_userlevel.describe(datetime_is_numeric=True)

In [36]:
print("Min. number of unique days of data from a user is:", tyt_interaction_desc['n_unique_days']['min'],
      "days \n25% of the users have <=",tyt_interaction_desc['n_unique_days']['25%'],
      "days \n50% of the users have <=",tyt_interaction_desc['n_unique_days']['50%'],
      "days \n75% of the users have <=",tyt_interaction_desc['n_unique_days']['75%'],
      "days, and \nMax. number of unique days of data from a user is:",tyt_interaction_desc['n_unique_days']['max'])

Min. number of unique days of data from a user is: 1.0 days 
25% of the users have <= 1.0 days 
50% of the users have <= 2.0 days 
75% of the users have <= 8.0 days, and 
Max. number of unique days of data from a user is: 1710.0


In [37]:
print("Min. number of submitted assessments from a user is:", tyt_interaction_desc['n_assessments']['min'],
      "assessments \n25% of the users have <=",tyt_interaction_desc['n_assessments']['25%'],
      "assessments \n50% of the users have <=",tyt_interaction_desc['n_assessments']['50%'],
      "assessments \n75% of the users have <=",tyt_interaction_desc['n_assessments']['75%'],
      "assessments, and \nMax. number of submitted assessments from a user is:",tyt_interaction_desc['n_assessments']['max'])

Min. number of submitted assessments from a user is: 1.0 assessments 
25% of the users have <= 1.0 assessments 
50% of the users have <= 3.0 assessments 
75% of the users have <= 16.0 assessments, and 
Max. number of submitted assessments from a user is: 6075.0


# Pointers for Target variable

###  A candidate for target variable (Regression) is "question3"


This is because it is a measure of the distress caused by the disease, and because there is no clear treatment that reliably reduces symptom severity, treating the distress caused by the disease (like in the case of chronic pain) is considered the thing to do, rather than treat the symptom severity.

###  If classification, target variable ("question3") discretisation can be attempted.:

(target in mean +/- user-defined noise threshold is "no change", 
target > mean + threashold is "worse", 
mean - threshold is "better")

### Candidate for features is all other questions excluding target:

[question1,question2,question4,question5,question6, question7]

###  Misc. tips

#### It might be useful to exclude the single binary variable question 1, which asks if the user hears tinnitus right now. It is observed that users are filling loudness and distress as nonzero even when they answer question 1 as "NO".

For example, see below a comparison of the values for the other 6 questions when NO was the answer to question1.

The table below that shows, however, that the values for these 6 questions are much lower than usual... So, this decision is a bit complicated.... But it is possible to argue for either decision (include / exclude this var)

In [40]:
tyt_raw[tyt_raw.question1 == 0].drop('user_id', axis=1).describe()

Unnamed: 0,question1,question2,question3,question4,question5,question6,question7
count,21973.0,20659.0,20549.0,21438.0,21348.0,20839.0,20912.0
mean,0.0,0.250289,0.188913,0.614874,0.262954,0.215334,0.589202
std,0.0,0.246518,0.204156,0.192227,0.226456,0.206481,0.312787
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.05,0.03,0.5,0.12,0.05,0.327206
50%,0.0,0.176471,0.126838,0.62,0.25,0.161765,0.63
75%,0.0,0.38,0.279412,0.75,0.375,0.316177,0.87
max,0.0,1.0,1.0,1.0,1.0,1.0,1.0


In [41]:
tyt_raw[tyt_raw.question1 != 0].drop('user_id', axis=1).describe()

Unnamed: 0,question1,question2,question3,question4,question5,question6,question7
count,82966.0,84078.0,79088.0,82474.0,81423.0,77792.0,80959.0
mean,1.0,0.533405,0.420729,0.54167,0.237398,0.293439,0.590715
std,0.0,0.299429,0.285912,0.213529,0.223359,0.246238,0.318637
min,1.0,-0.01,-0.01,-0.01,-0.01,-0.01,-0.01
25%,1.0,0.28,0.189338,0.5,0.0,0.09,0.32
50%,1.0,0.51,0.38,0.5,0.25,0.24,0.608456
75%,1.0,0.79,0.63,0.75,0.37,0.44,0.91
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0
