#  Track Your Tinnitus (TYT) Dataset

## Purpose of this Notebook
- [ ] Clean the dataset and save cleaned version
- [x] Get an statistical overview
    - [x] How many users?
    - [x] How many assessments (= filled out questionnaires)?
    - [x] Date range of the dataset?
    - [x] User-assessment distribution
- [x] Potential target for classification?
- [ ] Potential features for classification?
    
    

In [1]:
# imports
import pandas as pd
from datetime import date, datetime
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
# local imports
import sys
sys.path.insert(0, "../..")

from src.d00_utils import helpers

In [3]:
# read data
tyt_raw = pd.read_csv("../../data/d01_raw/tyt/22-10-24_standardanswers.csv")
KEEP_COLUMNS = ['user_id','created_at','question1','question2','question3','question4','question5','question6','question7']
tyt_raw = tyt_raw[KEEP_COLUMNS]
tyt_raw.loc[:,'created_at'] = pd.to_datetime(tyt_raw.created_at, format="%Y-%m-%d %H:%M:%S")
tyt_raw.head()

Unnamed: 0,user_id,created_at,question1,question2,question3,question4,question5,question6,question7
0,,2013-07-18 14:14:32,0.0,0.0,0.170818,0.666667,0.25,0.241993,0.343416
1,,2013-07-18 14:14:34,0.0,0.0,0.170818,0.666667,0.25,0.241993,0.343416
2,,2013-07-18 14:14:35,0.0,0.0,0.170818,0.666667,0.25,0.241993,0.343416
3,,2013-07-18 14:14:35,0.0,0.0,0.170818,0.666667,0.25,0.241993,0.343416
4,,2013-07-26 07:40:23,0.0,0.0,0.170818,0.666667,0.25,0.241993,0.343416


In [4]:
# drop testusers
tyt_raw = helpers.drop_test_users('tyt', tyt_raw)

###  How many users?

In [5]:
print("No. of unique users:",tyt_raw.user_id.nunique())

No. of unique users: 3303


###  How long does the dataset span?

In [6]:
print("The dataset spans", (tyt_raw.created_at.max() - tyt_raw.created_at.min()).days, "days, starting on", tyt_raw.created_at.min().date(), "and ending on", tyt_raw.created_at.max().date())

The dataset spans 3360 days, starting on 2013-07-18 and ending on 2022-09-30


### How intensely do users engage with the app?

In [7]:
tyt_raw.loc[:,'assessment_quantile'] = tyt_raw.loc[:,'user_id'].map(pd.qcut(tyt_raw.user_id.value_counts(), 5, duplicates='drop').to_dict())

print("The distribution for number of users within a certain range of assessments submitted varies like in the table below:")
print(pd.DataFrame(tyt_raw.groupby('assessment_quantile')['user_id'].nunique()).reset_index().rename({'user_id':'n_users'}, axis=1))

print("As you can see, unlike the UNITI dataset, the TYT dataset shows a much more sharp drop-off curve for how long users last before they give up..." + 
      " This is probably due to the fact that most UNITI App users are recruited by doctors for the UNITI RCT")

The distribution for number of users within a certain range of assessments submitted varies like in the table below:
  assessment_quantile  n_users
0        (0.999, 2.0]     1473
1          (2.0, 5.0]      546
2         (5.0, 24.0]      627
3      (24.0, 6815.0]      657
As you can see, unlike the UNITI dataset, the TYT dataset shows a much more sharp drop-off curve for how long users last before they give up... This is probably due to the fact that most UNITI App users are recruited by doctors for the UNITI RCT


#### What about at the user level?

In [8]:
tyt_raw['date'] = tyt_raw['created_at'].map(lambda x: x.date())
tyt_interaction_intensity_userlevel = pd.DataFrame(tyt_raw.groupby('user_id').agg({'date':['min','max','nunique'], 'user_id':'count'}).reset_index().values, columns = ['user_id','date_min','date_max','n_unique_days', 'n_assessments'])
tyt_interaction_intensity_userlevel['date_min'] = pd.to_datetime(tyt_interaction_intensity_userlevel.date_min, format='%Y-%m-%d')
tyt_interaction_intensity_userlevel['date_max'] = pd.to_datetime(tyt_interaction_intensity_userlevel.date_max, format='%Y-%m-%d')
tyt_interaction_intensity_userlevel['n_unique_days'] = tyt_interaction_intensity_userlevel['n_unique_days'].astype(int)
tyt_interaction_intensity_userlevel['n_assessments'] = tyt_interaction_intensity_userlevel['n_assessments'].astype(int)
tyt_interaction_desc = tyt_interaction_intensity_userlevel.describe(datetime_is_numeric=True)

In [9]:
print("Min. number of unique days of data from a user is:", tyt_interaction_desc['n_unique_days']['min'],
      "days \n25% of the users have <=",tyt_interaction_desc['n_unique_days']['25%'],
      "days \n50% of the users have <=",tyt_interaction_desc['n_unique_days']['50%'],
      "days \n75% of the users have <=",tyt_interaction_desc['n_unique_days']['75%'],
      "days, and \nMax. number of unique days of data from a user is:",tyt_interaction_desc['n_unique_days']['max'])

Min. number of unique days of data from a user is: 1.0 days 
25% of the users have <= 1.0 days 
50% of the users have <= 2.0 days 
75% of the users have <= 8.0 days, and 
Max. number of unique days of data from a user is: 1849.0


In [10]:
print("Min. number of submitted assessments from a user is:", tyt_interaction_desc['n_assessments']['min'],
      "assessments \n25% of the users have <=",tyt_interaction_desc['n_assessments']['25%'],
      "assessments \n50% of the users have <=",tyt_interaction_desc['n_assessments']['50%'],
      "assessments \n75% of the users have <=",tyt_interaction_desc['n_assessments']['75%'],
      "assessments, and \nMax. number of submitted assessments from a user is:",tyt_interaction_desc['n_assessments']['max'])

Min. number of submitted assessments from a user is: 1.0 assessments 
25% of the users have <= 1.0 assessments 
50% of the users have <= 3.0 assessments 
75% of the users have <= 16.0 assessments, and 
Max. number of submitted assessments from a user is: 6815.0


# Baseline Statistics

### Sex Distribution

We calculate the number of users that self identified their sex as male (0), female (1) or other (2)
df_sex = pd.DataFrame(df_baseline.groupby('geschlecht')['user_id'].count())
df_sex['label'] = ('male', 'female', 'other')

result['n_users_male'] = df_sex['user_id'][0.0]
result['n_users_female'] = df_sex['user_id'][1.0]
result['n_users_other'] = df_sex['user_id'][2.0]

print(df_sex)
print("\n{} users without submitted sex".format(result['n_users'] - (result['n_users_male'] + result['n_users_female'] + result['n_users_other'])))
### Country Statistics

We calculate how many users participated by country.
We also calculate the percentage of german-based users in the dataset

In [None]:
df_baseline.user_id.groupby('alter')['user_id'].count()

Now we compute the mean age and the standard deviation

In [None]:
avg_age = df_baseline['alter'].mean()
std_age = df_baseline['alter'].std()

result['avg_age'] = avg_age
result['std_age'] = std_age

### Age Distribution

The following table shows the number of users for each age.

In [None]:
print("Durschnittliches Alter: %4.2f Jahre" % avg_age)
print("Standardabweichung Alter: %4.2f Jahre" % std_age)

### Sex Distribution

We calculate the number of users that self identified their sex as male (0), female (1) or other (2)

In [None]:
df_sex = pd.DataFrame(df_baseline.groupby('geschlecht')['user_id'].count())
df_sex['label'] = ('male', 'female', 'other')

result['n_users_male'] = df_sex['user_id'][0.0]
result['n_users_female'] = df_sex['user_id'][1.0]
result['n_users_other'] = df_sex['user_id'][2.0]

print(df_sex)
print("\n{} users without submitted sex".format(result['n_users'] - (result['n_users_male'] + result['n_users_female'] + result['n_users_other'])))

### Country Statistics

We calculate how many users participated by country.

In [None]:
result['avg_age'] = avg_age
result['std_age'] = std_age

print("Durschnittliches Alter: %4.2f Jahre" % avg_age)
print("Standardabweichung Alter: %4.2f Jahre" % std_age)

# Pointers for Target variable

###  A candidate for target variable (Regression) is "question3"


This is because it is a measure of the distress caused by the disease, and because there is no clear treatment that reliably reduces symptom severity, treating the distress caused by the disease (like in the case of chronic pain) is considered the thing to do, rather than treat the symptom severity.

###  If classification, target variable ("question3") discretisation can be attempted.:

(target in mean +/- user-defined noise threshold is "no change", 
target > mean + threashold is "worse", 
mean - threshold is "better")

### Candidate for features is all other questions excluding target:

[question1,question2,question4,question5,question6, question7]

###  Misc. tips

#### It might be useful to exclude the single binary variable question 1, which asks if the user hears tinnitus right now. It is observed that users are filling loudness and distress as nonzero even when they answer question 1 as "NO".

For example, see below a comparison of the values for the other 6 questions when NO was the answer to question1.

The table below that shows, however, that the values for these 6 questions are much lower than usual... So, this decision is a bit complicated.... But it is possible to argue for either decision (include / exclude this var)

In [11]:
tyt_raw[tyt_raw.question1 == 0].drop('user_id', axis=1).describe()

Unnamed: 0,question1,question2,question3,question4,question5,question6,question7
count,22184.0,20867.0,20753.0,21679.0,21583.0,21053.0,21153.0
mean,0.0,0.248225,0.187638,0.614473,0.263132,0.21486,0.590614
std,0.0,0.245586,0.203526,0.191691,0.22662,0.20562,0.312521
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.05,0.03,0.5,0.12,0.05,0.33
50%,0.0,0.172794,0.125,0.62,0.25,0.161765,0.63
75%,0.0,0.378676,0.272059,0.75,0.375,0.316177,0.87
max,0.0,1.0,1.0,1.0,1.0,1.0,1.0


In [12]:
tyt_raw[tyt_raw.question1 != 0].drop('user_id', axis=1).describe()

Unnamed: 0,question1,question2,question3,question4,question5,question6,question7
count,87160.0,88161.0,82311.0,86477.0,85133.0,80979.0,84686.0
mean,1.0,0.532222,0.418216,0.54128,0.23676,0.291427,0.591452
std,0.0,0.302344,0.286249,0.214534,0.223614,0.245846,0.320795
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.28,0.183823,0.5,0.0,0.09,0.32
50%,1.0,0.51,0.378677,0.5,0.25,0.24,0.61
75%,1.0,0.79,0.63,0.75,0.37,0.433824,0.913603
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0
