# Corona Health - Physiological Health Adults (`Heart`)

## Purpose of this Notebook
- [ ] Clean the dataset and save cleaned version
- [x] Get an statistical overview
    - [x] How many users?
    - [x] How many assessments (= filled out questionnaires)?
    - [x] Date range of the dataset?
    - [x] User-assessment distribution
- [ ] Potential target for classification?
- [ ] Potential features for classification?
    
    

## Preparation

In [1]:
# imports
import pandas as pd
from datetime import date, datetime

In [2]:
# to import own modules, a workaround - assuming the curent working directory is the folder (your/local/path/UsAs/src/d01_analysis
import sys
sys.path.insert(0, "../..")

from src.d00_utils import helpers

In [3]:
# read in dataframe
# assuming current working directory is the folder (your/local/path/UsAs/src/d01_analysis)
df = pd.read_csv('../../data/d01_raw/ch/22-10-05_rki_stress_followup.csv')
df_baseline = pd.read_csv('../../data/d01_raw/ch/22-10-05_rki_stress_baseline.csv')

## Statistical overview
Here we calculate statistics for both the dataset overall and the baseline questionaires

In [4]:
# create result dict to save to disk
result = dict()

### Number of users

In [5]:
# Number of unique users
print('Number of user:\t', df.user_id.nunique())
result['n_users'] = df.user_id.nunique()

Number of user:	 374


### Number of assessments

In [6]:
# number of unique answers
print('Number of assessments:\t', df.answer_id.nunique())
result['n_assessments'] = df.answer_id.nunique()

Number of assessments:	 3845


### Date Range of assessments

In [7]:
# date range
form = '%Y-%m-%d %H:%M:%S'
date_start = df.created_at.min()
result['First assessment from'] = date_start
date_start = datetime.strptime(date_start, form)
date_end = df.created_at.max()
result['Latest assessment from'] = date_end
date_end = datetime.strptime(date_end, form)



print('Start:\t', date_start)
print('End:\t', date_end)

delta = date_end.date()-date_start.date()

print('\nDate Range in')
print('Years:\t', delta.days/365)
print('Months:\t', delta.days/12)
print('Days:\t', delta.days)


result['Time range in days'] = delta.days

Start:	 2020-12-19 07:59:15
End:	 2022-09-30 17:11:43

Date Range in
Years:	 1.7808219178082192
Months:	 54.166666666666664
Days:	 650


In [8]:
### Schedule pattern

In [9]:
helpers.find_schedule_pattern(df)

{'Median hours between two assessments': 168.54972222222221,
 'Median days between two assessments': 7.022905092592592,
 'std_hours': 359.7016961567075,
 'std_days': 14.987570673196146}

### Distribution of filled out questionnaires

In [10]:
bins = [0, 1, 2, 3, 5, 10, 100, 1000]
ser = pd.cut(df.user_id.value_counts(), bins = bins).value_counts().sort_index()
ser
result.update(dict(ser))

### Average period length between two filled out questionnaires

In [11]:
# res = result
res = helpers.find_schedule_pattern(df, form='%Y-%m-%d %H:%M:%S', date_col_name='created_at')
result.update(res)

### Age Distribution

The following table shows the number of users for each age.

In [12]:
df_baseline.groupby('alter')['user_id'].count()

alter
18    22
19    12
20     9
21    17
22    13
23    11
24    10
25    11
26    12
27    11
28    14
29     4
30    13
31    15
32    13
33    13
34    13
35    19
36    13
37    21
38    22
39    12
40    15
41    18
42    14
43     8
44    16
45     8
46    13
47    13
48    12
49    13
50    16
51    14
52     3
53     7
54    21
55     9
56    14
57     6
58     8
59    12
60     8
61    12
62     5
63     9
64     9
65     3
66     5
67     2
68     4
70     4
71     2
73     2
75     1
Name: user_id, dtype: int64

Now we compute the mean age and the standard deviation

In [13]:
avg_age = df_baseline['alter'].mean()
std_age = df_baseline['alter'].std()

result['avg_age'] = avg_age
result['std_age'] = std_age

print("Durschnittliches Alter: %4.2f Jahre" % avg_age)
print("Standardabweichung Alter: %4.2f Jahre" % std_age)

Durschnittliches Alter: 40.68 Jahre
Standardabweichung Alter: 13.87 Jahre


### Sex Distribution

We calculate the number of users that self identified their sex as male (0), female (1) or other (2)

In [14]:
df_sex = pd.DataFrame(df_baseline.groupby('geschlecht')['user_id'].count())
df_sex['label'] = ('male', 'female', 'other')

result['n_users_male'] = df_sex['user_id'][0.0]
result['n_users_female'] = df_sex['user_id'][1.0]
result['n_users_other'] = df_sex['user_id'][2.0]

print(df_sex)
print("\n{} users without submitted sex".format(df_baseline['user_id'].count() - (result['n_users_male'] + result['n_users_female'] + result['n_users_other'])))

            user_id   label
geschlecht                 
0               392    male
1               206  female
2                 8   other

0 users without submitted sex


### Country Statistics

We calculate how many users participated by country.

In [15]:
country_series = df_baseline.groupby('country')['user_id'].count()
country_series

country
AND      1
ATG      1
AUT      2
DEU    596
DZA      1
FRA      1
GBR      1
IND      1
MEX      1
NLD      1
Name: user_id, dtype: int64

We also calculate the percentage of german-based users in the dataset

In [16]:
result['n_users_german'] = country_series['DEU']
result['n_users_non_german'] = country_series.sum()
print('{:.2f}% german-based users in dataset'.format(country_series['DEU'] / country_series.sum() * 100))

98.35% german-based users in dataset


### Statistical Overview

In [17]:
result

{'n_users': 374,
 'n_assessments': 3845,
 'First assessment from': '2020-12-19 07:59:15',
 'Latest assessment from': '2022-09-30 17:11:43',
 'Time range in days': 650,
 Interval(0, 1, closed='right'): 76,
 Interval(1, 2, closed='right'): 53,
 Interval(2, 3, closed='right'): 33,
 Interval(3, 5, closed='right'): 58,
 Interval(5, 10, closed='right'): 55,
 Interval(10, 100, closed='right'): 99,
 Interval(100, 1000, closed='right'): 0,
 'Median hours between two assessments': 168.54972222222221,
 'Median days between two assessments': 7.022905092592592,
 'std_hours': 359.7016961567075,
 'std_days': 14.987570673196146,
 'avg_age': 40.67821782178218,
 'std_age': 13.871324394410022,
 'n_users_male': 392,
 'n_users_female': 206,
 'n_users_other': 8,
 'n_users_german': 596,
 'n_users_non_german': 606}

In [18]:
# read in codebook and reduce to columns and rows of interest
cb = pd.read_excel('../../data/d00_helpers/codebooks/ch/rki_stress.xlsx', sheet_name='FollowUp', header=5)
cb = cb[cb.elementtype=='question']
cb = cb.iloc[:, :30]

  warn(msg)
