# Corona Health - Pychological Health Adults (`Children`)

## Purpose of this Notebook
- [ ] Clean the dataset and save cleaned version
- [x] Get an statistical overview
    - [x] How many users?
    - [x] How many assessments (= filled out questionnaires)?
    - [x] Date range of the dataset?
    - [x] User-assessment distribution
- [ ] Potential target for classification?
- [ ] Potential features for classification?
    
    

## Preparation
Import mudules and load data for later use

In [1]:
# imports
import pandas as pd
from datetime import date, datetime

In [2]:
# to import own modules, a workaround - assuming the curent working directory is the folder (your/local/path/UsAs/src/d01_analysis
import sys
sys.path.insert(0, "../..")

from src.d00_utils import cc_helpers, helpers

In [3]:
# read in dataframe
# assuming current working directory is the folder (your/local/path/UsAs/src/d01_analysis)
df = pd.read_csv('../../data/d01_raw/ch/22-10-05_rki_children_followup.csv')
df_baseline = pd.read_csv('../../data/d01_raw/ch/22-10-05_rki_children_baseline.csv')

## Statistical overview
Here we calculate statistics for both the dataset overall and the baseline questionaires

In [4]:
# create result dict to save to disk
result = dict()

### Number of users

In [5]:
# Number of unique users
print('Number of user:\t', df.user_id.nunique())
result['n_users'] = df.user_id.nunique()

Number of user:	 111


### Number of assessments

In [6]:
# number of unique answers
print('Number of assessments:\t', df.answer_id.nunique())
result['n_assessments'] = df.answer_id.nunique()

Number of assessments:	 630


### Date Range of assessments

In [7]:
# date range
form = '%Y-%m-%d %H:%M:%S'
date_start = df.created_at.min()
result['First assessment from'] = date_start
date_start = datetime.strptime(date_start, form)
date_end = df.created_at.max()
result['Latest assessment from'] = date_end
date_end = datetime.strptime(date_end, form)

print('Start:\t', date_start)
print('End:\t', date_end)

delta = date_end.date()-date_start.date()

print('\nDate Range in')
print('Years:\t', delta.days/365)
print('Months:\t', delta.days/12)
print('Days:\t', delta.days)

result['Time range in days'] = delta.days

Start:	 2020-08-08 14:38:32
End:	 2022-09-29 12:24:12

Date Range in
Years:	 2.1424657534246574
Months:	 65.16666666666667
Days:	 782


### Schedule pattern

In [8]:
helpers.find_schedule_pattern(df)

{'Median hours between two assessments': 174.29722222222222,
 'Median days between two assessments': 7.262384259259259,
 'std_hours': 194.8483989912335,
 'std_days': 8.118683291301394}

### Distribution of filled out questionnaires

In [9]:
bins = [0, 1, 2, 3, 5, 10, 100, 1000]
ser = pd.cut(df.user_id.value_counts(), bins = bins).value_counts().sort_index()
ser
result.update(dict(ser))

### Average period length between two filled out questionnaires

In [10]:
# res = result
res = helpers.find_schedule_pattern(df, form='%Y-%m-%d %H:%M:%S', date_col_name='created_at')
result.update(res)

### Age distribution
The following table shows the number of users in each age group.

In [11]:
age_groups = df_baseline.groupby('kj_age').size()
age_groups

kj_age
12    18
13    31
14    39
15    44
16    71
17    75
dtype: int64

Next we calculate the mean age and the standard deviation

In [12]:
result['user_age_mean'] = df_baseline['kj_age'].mean()
result['user_age_standard_deviation'] = df_baseline['kj_age'].std()

### Sex Distribution
We calculate the number of users that self identified their sex as male, female or diverse or no answer

In [13]:
df_sex = pd.DataFrame(df_baseline.groupby('kj_sex')['user_id'].count())

df_sex

result['n_users_male'] = df_sex['user_id'][1]
result['n_users_female'] = df_sex['user_id'][2]
result['n_users_diverse'] = df_sex['user_id'][3]
result['n_users_no_answer'] = df_sex['user_id'][4]

print(df_sex)

        user_id
kj_sex         
1           139
2           129
3             4
4             6


### Statistical Overview

In [14]:
result

{'n_users': 111,
 'n_assessments': 630,
 'First assessment from': '2020-08-08 14:38:32',
 'Latest assessment from': '2022-09-29 12:24:12',
 'Time range in days': 782,
 Interval(0, 1, closed='right'): 48,
 Interval(1, 2, closed='right'): 21,
 Interval(2, 3, closed='right'): 7,
 Interval(3, 5, closed='right'): 7,
 Interval(5, 10, closed='right'): 10,
 Interval(10, 100, closed='right'): 18,
 Interval(100, 1000, closed='right'): 0,
 'Median hours between two assessments': 174.29722222222222,
 'Median days between two assessments': 7.262384259259259,
 'std_hours': 194.8483989912335,
 'std_days': 8.118683291301394,
 'user_age_mean': 15.237410071942445,
 'user_age_standard_deviation': 1.5671848513528142,
 'n_users_male': 139,
 'n_users_female': 129,
 'n_users_diverse': 4,
 'n_users_no_answer': 6}

In [15]:
# res = result
res = helpers.find_schedule_pattern(df, form='%Y-%m-%d %H:%M:%S', date_col_name='created_at')
result.update(res)

In [16]:
# read in codebook and reduce to columns and rows of interest
cb = pd.read_excel('../../data/d00_helpers/codebook/ch/rki_children.xlsx', sheet_name='FollowUp', header=4)
cb = cb[cb.elementtype=='question']
cb = cb.iloc[:, :30]

FileNotFoundError: [Errno 2] No such file or directory: '../../data/d00_helpers/codebook/ch/rki_children.xlsx'