# Corona Check - Get Covid estimation based on reported symptoms for yourself and others (`CoronaCheck`)

## Purpose of this Notebook
- [ ] Clean the dataset and save cleaned version
- [x] Get an statistical overview
    - [x] How many users?
    - [x] How many assessments (= filled out questionnaires)?
    - [x] Date range of the dataset?
    - [x] User-assessment distribution
- [ ] Potential target for classification?
- [ ] Potential features for classification?
    
    

In [2]:
# imports
import pandas as pd
from datetime import date, datetime

In [3]:
# to import own modules, a workaround - assuming the curent working directory is the folder (your/local/path/UsAs/src/d01_analysis
import sys
sys.path.insert(0, "../..")

from src.d00_utils import cc_helpers, helpers

In [4]:
# read in dataframe
# assuming current working directory is the folder (your/local/path/UsAs/src/d01_analysis)
df = pd.read_csv('../../data/d01_raw/cc/22-10-05_corona-check-data.csv')


  df = pd.read_csv('../../data/d01_raw/cc/22-10-05_corona-check-data.csv')


#### <font color='red'>Problem with the user_id</font> 
A user_id does not refer to one person in this questionnaire, since the baseline and followup questionnaire are within ONE questionnaire. That is, we have to make assumptions when a user_id refers to one specific person.
These assumptions are: 
- Do you fill out this questionnaire for yourself? == `YES`
- `Age` must not vary
    - If `Age` varies within the `Author==YES` filtered answers, we take the mode age and drop other assessments.

In [5]:
print('No of assessments at start:\t', df.shape[0])
df = cc_helpers.drop_one_time_users(df)
print('No of assessments without one time users:\t', df.shape[0])
df = cc_helpers.drop_ambiguous_users(df)
print('No of assessments without ambigious users:\t', df.shape[0])

No of assessments at start:	 89659
No of assessments without one time users:	 50223


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(index=assessments_to_drop, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(index=assessments_to_drop, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(index=assessments_to_drop, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(index=assessm

#### Statistical overview

In [6]:
# create result dict to save to disk
result = dict()

In [7]:
# Number of unique users
print('Number of user:\t', df.user_id.nunique())
result['n_users'] = df.user_id.nunique()

Number of user:	 13830


In [8]:
# number of unique answers
print('Number of assessments:\t', df.answer_id.nunique())
result['n_assessments'] = df.answer_id.nunique()

Number of assessments:	 49535


In [9]:
# date range
form = '%Y-%m-%d %H:%M:%S'
date_start = df.created_at.min()
result['First assessment from'] = date_start
date_start = datetime.strptime(date_start, form)
date_end = df.created_at.max()
result['Latest assessment from'] = date_end
date_end = datetime.strptime(date_end, form)



print('Start:\t', date_start)
print('End:\t', date_end)

delta = date_end.date()-date_start.date()

print('\nDate Range in')
print('Years:\t', delta.days/365)
print('Months:\t', delta.days/12)
print('Days:\t', delta.days)


result['Time range in days'] = delta.days

Start:	 2020-04-08 13:48:43
End:	 2022-09-30 14:25:32

Date Range in
Years:	 2.4794520547945207
Months:	 75.41666666666667
Days:	 905


#### Distribution of filled out questionnaires

In [10]:
bins = [0, 1, 2, 3, 5, 10, 100, 1000]
ser = pd.cut(df.user_id.value_counts(), bins = bins).value_counts().sort_index()
ser
result.update(dict(ser))

### Average period length between two filled out questionnaires

In [11]:
# res = result
res = helpers.find_schedule_pattern(df, form='%Y-%m-%d %H:%M:%S', date_col_name='created_at')
result.update(res)

### Age Distribution

Age is recorded in groups with stepsize of 10 (eg. from ages 20 to 29). To be able to work with the data we assume that each user in a given age groug (eg. 20-29) is aged equal to the mean of the group (25). For user in the age group 80+ we assume an age of 85.

The following table shows the number of users in each age group.

In [12]:
d['Percentage']
df_age_groups['mean_age'] = (5, 15, 25, 35, 45, 55, 65, 75, 85)
df_age_groups

Unnamed: 0_level_0,user_id,mean_age
age,Unnamed: 1_level_1,Unnamed: 2_level_1
00-09,1726,5
10-19,9614,15
20-29,9570,25
30-39,8015,35
40-49,6663,45
50-59,4769,55
60-69,4110,65
70-79,2991,75
80+,1170,85


Next we calculate the mean age and the standard deviation

In [14]:
df_age_groups.groupby('mean_age')['user_id'].avg

AttributeError: 'SeriesGroupBy' object has no attribute 'avg'

### Sex Distribution

We calculate the number of users that self identified their sex as male (0), female (1) or other (2)

In [None]:
df_sex = pd.DataFrame(df_baseline.groupby('geschlecht')['user_id'].count())
df_sex['label'] = ('male', 'female', 'other')

result['n_users_male'] = df_sex['user_id'][0.0]
result['n_users_female'] = df_sex['user_id'][1.0]
result['n_users_other'] = df_sex['user_id'][2.0]

print(df_sex)
print("\n{} users without submitted sex".format(df_baseline['user_id'].count() - (result['n_users_male'] + result['n_users_female'] + result['n_users_other'])))

### Statistical Overview

In [None]:
result

In [None]:
# read in codebook and reduce to columns and rows of interest
cb = pd.read_excel('../../data/d00_helpers/codebook/cc/codebook_cc.xlsx', sheet_name='Sheet1')
cb