# Corona Check - Get Covid estimation based on reported symptoms for yourself and others (`CoronaCheck`)

## Purpose of this Notebook
- [ ] Clean the dataset and save cleaned version
- [x] Get an statistical overview
    - [x] How many users?
    - [x] How many assessments (= filled out questionnaires)?
    - [x] Date range of the dataset?
    - [x] User-assessment distribution
- [ ] Potential target for classification?
- [ ] Potential features for classification?
    
    

In [15]:
# imports
import pandas as pd
from datetime import date, datetime

In [16]:
# to import own modules, a workaround - assuming the curent working directory is the folder (your/local/path/UsAs/src/d01_analysis
import sys
sys.path.insert(0, "../..")

from src.d00_utils import cc_helpers

In [17]:
# read in dataframe
# assuming current working directory is the folder (your/local/path/UsAs/src/d01_analysis)
df = pd.read_csv('../../data/d01_raw/cc/22-06-29_corona-check-data.csv')

#### <font color='red'>Problem with the user_id</font> 
A user_id does not refer to one person in this questionnaire, since the baseline and followup questionnaire are within ONE questionnaire. That is, we have to make assumptions when a user_id refers to one specific person.
These assumptions are: 
- Do you fill out this questionnaire for yourself? == `YES`
- `Age` must not vary
    - If `Age` varies within the `Author==YES` filtered answers, we take the mode age and drop other assessments.

In [18]:
print('No of assessments at start:\t', df.shape[0])
df = cc_helpers.drop_one_time_users(df)
print('No of assessments without one time users:\t', df.shape[0])
df = cc_helpers.drop_ambiguous_users(df)
print('No of assessments without ambigious users:\t', df.shape[0])

No of assessments at start:	 500
No of assessments without one time users:	 280
No of assessments without ambigious users:	 245


#### Statistical overview

In [19]:
# create result dict to save to disk
result = dict()

In [20]:
# Number of unique users
print('Number of user:\t', df.user_id.nunique())
result['n_users'] = df.user_id.nunique()

Number of user:	 68


In [21]:
# number of unique answers
print('Number of assessments:\t', df.answer_id.nunique())
result['n_assessments'] = df.answer_id.nunique()

Number of assessments:	 245


In [22]:
# date range
form = '%Y-%m-%d %H:%M:%S'
date_start = df.created_at.min()
result['First assessment from'] = date_start
date_start = datetime.strptime(date_start, form)
date_end = df.created_at.max()
result['Latest assessment from'] = date_end
date_end = datetime.strptime(date_end, form)



print('Start:\t', date_start)
print('End:\t', date_end)

delta = date_end.date()-date_start.date()

print('\nDate Range in')
print('Years:\t', delta.days/365)
print('Months:\t', delta.days/12)
print('Days:\t', delta.days)


result['Time range in days'] = delta.days

Start:	 2022-01-11 18:04:19
End:	 2022-02-03 21:15:01

Date Range in
Years:	 0.06301369863013699
Months:	 1.9166666666666667
Days:	 23


#### Distribution of filled out questionnaires

In [23]:
bins = [0, 1, 2, 3, 5, 10, 100, 1000]
ser = pd.cut(df.user_id.value_counts(), bins = bins).value_counts().sort_index()
ser
result.update(dict(ser))

### Average period length between two filled out questionnaires

In [24]:
# res = result
res = helpers.find_mode_period_length(df, form='%Y-%m-%d %H:%M:%S', date_col_name='created_at')
result.update(res)

### Statistical Overview

In [25]:
result

{'n_users': 68,
 'n_assessments': 245,
 'First assessment from': '2022-01-11 18:04:19',
 'Latest assessment from': '2022-02-03 21:15:01',
 'Time range in days': 23,
 Interval(0, 1, closed='right'): 9,
 Interval(1, 2, closed='right'): 32,
 Interval(2, 3, closed='right'): 10,
 Interval(3, 5, closed='right'): 8,
 Interval(5, 10, closed='right'): 4,
 Interval(10, 100, closed='right'): 5,
 Interval(100, 1000, closed='right'): 0,
 'avg hours between two assessments': 24.054444444444446,
 'avg days between two assessments': 1.0022685185185185,
 'std_hours': 19.987742040049646,
 'std_days': 0.8328225850020685}

In [11]:
# read in codebook and reduce to columns and rows of interest
cb = pd.read_excel('../../data/d00_helpers/codebook/cc/codebook_cc.xlsx', sheet_name='Sheet1')
cb

Unnamed: 0,variablename,variablemeaning,code / measuring unit,codemeaning
0,age,Age of the user,agegroups in decades,"00-09, 10-19, 20-29, …"
1,altitude,GPS altitude location,float,mearsuring unit unknown
2,answer_id,Index col - unique,ID number,One user can give several answers
3,author,Do you fill out this questionnaire for yourself?,YES / NO,
4,corona_result_api,result that is reported to a user after fillin...,"1 = Contact YES, Symptoms YES, \n2 = Contact N...",1 Suspected coronavirus (COVID-19) case\n2...
5,cough,Wether user reported this symptom,TRUE FALSE,
6,country_code,Two letters ISO2 country code,,location of user while filling out the questio...
7,created_at,datetime when questionnaire was filled out,datetime,
8,device,device used to fill out the questionnaire,,
9,diarrhea,Wether user reported this symptom,TRUE FALSE,
