# Covid-19 Spanish Dataset Ingestion

## TODO LIST

- Añadir apartado de dependencias de ejecución con notebooks previos
- Completar comentarios y traducir al inglés
- Añadir control de fichero a descargar
- Modificar función de control de coherencia entre num_cases, num_cases_pcr y num_cases_qtest para recibir parámetro y corregir la incoherencia
  mostrando las correcciones (por parámetro, 4 opciones: corregir total, corregir detalle, al alza, a la baja)
- Modificar función de control de coherencia entre num_cases, num_cases_pcr y num_cases_qtest para recibir parámetro y corregir qué hace cuando
  num_cases está informado num_cases_pcr y num_cases_qtest (por parámetro, 2 opciones: todo a pcr, dividir entre pcr y test rápidos)
- Refactorizar estandarización, validación y "completación" creando clases y subclases

In [1]:
import numpy as np
import pandas as pd
import requests

import matplotlib.pyplot as plt
import seaborn as sns
import re
import sys
import locale

locale_ingest_str = 'es_ES.UTF-8'
locale.setlocale(locale.LC_ALL, locale_ingest_str)

date_format_ingest_raw = '%d/%m/%Y'
date_format_ingest_std = '%Y-%m-%d'

## Dataset Description

Número de habitantes en España para cada mes desde el inicio de la serie histórica, desagregados por sexo, provincia y edad.

## Aggregated Raw Dataset Download

The dataset can be found ["here"](https://covid19.isciii.es/resources/serie_historica_acumulados.csv).


In [2]:
# Download Raw Dataset file
url = 'https://covid19.isciii.es/resources/serie_historica_acumulados.csv'
req = requests.get(url, allow_redirects = True)
f = open( '../../data/raw/Covid19AggSPWithNotes.csv', 'wb')
f.write(req.content)
f.close()

# TODO: Check for file not found and empty file 

In [3]:
# Remove notes from Raw Dataset files 
f1 = open( '../../data/raw/Covid19AggSPWithNotes.csv', 'r')
  
in_notes = False
file_notes = ''

with open('../../data/raw/Covid19AggSP.csv', 'w') as f2:
    
    while True: 
        # Get next line from original file 
        line = f1.readline() 

        # If line is empty end of file is reached 
        if not line: 
            break
        else:
            if line.startswith('NOTA'):
                 in_notes = True
    
        # If not in notes, write the line to the new file
        if in_notes == True:
            while line[-2:-1] == ',': line = line[:-2] + '\n'
            file_notes = file_notes + line
        else:
            f2.write(line)

    f1.close() 

# Showing file notes.
print('FILE NOTES')
print('==========')
print(file_notes)

FILE NOTES
NOTA: El objetivo de los datos que se publican en esta web es saber el número de casos acumulados a la fecha y que por tanto no se puede deducir que la diferencia entre un día y el anterior es el número de casos nuevos ya que esos casos pueden haber sido recuperados de fechas anteriores. Cualquier inferencia que se haga sobre las diferencias de un día para otro deben hacerse con precaución y son únicamente la responsabilidad del autor.
Los datos de estas comunidades son datos de prevalencia (personas ingresadas a fecha de hoy). No reflejan el total de personas que han sido hospitalizadas o ingresadas en UCI  a lo largo del periodo de notificación(CL(UCIs)-GA(UCIS)-CM*-MD)
* Desde el día 11/04/2020 las cifras de hospitalizados de CM son casos acumulados. Previamente se refieren a personas ingresadas ese día.
* Desde el día 12/04/2020 las cifras de UCIs de CM son casos acumulados. Previamente se refieren a personas ingresadas ese día.
* Desde el día 26/04/2020 las cifras de ho

In [4]:
# Columns names for renaming.
all_columns = ['region_raw', 'date_raw', 'num_cases_raw', 'num_cases_pcr_raw', 'num_cases_qtest_raw', 'num_hosp_raw', 'num_icu_raw', 'num_deaths_raw', 'num_recov_raw']

# Raw Datased load and show
df_covid19_agg = pd.read_csv('../../data/raw/Covid19AggSP.csv', encoding = 'ISO-8859-1', sep = ',', index_col = False, header = 0, names = all_columns, quotechar = "\"", na_filter = False, low_memory = False)

In [5]:
df_covid19_agg

Unnamed: 0,region_raw,date_raw,num_cases_raw,num_cases_pcr_raw,num_cases_qtest_raw,num_hosp_raw,num_icu_raw,num_deaths_raw,num_recov_raw
0,AN,20/2/2020,,,,,,,
1,AR,20/2/2020,,,,,,,
2,AS,20/2/2020,,,,,,,
3,IB,20/2/2020,1,,,,,,
4,CN,20/2/2020,1,,,,,,
...,...,...,...,...,...,...,...,...,...
1268,ML,26/4/2020,,110,11,44,3,2,87
1269,MC,26/4/2020,,1474,297,627,106,128,990
1270,NC,26/4/2020,,4733,7530,1942,130,432,1918
1271,PV,26/4/2020,,12513,1899,6457,538,1241,9840


In [6]:
# Analysis cases detected:

columns = ['region_raw', 'date_raw', 'num_cases_raw', 'num_cases_pcr_raw', 'num_cases_qtest_raw']

x0 = df_covid19_agg.shape
df1 = df_covid19_agg[(df_covid19_agg['num_cases_raw'] == '') & (df_covid19_agg['num_cases_pcr_raw'] == '') & (df_covid19_agg['num_cases_qtest_raw'] == '')][columns]
df2 = df_covid19_agg[(df_covid19_agg['num_cases_raw'] == '') & (df_covid19_agg['num_cases_pcr_raw'] == '') & (df_covid19_agg['num_cases_qtest_raw'] != '')][columns]
df3 = df_covid19_agg[(df_covid19_agg['num_cases_raw'] == '') & (df_covid19_agg['num_cases_pcr_raw'] != '') & (df_covid19_agg['num_cases_qtest_raw'] == '')][columns]
df4 = df_covid19_agg[(df_covid19_agg['num_cases_raw'] == '') & (df_covid19_agg['num_cases_pcr_raw'] != '') & (df_covid19_agg['num_cases_qtest_raw'] != '')][columns]
df5 = df_covid19_agg[(df_covid19_agg['num_cases_raw'] != '') & (df_covid19_agg['num_cases_pcr_raw'] == '') & (df_covid19_agg['num_cases_qtest_raw'] == '')][columns]
df6 = df_covid19_agg[(df_covid19_agg['num_cases_raw'] != '') & (df_covid19_agg['num_cases_pcr_raw'] == '') & (df_covid19_agg['num_cases_qtest_raw'] != '')][columns]
df7 = df_covid19_agg[(df_covid19_agg['num_cases_raw'] != '') & (df_covid19_agg['num_cases_pcr_raw'] != '') & (df_covid19_agg['num_cases_qtest_raw'] == '')][columns]
df8 = df_covid19_agg[(df_covid19_agg['num_cases_raw'] != '') & (df_covid19_agg['num_cases_pcr_raw'] != '') & (df_covid19_agg['num_cases_qtest_raw'] != '')][columns]

x1 = len(df1)
x2 = len(df2)
x3 = len(df3)
x4 = len(df4)
x5 = len(df5)
x6 = len(df6)
x7 = len(df7)
x8 = len(df8)

print('TOTAL ROWS: {0}'.format(x0[0]))
print('----------------')
print('- Total casos desglose NO - NO - NO: {0}'.format(x1))
print(df1)
print('- Total casos desglose NO - NO - SI: {0}'.format(x2))
print(df2)
print('- Total casos desglose NO - SI - NO: {0}'.format(x3))
print(df3)
print('- Total casos desglose NO - SI - SI: {0}'.format(x4))
print(df4)
print('- Total casos desglose SI - NO - NO: {0}'.format(x5))
print(df5)
print('- Total casos desglose SI - NO - SI: {0}'.format(x6))
print(df6)
print('- Total casos desglose SI - SI - NO: {0}'.format(x7))
print(df7)
print('- Total casos desglose SI - SI - SI: {0}'.format(x8))
print(df8)

TOTAL ROWS: 1273
----------------
- Total casos desglose NO - NO - NO: 136
    region_raw   date_raw num_cases_raw num_cases_pcr_raw num_cases_qtest_raw
0           AN  20/2/2020                                                    
1           AR  20/2/2020                                                    
2           AS  20/2/2020                                                    
5           CB  20/2/2020                                                    
6           CM  20/2/2020                                                    
..         ...        ...           ...               ...                 ...
202         GA   1/3/2020                                                    
204         ML   1/3/2020                                                    
205         MC   1/3/2020                                                    
351         CE   9/3/2020                                                    
356         ML   9/3/2020                                          

## Standard Datasets Generation

### Aggregated Raw Dataset Validation

Remove invalid bottom rows due to dataset comments. 

In [7]:
df_covid19_agg

Unnamed: 0,region_raw,date_raw,num_cases_raw,num_cases_pcr_raw,num_cases_qtest_raw,num_hosp_raw,num_icu_raw,num_deaths_raw,num_recov_raw
0,AN,20/2/2020,,,,,,,
1,AR,20/2/2020,,,,,,,
2,AS,20/2/2020,,,,,,,
3,IB,20/2/2020,1,,,,,,
4,CN,20/2/2020,1,,,,,,
...,...,...,...,...,...,...,...,...,...
1268,ML,26/4/2020,,110,11,44,3,2,87
1269,MC,26/4/2020,,1474,297,627,106,128,990
1270,NC,26/4/2020,,4733,7530,1942,130,432,1918
1271,PV,26/4/2020,,12513,1899,6457,538,1241,9840


In [8]:
# Number validation function
def validate_func_num(value_str, empty_val):
    import math 
    def _is_num(x): 
        return isinstance(empty_val, (int, float)) and not isinstance(empty_val, bool)
    def _is_nan(x):
        return math.isnan(x)
        
    if not (_is_num(empty_val) or _is_nan(empty_val)): 
        raise TypeError('\'empty_val\' parameter must be int, float or NaN. Value \'{0}\' found instead'. format(empty_val))
    
    if value_str == '':
        value_val = [empty_val, 0]
    else:
        try:
            value_val = [float(str(value_str)), 0]
        except:
            value_val = [None, 1]
    return value_val

# Region validation function
def validate_func_inlist(value_str, list_values):
    if value_str == '':
        value_val = [None, 1]
    try:
        if value_str in list_values:
            value_val = [str(value_str), 0]
        else:
            value_val = [None, 2]
    except:
        value_val = [None, 3]
    return value_val

# Date validation function
def validate_func_date(value_str, date_format):
    try:
        value_val = [pd.to_datetime(str(value_str), format = date_format), 0]
    except:
        value_val = [None, 1]
    return value_val

def validator_df_column(row, validate_func, column_str, column_cnv, column_val):
    value_str = row[column_str]
    value_val = validate_func(value_str)
    row[column_cnv] = value_val[0]
    row[column_val] = value_val[1]
    return row    

In [9]:
# Validating Dataset.

# Get list of regions for validation
list_regions = list(dict.fromkeys(df_covid19_agg['region_raw'].tolist()))
if '' in list_regions: list_regions.remove('')
    
# Applying validation functions to dataset.
df_covid19_agg = df_covid19_agg.apply(lambda x: validator_df_column(x, lambda x: validate_func_inlist(x, list_regions), 'region_raw', 'region', 'region_val'), axis = 1)
df_covid19_agg = df_covid19_agg.apply(lambda x: validator_df_column(x, lambda x : validate_func_date(x, date_format_ingest_raw), 'date_raw', 'date', 'date_val'), axis = 1)
df_covid19_agg = df_covid19_agg.apply(lambda x: validator_df_column(x, lambda x: validate_func_num(x, float('NaN')), 'num_cases_raw', 'num_cases', 'num_cases_val'), axis = 1)
df_covid19_agg = df_covid19_agg.apply(lambda x: validator_df_column(x, lambda x: validate_func_num(x, float('NaN')), 'num_cases_pcr_raw', 'num_cases_pcr', 'num_cases_prc_val'), axis = 1)
df_covid19_agg = df_covid19_agg.apply(lambda x: validator_df_column(x, lambda x: validate_func_num(x, float('NaN')), 'num_cases_qtest_raw', 'num_cases_qtest', 'num_cases_qtest_val'), axis = 1)
df_covid19_agg = df_covid19_agg.apply(lambda x: validator_df_column(x, lambda x: validate_func_num(x, 0), 'num_hosp_raw', 'num_hosp', 'num_hosp_val'), axis = 1)
df_covid19_agg = df_covid19_agg.apply(lambda x: validator_df_column(x, lambda x: validate_func_num(x, 0), 'num_icu_raw', 'num_icu', 'num_icu_val'), axis = 1)
df_covid19_agg = df_covid19_agg.apply(lambda x: validator_df_column(x, lambda x: validate_func_num(x, 0), 'num_deaths_raw', 'num_deaths', 'num_deaths_val'), axis = 1)
df_covid19_agg = df_covid19_agg.apply(lambda x: validator_df_column(x, lambda x: validate_func_num(x, 0), 'num_recov_raw', 'num_recov', 'num_recov_val'), axis = 1)

In [10]:
# Complete and validate num_cases_str, num_cases_pcr_str and num_cases_qtest
# Check num_cases_str = num_cases_pcr + num_cases_qtest). Complete values using the formula.
# On error -> validation value column set to 2

def validate_and_complete_num_cases(row):
    import math
    def _is_nan(x):
        return math.isnan(x)
    if _is_nan(row['num_cases']):
        if _is_nan(row['num_cases_pcr']):
            if _is_nan(row['num_cases_qtest']):
                # No values. Set all to 0.
                row['num_cases'] = 0
                row['num_cases_pcr'] = 0
                row['num_cases_qtest'] = 0
            else:
                row['num_cases'] = row['num_cases_qtest']
                row['num_cases_pcr'] = 0
        else:
            if _is_nan(row['num_cases_qtest']):
                row['num_cases'] = row['num_cases_pcr']
                row['num_cases_qtest'] = 0
            else:
                row['num_cases'] = row['num_cases_pcr'] + row['num_cases_qtest']                    
    else:
        if _is_nan(row['num_cases_pcr']):
            if _is_nan(row['num_cases_qtest']):
                # Option 1: Do nothing. Can't split num_cases between num_cases_pcr and num_cases_qtest                 
                a = 3
                # Option 2: Educated gesss. All are PCRs
                row['num_cases_pcr'] = row['num_cases'] 
                row['num_cases_qtest'] = 0
            else:
                row['num_cases_pcr'] = row['num_cases'] - row['num_cases_qtest']
        else:
            if _is_nan(row['num_cases_qtest']):
                row['num_cases_qtest'] = row['num_cases'] - row['num_cases_pcr']
            else:
                # All values.
                a = 3

    if (not _is_nan(row['num_cases'])) and (not _is_nan(row['num_cases_pcr'])) and (not _is_nan(row['num_cases_pcr'])):
        if row['num_cases'] != row['num_cases_pcr'] + row['num_cases_qtest']:
            raise ArithmeticError ('Case numbers are not coherent: \'{0}\' \'{1}\' -> \'{2}\' ¿=? \'{3}\' \'{4}\''.format(row['region'], row['date'], row['num_cases'], row['num_cases_pcr'], row['num_cases_qtest'])) 

    if (row['num_cases'] < 0) or (row['num_cases_pcr'] < 0) or (row['num_cases_pcr'] < 0):
        raise ArithmeticError ('Case number must be equal or grater than 0') 

    return row

In [11]:
## DMM QUICK FIX DUE TO INCOHERENT VALUES

#df_covid19_agg.loc[(df_covid19_agg['region'] == 'GA') & (df_covid19_agg['date'] == pd.to_datetime('2020-04-24', format = date_format_ingest_std)), 'num_cases_pcr'] = 9196
df_covid19_agg

Unnamed: 0,region_raw,date_raw,num_cases_raw,num_cases_pcr_raw,num_cases_qtest_raw,num_hosp_raw,num_icu_raw,num_deaths_raw,num_recov_raw,region,...,num_cases_qtest,num_cases_qtest_val,num_hosp,num_hosp_val,num_icu,num_icu_val,num_deaths,num_deaths_val,num_recov,num_recov_val
0,AN,20/2/2020,,,,,,,,AN,...,,0,0.0,0,0.0,0,0.0,0,0.0,0
1,AR,20/2/2020,,,,,,,,AR,...,,0,0.0,0,0.0,0,0.0,0,0.0,0
2,AS,20/2/2020,,,,,,,,AS,...,,0,0.0,0,0.0,0,0.0,0,0.0,0
3,IB,20/2/2020,1,,,,,,,IB,...,,0,0.0,0,0.0,0,0.0,0,0.0,0
4,CN,20/2/2020,1,,,,,,,CN,...,,0,0.0,0,0.0,0,0.0,0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1268,ML,26/4/2020,,110,11,44,3,2,87,ML,...,11.0,0,44.0,0,3.0,0,2.0,0,87.0,0
1269,MC,26/4/2020,,1474,297,627,106,128,990,MC,...,297.0,0,627.0,0,106.0,0,128.0,0,990.0,0
1270,NC,26/4/2020,,4733,7530,1942,130,432,1918,NC,...,7530.0,0,1942.0,0,130.0,0,432.0,0,1918.0,0
1271,PV,26/4/2020,,12513,1899,6457,538,1241,9840,PV,...,1899.0,0,6457.0,0,538.0,0,1241.0,0,9840.0,0


In [12]:
df_covid19_agg = df_covid19_agg.apply(validate_and_complete_num_cases, axis = 1)

In [13]:
df_covid19_agg

Unnamed: 0,region_raw,date_raw,num_cases_raw,num_cases_pcr_raw,num_cases_qtest_raw,num_hosp_raw,num_icu_raw,num_deaths_raw,num_recov_raw,region,...,num_cases_qtest,num_cases_qtest_val,num_hosp,num_hosp_val,num_icu,num_icu_val,num_deaths,num_deaths_val,num_recov,num_recov_val
0,AN,20/2/2020,,,,,,,,AN,...,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0
1,AR,20/2/2020,,,,,,,,AR,...,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0
2,AS,20/2/2020,,,,,,,,AS,...,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0
3,IB,20/2/2020,1,,,,,,,IB,...,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0
4,CN,20/2/2020,1,,,,,,,CN,...,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1268,ML,26/4/2020,,110,11,44,3,2,87,ML,...,11.0,0,44.0,0,3.0,0,2.0,0,87.0,0
1269,MC,26/4/2020,,1474,297,627,106,128,990,MC,...,297.0,0,627.0,0,106.0,0,128.0,0,990.0,0
1270,NC,26/4/2020,,4733,7530,1942,130,432,1918,NC,...,7530.0,0,1942.0,0,130.0,0,432.0,0,1918.0,0
1271,PV,26/4/2020,,12513,1899,6457,538,1241,9840,PV,...,1899.0,0,6457.0,0,538.0,0,1241.0,0,9840.0,0


In [14]:
# Applying extra int conversion.

df_covid19_agg['num_cases'] = df_covid19_agg['num_cases'].apply(int)
df_covid19_agg['num_cases_pcr'] = df_covid19_agg['num_cases_pcr'].apply(int)
df_covid19_agg['num_cases_qtest'] = df_covid19_agg['num_cases_qtest'].apply(int)
df_covid19_agg['num_hosp'] = df_covid19_agg['num_hosp'].apply(int)
df_covid19_agg['num_icu'] = df_covid19_agg['num_icu'].apply(int)
df_covid19_agg['num_deaths'] = df_covid19_agg['num_deaths'].apply(int)
df_covid19_agg['num_recov'] = df_covid19_agg['num_recov'].apply(int)

In [15]:
# None values mark rows with standarization problems.
# Check for None values in all columns.

# Erros on province column.
prov_ko_count = len(df_covid19_agg[df_covid19_agg['region_val'] == None])
if prov_ko_count != 0: 
    sys.exit('Found {0} rows with incorrect values on \'region_raw\' column.'.format(sex_ko_count))

# Erros on date column.
date_ko_count = len(df_covid19_agg[df_covid19_agg['date_val'] == None])
if date_ko_count != 0: 
    sys.exit('Found {0} rows with incorrect values on \'date_raw\' column.'.format(date_ko_count))
    
# Erros on num_cases column.
num_cases_ko_count = len(df_covid19_agg[df_covid19_agg['num_cases_val'] == None])
if num_cases_ko_count != 0: 
    sys.exit('Found {0} rows with incorrect values on \'num_cases_raw\' column.'.format(num_cases_ko_count))

# Erros on num_hosp column.
num_hosp_ko_count = len(df_covid19_agg[df_covid19_agg['num_hosp_val'] == None])
if num_hosp_ko_count != 0: 
    sys.exit('Found {0} rows with incorrect values on \'num_hosp_raw\' column.'.format(num_hosp_ko_count))

# Erros on num_icu column.
num_icu_ko_count = len(df_covid19_agg[df_covid19_agg['num_icu_val'] == None])
if num_icu_ko_count != 0: 
    sys.exit('Found {0} rows with incorrect values on \'num_icu_raw\' column.'.format(num_icu_ko_count))
    
# Erros on num_deaths column.
num_deaths_ko_count = len(df_covid19_agg[df_covid19_agg['num_deaths_val'] == None])
if num_deaths_ko_count != 0: 
    sys.exit('Found {0} rows with incorrect values on \'num_deaths_raw\' column.'.format(num_deaths_ko_count))
    
# Erros on num_recov column.
num_recov_ko_count = len(df_covid19_agg[df_covid19_agg['num_recov_val'] == None])
if num_recov_ko_count != 0: 
    sys.exit('Found {0} rows with incorrect values on \'num_recov_raw\' column.'.format(num_recov_ko_count))

In [16]:
df_covid19_agg

Unnamed: 0,region_raw,date_raw,num_cases_raw,num_cases_pcr_raw,num_cases_qtest_raw,num_hosp_raw,num_icu_raw,num_deaths_raw,num_recov_raw,region,...,num_cases_qtest,num_cases_qtest_val,num_hosp,num_hosp_val,num_icu,num_icu_val,num_deaths,num_deaths_val,num_recov,num_recov_val
0,AN,20/2/2020,,,,,,,,AN,...,0,0,0,0,0,0,0,0,0,0
1,AR,20/2/2020,,,,,,,,AR,...,0,0,0,0,0,0,0,0,0,0
2,AS,20/2/2020,,,,,,,,AS,...,0,0,0,0,0,0,0,0,0,0
3,IB,20/2/2020,1,,,,,,,IB,...,0,0,0,0,0,0,0,0,0,0
4,CN,20/2/2020,1,,,,,,,CN,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1268,ML,26/4/2020,,110,11,44,3,2,87,ML,...,11,0,44,0,3,0,2,0,87,0
1269,MC,26/4/2020,,1474,297,627,106,128,990,MC,...,297,0,627,0,106,0,128,0,990,0
1270,NC,26/4/2020,,4733,7530,1942,130,432,1918,NC,...,7530,0,1942,0,130,0,432,0,1918,0
1271,PV,26/4/2020,,12513,1899,6457,538,1241,9840,PV,...,1899,0,6457,0,538,0,1241,0,9840,0


### Conversion to Aggregated Standard Dataset  

In [17]:
# Remove raw columns
df_covid19_agg.drop(all_columns, axis = 1, inplace = True)

In [18]:
# Remove validation columns
validation_columns = [x for x in df_covid19_agg.columns.tolist() if x.endswith('_val')]
df_covid19_agg.drop(validation_columns, axis = 1, inplace = True)

In [19]:
df_covid19_agg

Unnamed: 0,region,date,num_cases,num_cases_pcr,num_cases_qtest,num_hosp,num_icu,num_deaths,num_recov
0,AN,2020-02-20,0,0,0,0,0,0,0
1,AR,2020-02-20,0,0,0,0,0,0,0
2,AS,2020-02-20,0,0,0,0,0,0,0
3,IB,2020-02-20,1,1,0,0,0,0,0
4,CN,2020-02-20,1,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
1268,ML,2020-04-26,121,110,11,44,3,2,87
1269,MC,2020-04-26,1771,1474,297,627,106,128,990
1270,NC,2020-04-26,12263,4733,7530,1942,130,432,1918
1271,PV,2020-04-26,14412,12513,1899,6457,538,1241,9840


In [20]:
import math
df_covid19_agg[df_covid19_agg['num_cases_pcr'].isnull()]

Unnamed: 0,region,date,num_cases,num_cases_pcr,num_cases_qtest,num_hosp,num_icu,num_deaths,num_recov


In [21]:
import math
df_covid19_agg[df_covid19_agg['num_cases_qtest'].isnull()]

Unnamed: 0,region,date,num_cases,num_cases_pcr,num_cases_qtest,num_hosp,num_icu,num_deaths,num_recov


In [22]:
# Order dataset by 'region' and 'date'
df_covid19_agg.sort_values(['region','date'], inplace = True)

# Saving standard dataset with aggregated values
df_covid19_agg.to_csv('../../data/standard/Covid19AggSP.csv', index = False)

### Incremental Standard Dataset Generation

In [23]:
# Get list of regions
list_regions = sorted(list(df_covid19_agg['region'].unique()))
df_covid19_inc = dfObj = pd.DataFrame(columns = df_covid19_agg.columns)

for region in list_regions:
    df_covid19_agg_region = df_covid19_agg[df_covid19_agg['region'] == region].drop('region', axis = 1)
    df_covid19_inc_region = df_covid19_agg_region.diff(axis = 0)    
    df_covid19_inc_region['date'] = df_covid19_agg_region['date']
    df_covid19_inc_region = df_covid19_inc_region.iloc[1:]    
    df_covid19_inc_region.insert(0, 'region', region)
    
    df_covid19_inc = pd.concat([df_covid19_inc, df_covid19_inc_region])
    
# Convert values from float to int
df_covid19_inc['num_cases'] = df_covid19_inc['num_cases'].apply(int)
df_covid19_inc['num_cases_pcr'] = df_covid19_inc['num_cases_pcr'].apply(int)
df_covid19_inc['num_cases_qtest'] = df_covid19_inc['num_cases_qtest'].apply(int)
df_covid19_inc['num_hosp'] = df_covid19_inc['num_hosp'].apply(int)
df_covid19_inc['num_icu'] = df_covid19_inc['num_icu'].apply(int)
df_covid19_inc['num_deaths'] = df_covid19_inc['num_deaths'].apply(int)
df_covid19_inc['num_recov'] = df_covid19_inc['num_recov'].apply(int)

In [24]:
df_covid19_inc

Unnamed: 0,region,date,num_cases,num_cases_pcr,num_cases_qtest,num_hosp,num_icu,num_deaths,num_recov
19,AN,2020-02-21,0,0,0,0,0,0,0
38,AN,2020-02-22,0,0,0,0,0,0,0
57,AN,2020-02-23,0,0,0,0,0,0,0
76,AN,2020-02-24,0,0,0,0,0,0,0
95,AN,2020-02-25,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
1188,VC,2020-04-22,164,44,120,21,1,18,377
1207,VC,2020-04-23,171,70,101,28,1,23,421
1226,VC,2020-04-24,217,127,90,33,2,25,224
1245,VC,2020-04-25,249,94,155,35,1,14,210


In [25]:
# Order dataset by 'region' and 'date'
df_covid19_inc.sort_values(['region','date'], inplace = True)

# Saving standard dataset with incremental values
df_covid19_inc.to_csv('../../data/standard/Covid19IncSP.csv', index = False)