# Cleaning the Data

In [1]:
import os
import sys
#sys.path.insert(0, "../../")

import warnings
warnings.filterwarnings('ignore')

from prediksicovidjatim import util, config, database
from prediksicovidjatim.data.raw import RawDataRepo
from prediksicovidjatim.data.raw.entities import RawData

database.init()

In [2]:
kabko = RawDataRepo.fetch_kabko()
len(kabko)

38

In [3]:
selected_kabko = kabko[0]
selected_kabko

'KAB. BANGKALAN'

In [4]:
data = RawDataRepo.fetch_data(selected_kabko)
len(data)

129

## Wrong Input Field

Some data seem to have been inputted into the wrong field. Like field a and field b were going steadily then one day it's suddenly flipped and returns normal the next day. This was fixed manually from database client. There were also some dead count which suddenly pops out and then vanishes in one day. The other field groups should also be considered because it might be a legit flow. Make sure to recalculate the row's totals if your changes should be changing it.

## Defected Data

Some data just seem to be way too high or way too low, like it's got some extra digit or missing some digit. Ones with extra digits are fixed by removing the last digits. Ones with missing digits are fixed by adding trailing zeros. Make sure to recalculate the row's totals. These were done by hand.

## Missing Data

There were cases where some data were going quite steadily and then one day they're all zeroes and goes back to similar values the next day. We should be filling this with interpolated value from neighboring data. For simplicity, linear interpolation was used. 

In [5]:
start_index = util.get_date_index(data, "2020-05-29")
start_index

70

In [6]:
missing_count = 1

In [7]:
missing = util.get_missing_data(data, start_index, missing_count) #last parameter is count of missing days in the missing span
missing #If it's truly missing, all fields except kabko and tanggal should be 0

[{'kabko': 'KAB. BANGKALAN',
  'tanggal': '2020-05-29',
  'odr': 19587,
  'otg': 696,
  'odp_total': 911,
  'odp_belum': 0,
  'odp_selesai': 369,
  'odp_meninggal': 0,
  'odp_rawat_total': 542,
  'odp_rawat_rumah': 527,
  'odp_rawat_gedung': 0,
  'odp_rawat_rs': 15,
  'pdp_total': 22,
  'pdp_belum': 0,
  'pdp_sehat': 3,
  'pdp_meninggal': 13,
  'pdp_rawat_total': 6,
  'pdp_rawat_rumah': 0,
  'pdp_rawat_gedung': 0,
  'pdp_rawat_rs': 6,
  'pos_total': 39,
  'pos_sembuh': 6,
  'pos_meninggal': 3,
  'pos_rawat_total': 30,
  'pos_rawat_rumah': 4,
  'pos_rawat_gedung': 0,
  'pos_rawat_rs': 26}]

In [8]:
interpolated = util.lerp_missing_data(data, start_index, missing_count)
interpolated

[{'kabko': 'KAB. BANGKALAN',
  'tanggal': '2020-05-29',
  'odr': 19575,
  'otg': 694,
  'odp_total': 913,
  'odp_belum': 0,
  'odp_selesai': 369,
  'odp_meninggal': 0,
  'odp_rawat_total': 544,
  'odp_rawat_rumah': 529,
  'odp_rawat_gedung': 0,
  'odp_rawat_rs': 15,
  'pdp_total': 23,
  'pdp_belum': 0,
  'pdp_sehat': 3,
  'pdp_meninggal': 14,
  'pdp_rawat_total': 6,
  'pdp_rawat_rumah': 0,
  'pdp_rawat_gedung': 0,
  'pdp_rawat_rs': 6,
  'pos_total': 40,
  'pos_sembuh': 6,
  'pos_meninggal': 3,
  'pos_rawat_total': 31,
  'pos_rawat_rumah': 5,
  'pos_rawat_gedung': 0,
  'pos_rawat_rs': 26}]

#RawDataRepo.save_data(interpolated, upsert=True)

## Incorrect Totals

Logically, a total field should be reflected by its component values. The sum of all of its values should equal the total value. However, its component values is a categorization. There may be inconsistencies between the total and the categorization. Therefore, I made total_opt() and total_max(). Total_opt just returns total_calc() if it's non zero. Total_max just returns the max value between total field and total_calc(). While total_opt() might be wrong, it should be safe to fix the data with total_max(), although some may still be incorect.

#data_full = [x for k in kabko for x in RawDataRepo.fetch_data(k)]
#len(data_full)

#RawDataRepo.save_data([d.to_db_row(option="max") for d in data_full], upsert=True)

#RawDataRepo.trim_early_zeros()