# Pre-processing data

This file explains how the machine learning model was created for the project

## Getting the data

The origin of the data is the NoSQL database in the related project called FeverFriend. A backup has been created on 2022 December 20th from that project using a custom tool and then was parsed into a local SQLite3 database. From the database a csv file was exported returning all relevant fields connected to the measurement objects and a user ids if referencing back is required. Overall the database had 19709 measurement entries. In the query the following filters were set:

- It must have an illnessId
- The user has not opted out of research
- The user's email is not a test email
- The temperature value is not null
- ~~The respiratory rate value is not null~~
- ~~The pulse value is not null~~

(Note: last two constraints were removed and rows with missing data were imputed)

After exporting the csv a small python notebook was created to preprocess the data.


In [84]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt


In [85]:
df = pd.read_csv('./preprocessing/rawAnalysisData.csv', header=0)

print(f'The dataset has {len(df)} entries and {len(df.columns)} columns')
df.shape


The dataset has 18682 entries and 74 columns


(18682, 74)

Filtering for numerical variables

- Filter temperature 34.5 - 43.5 (C)
- Filter ventilation 10 - 80 (breath/min)
- Filter pulse 40 - 220 (bpm)


In [86]:
def set_boundaries(dataset):
    ds = dataset.copy()
    ds = ds[(ds['temperature'] >= 34.5) & (ds['temperature'] < 43.5)]
    ds = ds[(ds['ageInMonths'] > 0) & (ds['ageInMonths'] <= 100*12)]

    ds = ds[(ds['respiratoryRate'] >= 10) & (ds['respiratoryRate'] < 80)]
    ds = ds[(ds['pulse'] >= 40) & (ds['pulse'] < 220)]
    return ds


In [87]:
df3 = df.copy()
df3.shape


(18682, 74)

In [88]:
ordinal_categories = {
    'antibiotics': {
        'antibiotics': 0,
        'antibiotics-01-No': 0,
        'antibiotics-02-Yes': 1,
    },
    'antibioticsHowMany': {
        'antibioticsHowMany': 0,
        'antibioticsHowManyTimes01-1': 1,
        'antibioticsHowManyTimes02-2': 2,
        'antibioticsHowManyTimes03-3': 3,
        'antibioticsHowManyTimes04-4': 4,
        'antibioticsHowManyTimes05-5': 5,
        'antibioticsHowManyTimes06-MoreThan5': 6,
    },
    'antibioticsHowMuch': {
        'antibioticsHowMuch': 0,
        'antibioticsHowMuch01-50mg': 50,
        'antibioticsHowMuch02-75mg': 75,
        'antibioticsHowMuch03-100mg': 100,
        'antibioticsHowMuch04-125mg': 125,
        'antibioticsHowMuch05-150mg': 150,
        'antibioticsHowMuch06-175mg': 175,
        'antibioticsHowMuch07-200mg': 200,
        'antibioticsHowMuch08-225mg': 225,
        'antibioticsHowMuch09-250mg': 250,
        'antibioticsHowMuch10-300mg': 300,
        'antibioticsHowMuch11-350mg': 350,
        'antibioticsHowMuch12-400mg': 400,
        'antibioticsHowMuch13-450mg': 450,
        'antibioticsHowMuch14-500mg': 500,
        'antibioticsHowMuch15-MoreThan500': 600,
    },
    'antipyretic': {
        'antipyretic': 0,
        'antipyreticMedication-01-No': 0,
        'antipyreticMedication-02-Yes': 1,
    },
    'antipyreticHowMany': {
        'antipyreticHowMany': 0,
        'antipyreticMedicationHowManyTimes01-1': 1,
        'antipyreticMedicationHowManyTimes02-2': 2,
        'antipyreticMedicationHowManyTimes03-3': 3,
        'antipyreticMedicationHowManyTimes04-4': 4,
        'antipyreticMedicationHowManyTimes05-5': 5,
        'antipyreticMedicationHowManyTimes06-MoreThan5': 6,
    },
    'antipyreticHowMuch': {
        'antipyreticHowMuch': 0,
        'antipyreticMedicationHowMuch01-50mg': 50,
        'antipyreticMedicationHowMuch02-75mg': 75,
        'antipyreticMedicationHowMuch03-100mg': 100,
        'antipyreticMedicationHowMuch04-125mg': 125,
        'antipyreticMedicationHowMuch05-150mg': 150,
        'antipyreticMedicationHowMuch06-175mg': 175,
        'antipyreticMedicationHowMuch07-200mg': 200,
        'antipyreticMedicationHowMuch08-225mg': 225,
        'antipyreticMedicationHowMuch09-250mg': 250,
        'antipyreticMedicationHowMuch10-300mg': 300,
        'antipyreticMedicationHowMuch11-350mg': 350,
        'antipyreticMedicationHowMuch12-400mg': 400,
        'antipyreticMedicationHowMuch13-450mg': 450,
        'antipyreticMedicationHowMuch14-500mg': 500,
        'antipyreticMedicationHowMuch15-MoreThan500': 600,
    },
    'bulgingFontanelleMax18MOld': {
        'bulgingFontanelleMax18MOld': 0,
        'bulgingFontanelleMax18MOld-01-No': 0,
        'bulgingFontanelleMax18MOld-02-Yes': 1,
    },
    'diarrhea': {
        'diarrhea': 0,
        'diarrhea-01-NoOrSlight': 0,
        'diarrhea-02-Frequent': 1,
        'diarrhea-03-FrequentAndBloody': 2,
    },
    'crying': {
        'crying': 0,
        'crying-01-DoesntCry': 0,
        'crying-02-NormalBoldCrying': 1,
        'crying-03-ContinuousWithUnusuallyHighPitch': 2,
        'crying-04-Weak': 3,
    },
    'drinking': {
        'drinking': 0,
        'drinking-01-Normal': 0,
        'drinking-02-LessThanNormal': 1,
        'drinking-03-NotFor12Hours': 3,
    },
    'dyspnea': {
        'dyspnea': 1,
        'dyspnea-01-1': 1,
        'dyspnea-02-2': 2,
        'dyspnea-03-3': 3,
        'dyspnea-04-4': 4,
        'dyspnea-05-5': 5,
    },
    'exoticTrip': {
        'exoticTrip': 0,
        'exoticTripInTheLast12Months-01-No': 0,
        'exoticTripInTheLast12Months-02-Yes': 1,
    },
    'seizure': {
        'seizure': 0,
        'febrileSeizure-01-No': 0,
        'febrileSeizure-02-Yes': 1,
    },
    'feverDuration': {
        'feverDuration': 1,
        'feverDuration-01-3>days': 1,
        'feverDuration-02-5>=days>3': 3,
        'feverDuration-03-days>=5': 5,
    },
    'glassTest': {
        'glassTest': 0,
        'glassTest-01-RedDisappears': 0,
        'glassTest-02-RedRemains': 1,
    },
    'lastTimeEating': {
        'lastTimeEating': 0,
        'lastTimeEating-01-<12hours': 0,
        'lastTimeEating-02-12<=<24hours': 12,
        'lastTimeEating-03->24hours': 24,
    },
    'lastUrination': {
        'lastUrination': 0,
        'lastUrination-01-6>hours': 0,
        'lastUrination-02-6<=hours<12': 6,
        'lastUrination-01-12<hours': 12,  # ! error in key
    },
    'painfulUrination': {
        'painfulUrination': 0,
        'painfulUrination-01-No': 0,
        'painfulUrination-02-Yes': 1,
    },
    'rash': {
        'rash': 0,
        'rash-01-No': 0,
        'rash-02-Yes': 1,
    },
    'skinColor': {
        'skinColor': 0,
        'skinColor-01-NormalSlightlyPale': 0,
        'skinColor-02-Pale': 1,
        'skinColor-03-GreyBlueCyanotic': 2,
    },
    'skinTurgor': {
        'skinTurgor': 0,
        'skinTurgor-01-Normal': 0,
        'skinTurgor-02-SomewhatDecreased': 1,
        'skinTurgor-03-SeverelyDecreased': 2,
    },
    'smellyUrine': {
        'smellyUrine': 0,
        'smellyUrine-01-No': 0,
        'smellyUrine-02-Yes': 1,
    },
    'tearsWhenCrying': {
        'tearsWhenCrying': 0,
        'tearsWhenCrying-01-Yes': 0,
        'tearsWhenCrying-02-NotSoMuch': 1,
        'tearsWhenCrying-03-No': 2,
    },
    'tongue': {
        'tongue': 0,
        'tongue-01-Wet': 0,
        'tongue-02-Dry': 1,
    },
    'vaccinationIn14days': {
        'vaccinationIn14days': 0,
        'vaccinationsWithIn14days-01-No': 0,
        'vaccinationsWithIn14days-02-Yes': 1,
    },
    'vaccinationHowManyHoursAgo': {
        'vaccinationHowManyHoursAgo': 0,
        'vaccinationsHowManyHoursAgo-01-Within48h': 1,
        'vaccinationsHowManyHoursAgo-02-Beyond48h': 2,
    },
    'wheezing': {
        'wheezing': 0,
        'wheezing-01-No': 0,
        'wheezing-02-SomewhatYes': 1,
        'wheezing-03-Stridor': 2,
    },
    'wryNeck': {
        'wryNeck': 0,
        'wryNeck-01-No': 0,
        'wryNeck-02-Yes': 1,
    }
}


## Questions

- Should I discard entries with chronic disease? - no
- Which values to include?
- Should I exclude `State` columns? -yes
- Should I exclude `thermometerUsed` column? -yes
- `vaccinationsHowManyHoursAgo` which one is better? -default set
- Defaults for all? If no default how to indicate empty? -0 for all
- `vaccinationsUsedVaccination` excluded because user entry, ok? -yes

### OneHotEncode variables

- pain
- antipyreticMedicationWhat
- awareness
- patientState
- vomit


In [89]:
one_hot_categories = {
    'pain': [
        'pain-01-No',
        'pain-02-FeelingBad',
        'pain-03-Headache',
        'pain-04-SwollenPainful',
        'pain-05-StrongBellyacheAche'
    ],
    # Excluded because irrelevant
    # 'antipyreticWhat': [
    #     'antipyreticMedicationWhat-01-Paracetamol',
    #     'antipyreticMedicationWhat-02-Ibuprofen',
    #     'antipyreticMedicationWhat-03-Aminophenason',
    #     'antipyreticMedicationWhat-04-Diclofenac',
    #     'antipyreticMedicationWhat-05-Metamizol',
    #     'antipyreticMedicationWhat-06-Other',
    # ],
    'awareness': [
        'awareness-01-Normal',
        'awareness-02-SleepyOddOrFeverishNightmares',
        'awareness-03-NoReactionsNoAwareness',
    ],
    'patientState': [
        'good',
        'caution',
        'danger',
    ],
    'vomit': [
        'vomit-01-No',
        'vomit-02-Slight',
        'vomit-03-Frequent',
        'vomit-04-Yellow',
        'vomit-05-5<hours',
    ]
}


In [90]:
def one_hot_array_encoder(dataset, categories, column, default, use_column_name=False, replace=True):
    # Getting array values
    ds = dataset.copy()
    ds_col = ds[column].apply(lambda s: s.split(',') if type(s) == str else [])
    for cat in categories:
        if use_column_name:
            cat = column + '_' + cat
        ds[cat] = [1 if cat in arr or (len(arr) == 0 and cat == default) else 0 for arr in ds_col]
    return ds.drop(columns=[column]) if replace else ds


def one_hot_encoder(dataset, categories, column, default, use_column_name=False, replace=True):
    ds = dataset.copy()
    for cat in categories:
        cat_name = cat
        if use_column_name:
            cat_name = column + '_' + cat
        ds[cat_name] = [1 if cat == val or (
            val == np.nan and cat == default) else 0 for val in ds[column]]
    return ds.drop(columns=[column]) if replace else ds


def ordinal_encoder(dataset, categories, column, default, replace=True):
    ds = dataset.copy()
    new_col = np.empty(ds[column].shape)
    keys = categories.keys()
    for i, entry in enumerate(ds[column]):
        new_col[i] = categories[entry] if entry in keys else default
    ds[column+'_new' if not replace else column] = new_col
    return ds


## SectionState generation

All question fields in the dataset have a matching `State` variable. These `State` fields will be groupped together by each section of the form. Therefore each section will have a `State` on its own.
The section `State` for now will be generated by taking the highest/worst state.

$$good < caution < danger$$


In [91]:
sections = {
    'fever': [
        'feverDuration',
        # 'measurementLocation', # excluded because effect is reflected in temperature
        'temperature',
        # 'thermometerUsed', # excluded because effect is reflected in temperature
    ],
    'medication': [
        'antibiotics',
        'antibioticsHowMany',
        # 'antibioticsHowMuch', # excluded because no corresponding state field
        # 'antibioticsWhat', # excluded because arbitrary string
        'antipyretic',
        'antipyreticHowMany',
        # 'antipyreticHowMuch', # excluded because no corresponding state field
        # 'antipyreticReason', # excluded because irrelevant
        # 'antipyreticWhat', # excluded because irrelevant
    ],
    'hydration': [
        'crying',
        'diarrhea',
        'drinking',
        'lastUrination',
        'skinTurgor',
        'tearsWhenCrying',
        'tongue',
        'vomit',
    ],
    'respiration': [
        'dyspnea',
        'respiratoryRate',
        'wheezing',
    ],
    'skin': [
        'glassTest',
        'rash',
        'skinColor',
    ],
    # 'pulse': ['pulse'], # already done
    'general': [
        'awareness',
        'bulgingFontanelleMax18MOld',
        'exoticTrip',
        'lastTimeEating',
        'pain',
        'painfulUrination',
        'seizure',
        'smellyUrine',
        'vaccinationIn14days',
        'vaccinationHowManyHoursAgo',
        # 'vaccinationWhat', # excluded because arbitrary text
        'wryNeck',
    ],
    # Excluded because irrelevant for measurement
    # 'caregiver': [
    #     'caregiverConfident',
    #     'caregiverFeel',
    #     'caregiverThink',
    # ]
}


In [92]:
def generate_section_state(dataset: pd.DataFrame):
    '''
    Generate a State value (good, caution, danger) for each section in each row.
    '''
    for section in sections:
        keys = [key + 'State' for key in sections[section]]
        res = np.array([cols for cols in dataset[keys].to_numpy()])

        result = np.empty((len(res),), dtype=object)
        for i, row in enumerate(res):
            if 'danger' in row:
                result[i] = 'danger'
            elif 'caution' in row:
                result[i] = 'caution'
            else:
                result[i] = 'good'
        dataset = dataset.drop(columns=keys)
        dataset[section+'State'] = result
    return dataset


In [93]:
print(df3.shape)
df_with_section_states = generate_section_state(df3.copy())
df_with_section_states['pulseState'].value_counts()


(18682, 74)


good       1763
caution    1320
danger      410
neutral     207
Name: pulseState, dtype: int64

In [94]:
# ---------------------------------------------------------------------------- #
#                                   Encoding                                   #
# ---------------------------------------------------------------------------- #
from sklearn.impute import KNNImputer

# ----------------------------- One hot encoding ----------------------------- #
df_enc = one_hot_array_encoder(
    df_with_section_states, one_hot_categories['pain'], 'pain', one_hot_categories['pain'][0])
df_enc = one_hot_encoder(
    df_enc, one_hot_categories['awareness'], 'awareness', one_hot_categories['awareness'][0])
df_enc = one_hot_encoder(
    df_enc, one_hot_categories['vomit'], 'vomit', one_hot_categories['vomit'][0])
df_enc = one_hot_encoder(df_enc, one_hot_categories['patientState'], 'patientState', None, True)

# ---------------------- Section state one hot encoding ---------------------- #
for key in sections.keys():
    k = key + 'State'
    df_enc = one_hot_encoder(df_enc, ['good', 'caution', 'danger'], k, None, True)

# NOTE: pulseState already exists and therefore is not processed in sectionState
# generator, so it is added here
df_enc = one_hot_encoder(df_enc, ['good', 'caution', 'danger'], 'pulseState', 'good', True)

# ----------------------------- Ordinal encoding ----------------------------- #
for key in ordinal_categories.keys():
    df_enc = ordinal_encoder(df_enc, ordinal_categories[key], key, ordinal_categories[key][key])

# Dropping irrelevant or unused fields
droppables = [
    'userId',
    'id',
    'illnessId',
    'patientId',
    'timestampOfCreation',
    'timestampOfCreationOnDevice',
]
df_parsed = df_enc.drop(columns=droppables)

# checking which column has the most missing values
# if it returns the first one there might not be missing values at all
df_parsed.count().idxmin()
df_parsed


Unnamed: 0,ageInMonths,feverDuration,temperature,antibiotics,antibioticsHowMany,antibioticsHowMuch,antipyretic,antipyreticHowMany,antipyreticHowMuch,crying,...,respirationState_danger,skinState_good,skinState_caution,skinState_danger,generalState_good,generalState_caution,generalState_danger,pulseState_good,pulseState_caution,pulseState_danger
0,26.812914,1.0,38.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,0,1,0,0,0,1,0
1,26.833762,1.0,38.1,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0,1,0,0,1,0,0,1,0,0
2,10.822871,1.0,37.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0,1,0,0,1,0,0,1,0,0
3,57.098591,3.0,36.8,1.0,2.0,50.0,0.0,0.0,0.0,0.0,...,0,1,0,0,1,0,0,1,0,0
4,57.080370,3.0,36.6,1.0,2.0,50.0,1.0,2.0,50.0,0.0,...,0,1,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18677,6.161488,1.0,38.0,0.0,0.0,0.0,1.0,2.0,50.0,0.0,...,0,1,0,0,1,0,0,0,0,0
18678,6.179882,3.0,39.0,0.0,0.0,0.0,1.0,2.0,50.0,0.0,...,0,1,0,0,1,0,0,0,0,0
18679,38.600158,1.0,37.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,0,1,0,0,1,0,0
18680,38.591001,1.0,37.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,0,1,0,0,1,0,0


- redo core algorithm for temperature, respiratoryRate and pulse
- encode categorical data with ordinal and one hot
- allow entries with null values for temperature, respiratoryRate and pulse with input strategy to replace null


## Imputing

- Item non-response -> Missing At Random model (MAR) -> Single Imputation (I did not find a very good reason to use multiple imputation) -> KNNImputer

Reference: [sklearn docs](https://scikit-learn.org/stable/modules/impute.html)


In [95]:
knn_i = KNNImputer()
imp_arr = knn_i.fit_transform(df_parsed)


In [96]:
df_imp = pd.DataFrame(imp_arr, columns=knn_i.get_feature_names_out())
df_imp.columns


Index(['ageInMonths', 'feverDuration', 'temperature', 'antibiotics',
       'antibioticsHowMany', 'antibioticsHowMuch', 'antipyretic',
       'antipyreticHowMany', 'antipyreticHowMuch', 'crying', 'diarrhea',
       'drinking', 'lastUrination', 'skinTurgor', 'tearsWhenCrying', 'tongue',
       'dyspnea', 'respiratoryRate', 'wheezing', 'glassTest', 'rash',
       'skinColor', 'pulse', 'bulgingFontanelleMax18MOld', 'exoticTrip',
       'lastTimeEating', 'painfulUrination', 'seizure', 'smellyUrine',
       'vaccinationIn14days', 'vaccinationHowManyHoursAgo', 'wryNeck',
       'pain-01-No', 'pain-02-FeelingBad', 'pain-03-Headache',
       'pain-04-SwollenPainful', 'pain-05-StrongBellyacheAche',
       'awareness-01-Normal', 'awareness-02-SleepyOddOrFeverishNightmares',
       'awareness-03-NoReactionsNoAwareness', 'vomit-01-No', 'vomit-02-Slight',
       'vomit-03-Frequent', 'vomit-04-Yellow', 'vomit-05-5<hours',
       'patientState_good', 'patientState_caution', 'patientState_danger',
   

In [97]:
df_done = set_boundaries(df_imp)
df_done.shape


(18471, 69)

In [98]:
df_done.to_csv('processed_data.csv', index=False)
