# Processing
As we say from the `data_exploration` file, the initial set of data contains a lot of `NaN` values. For now, we'll impute using `sklearn.impute.KNNImputer`, but we will experiment with other methods in the future.

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.impute import KNNImputer

df = pd.read_csv('../data/raw/dengue_features_train.csv')
df_labels = pd.read_csv('../data/raw/dengue_labels_train.csv')
df_test = pd.read_csv('../data/raw/dengue_features_test.csv')

Define our impute function

In [8]:
def impute_df(df):
    imputer = KNNImputer()
    for column in df.columns:
        if df[column].isna().sum() != 0:
            df[column] = imputer.fit_transform(df[[column]])
    return df

Drop columns that are linearly dependent. This is shown in the correlation matrix in the `data_exploration` notebook.

In [9]:
df = impute_df(df)
df_labels = impute_df(df_labels)
df_test = impute_df(df_test)

df.drop(labels=['precipitation_amt_mm', 'reanalysis_specific_humidity_g_per_kg', 'reanalysis_min_air_temp_k'], axis=1, inplace=True)
df_test.drop(labels=['precipitation_amt_mm', 'reanalysis_specific_humidity_g_per_kg', 'reanalysis_min_air_temp_k'], axis=1, inplace=True)

In [10]:
df.to_csv('../data/clean/full/dengue_features_train.csv', index=False)
df_labels.to_csv('../data/clean/full/dengue_labels_train.csv', index=False)

df_test.to_csv('../data/clean/full/dengue_features_test.csv', index=False)

Define new .csv's for each of the cities, for easier data modeling.

In [11]:
sj_train_features = df[df['city'] == 'sj']
iq_train_features = df[df['city'] == 'iq']

sj_train_labels = df_labels[df_labels['city'] == 'sj']
iq_train_labels = df_labels[df_labels['city'] == 'iq']

sj_test_features = df_test[df_test['city'] == 'sj']
iq_test_features = df_test[df_test['city'] == 'iq']

In [12]:
sj_train_features.to_csv('../data/clean/sj/sj_train_features.csv', index=False)
sj_train_labels.to_csv('../data/clean/sj/sj_train_labels.csv', index=False)

iq_train_features.to_csv('../data/clean/iq/iq_train_features.csv', index=False)
iq_train_labels.to_csv('../data/clean/iq/iq_train_labels.csv', index=False)

sj_test_features.to_csv('../data/clean/sj/sj_test_features.csv', index=False)
iq_test_features.to_csv('../data/clean/iq/iq_test_features.csv', index=False)