# Data preparation and transformation exercise

## Part III - Missing values and Imputation

The objective of this exercise is to practice various steps of data preprocessing and feature engineering.

The scenario is the preparation of data for a ML multilinear regressions.

The dataset used is the "Climate Weather Surface of Brazil - Hourly", wich is available at <a href="https://www.kaggle.com/PROPPG-PPG/hourly-weather-surface-brazil-southeast-region?select=make_dataset.py">Kaggle</a>.

It contains hourly climate data taken from 122 weather stations in Brasil between 2000 and 2021.

**Steps:**
1. Load data
2. Inspect data
3. Format features
4. Clean messy data
5. Remove duplicate values
6. <a href="#Treat-missing-values">Treat missing values</a>
7. <a href="#Imputation">Imputation</a>
8. Remove strongly correlated features
9. Remove outliers
10. Aggregate features
11. Encode categorical features
12. Feature scaling
13. Dimensionality reduction and feature decomposition
14. Sample and balance

In [2]:
import pandas as pd
import numpy as np
import pickle
dataset = pickle.load(open("clean_dataset.pkl", "rb"))

## Treat missing values

As seen above, missing values were replaced with -9999 (solar_radiation) or -9999.0 (other features).

If all sensor values are simultaneously missing, we will discard the row.

In [17]:
duplicates = dataset.loc[
    (dataset.precipitation == -9999.0) & 
    (dataset.pressure == -9999.0) & 
    (dataset.pressure_max == -9999.0) & 
    (dataset.pressure_min == -9999.0) & 
    (dataset.solar_radiation == -9999) & 
    (dataset.air_temperature == -9999.0) & 
    (dataset.dp_temperature == -9999.0) & 
    (dataset.air_temp_max == -9999.0) & 
    (dataset.air_temp_min == -9999.0) & 
    (dataset.wind_gust == -9999.0) & 
    (dataset.wind_speed == -9999.0)].index
print("The dataset has {} empty measurements".format(len(duplicates)))

The dataset has 1468158 empty measurements


In [22]:
# Let's drop the duplicates
dataset.drop(index=duplicates, inplace=True)

The empty readings are gone. Next we will replace individual missing readings (-9999) with np.nan, to facilitate their tratment.

In [23]:
def replace_nines(col_name :str):
    """
    Replace the values -9999 in the column specified of the (global) dataset with np.nan
    """
    global dataset
    
    dataset.loc[dataset[col_name] == -9999.0, col_name] = np.nan

In [24]:
numeric_features = ["precipitation",
                    "pressure",
                    "pressure_max",
                    "pressure_min",
                    "solar_radiation",
                    "air_temperature",
                    "dp_temperature",
                    "air_temp_max",
                    "air_temp_min",
                    "dp_temp_max",
                    "dp_temp_min",
                    "rel_hum_max",
                    "rel_hum_min",
                    "Rel_humidity",
                    "wind_direction",
                    "wind_gust",
                    "wind_speed"]

for i in numeric_features:
    replace_nines(i)

In [25]:
pickle.dump(dataset, open("with_nan_dataset.pkl", "wb"))

## Imputation

Next we will use the mean value of the feature in each station to replace the missing values of that feature.

**Warning:** we should not calculate the mean of the training dataset using values from the test dataset. However, in order to keep this exercise simple, we will ignore this and impute the missing values before spliting the dataset.

In [26]:
# count missing values per feature
print(dataset.isnull().sum())

full_time                0
precipitation       569786
pressure             81912
pressure_max         86194
pressure_min         87165
solar_radiation    4606596
air_temperature      19866
dp_temperature      221481
air_temp_max         26916
air_temp_min         27768
dp_temp_max         224436
dp_temp_min         231134
rel_hum_max         214905
rel_hum_min         220026
Rel_humidity        209843
wind_direction      230632
wind_gust           154709
wind_speed          142705
station_name             0
dtype: int64


In [27]:
# we'll generate a dataset (frame) with the data of each station
# and use them to calculate the mean to imputate missing values
frames = []

# number of numeric features present in the dataset
num_features = 17

# position offset for the first numeric feature within datates
feature_offset = 1

# list with names of all the stations
stations = dataset['station_name'].unique().tolist()

a = 1
for i in stations: # iterate stations
    # create dataset with data of that station only
    df_station = dataset[dataset['station_name']== i].copy()
    print('processing station {} {}'.format(str(a),i))
    a +=1
    for j in range(feature_offset, feature_offset + num_features): # iterate features
        df_station[dataset.columns[j]].fillna(df_station[dataset.columns[j]].mean(),inplace = True)
    frames.append(df_station)

processing station 1 PARANOA (COOPA-DF)
processing station 2 BELA VISTA
processing station 3 NOVA UBIRATA
processing station 4 GOIANIA
processing station 5 CARLINDA
processing station 6 SIDROLANDIA
processing station 7 ITAQUIRAI
processing station 8 COTRIGUACU
processing station 9 SAO MIGUEL DO ARAGUAIA
processing station 10 NHUMIRIM
processing station 11 QUERENCIA
processing station 12 RIO VERDE
processing station 13 APIACAS
processing station 14 SALTO DO CEU
processing station 15 COXIM
processing station 16 SAO GABRIEL DO OESTE
processing station 17 PONTA PORA
processing station 18 MARACAJU
processing station 19 ALTO PARAISO DE GOIAS
processing station 20 BRASNORTE (NOVO MUNDO)
processing station 21 JATAI
processing station 22 GAUCHA DO NORTE
processing station 23 SAO FELIX  DO ARAGUAIA
processing station 24 SANTO ANTONIO DO LESTE
processing station 25 PARANATINGA
processing station 26 SORRISO
processing station 27 DOURADOS
processing station 28 GUARANTA DO NORTE
processing station 2

In [28]:
len(frames)

115

In [29]:
# recreate the dataset from the cleaned frames
dataset = pd.concat(frames)

In [30]:
pickle.dump(dataset, open("imputed_dataset.pkl", "wb"))

In [31]:
# confirm that we don not have missing values
print(dataset.isnull().sum())

full_time          0
precipitation      0
pressure           0
pressure_max       0
pressure_min       0
solar_radiation    0
air_temperature    0
dp_temperature     0
air_temp_max       0
air_temp_min       0
dp_temp_max        0
dp_temp_min        0
rel_hum_max        0
rel_hum_min        0
Rel_humidity       0
wind_direction     0
wind_gust          0
wind_speed         0
station_name       0
dtype: int64
