# Data preparation and transformation exercise

## Part II - Format, Clean and Remove duplicates

The objective of this exercise is to practice various steps of data preprocessing and feature engineering.

The scenario is the preparation of data for a ML multilinear regressions.

The dataset used is the "Climate Weather Surface of Brazil - Hourly", wich is available at <a href="https://www.kaggle.com/PROPPG-PPG/hourly-weather-surface-brazil-southeast-region?select=make_dataset.py">Kaggle</a>.

It contains hourly climate data taken from 122 weather stations in Brasil between 2000 and 2021.

**Steps:**
1. Load data
2. Inspect data
3. <a href="#Format-features">Format features</a>
4. <a href="#Clean-messy-data">Clean messy data</a>
5. <a href="#Remove-duplicate-values">Remove duplicate values</a>
6. Treat missing values
7. Imputation
8. Remove strongly correlated features
9.  Remove outliers
10. Aggregate features
11. Encode categorical features
12. Feature scaling
13. Dimensionality reduction and feature decomposition
14. Sample and balance

In [None]:
import pandas as pd
import numpy as np
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pickle.load(open("newindex_dataset.pkl", "rb"))

## Format features

Let's look at the first row, to get a feeling for how the features are presented.

In [30]:
dataset.iloc[0]

full_time          2017-12-20 14:00:00
precipitation                      0.0
pressure                         899.6
pressure_max                     900.0
pressure_min                     899.6
solar_radiation                   3391
air_temperature                   26.5
dp_temperature                    17.7
air_temp_max                      26.5
air_temp_min                      24.4
dp_temp_max                       18.3
dp_temp_min                       16.5
rel_hum_max                         65
rel_hum_min                         57
Rel_humidity                        59
wind_direction                      39
wind_gust                          9.6
wind_speed                         3.9
region                              CO
state                               DF
station_name        PARANOA (COOPA-DF)
station_id                        A047
Latitude                    -16.011111
Longitude                     -47.5575
Elevation                       1043.0
year                     

Let's check maximum and minimum values for each feature, to tee if they are within reasonable bounds

In [31]:
dataset.max()

full_time                       2021-04-30 23:00:00
precipitation                                  96.0
pressure                                     1028.8
pressure_max                                 1030.6
pressure_min                                 1028.1
solar_radiation                               48898
air_temperature                                45.0
dp_temperature                                 44.8
air_temp_max                                   45.0
air_temp_min                                   45.0
dp_temp_max                                    44.9
dp_temp_min                                    44.7
rel_hum_max                                     100
rel_hum_min                                     100
Rel_humidity                                    100
wind_direction                                  360
wind_gust                                      49.4
wind_speed                                     20.0
region                                           CO
state       

The maximum values seem ok for these feature types

In [32]:
dataset.min()

full_time          2000-05-07 00:00:00
precipitation                  -9999.0
pressure                       -9999.0
pressure_max                   -9999.0
pressure_min                   -9999.0
solar_radiation                  -9999
air_temperature                -9999.0
dp_temperature                 -9999.0
air_temp_max                   -9999.0
air_temp_min                   -9999.0
dp_temp_max                    -9999.0
dp_temp_min                    -9999.0
rel_hum_max                      -9999
rel_hum_min                      -9999
Rel_humidity                     -9999
wind_direction                   -9999
wind_gust                      -9999.0
wind_speed                     -9999.0
region                              CO
state                               DF
station_name                  AGUA BOA
station_id                        A001
Latitude                    -23.966944
Longitude                   -59.873056
Elevation                          5.0
year                     

Clearly we have -9999 replacing missing values.

## Clean messy data

As we saw before, the values for Latitude, Longitude and Elevation in each station are not unique, but as they are categorical features that correspond to a same station, like 'region' and 'state', so we'll discard them, as well as the station_id, and use the categorical feature station_name (one hot encoded).

This still leaves us with the two different values for station_name of the same station, so before discarding station_id, we'll unify the name for station with 'station_id' == A927 as "BRASNORTE (NOVO MUNDO)"

In [34]:
dataset.loc[dataset.station_id == 'A927', 'station_name'] = "BRASNORTE (NOVO MUNDO)"

In [35]:
dataset.station_name.loc[dataset.station_id == 'A927'].unique()

array(['BRASNORTE (NOVO MUNDO)'], dtype=object)

In [36]:
dataset.drop(columns = ['station_id', 
                        'Latitude', 
                        'Longitude', 
                        'Elevation', 
                        'year', 
                        'region', 
                        'state'], inplace=True)

In [37]:
pickle.dump(dataset, open("clean_dataset.pkl", "wb"))

## Remove duplicate values

In [39]:
dataset.duplicated().value_counts()

False    11427120
dtype: int64

There are no duplicates in the dataset.