# 02b-Preprocess the `development` dataset - Columns

__Goal__:

1. Read the dataset `weather_dataset_raw_development_timestamp.pkl`
2. Preprocess the columns
   - Rename the columns
   - Drop the columns `S_No`, `Location`, `Apparent_temperature`
   - Refactor the column `Weather`
3. Statistical analysis of the numerical variables
4. Statistical analysis of the categorical variable `Weather`
5. Save the  preprocessed data as `weather_dataset_raw_development_columns.pkl`, and remove `weather_dataset_raw_development_timestamp.pkl`.

### Import

In [1]:
import os

import numpy as np
import pandas as pd
from IPython.display import display
from pathlib import Path

### Utilities

In [2]:
def statistical_analysis_of_numerical_variables(df):
    
    # Print the numerical variables
    numerical_variables = list(df.select_dtypes(include= "number").columns)
    print('Numerical variables of "df":')
    print("-"*len('Numerical variables of "df":')+"\n")
    print(', '.join(numerical_variables)+"\n")

    # Print the number of NaN's per numerical variables
    print("Number of NaN's per numerical variables:")
    print("-"*len("Number of NaN's per numerical variables:")+"\n")
    display(df[numerical_variables].isnull().sum())

    # Display the statistics of numerical variables
    print("\nStatistics of numerical variables:")
    print("-"*len("Statistics of numerical variables:")+"\n")
    display(df[numerical_variables].describe())

    # Display the correlation matrix of numerical variables
    print("\nCorrelation matrix of numerical variables:")
    print("-"*len("Correlation matrix of numerical variables:")+"\n")
    display(df[numerical_variables].corr('pearson'))

# 1. Read the  dataset

In [3]:
df = pd.read_pickle(Path('datasets')/'weather_dataset_raw_development_timestamp.pkl')
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 43824 entries, 2006-01-01 00:00:00+00:00 to 2010-12-31 23:00:00+00:00
Freq: H
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   S_No                    43824 non-null  int64  
 1   Location                43824 non-null  object 
 2   Temperature_C           43824 non-null  float64
 3   Apparent_Temperature_C  43824 non-null  float64
 4   Humidity                43824 non-null  float64
 5   Wind_speed_kmph         43824 non-null  float64
 6   Wind_bearing_degrees    43824 non-null  int64  
 7   Visibility_km           43824 non-null  float64
 8   Pressure_millibars      43824 non-null  float64
 9   Weather_conditions      43819 non-null  object 
dtypes: float64(6), int64(2), object(2)
memory usage: 3.7+ MB


In [4]:
df.head()

Unnamed: 0,S_No,Location,Temperature_C,Apparent_Temperature_C,Humidity,Wind_speed_kmph,Wind_bearing_degrees,Visibility_km,Pressure_millibars,Weather_conditions
2006-01-01 00:00:00+00:00,2881,"Port of Turku, Finland",1.161111,-3.238889,0.85,16.6152,139,9.9015,1016.15,rain
2006-01-01 01:00:00+00:00,2882,"Port of Turku, Finland",1.666667,-3.155556,0.82,20.2538,140,9.9015,1015.87,rain
2006-01-01 02:00:00+00:00,2883,"Port of Turku, Finland",1.711111,-2.194444,0.82,14.49,140,9.9015,1015.56,rain
2006-01-01 03:00:00+00:00,2884,"Port of Turku, Finland",1.183333,-2.744444,0.86,13.9426,134,9.9015,1014.98,rain
2006-01-01 04:00:00+00:00,2885,"Port of Turku, Finland",1.205556,-3.072222,0.85,15.9068,149,9.982,1014.08,rain


# 2. Preprocess the columns

## A. Rename the columns

In [5]:
df.rename(columns={"Temperature_C": "Temperature", 
                   "Apparent_Temperature_C": "Apparent_temperature",
                   "Wind_speed_kmph": "Wind_speed",
                   "Wind_bearing_degrees": "Wind_bearing",
                   "Visibility_km": "Visibility",
                   "Pressure_millibars": "Pressure",
                   "Weather_conditions": "Weather"}, inplace=True)

In [6]:
df.head()

Unnamed: 0,S_No,Location,Temperature,Apparent_temperature,Humidity,Wind_speed,Wind_bearing,Visibility,Pressure,Weather
2006-01-01 00:00:00+00:00,2881,"Port of Turku, Finland",1.161111,-3.238889,0.85,16.6152,139,9.9015,1016.15,rain
2006-01-01 01:00:00+00:00,2882,"Port of Turku, Finland",1.666667,-3.155556,0.82,20.2538,140,9.9015,1015.87,rain
2006-01-01 02:00:00+00:00,2883,"Port of Turku, Finland",1.711111,-2.194444,0.82,14.49,140,9.9015,1015.56,rain
2006-01-01 03:00:00+00:00,2884,"Port of Turku, Finland",1.183333,-2.744444,0.86,13.9426,134,9.9015,1014.98,rain
2006-01-01 04:00:00+00:00,2885,"Port of Turku, Finland",1.205556,-3.072222,0.85,15.9068,149,9.982,1014.08,rain


## B. Drop the columns `S_No`, `Location`, and `Apparent_temperature`

#### `S_No`

In [7]:
len(df["S_No"].unique())

43824

As `S_No` (Serial number) seems to be an incremental index, we discard it.

In [8]:
df.drop(['S_No'], axis=1, inplace=True)

#### `Location`

In [9]:
df["Location"].value_counts()

Location
Port of Turku, Finland    43824
Name: count, dtype: int64

In [10]:
df.drop(['Location'], axis=1, inplace=True)

#### `Apparent_temperature`

In [11]:
df[["Temperature", "Apparent_temperature"]].corr('pearson')

Unnamed: 0,Temperature,Apparent_temperature
Temperature,1.0,0.992251
Apparent_temperature,0.992251,1.0


In [12]:
df.drop(['Apparent_temperature'], axis=1, inplace=True)

## C. Refactor the column `Weather`

In [13]:
df["Weather"].value_counts(dropna=False)

Weather
rain     36840
snow      5184
clear     1795
NaN          5
Name: count, dtype: int64

In [14]:
df["Weather"].replace({"snow": "no_rain", "clear": "no_rain"}, inplace=True)

In [15]:
df["Weather"].value_counts(dropna=False)

Weather
rain       36840
no_rain     6979
NaN            5
Name: count, dtype: int64

In [16]:
df["Weather"] = df["Weather"].map({'rain': 0, 'no_rain': 1})

In [17]:
df["Weather"].value_counts(dropna=False)

Weather
0.0    36840
1.0     6979
NaN        5
Name: count, dtype: int64

# 3. Statistical analysis of `numerical_variables`

In [18]:
statistical_analysis_of_numerical_variables(df)

Numerical variables of "df":
----------------------------

Temperature, Humidity, Wind_speed, Wind_bearing, Visibility, Pressure, Weather

Number of NaN's per numerical variables:
----------------------------------------



Temperature     0
Humidity        0
Wind_speed      0
Wind_bearing    0
Visibility      0
Pressure        0
Weather         5
dtype: int64


Statistics of numerical variables:
----------------------------------



Unnamed: 0,Temperature,Humidity,Wind_speed,Wind_bearing,Visibility,Pressure,Weather
count,43824.0,43824.0,43824.0,43824.0,43824.0,43824.0,43819.0
mean,11.789543,0.732492,10.972127,189.951556,9.914277,1001.865363,0.159269
std,9.527718,0.191495,7.024639,107.132753,3.793477,121.552295,0.365931
min,-16.666667,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.855556,0.61,5.957,118.0,8.1949,1011.39,0.0
50%,11.777778,0.78,10.143,185.0,9.982,1016.21,0.0
75%,18.75,0.89,14.3129,290.0,11.27,1021.0,0.0
max,39.905556,1.0,63.8526,359.0,16.1,1046.38,1.0



Correlation matrix of numerical variables:
------------------------------------------



Unnamed: 0,Temperature,Humidity,Wind_speed,Wind_bearing,Visibility,Pressure,Weather
Temperature,1.0,-0.626261,-0.000865,0.011999,0.348283,-0.036823,-0.382796
Humidity,-0.626261,1.0,-0.225751,0.013195,-0.32146,0.004298,0.134289
Wind_speed,-0.000865,-0.225751,1.0,0.121207,0.122641,-0.038202,-0.106498
Wind_bearing,0.011999,0.013195,0.121207,1.0,0.051812,-0.00657,-0.051289
Visibility,0.348283,-0.32146,0.122641,0.051812,1.0,0.014031,-0.240793
Pressure,-0.036823,0.004298,-0.038202,-0.00657,0.014031,1.0,-0.016011
Weather,-0.382796,0.134289,-0.106498,-0.051289,-0.240793,-0.016011,1.0


# 4. Statistical analysis of `Weather`

In [19]:
df['Weather'].value_counts(dropna=False) 

Weather
0.0    36840
1.0     6979
NaN        5
Name: count, dtype: int64

# 5. Save the preprocessed dataset, and remove the previous one

In [20]:
df.to_pickle(Path('datasets')/'weather_dataset_raw_development_columns.pkl')

In [21]:
os.remove(Path('datasets')/'weather_dataset_raw_development_timestamp.pkl')