# Weather Data
We build our weather dataset using the datasets from https://www.kaggle.com/datasets/selfishgene/historical-hourly-weather-data


## Cleaning Kaggle (OpenWeather) Datasets


The data was originally procured from https://openweathermap.org/

Overview of the units for the city of Chicago:
https://openweathermap.org/city/4887398

### Temperature

Originally, the temperature is given in degree Kelvin.

In [None]:
# Load data
df_temp = pd.read_csv("temperature.csv", parse_dates=['datetime'])
df_temp.head()

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
0,2012-10-01 12:00:00,,,,,,,,,,...,,,,,,,309.1,,,
1,2012-10-01 13:00:00,284.63,282.08,289.48,281.8,291.87,291.53,293.41,296.6,285.12,...,285.63,288.22,285.83,287.17,307.59,305.47,310.58,304.4,304.4,303.5
2,2012-10-01 14:00:00,284.629041,282.083252,289.474993,281.797217,291.868186,291.533501,293.403141,296.608509,285.154558,...,285.663208,288.247676,285.83465,287.186092,307.59,304.31,310.495769,304.4,304.4,303.5
3,2012-10-01 15:00:00,284.626998,282.091866,289.460618,281.789833,291.862844,291.543355,293.392177,296.631487,285.233952,...,285.756824,288.32694,285.84779,287.231672,307.391513,304.281841,310.411538,304.4,304.4,303.5
4,2012-10-01 16:00:00,284.624955,282.100481,289.446243,281.782449,291.857503,291.553209,293.381213,296.654466,285.313345,...,285.85044,288.406203,285.860929,287.277251,307.1452,304.238015,310.327308,304.4,304.4,303.5


The procedure is more or less the same for all Kaggle datasets:
* Remove all cities expect for Chicago
* Remove data that is not from the year 2013
* Add a unique index
* Rename the column "Chicago" to the respective type of weather data

In [None]:
# Drop superfluous cities
df_temp.drop(df_temp.columns.difference(['datetime','Chicago']), axis=1, inplace=True)

# Drop superfluous weather data
d1 = datetime.datetime(2013, 1, 1)
d2 = datetime.datetime(2014, 1, 1)
df_temp = df_temp[df_temp.datetime >= d1]
df_temp = df_temp[df_temp.datetime < d2]

# Add unique index to weather
df_temp.reset_index(drop=True, inplace=True)

# Rename column Chicago
df_temp.rename(columns={"Chicago": "temperature"}, inplace = True)

# Convert from Kelvin to Celsius
df_temp['temperature'] = df_temp['temperature'] - 273.15
df_temp

Unnamed: 0,datetime,temperature
0,2013-01-01 00:00:00,-0.19
1,2013-01-01 01:00:00,0.28
2,2013-01-01 02:00:00,0.33
3,2013-01-01 03:00:00,0.12
4,2013-01-01 04:00:00,0.04
...,...,...
8755,2013-12-31 19:00:00,-11.27
8756,2013-12-31 20:00:00,-10.60
8757,2013-12-31 21:00:00,-10.98
8758,2013-12-31 22:00:00,-11.24


In [None]:
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     8760 non-null   datetime64[ns]
 1   temperature  8758 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB


We can see that there are two temperature values missing.

In [None]:
df_temp.describe()

Unnamed: 0,datetime,temperature
count,8760,8758.0
mean,2013-07-02 11:30:00,9.942069
min,2013-01-01 00:00:00,-17.92
25%,2013-04-02 05:45:00,0.8
50%,2013-07-02 11:30:00,10.364167
75%,2013-10-01 17:15:00,19.423
max,2013-12-31 23:00:00,35.33
std,,11.210484


The temperatures seem to be consistent with what can be found about the weather (extremes) in Chicago: https://www.weather.gov/lot/Chicago_Temperature_Records

In [None]:
# Show Null values
df_temp[df_temp['temperature'].isna()]

Unnamed: 0,datetime,temperature
1663,2013-03-11 07:00:00,
1664,2013-03-11 08:00:00,


Coincidentally, the missing values are adjacent. Therefore, we have a look at the temperature before and after those missing values:

In [None]:
# Print values before and after missing values
print(df_temp[df_temp['datetime']=='2013-03-11 06:00:00'])
print(df_temp[df_temp['datetime']=='2013-03-11 09:00:00'])

                datetime  temperature
1662 2013-03-11 06:00:00         7.54
                datetime  temperature
1665 2013-03-11 09:00:00          5.7


In [None]:
# Fill missing values with average of values that are before and after missing ones
fill = (7.54 + 5.7) / 2
df_temp.at[1663,'temperature']= fill
df_temp.at[1664,'temperature']= fill
print(df_temp[df_temp['datetime']=='2013-03-11 07:00:00'])
print(df_temp[df_temp['datetime']=='2013-03-11 08:00:00'])

                datetime  temperature
1663 2013-03-11 07:00:00         6.62
                datetime  temperature
1664 2013-03-11 08:00:00         6.62


In [None]:
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     8760 non-null   datetime64[ns]
 1   temperature  8760 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB


In [None]:
df_temp.duplicated(keep='first')

0       False
1       False
2       False
3       False
4       False
        ...  
8755    False
8756    False
8757    False
8758    False
8759    False
Length: 8760, dtype: bool

There are no Null values or duplicates left.

In [None]:
df_temp.head()

Unnamed: 0,datetime,temperature
0,2013-01-01 00:00:00,-0.19
1,2013-01-01 01:00:00,0.28
2,2013-01-01 02:00:00,0.33
3,2013-01-01 03:00:00,0.12
4,2013-01-01 04:00:00,0.04


### Weather Description

We continue with the weather description.

In [None]:
# Load data
df_desc = pd.read_csv("weather_description.csv", parse_dates=['datetime'])
df_desc.head()

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
0,2012-10-01 12:00:00,,,,,,,,,,...,,,,,,,haze,,,
1,2012-10-01 13:00:00,mist,scattered clouds,light rain,sky is clear,mist,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,overcast clouds,sky is clear,sky is clear,sky is clear,haze,sky is clear,sky is clear,sky is clear
2,2012-10-01 14:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,sky is clear,sky is clear,broken clouds,overcast clouds,sky is clear,overcast clouds
3,2012-10-01 15:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds
4,2012-10-01 16:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds


We proceed analogously to before:

In [None]:
# Drop superfluous cities
df_desc.drop(df_desc.columns.difference(['datetime','Chicago']), axis=1, inplace=True)

# Drop superfluous weather data
d1 = datetime.datetime(2013, 1, 1)
d2 = datetime.datetime(2014, 1, 1)
df_desc = df_desc[df_desc.datetime >= d1]
df_desc = df_desc[df_desc.datetime < d2]

# Add unique index to weather
df_desc.reset_index(drop=True, inplace=True)


# Rename column Chicago
df_desc.rename(columns={"Chicago": "weather_description"}, inplace = True)
df_desc

Unnamed: 0,datetime,weather_description
0,2013-01-01 00:00:00,overcast clouds
1,2013-01-01 01:00:00,broken clouds
2,2013-01-01 02:00:00,overcast clouds
3,2013-01-01 03:00:00,overcast clouds
4,2013-01-01 04:00:00,broken clouds
...,...,...
8755,2013-12-31 19:00:00,broken clouds
8756,2013-12-31 20:00:00,broken clouds
8757,2013-12-31 21:00:00,light snow
8758,2013-12-31 22:00:00,snow


We can see that the weather is described with words, but we need numerical values for our tasks. Let us have a look at the frequency of different weather descriptions.

In [None]:
df_desc['weather_description'].value_counts()

weather_description
sky is clear                    2139
broken clouds                   1908
overcast clouds                 1329
scattered clouds                 902
few clouds                       639
light rain                       625
mist                             572
moderate rain                    170
heavy snow                       142
haze                              72
light snow                        59
heavy intensity rain              57
fog                               33
thunderstorm with light rain      22
light intensity drizzle           19
proximity thunderstorm            18
very heavy rain                   13
thunderstorm                      13
thunderstorm with rain             9
snow                               7
thunderstorm with heavy rain       5
drizzle                            4
light rain and snow                2
heavy intensity drizzle            1
Name: count, dtype: int64

In [None]:
df_desc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   datetime             8760 non-null   datetime64[ns]
 1   weather_description  8760 non-null   object        
dtypes: datetime64[ns](1), object(1)
memory usage: 137.0+ KB


In [None]:
df_desc.duplicated(keep='first')

0       False
1       False
2       False
3       False
4       False
        ...  
8755    False
8756    False
8757    False
8758    False
8759    False
Length: 8760, dtype: bool

There are no Null values or duplicates.

In [None]:
print(df_desc['weather_description'].unique())

['overcast clouds' 'broken clouds' 'sky is clear' 'scattered clouds'
 'few clouds' 'heavy snow' 'haze' 'mist' 'light rain'
 'light rain and snow' 'moderate rain' 'heavy intensity rain'
 'light intensity drizzle' 'fog' 'snow' 'light snow'
 'thunderstorm with rain' 'thunderstorm with light rain' 'drizzle'
 'proximity thunderstorm' 'thunderstorm with heavy rain' 'very heavy rain'
 'thunderstorm' 'heavy intensity drizzle']


We can see that there is a great variety of different weather descriptions. To quantify the descriptions, we reduce them to precipitation or no precipitation.

In [None]:
# Distinguish between precipitation and no precipitation (we consider proximity thunderstorm as precipitation)

df_desc.loc[(df_desc['weather_description'] == 'scattered clouds')
            |(df_desc['weather_description'] == 'sky is clear')
            |(df_desc['weather_description'] == 'overcast clouds')
            |(df_desc['weather_description'] == 'broken clouds')
            |(df_desc['weather_description'] == 'few clouds')
            |(df_desc['weather_description'] == 'fog')
            |(df_desc['weather_description'] == 'mist')
            |(df_desc['weather_description'] == 'haze'), 'weather_description'] = 0
df_desc.loc[(df_desc['weather_description'] != 0), 'weather_description'] = 1
df_desc

Unnamed: 0,datetime,weather_description
0,2013-01-01 00:00:00,0
1,2013-01-01 01:00:00,0
2,2013-01-01 02:00:00,0
3,2013-01-01 03:00:00,0
4,2013-01-01 04:00:00,0
...,...,...
8755,2013-12-31 19:00:00,0
8756,2013-12-31 20:00:00,0
8757,2013-12-31 21:00:00,1
8758,2013-12-31 22:00:00,1


1 means precipitation, 0 means no precipitation.

In [None]:
# Rename column weather_description to precip(itation)
df_desc.rename(columns={"weather_description": "precip"}, inplace = True)
df_desc.nunique()

datetime    8760
precip         2
dtype: int64

In [None]:
# Count number of hours with precipitation
df_desc['precip'].value_counts()

precip
0    7594
1    1166
Name: count, dtype: int64

In [None]:
df_desc.head()

Unnamed: 0,datetime,precip
0,2013-01-01 00:00:00,0
1,2013-01-01 01:00:00,0
2,2013-01-01 02:00:00,0
3,2013-01-01 03:00:00,0
4,2013-01-01 04:00:00,0


### Humidity

The humidity is given in percent.

In [None]:
# Load data
df_humid = pd.read_csv("humidity.csv", parse_dates=['datetime'])
df_humid.head()

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
0,2012-10-01 12:00:00,,,,,,,,,,...,,,,,,,25.0,,,
1,2012-10-01 13:00:00,76.0,81.0,88.0,81.0,88.0,82.0,22.0,23.0,50.0,...,71.0,58.0,93.0,68.0,50.0,63.0,22.0,51.0,51.0,50.0
2,2012-10-01 14:00:00,76.0,80.0,87.0,80.0,88.0,81.0,21.0,23.0,49.0,...,70.0,57.0,91.0,68.0,51.0,62.0,22.0,51.0,51.0,50.0
3,2012-10-01 15:00:00,76.0,80.0,86.0,80.0,88.0,81.0,21.0,23.0,49.0,...,70.0,57.0,87.0,68.0,51.0,62.0,22.0,51.0,51.0,50.0
4,2012-10-01 16:00:00,77.0,80.0,85.0,79.0,88.0,81.0,21.0,23.0,49.0,...,69.0,57.0,84.0,68.0,52.0,62.0,22.0,51.0,51.0,50.0


We proceed analogously to before:

In [None]:
# Drop superfluous cities
df_humid.drop(df_humid.columns.difference(['datetime','Chicago']), axis=1, inplace=True)

# Drop superfluous weather data
d1 = datetime.datetime(2013, 1, 1)
d2 = datetime.datetime(2014, 1, 1)
df_humid = df_humid[df_humid.datetime >= d1]
df_humid = df_humid[df_humid.datetime < d2]

# Add unique index to weather
df_humid.reset_index(drop=True, inplace=True)


# Rename column Chicago
df_humid.rename(columns={"Chicago": "humidity"}, inplace = True)
df_humid

Unnamed: 0,datetime,humidity
0,2013-01-01 00:00:00,
1,2013-01-01 01:00:00,64.0
2,2013-01-01 02:00:00,69.0
3,2013-01-01 03:00:00,
4,2013-01-01 04:00:00,68.0
...,...,...
8755,2013-12-31 19:00:00,89.0
8756,2013-12-31 20:00:00,89.0
8757,2013-12-31 21:00:00,89.0
8758,2013-12-31 22:00:00,89.0


In [None]:
df_humid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   datetime  8760 non-null   datetime64[ns]
 1   humidity  8132 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB


In [None]:
df_humid.duplicated(keep='first')

0       False
1       False
2       False
3       False
4       False
        ...  
8755    False
8756    False
8757    False
8758    False
8759    False
Length: 8760, dtype: bool

There are no duplicates, but there are around 600 Null rows with values (only 8132 non-null).

In [None]:
df_humid[df_humid['humidity'].isna()]

Unnamed: 0,datetime,humidity
0,2013-01-01 00:00:00,
3,2013-01-01 03:00:00,
6,2013-01-01 06:00:00,
8,2013-01-01 08:00:00,
9,2013-01-01 09:00:00,
...,...,...
3725,2013-06-05 05:00:00,
3727,2013-06-05 07:00:00,
3729,2013-06-05 09:00:00,
3730,2013-06-05 10:00:00,


In [None]:
df_humid['humidity'].describe()

count    8132.000000
mean       75.193925
std        16.704538
min        17.000000
25%        64.000000
50%        78.000000
75%        89.000000
max       100.000000
Name: humidity, dtype: float64

We carry out imputation using the overall mean humidity to replace the missing values.

In [None]:
df_humid = df_humid.fillna(df_humid['humidity'].mean())
df_humid.head()

Unnamed: 0,datetime,humidity
0,2013-01-01 00:00:00,75.193925
1,2013-01-01 01:00:00,64.0
2,2013-01-01 02:00:00,69.0
3,2013-01-01 03:00:00,75.193925
4,2013-01-01 04:00:00,68.0


### Wind Speed

The wind speed is given in m/s (meters per second).

In [None]:
# Load data
df_wind = pd.read_csv("wind_speed.csv", parse_dates=['datetime'])
df_wind.head(5)

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
0,2012-10-01 12:00:00,,,,,,,,,,...,,,,,,,8.0,,,
1,2012-10-01 13:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,...,4.0,7.0,4.0,3.0,1.0,0.0,8.0,2.0,2.0,2.0
2,2012-10-01 14:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,...,4.0,7.0,4.0,3.0,3.0,0.0,8.0,2.0,2.0,2.0
3,2012-10-01 15:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,...,3.0,7.0,4.0,3.0,3.0,0.0,8.0,2.0,2.0,2.0
4,2012-10-01 16:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,...,3.0,7.0,4.0,3.0,3.0,0.0,8.0,2.0,2.0,2.0


We proceed analogously to before:

In [None]:
# Drop superfluous cities
df_wind.drop(df_wind.columns.difference(['datetime','Chicago']), axis= 1, inplace=True)

# Drop superfluous weather data
d1 = datetime.datetime(2013, 1, 1)
d2 = datetime.datetime(2014, 1, 1)
df_wind = df_wind[df_wind.datetime >= d1]
df_wind = df_wind[df_wind.datetime < d2]

# Add unique index to weather
df_wind.reset_index(drop=True, inplace=True)

# Rename column Chicago
df_wind.rename(columns={"Chicago": "wind_speed"}, inplace = True)
df_wind

Unnamed: 0,datetime,wind_speed
0,2013-01-01 00:00:00,4.0
1,2013-01-01 01:00:00,3.0
2,2013-01-01 02:00:00,6.0
3,2013-01-01 03:00:00,7.0
4,2013-01-01 04:00:00,7.0
...,...,...
8755,2013-12-31 19:00:00,0.0
8756,2013-12-31 20:00:00,0.0
8757,2013-12-31 21:00:00,3.0
8758,2013-12-31 22:00:00,1.0


In [None]:
df_wind.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    8760 non-null   datetime64[ns]
 1   wind_speed  8760 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB


In [None]:
df_wind.duplicated(keep='first')

0       False
1       False
2       False
3       False
4       False
        ...  
8755    False
8756    False
8757    False
8758    False
8759    False
Length: 8760, dtype: bool

There are no Null values or duplicates.

In [None]:
df_wind['wind_speed'].describe()

count    8760.000000
mean        3.066667
std         2.238570
min         0.000000
25%         1.000000
50%         3.000000
75%         4.000000
max        18.000000
Name: wind_speed, dtype: float64

In [None]:
df_wind.head()

Unnamed: 0,datetime,wind_speed
0,2013-01-01 00:00:00,4.0
1,2013-01-01 01:00:00,3.0
2,2013-01-01 02:00:00,6.0
3,2013-01-01 03:00:00,7.0
4,2013-01-01 04:00:00,7.0


### Pressure

The pressure is given in hPa (hectopascal).

In [None]:
# Load data
df_press = pd.read_csv("pressure.csv", parse_dates=['datetime'])
df_press.head()

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
0,2012-10-01 12:00:00,,,,,,,,,,...,,,,,,,1011.0,,,
1,2012-10-01 13:00:00,,1024.0,1009.0,1027.0,1013.0,1013.0,1018.0,1013.0,1024.0,...,1014.0,1012.0,1001.0,1014.0,984.0,1012.0,1010.0,1013.0,1013.0,990.0
2,2012-10-01 14:00:00,,1024.0,1009.0,1027.0,1013.0,1013.0,1018.0,1013.0,1024.0,...,1014.0,1012.0,986.0,1014.0,984.0,1012.0,1010.0,1013.0,1013.0,990.0
3,2012-10-01 15:00:00,,1024.0,1009.0,1028.0,1013.0,1013.0,1018.0,1013.0,1024.0,...,1014.0,1012.0,945.0,1014.0,984.0,1012.0,1010.0,1013.0,1013.0,990.0
4,2012-10-01 16:00:00,,1024.0,1009.0,1028.0,1013.0,1013.0,1018.0,1013.0,1024.0,...,1014.0,1012.0,904.0,1014.0,984.0,1012.0,1010.0,1013.0,1013.0,990.0


We proceed analogously to before.

In [None]:
# Drop superfluous cities
df_press.drop(df_press.columns.difference(['datetime','Chicago']), axis=1, inplace=True)

# Drop superfluous weather data
d1 = datetime.datetime(2013, 1, 1)
d2 = datetime.datetime(2014, 1, 1)
df_press = df_press[df_press.datetime >= d1]
df_press = df_press[df_press.datetime < d2]

# Add unique index to weather
df_press.reset_index(drop=True, inplace=True)

# Rename column Chicago
df_press.rename(columns={"Chicago": "pressure"}, inplace = True)
df_press

Unnamed: 0,datetime,pressure
0,2013-01-01 00:00:00,1024.0
1,2013-01-01 01:00:00,1022.0
2,2013-01-01 02:00:00,1022.0
3,2013-01-01 03:00:00,1021.0
4,2013-01-01 04:00:00,1021.0
...,...,...
8755,2013-12-31 19:00:00,1026.0
8756,2013-12-31 20:00:00,1026.0
8757,2013-12-31 21:00:00,1026.0
8758,2013-12-31 22:00:00,1026.0


In [None]:
df_press.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   datetime  8760 non-null   datetime64[ns]
 1   pressure  8203 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB


In [None]:
df_press.duplicated(keep='first')

0       False
1       False
2       False
3       False
4       False
        ...  
8755    False
8756    False
8757    False
8758    False
8759    False
Length: 8760, dtype: bool

Here, we once again have Null values, but no duplicates.

In [None]:
df_press[df_press['pressure'].isna()]

Unnamed: 0,datetime,pressure
6,2013-01-01 06:00:00,
71,2013-01-03 23:00:00,
107,2013-01-05 11:00:00,
108,2013-01-05 12:00:00,
143,2013-01-06 23:00:00,
...,...,...
3722,2013-06-05 02:00:00,
3725,2013-06-05 05:00:00,
3727,2013-06-05 07:00:00,
3729,2013-06-05 09:00:00,


In [None]:
df_press['pressure'].describe()

count    8203.000000
mean     1016.571864
std         8.615725
min       979.000000
25%      1011.000000
50%      1016.000000
75%      1022.000000
max      1047.000000
Name: pressure, dtype: float64

We carry out imputation using the overall mean pressure to replace the missing values.

In [None]:
df_press = df_press.fillna(df_press['pressure'].mean())
# Show first 7 values because #6 was filled with mean
df_press.head(7)

Unnamed: 0,datetime,pressure
0,2013-01-01 00:00:00,1024.0
1,2013-01-01 01:00:00,1022.0
2,2013-01-01 02:00:00,1022.0
3,2013-01-01 03:00:00,1021.0
4,2013-01-01 04:00:00,1021.0
5,2013-01-01 05:00:00,1020.0
6,2013-01-01 06:00:00,1016.571864


## Merging Kaggle Data

Now, we merge all the aforementioned dataframes using the datetime column.

In [None]:
# Merge on datetime
df_hourly = df_desc.merge(df_temp, left_on='datetime', right_on='datetime')
df_hourly = df_hourly.merge(df_humid, left_on='datetime', right_on='datetime')
df_hourly = df_hourly.merge(df_wind, left_on='datetime', right_on='datetime')
df_hourly = df_hourly.merge(df_press, left_on='datetime', right_on='datetime')
df_hourly.head()

Unnamed: 0,datetime,precip,temperature,humidity,wind_speed,pressure
0,2013-01-01 00:00:00,0,-0.19,75.193925,4.0,1024.0
1,2013-01-01 01:00:00,0,0.28,64.0,3.0,1022.0
2,2013-01-01 02:00:00,0,0.33,69.0,6.0,1022.0
3,2013-01-01 03:00:00,0,0.12,75.193925,7.0,1021.0
4,2013-01-01 04:00:00,0,0.04,68.0,7.0,1021.0


In [None]:
df_hourly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     8760 non-null   datetime64[ns]
 1   precip       8760 non-null   object        
 2   temperature  8760 non-null   float64       
 3   humidity     8760 non-null   float64       
 4   wind_speed   8760 non-null   float64       
 5   pressure     8760 non-null   float64       
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 410.8+ KB


In [None]:
df_hourly.describe()

Unnamed: 0,datetime,temperature,humidity,wind_speed,pressure
count,8760,8760.0,8760.0,8760.0,8760.0
mean,2013-07-02 11:30:00,9.941311,75.193925,3.066667,1016.571864
min,2013-01-01 00:00:00,-17.92,17.0,0.0,979.0
25%,2013-04-02 05:45:00,0.8,65.0,1.0,1011.0
50%,2013-07-02 11:30:00,10.361,76.0,3.0,1016.571864
75%,2013-10-01 17:15:00,19.423,89.0,4.0,1021.0
max,2013-12-31 23:00:00,35.33,100.0,18.0,1047.0
std,,11.209317,16.094563,2.23857,8.337281


There appears to be no problem with that dataframe, so we save it as a pickle.

In [None]:
# Save as pickle
pickle.dump(df_hourly, open("weather_hourly.pkl","wb"))

## Creating Additional Weather Datasets

Since we require different temporal intervals for our task, we now create additional weather datasets using the merged hourly data. First, we create a 4-hourly dataset.

### 4-Hourly

In [None]:
# Create duplicate
df_duplicated = df_hourly.copy()

# Set the 'datetime' column as the index
df_duplicated.set_index('datetime', inplace=True)

# Resample the data into 4-hour intervals and calculate the average temperature
df_4hourly = df_duplicated.resample('4H').mean()

# Reset the index to make the 'datetime' column a regular column again
df_4hourly.reset_index(inplace=True)
df_4hourly

Unnamed: 0,datetime,precip,temperature,humidity,wind_speed,pressure
0,2013-01-01 00:00:00,0.0,0.13500,70.846963,5.00,1022.250000
1,2013-01-01 04:00:00,0.0,-0.03750,69.798481,5.25,1018.892966
2,2013-01-01 08:00:00,0.0,0.74000,70.846963,5.75,1017.500000
3,2013-01-01 12:00:00,0.0,1.37000,59.250000,5.50,1017.000000
4,2013-01-01 16:00:00,0.0,0.96750,61.500000,5.25,1017.000000
...,...,...,...,...,...,...
2185,2013-12-31 04:00:00,0.5,-12.73000,89.000000,0.50,1022.000000
2186,2013-12-31 08:00:00,0.0,-14.52000,86.500000,1.25,1024.000000
2187,2013-12-31 12:00:00,0.0,-15.36125,89.000000,1.75,1023.750000
2188,2013-12-31 16:00:00,0.25,-11.89225,90.250000,1.00,1028.250000


We want the precipitation to be either 0 or 1, therefore we need to round it to 0 or 1 (0.5 is 1). This means that precipitation during 2 of the 4 hours is interpreted as an interval with precipitation.

In [None]:
# Round the values in the column
df_4hourly['precip'] = df_4hourly['precip'].apply(lambda x: math.ceil(x) if x >= 0.5 else math.floor(x))
df_4hourly

Unnamed: 0,datetime,precip,temperature,humidity,wind_speed,pressure
0,2013-01-01 00:00:00,0,0.13500,70.846963,5.00,1022.250000
1,2013-01-01 04:00:00,0,-0.03750,69.798481,5.25,1018.892966
2,2013-01-01 08:00:00,0,0.74000,70.846963,5.75,1017.500000
3,2013-01-01 12:00:00,0,1.37000,59.250000,5.50,1017.000000
4,2013-01-01 16:00:00,0,0.96750,61.500000,5.25,1017.000000
...,...,...,...,...,...,...
2185,2013-12-31 04:00:00,1,-12.73000,89.000000,0.50,1022.000000
2186,2013-12-31 08:00:00,0,-14.52000,86.500000,1.25,1024.000000
2187,2013-12-31 12:00:00,0,-15.36125,89.000000,1.75,1023.750000
2188,2013-12-31 16:00:00,0,-11.89225,90.250000,1.00,1028.250000


In [None]:
df_4hourly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2190 entries, 0 to 2189
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     2190 non-null   datetime64[ns]
 1   precip       2190 non-null   int64         
 2   temperature  2190 non-null   float64       
 3   humidity     2190 non-null   float64       
 4   wind_speed   2190 non-null   float64       
 5   pressure     2190 non-null   float64       
dtypes: datetime64[ns](1), float64(4), int64(1)
memory usage: 102.8 KB


In [None]:
df_4hourly.describe()

Unnamed: 0,datetime,precip,temperature,humidity,wind_speed,pressure
count,2190,2190.0,2190.0,2190.0,2190.0,2190.0
mean,2013-07-02 10:00:00,0.148402,9.941311,75.193925,3.066667,1016.571864
min,2013-01-01 00:00:00,0.0,-17.5275,27.5,0.0,989.5
25%,2013-04-02 05:00:00,0.0,0.81875,65.5,1.5,1011.5
50%,2013-07-02 10:00:00,0.0,10.228625,76.0,2.75,1016.571864
75%,2013-10-01 15:00:00,0.0,19.470297,87.75,4.0,1021.026949
max,2013-12-31 20:00:00,1.0,34.31625,100.0,12.5,1045.25
std,,0.355579,11.136285,14.718522,1.994813,7.913293


In [None]:
df_4hourly['precip'].value_counts()

precip
0    1865
1     325
Name: count, dtype: int64

In [None]:
# Save as pickle
pickle.dump(df_4hourly, open("weather_4hourly.pkl","wb"))

### Daily

Now, we create a daily dataset.

In [None]:
# Create duplicate
df_duplicated = df_hourly.copy()

# Set the 'datetime' column as the index
df_duplicated.set_index('datetime', inplace=True)

# Resample the data into 4-hour intervals and calculate the average temperature
df_daily = df_duplicated.resample('1D').mean()

# Reset the index to make the 'datetime' column a regular column again
df_daily.reset_index(inplace=True)
df_daily

Unnamed: 0,datetime,precip,temperature,humidity,wind_speed,pressure
0,2013-01-01,0.0,0.324167,67.306562,5.083333,1018.190494
1,2013-01-02,0.041667,-6.267500,55.250000,3.416667,1020.708333
2,2013-01-03,0.0,-6.540833,57.341414,4.083333,1018.440494
3,2013-01-04,0.291667,-3.306667,67.125000,4.625000,1020.583333
4,2013-01-05,0.0,-4.232448,67.549747,5.708333,1024.214322
...,...,...,...,...,...,...
360,2013-12-27,0.291667,1.209167,89.000000,1.375000,1020.041667
361,2013-12-28,0.708333,4.933125,89.000000,1.958333,1015.750000
362,2013-12-29,0.208333,1.697500,88.833333,2.375000,1009.833333
363,2013-12-30,0.166667,-11.748333,87.083333,2.000000,1021.125000


Once more, we round to 0 or 1 (0.5 is 1). Thus, a day with 12 hours of precipitation is considered to be a day with precipitation.

In [None]:
# Round the values in the column
df_daily['precip'] = df_daily['precip'].apply(lambda x: math.ceil(x) if x >= 0.5 else math.floor(x))
df_daily

Unnamed: 0,datetime,precip,temperature,humidity,wind_speed,pressure
0,2013-01-01,0,0.324167,67.306562,5.083333,1018.190494
1,2013-01-02,0,-6.267500,55.250000,3.416667,1020.708333
2,2013-01-03,0,-6.540833,57.341414,4.083333,1018.440494
3,2013-01-04,0,-3.306667,67.125000,4.625000,1020.583333
4,2013-01-05,0,-4.232448,67.549747,5.708333,1024.214322
...,...,...,...,...,...,...
360,2013-12-27,0,1.209167,89.000000,1.375000,1020.041667
361,2013-12-28,1,4.933125,89.000000,1.958333,1015.750000
362,2013-12-29,0,1.697500,88.833333,2.375000,1009.833333
363,2013-12-30,0,-11.748333,87.083333,2.000000,1021.125000


In [None]:
df_daily.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     365 non-null    datetime64[ns]
 1   precip       365 non-null    int64         
 2   temperature  365 non-null    float64       
 3   humidity     365 non-null    float64       
 4   wind_speed   365 non-null    float64       
 5   pressure     365 non-null    float64       
dtypes: datetime64[ns](1), float64(4), int64(1)
memory usage: 17.2 KB


In [None]:
df_daily.describe()

Unnamed: 0,datetime,precip,temperature,humidity,wind_speed,pressure
count,365,365.0,365.0,365.0,365.0,365.0
mean,2013-07-02 00:00:00,0.060274,9.941311,75.193925,3.066667,1016.571864
min,2013-01-01 00:00:00,0.0,-14.945,43.466414,0.291667,994.208333
25%,2013-04-02 00:00:00,0.0,0.9405,67.871203,1.875,1012.041667
50%,2013-07-02 00:00:00,0.0,9.773514,75.056562,2.625,1016.5
75%,2013-10-01 00:00:00,0.0,19.381469,84.474494,4.0,1020.875
max,2013-12-31 00:00:00,1.0,30.16625,100.0,10.25,1040.75
std,,0.23832,10.72018,11.210224,1.642333,6.979278


In [None]:
df_daily['precip'].value_counts()

precip
0    343
1     22
Name: count, dtype: int64

In [None]:
# Save as pickle
pickle.dump(df_daily, open("weather_daily.pkl","wb"))