## Weather Data Preparation

In this notebook we will have a look at the weather dataset provided to us.

In [1]:
import pandas as pd

In [2]:
weather_df = pd.read_csv('../00_data/weather_hourly_la.csv')

In [3]:
# check for null values
weather_df.isna().sum()

date_time    92
max_temp     92
min_temp     92
precip       90
dtype: int64

In [4]:
# drop null values for column 'date_time'
weather_df = weather_df[weather_df['date_time'].notna()] 

In [5]:
# check whether there are still null values in other columns
weather_df.isna().sum()

date_time    0
max_temp     0
min_temp     0
precip       0
dtype: int64

In [6]:
weather_df.describe()

Unnamed: 0,max_temp,min_temp,precip
count,43756.0,43756.0,43756.0
mean,17.928581,17.88525,0.019814
std,4.198326,4.20856,0.139364
min,2.8,2.8,0.0
25%,15.0,15.0,0.0
50%,17.8,17.8,0.0
75%,20.6,20.6,0.0
max,39.4,39.4,1.0


Next, we will change data format of ```date_time``` column to a more convinient one. Afterwards we will inspect the time range for which we have weather data.

In [7]:
weather_df['date_time'] = pd.to_datetime(weather_df['date_time'])

In [8]:
datetime_format = '%d.%m.%Y %H:%M:%S'
print(f"Earliest observation: {format(weather_df['date_time'].min(), datetime_format)}")
print(f"Latest observation: {format(weather_df['date_time'].max(), datetime_format)}")

Earliest observation: 01.01.2015 09:00:00
Latest observation: 02.01.2020 08:00:00


In [9]:
# drop all entries that have a 'date_time' earlier than 01.01.2019 or later than 31.12.2019
weather_df = weather_df[
    (weather_df["date_time"] >= "2019-01-01 00:00:00")
    & (weather_df["date_time"] <= "2019-12-31 23:59:59")
]

In [10]:
# check whether there are duplicate entries
weather_df[ weather_df.duplicated('date_time') ].sort_index().head(2)

Unnamed: 0,date_time,max_temp,min_temp,precip
35162,2019-01-06 02:00:00,12.8,12.2,1.0
35165,2019-01-06 05:00:00,10.0,10.0,0.0
35177,2019-01-05 17:00:00,12.8,12.2,0.0
35180,2019-01-05 20:00:00,14.4,14.4,0.0


In [11]:
# example of a duplicate entry
weather_df[weather_df['date_time'] == '2019-01-06 02:00:00']

Unnamed: 0,date_time,max_temp,min_temp,precip
35161,2019-01-06 02:00:00,12.8,12.8,0.0
35162,2019-01-06 02:00:00,12.8,12.2,1.0


There are some duplicates in the data. We will remove them by opting for mean values for temperature and max values for precipitation. We don't choose the mean for precipitation because it would result in a 0.5 value, whereas all other values are 0 or 1. We aim to preserve this boolean interpretation of data. We also choose max and not min because there are generally more sunny days in Los Angeles. (As can be seen above in the result of ```describe``` function - more than 75% of entries were with no precipitation)

In [12]:
weather_df = weather_df.groupby('date_time').agg({'max_temp': 'mean', 'min_temp': 'mean', 'precip': 'max'})

In [13]:
pd.to_pickle(weather_df, "../00_data/weather.pkl")