# Cleaning of Weather Data 

Import data and libraries

In [50]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# import data set 
df = pd.read_csv('data/weather_hourly_sf.csv')

Select a timeframe according to the bike sharing data set

In [51]:
# sort by date_time 
df.sort_values(by='date_time', ascending = True, inplace = True)
# turn date_time to datetime format and adjust to time difference
df['date_time'] = pd.to_datetime(df['date_time']) + timedelta(hours=-8)
# range from 2019-01-31 to 2019-12-31
df = df[(df['date_time'] >= '2019-01-01 00:00:00') & (df['date_time'] <= '2019-12-31 23:00:00')]
# reset index
df = df.reset_index(drop=True)

Check for NAs and duplicates 

In [54]:
df.duplicated().sum()

79

In [8]:
df.isnull().values.sum()

0

Since there are no NAs left after slicing the dataframe, only duplicates have to be removed.

Let's take a closer look at duplicates, especially duplicates in the date_time columns.



In [56]:
df['date_time'].duplicated().sum()

371

There are 79 duplicated rows across all collumns of the dataframe and 371 duplicated entries in the date_time cloumn.
We do not want to drop columns easily here. Instead we will count how often temperature changes across the whole data set.


In [73]:
# check if max_temp changes each row
max_temp_change = df['max_temp'].diff().gt(0)
max_temp_change.value_counts()

False    5301
True     2735
Name: max_temp, dtype: int64

In [74]:
# check if min_temp changes each row
min_temp_change = df['min_temp'].diff().gt(0)
min_temp_change.value_counts()

False    5296
True     2740
Name: min_temp, dtype: int64

Temperature changes hourly at odds of roughly 2:1.
Duplicated rows in the dataframe can therefore be a result of duplicated entries in the date_time column co-occuring with a natural steady temperature trend.



Now we determine the expected number of entries in the dataframe (the dataframe starts at 2019-01-31) to check whether the number of entries does not exceed the number of expected entries considerably

In [39]:
# calculate expected entries 
exp_length = (365-30) *24 
print(exp_length)

8040


We check the actual number of entries.

In [40]:
act_length = len(df)
print(act_length)

8036


The actual and expected numbers are matching, except of 4 entries. Further, Temperature values tend to not change in the course of an hour at non-duplicates aswell, therefore we can not declare the duplicates as redundant data. We will keep the entries but adjust the date_time column. Dropping duplicates and interpolating new values is not necessary.

The following function takes duplicated date_time entries as input and shifts these timestamps forward until no date_time duplicates remain.

In [None]:
# turn date_time to datetime format 
df['date_time'] = pd.to_datetime(df['date_time'])
# take duplicate timestamps and increase hour +1 while duplicate exists 
while any(df['date_time'].duplicated()) == True:
    df['date_time'] = np.where(~df['date_time'].duplicated(), df['date_time'], df['date_time']+timedelta(hours=1))

Finally, we save the data.

In [None]:
#save
df.to_csv('data/weather_hourly_sf_prepared.csv', index=False)