# Data-Processing
In this notebook the data will be cleaned.

We are using two different datasets for this project: the Bikesharing Dataset and the Weather Dataset. We analysed the datasets regarding null values, added additional data, addressed missing data, and dropped outliers and duplicates.

This notebook is structured as follows:

**Bikesharing Dataset**
* Importing the Dataset and the Libraries needed for processing
* Additional date related Features
* Additional Datasets
* Missing Data in the extension set
* Dropping outliers and redundant features

**Weather Dataset**
* Extending the given weather related dataframe and filling missing values
* Additional date related Features

## Bikesharing Dataset

### Importing the Dataset and the libraries needed for processing

In [2]:
import pandas as pd
import numpy as np
import math
from datetime import datetime, timedelta, date, time
from haversine import haversine
import pickle

import warnings
warnings.filterwarnings("ignore")

la_2018_set = pd.read_csv("data/la_2018.csv")

In [3]:
la_2018_set.head()

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,bike_id,user_type,start_station_name,end_station_name
0,2018-01-01 00:04:00,2018-01-01 00:25:00,3063,3018,5889,Walk-up,Pershing Square,Grand & Olympic
1,2018-01-01 00:05:00,2018-01-01 00:25:00,3063,3018,6311,Walk-up,Pershing Square,Grand & Olympic
2,2018-01-01 00:06:00,2018-01-01 00:25:00,3063,3018,5753,Walk-up,Pershing Square,Grand & Olympic
3,2018-01-01 00:13:00,2018-01-01 00:35:00,3018,3031,6220,Monthly Pass,Grand & Olympic,7th & Spring
4,2018-01-01 00:14:00,2018-01-01 00:59:00,4204,4216,12436,Monthly Pass,Washington & Abbot Kinney,17th St / SMC E Line Station


In [4]:
# Rename some columns for the sake of convenience
la_2018_set.rename(columns = {'start_station_id':'start_station'}, inplace = True)
la_2018_set.rename(columns = {'end_station_id':'end_station'}, inplace = True)

In [5]:
la_2018_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311894 entries, 0 to 311893
Data columns (total 8 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   start_time          311894 non-null  object
 1   end_time            311894 non-null  object
 2   start_station       311894 non-null  int64 
 3   end_station         311894 non-null  int64 
 4   bike_id             311894 non-null  int64 
 5   user_type           311894 non-null  object
 6   start_station_name  311894 non-null  object
 7   end_station_name    311894 non-null  object
dtypes: int64(3), object(5)
memory usage: 19.0+ MB


### Additional date related Features
As we want to gain deep insight into the Bikesharing business at specific timeframes we want to generate additional date related features. We can do this by casting the start- and end time to datetime objects and adding date related collumns for the hour, weekday, day, and month.

In [6]:
# Cast start_time & end_time to datetime objects
la_2018_set["start_time"] = pd.to_datetime(la_2018_set["start_time"])
la_2018_set["end_time"] = pd.to_datetime(la_2018_set["end_time"])

# Add some date related collumns to the dataset
la_2018_set["hour"] = la_2018_set["start_time"].apply(lambda x: x.hour)
la_2018_set["week_day"] = la_2018_set["start_time"].apply(lambda x: x.weekday())
la_2018_set["day"] = la_2018_set["start_time"].apply(lambda x: x.strftime("%d/%m/%Y"))
la_2018_set["month"] = la_2018_set["start_time"].apply(lambda x: x.month)

### Additonal Datasets
Additional data was available on the [Los Angeles Bikeshare Metro website](https://bikeshare.metro.net/about/data/). We will use this data to improve our ability to visualize the data and predict future demand. Since the data was splitted into quarters, we must first combine it into a single dataframe.

In [7]:
# Import all four datasets for every quarter
la_q1_set = pd.read_csv("data/metro-bike-share-trips-2018-q1.csv")
la_q2_set = pd.read_csv("data/metro-bike-share-trips-2018-q2.csv")
la_q3_set = pd.read_csv("data/metro-bike-share-trips-2018-q3.csv")
la_q4_set = pd.read_csv("data/metro-bike-share-trips-2018-q4.csv")

# Combine all four datasets into one dataframe. It is named as an extension to our main dataframe
la_2018_extension = pd.concat([la_q1_set,la_q2_set,la_q3_set,la_q4_set])
la_2018_extension = la_2018_extension.reset_index()
la_2018_extension.drop("index", axis=1)
la_2018_extension.head()

Unnamed: 0,index,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,bike_id,plan_duration,trip_route_category,passholder_type,bike_type
0,0,65406367,21,2018-01-01 00:04:00,2018-01-01 00:25:00,3063,34.049198,-118.252831,3018,34.043732,-118.260139,5889,0,One Way,Walk-up,
1,1,65406366,20,2018-01-01 00:05:00,2018-01-01 00:25:00,3063,34.049198,-118.252831,3018,34.043732,-118.260139,6311,0,One Way,Walk-up,
2,2,65406365,19,2018-01-01 00:06:00,2018-01-01 00:25:00,3063,34.049198,-118.252831,3018,34.043732,-118.260139,5753,0,One Way,Walk-up,
3,3,65406364,22,2018-01-01 00:13:00,2018-01-01 00:35:00,3018,34.043732,-118.260139,3031,34.044701,-118.252441,6220,30,One Way,Monthly Pass,
4,4,65406362,45,2018-01-01 00:14:00,2018-01-01 00:59:00,4204,33.988419,-118.45163,4216,34.023392,-118.479637,12436,30,One Way,Monthly Pass,


### Missing Data in the extension set
As we can see, there are some columns with missing data (null values). The affected columns include: start_lat, start_lon, end_lat, end_lon and bike_type.

In [8]:
la_2018_extension.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311894 entries, 0 to 311893
Data columns (total 16 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   index                311894 non-null  int64  
 1   trip_id              311894 non-null  int64  
 2   duration             311894 non-null  int64  
 3   start_time           311894 non-null  object 
 4   end_time             311894 non-null  object 
 5   start_station        311894 non-null  int64  
 6   start_lat            311146 non-null  float64
 7   start_lon            311146 non-null  float64
 8   end_station          311894 non-null  int64  
 9   end_lat              306771 non-null  float64
 10  end_lon              306771 non-null  float64
 11  bike_id              311894 non-null  int64  
 12  plan_duration        311894 non-null  int64  
 13  trip_route_category  311894 non-null  object 
 14  passholder_type      311894 non-null  object 
 15  bike_type        

In [9]:
# We can add the duration & trip_id feature directly to our main dataset la_2018_set, since they don't contain null values.
la_2018_set["trip_id"] = la_2018_extension["trip_id"]
la_2018_set["duration"] = la_2018_extension["duration"]

 A look at the units with missing coordinates reveals that these enteties belong to a specific bike station (3000). According to the [Los Angeles Bikeshare Metro website](https://bikeshare.metro.net/about/data/), this station is a "Virtual Station" used by employees to check in or check out a bike remotely for a special event or in a situation in which a bike could not otherwise be checked in or out to a station. As we want to gain knowledge about popular routes and distances driven, we need the start/end coordinates of a trip as a tuple. Since no start/end coordinates are specified for the entities with start/end station 3000, we decided to fill these entities assuming that the missing values correspond to the existing values (start/end station specified). Since we cannot use haversine when the starting station is equal to the ending station, we decided to fill these value with the mean of the remaining values.

In [10]:
# Combine longitude/latitude
la_2018_extension["start_coordinates"] = list(zip(la_2018_extension["start_lat"].round(4),la_2018_extension["start_lon"].round(4)))
la_2018_extension["end_coordinates"] = list(zip(la_2018_extension["end_lat"].round(4),la_2018_extension["end_lon"].round(4)))

# Add Collums to our main dataframe
la_2018_set["start_coordinates"] = la_2018_extension["start_coordinates"]
la_2018_set["end_coordinates"] = la_2018_extension["end_coordinates"]

# Calculating the distances driven (using Haversine if the coordinates are available) OTHERWISE -1 IS RETURNED!
# And adding the distance to our main dataframe
def calculateDistance(x):
    
    if(math.isnan(x["start_coordinates"][0])|math.isnan(x["start_coordinates"][1])|math.isnan(x["end_coordinates"][0])|math.isnan(x["end_coordinates"][1])|(x["start_coordinates"]==x["end_coordinates"])): return -1
    return haversine(x["start_coordinates"], x["end_coordinates"])

la_2018_set["distance"] = la_2018_set.apply(lambda x: calculateDistance(x), axis=1)

# Next, we want to get rid of the -1 distances. Since most entities now have a valid distance, we can calculate the mean value. We will then use this mean value as a replacement for the -1 values
distance_per_minute_sum = 0
valid_distances = 0

for index, row in la_2018_set.iterrows():
    
    if(row["distance"] > -1):
        distance_per_minute_sum = distance_per_minute_sum + (row["distance"] / row["duration"])
        valid_distances = valid_distances + 1
        
mean_distance_per_minute = distance_per_minute_sum / valid_distances

la_2018_set["distance"] = la_2018_set.apply(lambda x: np.round(x["distance"],2) if x["distance"] > 0 else np.round(mean_distance_per_minute * x["duration"],2), axis=1)

# Calculate the speed and add it to our main dataframe as "km/h"
la_2018_set["km/h"] = la_2018_set.apply(lambda x: np.round(x["distance"]/(x["duration"]/60),1), axis=1)


Another factor of interest is the type of bike ridden, as prices differ between standard and electric bikes. We found that for some entities we can draw conclusions about the bike_type, since some bike_ids provide information about what type of bike it is. Unfortunately, this is not the case for all entities, so we assumed that the missing values correspond to the more common standard bike.

In [11]:
# The unique IDs of bikes for which the type is specified 
bike_ids_with_type = la_2018_extension[pd.isna(la_2018_extension["bike_type"])==False]["bike_id"].unique()

# The unique IDs of bikes for which the type isn't specified 
bike_ids_without_type = la_2018_extension[pd.isna(la_2018_extension["bike_type"])]["bike_id"].unique()

# If the arrays are equal, we can infer each individual bike_type
print(np.array_equal(bike_ids_with_type, bike_ids_without_type))
np.setdiff1d(bike_ids_without_type, bike_ids_with_type, assume_unique=False)

False


array([ 5100,  5746,  5756,  5763,  5772,  5785,  5827,  5828,  5832,
        5850,  5857,  5862,  5864,  5873,  5880,  5891,  5897,  5933,
        5956,  5957,  5959,  5968,  5983,  6007,  6015,  6021,  6035,
        6040,  6041,  6056,  6057,  6063,  6065,  6074,  6078,  6080,
        6081,  6083,  6088,  6138,  6154,  6155,  6171,  6180,  6182,
        6212,  6215,  6217,  6224,  6237,  6246,  6251,  6254,  6264,
        6277,  6297,  6306,  6311,  6312,  6325,  6361,  6364,  6365,
        6383,  6386,  6396,  6397,  6439,  6440,  6442,  6464,  6477,
        6479,  6482,  6485,  6490,  6494,  6500,  6512,  6524,  6527,
        6534,  6535,  6545,  6548,  6562,  6565,  6570,  6576,  6585,
        6590,  6599,  6604,  6605,  6606,  6610,  6619,  6666,  6702,
        6705,  6711,  6727, 11200, 11319, 11981, 11985, 11988, 11995,
       12012, 12023, 12031, 12057, 12058, 12068, 12073, 12083, 12084,
       12087, 12093, 12096, 12105, 12113, 12118, 12122, 12127, 12132,
       12137, 12144,

In [12]:
# Combine the Bike IDs and the Bike Type
bike_type_by_id = la_2018_extension[pd.isna(la_2018_extension["bike_type"])==False][["bike_id","bike_type"]]

# Function for getting the corresponding Bike Type to an available Bike ID
def getBikeType(id):
    return bike_type_by_id[bike_type_by_id["bike_id"]==id]["bike_type"].unique()

# Function for filling out the missing Bike Types. If the Bike Type is available for a Bike ID, the Bike ID is returned.
# If the Bike Type is missing for a Bike ID the "standard" Bike Type is returned
def fillMissingBikeTypes(x):

    if(pd.isna(x["bike_type"])):
        
        if(x["bike_id"] in bike_ids_with_type):
            
             return getBikeType(x["bike_id"])[0]
            
        else: return "standard"
        
    return x["bike_type"]

# Filling out the missing Bike Types in our extension dataframe
la_2018_extension["bike_type"] = la_2018_extension.apply(lambda x: fillMissingBikeTypes(x), axis=1)


In [13]:
la_2018_extension.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311894 entries, 0 to 311893
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   index                311894 non-null  int64  
 1   trip_id              311894 non-null  int64  
 2   duration             311894 non-null  int64  
 3   start_time           311894 non-null  object 
 4   end_time             311894 non-null  object 
 5   start_station        311894 non-null  int64  
 6   start_lat            311146 non-null  float64
 7   start_lon            311146 non-null  float64
 8   end_station          311894 non-null  int64  
 9   end_lat              306771 non-null  float64
 10  end_lon              306771 non-null  float64
 11  bike_id              311894 non-null  int64  
 12  plan_duration        311894 non-null  int64  
 13  trip_route_category  311894 non-null  object 
 14  passholder_type      311894 non-null  object 
 15  bike_type        

Since the "bike_type" column does not contain null values anymore, we can now add it to the original dataset

In [14]:
la_2018_set["bike_type"] = la_2018_extension["bike_type"]

### Dropping outliers and redundant features

In [15]:
la_2018_set.head()

Unnamed: 0,start_time,end_time,start_station,end_station,bike_id,user_type,start_station_name,end_station_name,hour,week_day,day,month,trip_id,duration,start_coordinates,end_coordinates,distance,km/h,bike_type
0,2018-01-01 00:04:00,2018-01-01 00:25:00,3063,3018,5889,Walk-up,Pershing Square,Grand & Olympic,0,0,01/01/2018,1,65406367,21,"(34.0492, -118.2528)","(34.0437, -118.2601)",0.91,2.6,standard
1,2018-01-01 00:05:00,2018-01-01 00:25:00,3063,3018,6311,Walk-up,Pershing Square,Grand & Olympic,0,0,01/01/2018,1,65406366,20,"(34.0492, -118.2528)","(34.0437, -118.2601)",0.91,2.7,standard
2,2018-01-01 00:06:00,2018-01-01 00:25:00,3063,3018,5753,Walk-up,Pershing Square,Grand & Olympic,0,0,01/01/2018,1,65406365,19,"(34.0492, -118.2528)","(34.0437, -118.2601)",0.91,2.9,standard
3,2018-01-01 00:13:00,2018-01-01 00:35:00,3018,3031,6220,Monthly Pass,Grand & Olympic,7th & Spring,0,0,01/01/2018,1,65406364,22,"(34.0437, -118.2601)","(34.0447, -118.2524)",0.72,2.0,standard
4,2018-01-01 00:14:00,2018-01-01 00:59:00,4204,4216,12436,Monthly Pass,Washington & Abbot Kinney,17th St / SMC E Line Station,0,0,01/01/2018,1,65406362,45,"(33.9884, -118.4516)","(34.0234, -118.4796)",4.67,6.2,standard


In [16]:
#Drop enteties with start- & end station 3000 as they don't represent customer rides
la_2018_set.drop(la_2018_set[(la_2018_set["start_station"]==3000)&(la_2018_set["end_station"]==3000)].index, inplace=True)

#Drop rides with > 25 km/h 
la_2018_set.drop(la_2018_set[la_2018_set["km/h"]>25].index, inplace=True)

#Drop start_station_name & end_station_name as we can identify the station by its ID 
la_2018_set.drop(["start_station_name", "end_station_name"], axis=1, inplace=True)


In [17]:
la_2018_set.head()

Unnamed: 0,start_time,end_time,start_station,end_station,bike_id,user_type,hour,week_day,day,month,trip_id,duration,start_coordinates,end_coordinates,distance,km/h,bike_type
0,2018-01-01 00:04:00,2018-01-01 00:25:00,3063,3018,5889,Walk-up,0,0,01/01/2018,1,65406367,21,"(34.0492, -118.2528)","(34.0437, -118.2601)",0.91,2.6,standard
1,2018-01-01 00:05:00,2018-01-01 00:25:00,3063,3018,6311,Walk-up,0,0,01/01/2018,1,65406366,20,"(34.0492, -118.2528)","(34.0437, -118.2601)",0.91,2.7,standard
2,2018-01-01 00:06:00,2018-01-01 00:25:00,3063,3018,5753,Walk-up,0,0,01/01/2018,1,65406365,19,"(34.0492, -118.2528)","(34.0437, -118.2601)",0.91,2.9,standard
3,2018-01-01 00:13:00,2018-01-01 00:35:00,3018,3031,6220,Monthly Pass,0,0,01/01/2018,1,65406364,22,"(34.0437, -118.2601)","(34.0447, -118.2524)",0.72,2.0,standard
4,2018-01-01 00:14:00,2018-01-01 00:59:00,4204,4216,12436,Monthly Pass,0,0,01/01/2018,1,65406362,45,"(33.9884, -118.4516)","(34.0234, -118.4796)",4.67,6.2,standard


In [18]:
# Save dataset as pickle
la_2018_set.to_pickle("Data/la_2018_set.pickle")

## Weather Dataset
First, we import the weather-related data. Since we are only interested in the year 2018, we will exclude all data outside this frame. For this purpose, the "date_time" column is casted into datetime objects.

In [19]:
# Import weather dataset and cast to datetime objects
weather_set = pd.read_csv("Data/weather_hourly_la.csv")
weather_set["date_time"] = pd.to_datetime(weather_set["date_time"])

# Only keep data from the year 2018
weather_set = weather_set[(weather_set['date_time']>=datetime(year=2018, month=1, day=1))&(weather_set['date_time']<datetime(year=2019, month=1, day=1))]
weather_set.head()

Unnamed: 0,date_time,max_temp,min_temp,precip
26280,2018-01-01 00:00:00,14.4,13.9,0.0
26281,2018-01-01 02:00:00,14.4,14.4,0.0
26282,2018-01-01 03:00:00,14.4,14.4,0.0
26283,2018-01-01 04:00:00,14.4,13.9,0.0
26284,2018-01-01 05:00:00,13.9,13.9,0.0


In [20]:
weather_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8729 entries, 26280 to 35062
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date_time  8729 non-null   datetime64[ns]
 1   max_temp   8729 non-null   float64       
 2   min_temp   8729 non-null   float64       
 3   precip     8729 non-null   float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 341.0 KB


### Extending the given weather related dataframe and filling missing values
Normally, a year should have 8760 hours, so some entries are missing in the given dataset. Since there is no efficient way to insert entities in a particular row in the data frame, we need to extend the dataframe and reassign the values in the right place.

In [21]:
# For the sake of convenience we will reindex the dataset 
weather_set.index=range(8729)

In [22]:
# Create a copy of the dataframe, which can then be expanded while the original dataframe remains untouched
weather_copy = weather_set.copy()

In [23]:
# Make sure that there are no gaps in the dataset.

t1 = timedelta(hours=1, minutes=0)

for i in range(1,8729):
    
    weather_set["date_time"].loc[i]=weather_set["date_time"].loc[i-1]+t1

Now that there are no more gaps in the data set, we only need to extend it to a full year

In [24]:
t1 = timedelta(hours=1, minutes=0)
for i in range(8729,8760):
    
    weather_set.loc[i]=None
    weather_set["date_time"].loc[i]=weather_set["date_time"].loc[i-1]+t1

In [25]:
weather_copy[weather_copy.duplicated()]

Unnamed: 0,date_time,max_temp,min_temp,precip
23,2018-01-01 14:00:00,10.0,10.0,0.0
137,2018-01-06 08:00:00,14.4,14.4,0.0
141,2018-01-06 12:00:00,14.4,13.9,0.0
144,2018-01-06 15:00:00,14.4,14.4,0.0
223,2018-01-09 22:00:00,13.9,13.3,1.0
...,...,...,...,...
8527,2018-12-24 05:00:00,15.0,15.0,0.0
8545,2018-12-24 00:00:00,15.6,15.6,0.0
8555,2018-12-24 09:00:00,14.4,14.4,0.0
8561,2018-12-24 15:00:00,12.8,12.8,0.0


As we can see, there are some duplicates in the dataset. Since these entities do not represent valuable data, we can delete the corresponding rows.

In [26]:
# Drop duplicates
weather_copy.drop_duplicates(subset=["date_time"],keep="first",inplace=True)
weather_copy.index=range(8305)

In [27]:
# Sort dataframe after dropping duplicates
weather_copy.sort_values(by=["date_time"],inplace=True)
weather_copy.index=range(8305)

Since we moved the rows before, we now need to reassign the appropriate value to the entries. For the entries that do not have a matching row in the original data set, all values are given as null values.

In [28]:
# If data in the row of the copy of the weather dataframe is the same as the row in the original weather dataframe, data is reassigned to the original weather dataframe
# If not, the corresponding rows in the original weather dataframe are filled with null values
indexCopy=0
indexOrig=0
while indexCopy<8305 and indexOrig<8760:
    
    if (weather_copy["date_time"].loc[indexCopy]==weather_set["date_time"].loc[indexOrig]):
        weather_set["max_temp"].loc[indexOrig]=weather_copy["max_temp"].loc[indexCopy]
        weather_set["min_temp"].loc[indexOrig]=weather_copy["min_temp"].loc[indexCopy]
        weather_set["precip"].loc[indexOrig]=weather_copy["precip"].loc[indexCopy]
        indexCopy=indexCopy+1
        indexOrig=indexOrig+1
        
    else:
        weather_set["max_temp"].loc[indexOrig]=None
        weather_set["min_temp"].loc[indexOrig]=None
        weather_set["precip"].loc[indexOrig]=None
        indexOrig=indexOrig+1

We can then fill in the remaining zero values with average values. This is especially easy for gaps (entries with null values) of one hour, since we can simply use the average values of the entries before and after the corresponding row.

In [29]:
# Fill gaps of one hour with the average values of the entries before and after
for i in range(8760):
    if weather_set["max_temp"].isnull().loc[i]==True:
        weather_set["max_temp"].loc[i]=((weather_set["max_temp"].loc[i+1]+weather_set["max_temp"].loc[i-1])/2)
        weather_set["min_temp"].loc[i]=((weather_set["min_temp"].loc[i+1]+weather_set["min_temp"].loc[i-1])/2)
        weather_set["precip"].loc[i]=weather_set["precip"].loc[i-1]

In [30]:
# Indexes in the original dataframe where the max_temp is null
null_indexes = weather_set[weather_set["max_temp"].isnull()].index
null_indexes

Int64Index([2404, 2405, 3076, 3077, 3078, 3079, 3101, 3102, 4965, 4966, 5336,
            5337, 5338, 5339, 5340, 5672, 5673, 5674, 5675, 5676, 7473, 7474,
            7475, 7476, 7477, 7478, 7479, 7480],
           dtype='int64')

In [31]:
# For the rows with null values: Replace null values with the average values of the entries before and after, that have actual values and not null
for index in null_indexes:
    
    if(pd.isna(weather_set.loc[index]["max_temp"])):
        
        not_nan_index = index
        
        while(pd.isna(weather_set.loc[not_nan_index]["max_temp"])):
            not_nan_index = not_nan_index+1
            
        distance = not_nan_index - index
        
        max_temp_before = weather_set.loc[index-1]["max_temp"]
        max_temp_after = weather_set.loc[index+distance]["max_temp"]
        max_temp_step = (max_temp_before-max_temp_after)/(distance+1)
        
        min_temp_before = weather_set.loc[index-1]["min_temp"]
        min_temp_after = weather_set.loc[index+distance]["min_temp"]
        min_temp_step = (min_temp_before-min_temp_after)/(distance+1)
        
        precip = weather_set.loc[index-1]["precip"]
        
        for indexes in range(distance):
            weather_set["max_temp"].loc[index+indexes] = (weather_set["max_temp"].loc[index-1] - max_temp_step * (indexes+1)).round(1)
            weather_set["min_temp"].loc[index+indexes] = (weather_set["min_temp"].loc[index-1] - min_temp_step * (indexes+1)).round(1)
            weather_set["precip"].loc[index+indexes] = precip

In [32]:
weather_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8760 entries, 0 to 8759
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date_time  8760 non-null   datetime64[ns]
 1   max_temp   8760 non-null   float64       
 2   min_temp   8760 non-null   float64       
 3   precip     8760 non-null   float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 662.2 KB


Now we can see that the weather dataframe has 8760 values as desired and does not conain null values anymore.

### Additional date related Features
As we want to gain deep insight into the Bikesharing business at specific timeframes we want to generate additional date related features.

In [33]:
#Add some date related collumns to the dataset
weather_set["hour"] = weather_set["date_time"].apply(lambda x: x.hour)
weather_set["week_day"] = weather_set["date_time"].apply(lambda x: x.weekday())
weather_set["day"] = weather_set["date_time"].apply(lambda x: x.strftime("%d/%m/%Y"))
weather_set["month"] = weather_set["date_time"].apply(lambda x: x.month)

weather_set.head()

Unnamed: 0,date_time,max_temp,min_temp,precip,hour,week_day,day,month
0,2018-01-01 00:00:00,14.4,13.9,0.0,0,0,01/01/2018,1
1,2018-01-01 01:00:00,14.4,14.15,0.0,1,0,01/01/2018,1
2,2018-01-01 02:00:00,14.4,14.4,0.0,2,0,01/01/2018,1
3,2018-01-01 03:00:00,14.4,14.4,0.0,3,0,01/01/2018,1
4,2018-01-01 04:00:00,14.4,13.9,0.0,4,0,01/01/2018,1


In [34]:
# Save dataset as pickle
weather_set.to_pickle("Data/weather_set.pickle")