### Step 1: Choose the data you want
This is a notebook for handling shared bike trip data in Helsinki and Espoo. In the first step the notebook asks for the month and year you want to use. To run the cells in Jupyter, press shift + enter.

In [18]:
# Importing libraries
import pandas as pd
import geopandas as gpd
import requests, csv

In [19]:
# Asking for the wanted month
month = str(input('Write the number of the month as for example "05" or "09". Then press Enter.'))

Write the number of the month as for example "05" or "09". Then press Enter. 06


In [20]:
# Asking for the wanted year
year = str(input('Write the year as for example "2019". Then press Enter.'))

Write the year as for example "2019". Then press Enter. 2019


In [21]:
# Creating a URL to the wanted data
url =  "http://dev.hsl.fi/citybikes/od-trips-"+ year+ "/"+year +"-"+month+".csv"
# Reading the data from the URL to a dataframe
data = pd.read_csv(url)

In [22]:
# Printing out the amount of trips made
print(str(len(data)) + " trips were made in " + month + "/" + year+ ".")

715227 trips were made in 06/2019.


### Step 2: Dividing the bike trip data into night and day trips

In [23]:
# Setting depature and return time to datetime format
data['departure_time'] = pd.to_datetime(data['Departure'], format='%Y-%m-%dT%H:%M:%S')
data['return_time'] = pd.to_datetime(data['Return'], format='%Y-%m-%dT%H:%M:%S')

# New dataframe with depature time as datetimeindex
dep_index_data = data
dep_index_data['departure_time_2'] = dep_index_data['departure_time']
dep_index_data.set_index(['departure_time'], inplace=True)

# Selecting rows with departure time between 00:00 and 05:00
departures = dep_index_data.between_time('00:00','05:00')

# New dataframe with return time as datetimeindex
ret_index_data = departures.set_index(['return_time'])

# Selecting rows with return time between 00:00 and 05:00
returns = ret_index_data.between_time('00:00','05:00')

# New dataframe with only trips between 00:00 and 05:00
night = returns.reset_index()

# Dropping unnecessary columns
night.drop(columns=['Departure','Return'], inplace=True)

In [9]:
# If you want a csv-file of night time bike trips, please uncomment the following line

#night.to_csv('output/night_06_2019.csv')

We continue with the same procedure for getting the day data.

In [24]:
# Same procedure as above, but for days
d_deps = dep_index_data.between_time('05:00','00:00')
d_index_data = d_deps.set_index(['return_time'])
d_rets = d_index_data.between_time('05:00','00:00')

# Dropping unnecessary columns for day data
day = d_rets.reset_index()
day.drop(columns=['Departure','Return'], inplace=True)

In [12]:
# If you want a csv-file of day time bike trips, please uncomment the following line

#day.to_csv('output/day_06_2019.csv')

In [27]:
# Printing out general info of the data
print('Day trips: ' + str(len(day)))
print('Night trips: ' + str(len(night)))
print(str(len(day)-len(night))+ ' more trips during the day than the night!')

Day trips: 677875
Night trips: 32680
645195 more trips during the day than the night!


### Step 3: Reading in the bike station data

The data for the shared bike stations can be downloaded from [this link](https://public-transport-hslhrt.opendata.arcgis.com/datasets/helsingin-ja-espoon-kaupunkipy%C3%B6r%C3%A4asemat/data). In this notebook we used the Shapefile format.

In [29]:
# Reading in bike station shapefile
stations = gpd.read_file('data/Helsingin_ja_Espoon_kaupunkipy%C3%B6r%C3%A4asemat.shp')

# Using only some columns, setting the station ID as index
stat = stations[['ID','Nimi','Osoite','geometry']]
stat['ID'] = stat['ID'].astype(int)
stat = stat.set_index(('ID'))

### Step 3.1: Make a geopackage file with the stations, including all the night or day trips (optional)

Basically this data is all the trips (night and day separately) and adding the correct departure and return geometry, based on the station id given. If you are not interested in this data, you may skip this step.

In [35]:
# We'll start with the night data

# Setting index as departure station id for night data
night_dep_stat_id = night.set_index('Departure station id')
night_ret_stat_id = night.set_index('Return station id')

# Joining night trip data with departure station geometries
night_dep_geo = night_dep_stat_id.join(stat)
night_ret_geo = night_ret_stat_id.join(stat)

# Make GeoDataFrame
night_dep_gpd = gpd.GeoDataFrame(night_dep_geo, crs=stations.crs, geometry='geometry')
night_ret_gpd = gpd.GeoDataFrame(night_ret_geo, crs=stations.crs, geometry='geometry')
# Uncomment follwing two lines to create GeoPackage files 
#night_dep_gpd.to_file('night_trips_departure_geometry.gpkg', driver="GPKG")
#night_ret_gpd.to_file('night_trips_return_geometry.gpkg', driver="GPKG")


In [22]:
# This is the same as above, but for day time data

# Setting index as departure station id for night data
day_dep_stat_id = day.set_index('Departure station id')
day_ret_stat_id = day.set_index('Return station id')

# Joining night trip data with departure station geometries
day_dep_geo = day_dep_stat_id.join(stat)
day_ret_geo = day_ret_stat_id.join(stat)

# Make GeoDataFrame
day_dep_gpd = gpd.GeoDataFrame(day_dep_geo, crs=stations.crs, geometry='geometry')
day_ret_gpd = gpd.GeoDataFrame(day_ret_geo, crs=stations.crs, geometry='geometry')

# Uncomment follwing two lines to create GeoPackage files 
#day_dep_gpd.to_file('day_trips_departure_geometry.gpkg', driver="GPKG")
#day_ret_gpd.to_file('day_trips_return_geometry.gpkg', driver="GPKG")


### Step 3.2: Get the average data for each bike station (optional)

In this step, you can get a geopackage file with aggregated departure or return station data for night or day time data. In the finished file you will have the average distance covered, average duration covered as well as the trip count for each bike station.

The first option is to get out night time departure data.

In [37]:
# Grouping night data by departure station id
dep_grouped_night = night.groupby('Departure station id')

# Getting the grouped data averages over distance and duration, adding trip count
night_mean_data = pd.DataFrame()
mean_cols = ['Covered distance (m)', 'Duration (sec.)']

for key, group in dep_grouped_night:
    mean_values = group[mean_cols].mean()
    mean_values['Departure station id'] = key
    mean_values['trip_count'] = len(group)
    night_mean_data = night_mean_data.append(mean_values, ignore_index=True)
    
# Setting index to departure station id
night_mean_data = night_mean_data.set_index('Departure station id')

# Joining night mean data with geometries
dep_night_stat = night_mean_data.join(stat)

dep_night_stat_gpd = gpd.GeoDataFrame(dep_night_stat, crs=stations.crs, geometry='geometry')
# Uncomment following line to write to file
#dep_night_stat_gpd.to_file('data/night_departure_stations.gpkg', driver="GPKG")

The second option is to get out night time return data

In [38]:
# Grouping night data by departure station id
ret_grouped_night = night.groupby('Return station id')

# Getting the grouped data averages over distance and duration, adding trip count
night_mean_ret_data = pd.DataFrame()
mean_cols = ['Covered distance (m)', 'Duration (sec.)']

for key, group in ret_grouped_night:
    mean_values = group[mean_cols].mean()
    mean_values['Return station id'] = key
    mean_values['trip_count'] = len(group)
    night_mean_ret_data = night_mean_ret_data.append(mean_values, ignore_index=True)
    
# Setting index to departure station id
night_mean_ret_data = night_mean_ret_data.set_index('Return station id')

# Joining night mean data with geometries
ret_night_stat = night_mean_ret_data.join(stat)

ret_night_stat_gpd = gpd.GeoDataFrame(ret_night_stat, crs=stations.crs, geometry='geometry')
# Uncomment following line to write to file
#ret_night_stat_gpd.to_file('data/night_return_stations.gpkg', driver="GPKG")

The third option is to get day time departure data

In [28]:
# Grouping night data by departure station id
dep_grouped_day = day.groupby('Departure station id')

# Getting the grouped data averages over distance and duration, adding trip count
day_mean_dep_data = pd.DataFrame()
mean_cols = ['Covered distance (m)', 'Duration (sec.)']

for key, group in dep_grouped_day:
    mean_values = group[mean_cols].mean()
    mean_values['Departure station id'] = key
    mean_values['trip_count'] = len(group)
    day_mean_dep_data = day_mean_dep_data.append(mean_values, ignore_index=True)
    
# Setting index to departure station id
day_mean_dep_data = day_mean_dep_data.set_index('Departure station id')

# Joining night mean data with geometries
dep_day_stat = day_mean_dep_data.join(stat)

dep_day_stat_gpd = gpd.GeoDataFrame(dep_day_stat, crs=stations.crs, geometry='geometry')
# Uncomment following line to write to file
#dep_day_stat_gpd.to_file('data/day_departure_stations.gpkg', driver="GPKG")

And the fourth option is to get day time return data

In [27]:
# Grouping night data by departure station id
ret_grouped_day = day.groupby('Return station id')

# Getting the grouped data averages over distance and duration, adding trip count
day_mean_ret_data = pd.DataFrame()
mean_cols = ['Covered distance (m)', 'Duration (sec.)']

for key, group in ret_grouped_day:
    mean_values = group[mean_cols].mean()
    mean_values['Return station id'] = key
    mean_values['trip_count'] = len(group)
    day_mean_ret_data = day_mean_ret_data.append(mean_values, ignore_index=True)
    
# Setting index to departure station id
day_mean_ret_data = day_mean_ret_data.set_index('Return station id')

# Joining night mean data with geometries
ret_day_stat = day_mean_ret_data.join(stat)

ret_day_stat_gpd = gpd.GeoDataFrame(ret_day_stat, crs=stations.crs, geometry='geometry')
# Uncomment following line to write to file
#ret_day_stat_gpd.to_file('data/day_return_stations.gpkg', driver="GPKG")

### Step 3.4: Making a .gpkg file with euclidean lines between depature and return station (optional)

In [39]:
# Creating files with only station geometry and ID 
stat_line_dep = stations[['ID','geometry']]
stat_line_dep['Departure station id'] = stations['ID'].astype(int)

# Creating files with only station geometry and ID 
stat_line_ret = stations[['ID','geometry']]
stat_line_ret['Return station id'] = stations['ID'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


In [40]:
# Importing shapely 
from shapely.geometry import LineString

This first code cells makes the lines from the night data. The day data is too big, so an alternative approach is made in a later stage. In this stage we can get more information into the final geopackage output. The code is quite slow so please be patient. 

In [42]:
# Creating dataframe with only id of departure and return station id numbers from the night data
night_line = night[['Departure station id', 'Return station id']]
night_line['Return station id'] = night_line['Return station id'].astype(int)

# Merging the departure data with geometries
merged_dep_night = pd.merge(left=night_line, right=stat_line_dep, on='Departure station id', right_index=True)

# Selecting only some columns for depature data with geometries and renaming geometry column
merged_dep_night = merged_dep_night[['Departure station id','geometry']]
merged_dep_night.rename(columns={'geometry':'dep_point'}, inplace=True)

# Merging the return data with geometries
merged_ret_night = pd.merge(left=night, right=stat_line_ret, on='Return station id', right_index=True)

# Selecting only some columns for return data with geometries and renaming geometry column
merged_ret_night = merged_ret_night[['Return station id', 'geometry']]
merged_ret_night.rename(columns={'geometry':'ret_point'}, inplace=True)

# Joining the merged departure and return data
joined = merged_dep_night.join(merged_ret_night)

# Creating a new dataframe for the night points with an empty geometry column
night_points = pd.DataFrame()
night_points['geometry'] = None

# Looping through the joined data and adding a LineString to each bike trip, with the departure and return id
for i, row in joined.iterrows():
    line = LineString([row['dep_point'], row['ret_point']])
    night_points.at[i, 'dep_id'] = row['Departure station id']
    night_points.at[i, 'ret_id'] = row['Return station id']
    night_points.at[i, 'geometry'] = line
    
# Creating a geopandas dataframe with lines as the geometry
night_gpd = gpd.GeoDataFrame(night_points, geometry='geometry', crs=stations.crs)

#Uncomment line below to write to file
night_gpd.to_file('data/night_lines_with_station_id.gpkg', driver="GPKG")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  night_line['Return station id'] = night_line['Return station id'].astype(int)


The following code makes line data for the day time bike trips. We use a different, faster approach due to the massive amount of data.

In [43]:
# Choosing only id numbers for the day line data
day_line = day[['Departure station id', 'Return station id']]
#day_line['Return station id'] = day_line['Return station id'].astype(int)

# Merging the day departures with geometries
merged_dep_day = pd.merge(left=day_line, right=stat_line_dep, on='Departure station id', right_index=True)

# Selecting only some columns for depature data with geometries and renaming geometry column
merged_dep_day = merged_dep_day[['Departure station id','geometry']]
merged_dep_day.rename(columns={'geometry':'dep_point'}, inplace=True)

# Merging the return data with geometries
merged_ret_day = pd.merge(left=day, right=stat_line_ret, on='Return station id', right_index=True)

# Selecting only some columns for return data with geometries and renaming geometry column
merged_ret_day = merged_ret_day[['Return station id', 'geometry']]
merged_ret_day.rename(columns={'geometry':'ret_point'}, inplace=True)

# Joining the merged departure and return data
joined = merged_dep_day.join(merged_ret_day) 

#Creating an empty data frame with an empty geometry column
day_points = pd.DataFrame()
day_points['geometry'] = None

# Cropping NaN-values and resetting index of joined. Changing the id of return station to intege
joined.dropna(inplace=True)
joined.reset_index(inplace=True)
joined['Return station id'] = joined['Return station id'].astype(int)

# Making lists of the depature and return points
dep_list = joined['dep_point'].tolist()
ret_list = joined['ret_point'].tolist()

# Zip looping through the point lists to make the linestring and adding it to an empty list
line_list = []
for dep, ret in zip(dep_list, ret_list):
    if dep == ret:
        continue
    else:
        line = LineString([dep, ret])
        line_list.append(line)

# Creating a geodataframe of the list of lines
day_lines = gpd.GeoDataFrame(crs=stations.crs, geometry=line_list)

#Uncomment to write the linelist to a .gpkg file
day_lines.to_file('data/day_lines.gpkg', driver="GPKG")