# Taxi Driver Data - Exploration in Jupyter

## Context

This data set was created to help Kaggle users in the New Your City Taxi Trip Duration competition. New features were generated using Wolfram Mathematica system.
Hope that this data set will help both young and experienced researchers in their data mastering path.

### Content

Given dataset consists of both features from initial dataset and generated via Wolfram Mathematica computational system. Thus, all features can be split into following groups:

* Initial features (extracted from initial data),
* Calendar features (contains of season, day name and day period),
* Weather features (information about temperature, snow, and rain),
* Travel features (geo distance with estimated driving distance and time).

#### Dataset contains the following columns:
* `id` - a unique identifier for each trip,
* `vendorId` - a code indicating the provider associated with the trip record,
* `passengerCount` - the number of passengers in the vehicle (driver entered value),
* `year`,
* `month`,
* `day`,
* `hour`,
* `minute`,
* `second`,
* `season`,
* `dayName`,
* `dayPeriod` - day period, e.g. late night, morning, and etc.,
* `temperature`,
* `rain`,
* `snow`,
* `startLatitude`,
* `startLongitude`,
* `endLatitude`,
* `endLongitude`,
* `flag` - this flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip,
* `drivingDistance` - driving distance, estimated via Wolfram Mathematica system,
* `drivingTime` - driving time, estimated via Wolfram Mathematica system,
* `geoDistance` - distance between starting and ending points,
* `tripDuration` - duration of the trip in seconds (value -1 indicates test rows).

This first block of code imports the various modules and uses the `openml` API to download the specific dataset from the website. It is stored as a panda dataframe in the variable X.

In [None]:
import openml
import folium
import pandas as pd
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium import plugins
from folium.plugins import HeatMap
import numpy as np
import warnings
warnings.filterwarnings('ignore')
sns.set_style('darkgrid')

# List all datasets and their properties
openml.datasets.list_datasets(output_format="dataframe")

# Get dataset by ID
dataset = openml.datasets.get_dataset(43584)

# Get the data itself as a dataframe (or otherwise)
X, y, _, _ = dataset.get_data(dataset_format="dataframe")

We can check first that the data is presenting as we would have expected, by printing the first five rows of the panda.

In [None]:
X.head()

In [None]:
X.dtypes

> 🚧 `.dtypes` does not have parentheses after it.

By running this, we check also that the data are formatted correctly.

In [None]:
len(X[X['tripDuration'] == -1]) / len(X)

The description mentioned that the value of `-1` indicates test rows. We can see that $30\%$ of the rows are specified as test rows.

In [None]:
X = X.assign(date=pd.to_datetime(X[["year", "month", "day", "hour", "minute", "second"]]))

We have date and time data, but they are separated into separate columns for each component. We can combine them into a single column, and then convert them to a datetime format.

In [None]:
X.plot('date', 'drivingTime', kind='scatter')
plt.xlabel('Date')
plt.ylabel('Driving time (s)')
plt.show()

This first plot merely scatters the full datetime column along the bottom against the driving time. The presence of a very small number of extreme outliers obscures any trend in the main body of the data. Almost every journey is less than one hour; most are less than half an hour, so the extension of the $y$-axis to 13 hours is not helpful.

In [None]:
X[X['drivingTime'] > 40000]

We can slice the dataset to isolate this one point above 40000 seconds. By plotting the latitude and longitude of the start and end points, we can see that the journey is from the centre of New York to Quebec. Google Maps roughly agrees with the driving time, so the data are likely correct, though ridiculous.

![map](map.png)

We can see that this outlier data point is a 13-hour trip from New York to the north of Quebec.

In [None]:
mean_driving_time = X.groupby('hour')['drivingTime'].mean()/60

We can then find the mean driving time for each hour of the day, in order to plot it and see the average journey time during each hour of the day.

In [None]:
mean_driving_time

In [None]:
plt.bar(range(0,24), mean_driving_time)
plt.xlabel('Hour')
plt.ylabel('Mean Driving Time')
plt.title('Journey time at different times of day')
plt.show()

This plots the mean driving time against each hour of the day as a bar chart, so we can compare the average journey time during each hour of the day.

In [None]:
# Plot the median driving time for each hour of the day.

median_driving_time = X.groupby('hour')['drivingTime'].median()/60
median_driving_time

In [None]:
plt.bar(range(0,24), median_driving_time)
plt.xlabel('Hour')
plt.ylabel('Median Driving Time (minutes)')
plt.title('Median Driving Time vs. Hour')
plt.show()


The same plot, but with the median plotted instead.

In [None]:
base_date = X['date'][0].date()

# Update the 'date' column with the same date for all rows
X['date'] = X['date'].apply(lambda x: x.replace(year=base_date.year, month=base_date.month, day=base_date.day))

# Output the DataFrame with updated 'date' column
X.head()


A temporary hack, the date is made to be on the same day for all journeys, so that the time of day can be plotted against the driving time.

In [None]:
X.tail()

In [None]:
X['date'] = X['date'].dt.floor('10min')

The time is then rounded into 10-minute buckets.

In [None]:
mean_driving_time_by_minute = X.groupby('date')['drivingTime'].mean()/60
mean_driving_time_by_minute_std = X.groupby('date')['drivingTime'].std()/60

A new series is created which has the mean driving time in minutes for each 10 minute interval.

In [None]:
from datetime import datetime, timedelta

start_time = datetime.strptime("00:00", "%H:%M")
end_time = datetime.strptime("23:59", "%H:%M")
step = timedelta(minutes=10)
times = []

current_time = start_time
date_prefix = "2016-01-01"

while current_time <= end_time:
    formatted_time = current_time.strftime("%H:%M")
    times.append(f"{date_prefix} {formatted_time}:00")
    current_time += step




In [None]:
times = pd.DataFrame(times)
times = pd.to_datetime(times[0])

In [None]:
times

These cells create a series with each 10-minute time interval.

In [None]:
plt.figure(figsize=(10,10))
plt.bar(times, mean_driving_time_by_minute, width=0.007, edgecolor='none', color='red')
plt.ylim(3, 6.5)
plt.xlabel('Time of day, in 10-minute buckets')
plt.ylabel('Mean driving time (minutes)')
plt.title('Length of taxi drives throughout the day')
plt.show()

This plot is a bar chart showing the mean driving time for each 10-minute interval of the day.

In [None]:
plt.figure(figsize=(10,10))
plt.plot(times.to_numpy(), mean_driving_time_by_minute.to_numpy())
plt.ylim(3, 6.5)
plt.xlabel('Time of day, in 10-minute buckets')
plt.ylabel('Mean driving time (minutes)')
plt.title('Length of taxi drives throughout the day')
plt.show()

As a line graph instead of a bar chart.

In [None]:
def plot(x, y, ax, title, y_label):
    ax.set_title(title)
    ax.set_ylabel(y_label)
    ax.plot(x, y)
    ax.margins(x=0, y=0)
    
def plotWithStd(x, y, stds, ax, title, y_label):
    ax.fill_between(x, y - stds, y + stds, alpha=0.2)
    plot(x, y, ax, title, y_label)
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
fig, (ax1) = plt.subplots(ncols=1, figsize=(7, 3), dpi=300)
title = 'Line graph'
stds1 = mean_driving_time_by_minute.std()
plotWithStd(times.to_numpy(), mean_driving_time_by_minute.to_numpy(), mean_driving_time_by_minute_std.to_numpy()/5, ax1, 'Mean driving time with standard deviation', 'Driving time (minutes)')
fig.tight_layout()

Graph with standard deviation added. We can see that it's fairly constant throughout the day. Standard deviation is divided by 5 for clarity.


## Folium - using maps to display geographical data

In [None]:
i = 0
m = folium.Map(location=[40.7128, -74.0060], zoom_start=12)
while i < len(X):
    folium.Marker([X['startLatitude'][i], X['startLongitude'][i]], icon=folium.Icon(color="green", icon="play")).add_to(m)
    i += 100
i = 0
while i < len(X):
    folium.Marker([X['endLatitude'][i], X['endLongitude'][i]], icon=folium.Icon(color="red", icon="stop")).add_to(m)
    i += 100

This code uses the folium module and creates a map centred on New York city. Using a `while` loop, it adds a green marker at every start point and a red marker at every end point, based on the latitude and longitude data stored in the dataset. There are so many datapoints that only 1% are plotted (every 100th row), as the folium module is so heavy that it otherwise crashes.

In [None]:
# DO NOT RUN THIS CELL UNLESS ALL WORK IS SAVED AND JUPYTER IS RUNNING IN THE BROWSER. MOST LIKELY THIS WILL CRASH VSCODE AND PREVENT YOU FROM REOPENING THE PROJECT.

# m

I can't display the interactive map here as it crashes the system, but here is a screenshot of the map with a full 1% of the data points plotted.

![map](nymap.png)

### Heatmap

In [None]:
start_points = folium.Map(location=[40.7128, -74.0060], zoom_start=12, tiles='Stamen Toner')
end_points = folium.Map(location=[40.7128, -74.0060], zoom_start=12, tiles='Stamen Toner')

This code makes folium maps (using the 'Stamen Toner' tiles) centred on New York. One is designated for the start points of taxi rides, and one for the end points.

In [None]:
heat_data_start = X[["startLatitude","startLongitude"]].to_dict(orient='tight')["data"]
heat_data_end = X[["endLatitude","endLongitude"]].to_dict(orient='tight')["data"]

We then iterate through each row of the X dataframe, pulling the latitude and longitude of the start points into one variable, and the end points into another. The format of these variable is a list of lists, with each sublist containing the latitude and longitude of a point.

In [None]:
HeatMap(heat_data_start).add_to(start_points)
HeatMap(heat_data_end).add_to(end_points)

The `HeatMap` plugin can be used to import the heatmap data and add it to the maps we created earlier.

In [None]:
# Un-comment these lines to display the maps

start_points
# end_points

### Cells of $111\times 111 \text{m}$

In [None]:
Y = X
Y['startLatitude'] = np.floor(Y['startLatitude']*1000)/1000
Y['endLatitude'] = np.floor(Y['endLatitude']*1000)/1000
Y['startLongitude'] = np.floor(Y['startLongitude']*1000)/1000
Y['endLongitude'] = np.floor(Y['endLongitude']*1000)/1000

Creating a duplicate dataset called `Y`, this code rounds each latitude and longitude value to 0.001, equivalent to 111m.

In [None]:
Ystart_points = folium.Map(location=[40.7128, -74.0060], zoom_start=12, tiles='Stamen Toner')
Yend_points = folium.Map(location=[40.7128, -74.0060], zoom_start=12, tiles="Stamen Toner")
Yheat_data_start = Y[["startLatitude","startLongitude"]].to_dict(orient='tight')["data"]
Yheat_data_end = Y[["endLatitude","endLongitude"]].to_dict(orient='tight')["data"]
HeatMap(Yheat_data_start).add_to(Ystart_points)
HeatMap(Yheat_data_end).add_to(Yend_points)

We then repeat the above process of creating a heatmap, but using the new dataset `Y` instead of `X`. This is only a test to show that the data points are in the new, rounded grid.

In [None]:
Ystart_points
# Yend_points

#### Counting how many journeys start and end in each cell

> ❗ Parts of the following code may be redundant, but they work.

In [None]:
Y['startPos'] = Y['startLatitude'].astype(str) + ',' + Y['startLongitude'].astype(str)
Y['endPos'] = Y['endLatitude'].astype(str) + ',' + Y['endLongitude'].astype(str)
start_point_counts = Y.groupby('startPos').size().reset_index(name='startCount')
end_point_counts = Y.groupby('endPos').size().reset_index(name='endCount')

This code creates a new column in Y, the `startPos` and `endPos`, a composite of the two latitude and longitude columns. New dataframes are created to hold the number of starts in each cell, and the number of ends in each cell.

In [None]:
start_point_counts['latitude'] = start_point_counts['startPos'].str.split(',').str[0].astype(float)
sorted_start_point_counts = start_point_counts.sort_values(by='latitude', ascending=True)
start_point_counts = sorted_start_point_counts.drop('latitude', axis=1)

end_point_counts['latitude'] = end_point_counts['endPos'].str.split(',').str[0].astype(float)
sorted_end_point_counts = end_point_counts.sort_values(by='latitude', ascending=True)
end_point_counts = sorted_end_point_counts.drop('latitude', axis=1)

start_point_counts = start_point_counts.rename(columns={'startPos': 'pos'})
end_point_counts = end_point_counts.rename(columns={'endPos': 'pos'})

This cell then sorts the points and renames the position columns to `pos` in both dataframes.

In [None]:
start_point_counts

We can see that the dataframe is working as intended.

In [None]:
full_points = pd.merge(
    start_point_counts, end_point_counts, how="outer"
)

full_points = full_points.fillna(0)
full_points['netCount'] = full_points['startCount'] - full_points['endCount']

Importantly, we merge both databases to create a new datframe containing every point (nearest 111m) where a journey either starts or ends. It contains columns for the number of journeys that start and end at that point, and then a value with the difference between the two, giving us a value that is positive if more journeys start there, and negative if more journeys end there.

In [None]:
full_points

The dataframe runs as expected.

In [None]:
full_points['latitude'] = full_points['pos'].str.split(',').str[0].astype(float)
full_points['longitude'] = full_points['pos'].str.split(',').str[1].astype(float)
full_points

Next, the position column is split into latitude and longitude floats that we can plot.

In [None]:
import geopandas as gpd
from shapely.geometry import Polygon
ny = gpd.GeoDataFrame([], geometry=[Polygon([[-74.25, 40.5], [-73.5, 40.5], [-73.5, 41.1], [-74.25, 41.1]])])
gdf = gpd.GeoDataFrame(full_points, geometry=gpd.points_from_xy(full_points.longitude, full_points.latitude))
join:gpd.GeoDataFrame = gpd.sjoin(gdf, ny, how="inner", op='intersects')
fig = join.plot(markersize=0.1, column="netCount", legend=True, figsize=(10,10), vmin=-5, vmax=5)
plt.title('New York City Taxis: Balance of Journeys Starting and Ending')
plt.savefig('taximap.png', dpi=1000)
plt.close()

Finally, we use the `geopandas` module to plot our latitude and longitude points using `matplotlib`. We create a polygon, a rectangle containing all the value we want to see, and spatially join it with our dataset to exclude points far outside the New York area. The datapoints are then plotted on the map using a colour scale to indicate whether more journeys start or end at that point.

![hi-res plot](taximap.png)