In this file we will explore the data that we loaded through the Envirocar API for Münster (2000 tracks).

# Import packages and load data

In [None]:
import geopandas as gpd
import contextily as cx
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
import os
from datetime import datetime

In [None]:
# load data (may take some time due to dataset size)
filepath = os.path.join(os.getcwd(), "data", "envirocar_muenster", "envirocar_muenster_2000.shp")
tracks = gpd.read_file(filepath)

In [None]:
tracks.set_crs("EPSG:4326", inplace = True)

# First glance
We will take a look at several summaries to familiarize with the data.

In [None]:
tracks.shape

In [None]:
tracks.head()

In [None]:
tracks.describe(include='all')

We have quite a lot of features. They are split into general features (like "id" or "date"), driving specific features (such as "speed" or throttle position"), GPS specific features, sensor specific features and some calculated features such as CO2 emission. An overview can be found [here](https://github.com/enviroCar/envirocar-py/blob/master/examples/enviroCar_variable_description.ipynb).

We can also see that for 2000 tracks we have a huge amount of data, namely >700.000 rows. Why is that?

# Spatial and temporal resolution and coverage

## Plotting all and individual tracks
We can use contextily to plot our data against a background map. This is fairly straightforward, but we need to match the coordinate systems of our envirocar data to contextily data.

In [None]:
print (tracks.crs)

In [None]:
ax = tracks.plot(figsize=(20, 15))
cx.add_basemap(ax, crs=tracks.crs.to_string())

Interestingly enough, even though we requested tracks from inside our Münster bounding box, we have destinations all over Germany. Apparently we will be given tracks that are at some point within the bounding box.

In [None]:
tracks["time"] = pd.to_datetime(tracks["time"])

In [None]:
# let's plot ten tracks side by side to get a feeling for the spatial distribution

fig, ax = plt.subplots(2, 5, figsize=(50,30))
for count, canvas in enumerate(ax.flatten()):
  # plot the actual track first
  track = tracks[tracks['track.id'] == tracks['track.id'].unique()[count]].plot(ax = canvas)

  # due to resolution problems we end up getting a HTTPError while plotting the background
  # to avoid this catch exception and change zoom factor
  try: 
    cx.add_basemap(track, crs=tracks.crs.to_string(), source=cx.providers.Stamen.TonerLite) 
  except:
    #pass
    cx.add_basemap(track, zoom=15, crs=tracks.crs.to_string(), source=cx.providers.Stamen.TonerLite)
  
  # for better orientation plot date of track as title
  track.set_title("Track on {}".format(tracks[tracks['track.id'] == tracks['track.id'].unique()[count]]["time"].dt.date.iloc[0]))

In [None]:
  # A Single Track for higher resolution:
  fig, ax = plt.subplots(1, 1, figsize=(50,30))
  track = tracks[tracks['track.id'] == tracks['track.id'].unique()[1]].plot(ax = ax, markersize = 200)
  try: 
    cx.add_basemap(track, crs=tracks.crs.to_string(), source=cx.providers.Stamen.TonerLite) 
  except:
    #pass
    cx.add_basemap(track, zoom=15, crs=tracks.crs.to_string(), source=cx.providers.Stamen.TonerLite)

## Track length and measurement intervals

In [None]:
# looking at length of just one track
tracks.groupby("track.id")["id"].count().describe()

In [None]:
tracks.groupby("track.id")["id"].count().plot.box(vert=False, title="Amount of entries per track")

In [None]:
# what are the intervals between the entries? just looking at one track
tracks[tracks['track.id'] == '61bf3b387b277d59bd102f26'].head(15)

The track length varies greatly. Measurements are taken every 5-6s. Thus the longer the track, the more entries. A track with ~60 entries takes about 5 minutes.

## Distribution of tracks over time

In [None]:
tracks["time"] = pd.to_datetime(tracks["time"])
tracks.groupby(tracks["time"].dt.year)["track.id"].nunique().plot(kind="bar", title="Amount of tracks per year")

In [None]:
print ('Timespan: ' + tracks['time'].min().strftime('%Y-%m-%d') + " - " + tracks['time'].max().strftime('%Y-%m-%d'))

# Missing data

In [None]:
msno.matrix(tracks, labels=True)

The data has large parts missing, especially of the calcualted values concerning CO2 etc. GPS, track and sensor data as well as geometry and time are mostly complete.