# Data Pre-Processing

Is your data too messy to be utilized in 02? Look no further! This notebook walks through the data pre-processing methodology for our datasets, particularly BDD100K. We also include some helpful tips to make your data more compatible with these notebooks.


In [28]:
import networkx as nx
import osmnx as ox 
import time
from shapely.geometry import Polygon
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from algorithms import mm_utils

%matplotlib inline
#ox.__version__

# Importing your Data

If you are seriously testing your algorithm against data, chances are your dataset is huge. Blindly trying to import it into a Pandas (Geo)DataFrame is going to cause some issues, because it will attempt to load it all into memory (which is likely impossible).

In our case, we use Dask to handle this.

(In general, if you are exclusively using Pandas, Modin might be easier, as it is drop-in compatible. But Dask can handle GeoDataFrames (unlike Modin), so we will use that here) 

In [64]:
# Import the Dask libraries we need
import dask.dataframe as dd
import dask_geopandas as dgpd

# We load all the JSON files

df = dd.read_json('BDD100K/train/*.json',orient = 'index')
#df = dd.read_json('BDD100K/train/*.json')
df.partitions[0].compute()

Unnamed: 0,0
rideID,0a006b7b99b3d335d0e5371422d15482
accelerometer,"[{'y': -0.011, 'timestamp': 1504332296645, 'z'..."
gyro,"[{'y': 0.18710000000000002, 'timestamp': 15043..."
timelapse,False
locations,"[{'timestamp': 1504332296000, 'longitude': -73..."
filename,3e22a8a2-ade4-4516-a901-f3e2524fc4d9.mov
startTime,1504332296633
endTime,1504332336703
id,fb468f0e9d117bbf76965c6df604926b
gps,"[{'timestamp': 1504332297000, 'altitude': 8.57..."


Great, now we loaded our data into the notebook (you will have to repeat this process within other notebooks-- just copy the cell and put it at the start). But there's still a lot that needs to be done before we can plug in the data into our algorithms. For us, we need to reformat our data into a standard data format.

In [33]:
geodf = dgpd.from_geopandas(pd.DataFrame([]),npartitions=1)

## TODO: Parallelize the following loop
## Also, maybe don't write to JSON? Whatever format works best for SQLite database

for row in df.partitions[0].compute().loc['gps'][0]:
    temp = {"type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [row["longitude"], row["latitude"]]
  },
  "properties": {
    "timestamp": row["timestamp"],
    "altitude": row["altitude"],
    "speed": row["speed"],
    "vertical accuracy": row["vertical accuracy"],
    "horizontal accuracy": row["horizontal accuracy"]
  }}
    geodf.append(temp)

#geofp = {"type": "FeatureCollection", "features": geofp}
        
#geojsonfp = json.dumps(geofp)
#with open('data.json', 'w') as f:
#  f.write(geojsonfp)

# Enable GeoJSON driver
#fiona.drvsupport.supported_drivers["GeoJSON"] = "r"

#tripdata1 = gpd.read_file("data.json", enabled_drivers="GeoJSON")

AttributeError: 'DataFrame' object has no attribute 'gps'

That looks better. Let's export this to a database so we don't have to repeat this process every time.

In [None]:
# In our case, it makes more sense to store it into a SQLite database
# Then when we are ready to use it, we can load it smartly and call individual GeoDataFrames from the database


Much better. But there's still other things to check. For example: is your data fused?

In [None]:
## Display data and see if fused
import sqlite3
from sqlalchemy import create_engine

BDD100K_train = create_engine('sqlite:///BDD100K_train.db')


In our case, our data is already fused. But often you will have several datasets with asynchronous data that you will have to fuse first. We implemented a barebones method in mm_utils to handle this; here is an example of how to apply it.

Note that your data needs to be a (Geo)DataFrame. Also, the first column of all the datasets needs to be the time, and must all share the same time formatting. If you aren't sure your time format will work, we recommmend converting it all to Unix time (most languages have a built-in method to do this)