# Data Pre-Processing

Is your data too messy to be utilized in 02? Look no further! This notebook walks through the data pre-processing methodology for our datasets, particularly BDD100K. We also include some helpful tips to make your data more compatible with these notebooks.

## BDD100K


In [None]:
import networkx as nx
import osmnx as ox 
import time
from shapely.geometry import Polygon
import os, io, sys
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from algorithms import mm_utils

%matplotlib inline
#ox.__version__



### Importing your Data

If you are seriously testing your algorithm against data, chances are your dataset is huge. Blindly trying to import it into a Pandas (Geo)DataFrame is going to cause some issues, because it will attempt to load it all into memory (which is likely impossible).

In our case, we use Dask to handle this.

(In general, if you are exclusively using Pandas, Modin might be easier, as it is drop-in compatible. But Dask can handle GeoDataFrames (unlike Modin), so we will use that here) 

In [None]:
# Fast JSON library
import ujson as json

## Note for future self: If you need to muck around with JSON formatting, jq might be the way to go
## Transforming to/from line-delimited, for example, is far simpler

# Import the Dask libraries we need
import dask.bag as db

In [None]:
# If you try to load the BDD100K files directly into Dask, you'll run into some issues.
# Dask Bags assume every line in a json file is a distinct json object.
# So because BDD100K uses pretty json formatting, it will not load properly.
# So we have to first remove all the newline characters in all of the files.

# Note that the database size is huge, so make sure you have adequate disk space

path = 'BDD100K/train/'

for filename in os.listdir(path):
    if ('json' in filename and not 'processed' in filename):
        filepath = os.path.join(path, filename)
        f = open(filepath, 'r')
        f = f.read().replace('\n', '')
        if 'gps' in f.read(): # Some of the files don't have GPS data.
# These are unusuable to us, so we don't want them
# In BDD100K, there's about 20000 files like this... quite unfortunate
            if os.path.getsize(os.path.join(path,filename)) == 0: 
    # Fun fact: the BDD100K info dataset has a corrupted (empty) JSON file
    # This caused an enormous headache on my end while debugging
    # So now we will only process it if it's non-empty.
    # Otherwise, we skip the processing part, but still delete the file
    # (A more robust method would be to `try: json.loads`, but this would greatly increase processing time)
                with open(path + 'processed-' + filename, 'w') as fp:
                    print(f, file=fp)
        # These files take up a lot of space on my harddrive, so I will remove them here.
        # You may wish not to do this
        os.remove(filepath)


We went ahead and removed newline characters from our files, and threw away incompatible JSON files. Now we load the JSON files into our Dask bag, and initialize our reformatting function.

This function is customized to the BDD100K dataset to pull the necessary info out and reformat it into GeoJSON. You will have to write your own function customized for whatever dataset you choose to use.

In [None]:
# Now we load all the JSON files

dfbag = db.read_text('BDD100K/train/processed-*.json').map(json.loads)

# This is a helper function for reformatting

def bdd_reformat(jsonf):        
    listf = []
    if jsonf.get('gps') != None:
        for item in jsonf['gps']:
            listf.append({"type": "Feature",
          "geometry": {
            "type": "Point",
            "coordinates": [item["longitude"], item["latitude"]]
          },
          "properties": {
            "timestamp": item["timestamp"],
            "altitude": item["altitude"],
            "speed": item["speed"],
            "vertical accuracy": item["vertical accuracy"],
            "horizontal accuracy": item["horizontal accuracy"]
          }})
        geojsonf = {"type": "FeatureCollection", "features": listf}
        return geojsonf
    else:
        return jsonf

ValueError: ('No files found', 'BDD100K/train/processed-*.json')

While our data is now in GeoJSON, it is still stored as a Python dictionary. So we will export the files so we don't have to repeat this process later.

In [None]:
path = 'BDD100K/train/'
files = 'postprocessed-*.geojson'

# create a text trap and redirect stdout
text_trap = io.StringIO()
sys.stdout = text_trap

### TODO: Consider outputting to better file format, e.g. Avro, Parquet

dfbag = dfbag.map(bdd_reformat)
dfbag.map(json.dumps).to_textfiles(path + files)

# now restore stdout function
sys.stdout = sys.__stdout__

# I don't have the harddrive space to store both the intermediate and final files
# So I delete the old ones here
# You may prefer to not do this
for filename in os.listdir(path):
    if (not 'postprocessed' in filename):
        filepath = os.path.join(path, filename)
        os.remove(filepath)

# Note-- now may be a good time to zip and compress the processed files, in case something happens

NameError: name 'dfbag' is not defined

Now our data has been post-processed to a format that is compatible with map matching algorithms. The simplest way to utilize the data is to load it all into a Dask Bag, and `take(n,npartitions=n)` as needed (alternatively, you can load each GeoJSON as a partition in a Dask GeoDataFrame-- but this has complications). However, if you wish to do more in-depth data analysis on the dataset, thousands of JSON files aren't exactly optimal. We could try to apply functions on Dask Bags, but the simpler solution is to store the files into a SQLite database. Then we can access the database as needed and access filtered data quickly. 

Note: if you have no interest in utilizing the GeoJSON structure, you should create a database from the unprocessed files

In [None]:
# In our case, it makes more sense to store it into a SQLite database, but MySQL, MariaDB, or other formats work perfectly well.

# Fortunately, there are a lot of tools to convert GeoJSON to a spatially informed database
# So instead of trying to do it ourselves, we will use an external tool to do the heavy lifting
# Aren't you glad we processed the data into a more standard format?

# Run this only once
#! pip install geojson-to-sqlite
#! sudo pamac install spatialite-gui # Optional, but improves our database
#dfbag = db.read_text('BDD100K/train/postprocessed-*.json').map(json.loads)

In [None]:
path = 'BDD100K/train/'

# I am not adept at SQL, so I don't know the best way to store these files in an SQL table
# As a result, I will use an external program which converts GPS tracks to an SQL database
# However, it requires our files to be GPX, so let's do that first...

! bash geojson_to_gpx.sh

In [None]:
! gpx2spatialite create_db BDD100K/postprocessed_BDD100K.d
! gpx2spatialite import -d BDD100K/postprocessed_BDD100K.db -u Gabe BDD100K/train/*.gpx

Let's run a query to make sure it works.

In [None]:
### WIP


#import sqlite3
#conn = sqlite3.connect('BDD100K/postprocessed-BDD100K.db')
#conn.enable_load_extension(True)

# Now we load spatialite
#conn.execute('SELECT load_extension("mod_spatialite")')
#conn.execute('SELECT InitSpatialMetaData(1);')

# libspatialite
#conn.execute('SELECT load_extension("libspatialite")')
#conn.execute('SELECT InitSpatialMetaData();')

#cur = conn.cursor()
#cur.execute('SELECT ')

#conn.commit()
#conn.close()
#del conn

All done, right? Not quite. For example: is your data fused?

In [None]:
## Display data and see if fused

# Note: If you chose to store all your data in one SQL Table,
# Dask DataFrames can import from that.
gdfbag = dfbag.map(gpd.GeoDataFrame.from_features)

In our case, our data is already fused. But often you will have several datasets with asynchronous data that you will have to fuse first. We implemented a barebones method in mm_utils to handle this; here is an example of how to apply it.

Note that your data needs to be a (Geo)DataFrame or GeoJSON. Also, the first column of all the datasets needs to be the time, and must all share the same time formatting. If you aren't sure your time format will work, we recommmend converting it all to Unix time (most languages have a built-in method to do this)

In [None]:
# We create simulated asynchronous data

# Get columns of data
origdf = gdfbag.take(1)[0]



In [None]:
speed = origdf[['timestamp','speed']]

# Create rng to create noisy data
rng = np.random.default_rng()
unfused_spd = pd.DataFrame([])
for i in speed.index[:-1]:
    # Generate speed values based on original, at random timestamps
    row1 = pd.DataFrame([[speed.iloc[i][0] + rng.random()*1000,(speed.iloc[i][1] + speed.iloc[i+1][1])/2 + np.random.normal(0, (np.abs(speed.iloc[i][1] - speed.iloc[i+1][1]))/4)]], columns = ['timestamp','speed'], index = [i])
    unfused_spd = pd.concat([unfused_spd, row])
# Reset indices
unfused_spd = unfused_spd.sort_index().reset_index(drop=True)
unfused_spd

Unnamed: 0,timestamp,speed
0,1503834000000.0,4.695797
1,1503834000000.0,4.47984
2,1503834000000.0,4.365951
3,1503834000000.0,4.174768
4,1503834000000.0,3.970857
5,1503834000000.0,3.870786
6,1503834000000.0,3.882143
7,1503834000000.0,4.311895
8,1503834000000.0,4.999623
9,1503834000000.0,5.771902


Now we wish to fuse this DataFrame with our original dataframe (excluding the original speed column)

In [None]:
df1 = origdf[['timestamp', 'altitude', 'vertical accuracy', 'horizontal accuracy', 'geometry']]

df_prox = mm_utils.fuse(df1,unfused_spd,'timestamp','nearest neighbor')
df_avg = mm_utils.fuse(df1,unfused_spd,'timestamp','average')

Now let's see what the speed columns look like side-by-side

In [None]:
# This cell sets up styling to facilitate comparison
from IPython.display import display_html

df_sty = origdf[['speed']].style.set_table_attributes("style='display:inline'").set_caption('original df')
df_prox_sty = df_prox[['speed']].style.set_table_attributes("style='display:inline'").set_caption('proximity fuse')
df_avg_sty = df_avg[['speed']].style.set_table_attributes("style='display:inline'").set_caption('average fuse')

In [None]:
space = "\xa0" * 10
display_html(df_sty._repr_html_() + space
             + df_prox_sty._repr_html_() + space
             + df_avg_sty._repr_html_(), raw=True)


Unnamed: 0,speed
0,5.67
1,4.61
2,4.39
3,4.35
4,4.1
5,3.89
6,3.86
7,3.89
8,4.5
9,5.61

Unnamed: 0,speed
0,4.695797
1,4.695797
2,4.47984
3,4.365951
4,4.174768
5,3.970857
6,3.870786
7,3.882143
8,4.999623
9,5.771902

Unnamed: 0,speed
0,4.695797
1,4.695797
2,4.695797
3,4.695797
4,4.695797
5,3.970857
6,3.970857
7,3.882143
8,3.882143
9,3.882143


Because these are undersampled data points with heavy noise, the averaging method performs rather poorly. If sample size is far larger than the main dataset, average performs a lot better-- but then nearest neighbor is weak to high variance (as it only considers the closest point, which may be an outlier.

## Large-Scale Dataset

First, we import the data and write it as GEOJson

In [None]:
# Processing the coordinate track first

import csv
from shapely.geometry import Point
from shapely.geometry import LineString
path = 'map-matching-dataset/'


In [None]:

for filename in os.listdir(path):
    if '.track' in filename:
        gdf = gpd.GeoDataFrame([])
        
        filepath = os.path.join(path, filename)
        with open(filepath,newline='') as csvf:
            data = csv.reader(csvf,delimiter = '\t')
            # To be honest this is a lazy approach, but we only have 100 tracks anyway, so its fine...
            for row in data:
                pt = Point(float(row[0]), float(row[1]))
                gdf = pd.concat([gdf, gpd.GeoDataFrame([{'timestamp':row[2], 'geometry':pt}])])
        gdf.to_file(filepath[:-6] + 'track' + '.geojson', driver = 'GeoJSON')

In [None]:
# Now we construct our network

def string_to_coords(string):
    return tuple(map(float,string.split('\t')))

def string_to_ints(string):
    return tuple(map(int,string.split('\t')))


def helper_fn1(df,ls):
    return df.iloc[ls]

def helper_fn2(ls):
    return LineString(list(ls.loc[:,'geometry']))

for filename in os.listdir(path):
    if '.nodes' in filename:
        filepath = os.path.join(path, filename)
        with open(filepath,newline='') as f:
            fdata = f.read()
            flist = fdata.split('\n')
            flist = list(filter(None, flist))
            flist = map(Point,list(map(string_to_coords,flist)))
            ngdf = gpd.GeoDataFrame(flist, columns=['geometry'])
            # We need to do arcs while we do nodes, b.c. it is indexed by rows in .nodes
            with open(filepath[:-6] + '.arcs', newline='') as f2:
                arcdata = f2.read()
                arclist = arcdata.split('\n')
                arclist = list(filter(None,arclist))
                arclist = list(map(string_to_ints,arclist))
                arclistu = [] # We need to store the source and target nodes
                arclistv = []
                for i in range(len(arclist)):
                    arclistu.append(arclist[i][0])
                    arclistv.append(arclist[i][1])
                arclist = map(list, arclist)
                arcpts = list(map(lambda x: helper_fn1(ngdf,x), arclist))
                arcptsid = list(map(lambda x: helper_fn1(ngdf,x), arclist))
                arclist = list(map(helper_fn2, arcpts))
                agdf = gpd.GeoDataFrame({'geometry' : arclist, 'u' : arclistu, 'v' : arclistv})
        ngdf.to_file(filepath[:-6] + 'nodes' + '.geojson', driver = 'GeoJSON')
        agdf.to_file(filepath[:-6] + 'arcs' + '.geojson', driver = 'GeoJSON')

In [None]:
# Ground Truth
# This is the RIGHT way to do this... much faster. When I get time, rewrite the previous code block
for filename in os.listdir(path):
    if 'arcs.geojson' in filename:
        filepath = os.path.join(path, filename)
        agdf = gpd.read_file(filepath)
        with open(filepath[:-12] + '.route','r') as f:
            data = f.read()
            dlist = data.split("\n")
            dlist = list(filter(None, dlist))
            dlist = list(map(int,dlist))
            rgdf = agdf.iloc[dlist]
        rgdf.to_file(filepath[:-12] + 'route' + '.geojson', driver = 'GeoJSON', index = True)


In [None]:
for filename in os.listdir(path):
    if ('geojson' in filename):
        filepath = os.path.join(path, filename)
        with open(filepath, 'r+') as f:
            text = f.read().replace('\n', '')
            f.seek(0)
            f.write(text)
            f.truncate()

0

300442

300442

0

1199263

1199263

0

603893

603893

0

562932

562932

0

147112

147112

0

103096

103096

0

383522

383522

0

1485825

1485825

0

4441569

4441569

0

82750

82750

0

1080950

1080950

0

79850

79850

0

423408

423408

0

9619906

9619906

0

128305

128305

0

465869

465869

0

553634

553634

0

236548

236548

0

35607

35607

0

75455

75455

0

9787147

9787147

0

123458

123458

0

173454

173454

0

5991389

5991389

0

2658825

2658825

0

250190

250190

0

519747

519747

0

17636

17636

0

1833237

1833237

0

63967

63967

0

88972

88972

0

118314

118314

0

402882

402882

0

138516

138516

0

314384

314384

0

136276

136276

0

83570

83570

0

26303

26303

0

600442

600442

0

6039903

6039903

0

76887

76887

0

203180

203180

0

1242499

1242499

0

54259

54259

0

272283

272283

0

72326

72326

0

1103322

1103322

0

10798986

10798986

0

167175

167175

0

1964909

1964909

0

357782

357782

0

142056

142056

0

27333

27333

0

69960

69960

0

1481467

1481467

0

564098

564098

0

108095

108095

0

8158802

8158802

0

2747677

2747677

0

59293

59293

0

1708806

1708806

0

1517736

1517736

0

8941765

8941765

0

75250

75250

0

2882704

2882704

0

980481

980481

0

110256

110256

0

19269

19269

0

205329

205329

0

8837476

8837476

0

3977501

3977501

0

327263

327263

0

115821

115821

0

54131

54131

0

1117847

1117847

0

8755734

8755734

0

63109

63109

0

64422

64422

0

37874

37874

0

18285710

18285710

0

84058

84058

0

41256

41256

0

150757

150757

0

611541

611541

0

12385723

12385723

0

1444095

1444095

0

304050

304050

0

7251530

7251530

0

289647

289647

0

422170

422170

0

37885

37885

0

25359

25359

0

55540

55540

0

63025

63025

0

62650

62650

0

456947

456947

0

218648

218648

0

226942

226942

0

5616732

5616732

0

504600

504600

0

51204

51204

0

83435

83435

0

28235798

28235798

0

13184225

13184225

0

3747033

3747033

0

2561094

2561094

0

136060

136060

0

941738

941738

0

1305407

1305407

0

25787403

25787403

0

639283

639283

0

2244573

2244573

0

1565915

1565915

0

232140

232140

0

1096627

1096627

0

1010890

1010890

0

30504

30504

0

1048876

1048876

0

329981

329981

0

723684

723684

0

18442656

18442656

0

12402039

12402039

0

168695

168695

0

1793205

1793205

0

229760

229760

0

147659

147659

0

21906092

21906092

0

153174

153174

0

174726

174726

0

1199312

1199312

0

1102786

1102786

0

30832

30832

0

466811

466811

0

330477

330477

0

6284358

6284358

0

92008

92008

0

24069

24069

0

3037797

3037797

0

325817

325817

0

11003

11003

0

184064

184064

0

24804

24804

0

37565

37565

0

686401

686401

0

79268

79268

0

990634

990634

0

119912

119912

0

79372

79372

0

150165

150165

0

188480

188480

0

3057153

3057153

0

36843

36843

0

56803

56803

0

112002

112002

0

265863

265863

0

17635869

17635869

0

1737683

1737683

0

14648035

14648035

0

3098537

3098537

0

7022452

7022452

0

1025514

1025514

0

1879655

1879655

0

801972

801972

0

2727229

2727229

0

257201

257201

0

107636

107636

0

2249220

2249220

0

44491

44491

0

34058

34058

0

298713

298713

0

86485

86485

0

192609

192609

0

4055311

4055311

0

245673

245673

0

1801596

1801596

0

69747

69747

0

143501

143501

0

26864

26864

0

31355

31355

0

178777

178777

0

479937

479937

0

112412

112412

0

440509

440509

0

477930

477930

0

7198884

7198884

0

30286

30286

0

12985996

12985996

0

580038

580038

0

4338394

4338394

0

323214

323214

0

5716716

5716716

0

24199

24199

0

92360

92360

0

38931

38931

0

26749

26749

0

10104616

10104616

0

104462

104462

0

49557

49557

0

5588054

5588054

0

2570957

2570957

0

4178361

4178361

0

460299

460299

0

25618295

25618295

0

4779598

4779598

0

101387

101387

0

1636122

1636122

0

774261

774261

0

362203

362203

0

180340

180340

0

210089

210089

0

975825

975825

0

2544949

2544949

0

353739

353739

0

49653

49653

0

97639

97639

0

66878

66878

0

34288936

34288936

0

5183575

5183575

0

53301

53301

0

948345

948345

0

406019

406019

0

865904

865904

0

2438145

2438145

0

734572

734572

0

32394

32394

0

192904

192904

0

740081

740081

0

1802543

1802543

0

153900

153900

0

12671601

12671601

0

291237

291237

0

96854

96854

0

412434

412434

0

2540798

2540798

0

843698

843698

0

42436

42436

0

51092

51092

0

200267

200267

0

11987

11987

0

605167

605167

0

41023549

41023549

0

8322634

8322634

0

1736361

1736361

0

27320

27320

0

48057

48057

0

21667440

21667440

0

160242

160242

0

90037

90037

0

4426487

4426487

0

9159039

9159039

0

22527

22527

0

15112

15112

0

251063

251063

0

373475

373475

0

2231143

2231143

0

212366

212366

0

170656

170656

0

393291

393291

0

34834

34834

0

6608049

6608049

0

57183

57183

0

84811

84811

0

206518

206518

0

195905

195905

0

1102717

1102717

0

915163

915163

0

190920

190920

0

375898

375898

0

145605

145605

0

115513

115513

0

3513861

3513861

0

511416

511416

0

97520

97520

0

827979

827979

0

56306

56306

0

942215

942215

0

2294161

2294161

0

3267888

3267888

0

963777

963777

0

3965659

3965659

0

122082

122082

0

5072926

5072926

0

43517

43517

0

3070298

3070298

0

2962643

2962643

0

294838

294838

0

196468

196468

0

2704967

2704967

0

1147937

1147937

0

760447

760447

0

34187

34187

0

14294299

14294299

0

30235

30235

0

44342

44342

0

1024493

1024493

0

80895

80895

0

102835

102835

0

2413924

2413924

0

630569

630569

0

507854

507854

0

2933459

2933459

0

2261108

2261108

0

899503

899503

0

261583

261583

0

5433830

5433830

0

45851

45851

0

3243095

3243095

0

3417978

3417978

0

103108

103108

0

3233392

3233392

0

181530

181530

0

260004

260004

0

4137360

4137360

0

29641

29641

0

84664

84664

0

374122

374122

0

17606837

17606837

0

75996

75996

0

105110

105110

0

125754

125754

0

3558151

3558151

0

10936

10936

0

26487

26487

0

383394

383394

0

95957

95957

0

37914

37914

0

4158018

4158018

0

19855871

19855871

0

3196315

3196315

0

782216

782216

0

270214

270214

0

6958652

6958652

0

10498439

10498439

0

15516515

15516515

0

316358

316358

0

2676134

2676134

0

55234

55234

0

147028

147028

0

5145423

5145423

0

1437239

1437239

0

910905

910905

0

117374

117374

0

332543

332543

0

4479979

4479979

0

147860

147860

0

1635349

1635349

0

36667656

36667656

0

11306767

11306767

0

6890254

6890254

0

1408896

1408896

0

1349912

1349912

0

48676

48676

0

360075

360075

0

231612

231612

0

15073141

15073141

0

2070063

2070063

0

183822

183822

0

725333

725333

0

183844

183844

0

399872

399872

0

6778701

6778701

0

766781

766781

0

2424942

2424942

0

538934

538934

0

578363

578363

0

216780

216780

0

1955872

1955872

0

8559705

8559705

0

51102

51102

0

437640

437640

0

1169445

1169445

0

66980

66980

0

19822

19822

0

141083

141083

0

33127

33127

0

31155

31155

0

14761

14761

0

1406809

1406809

0

233516

233516

0

28945

28945

0

52595389

52595389

0

96782

96782

0

78600

78600

0

87638

87638

0

2128777

2128777

0

4617064

4617064

0

36832

36832

0

44824630

44824630

0

2638462

2638462

0

866735

866735

0

503471

503471

0

17280

17280

0

128932

128932

0

459473

459473

0

740265

740265

0

120506

120506

0

348126

348126

0

349392

349392

0

586345

586345

0

6677294

6677294

# Data Pre-Processing

Is your data too messy to be utilized in 02? Look no further! This notebook walks through the data pre-processing methodology for our datasets, particularly BDD100K. We also include some helpful tips to make your data more compatible with these notebooks.

## BDD100K


In [None]:
import networkx as nx
import osmnx as ox 
import time
from shapely.geometry import Polygon
import os, io, sys
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from algorithms import mm_utils

%matplotlib inline
#ox.__version__



### Importing your Data

If you are seriously testing your algorithm against data, chances are your dataset is huge. Blindly trying to import it into a Pandas (Geo)DataFrame is going to cause some issues, because it will attempt to load it all into memory (which is likely impossible).

In our case, we use Dask to handle this.

(In general, if you are exclusively using Pandas, Modin might be easier, as it is drop-in compatible. But Dask can handle GeoDataFrames (unlike Modin), so we will use that here) 

In [None]:
# Fast JSON library
import ujson as json

## Note for future self: If you need to muck around with JSON formatting, jq might be the way to go
## Transforming to/from line-delimited, for example, is far simpler

# Import the Dask libraries we need
import dask.bag as db

In [None]:
# If you try to load the BDD100K files directly into Dask, you'll run into some issues.
# Dask Bags assume every line in a json file is a distinct json object.
# So because BDD100K uses pretty json formatting, it will not load properly.
# So we have to first remove all the newline characters in all of the files.

# Note that the database size is huge, so make sure you have adequate disk space

path = 'BDD100K/train/'

for filename in os.listdir(path):
    if ('json' in filename and not 'processed' in filename):
        filepath = os.path.join(path, filename)
        f = open(filepath, 'r')
        f = f.read().replace('\n', '')
        if 'gps' in f.read(): # Some of the files don't have GPS data.
# These are unusuable to us, so we don't want them
# In BDD100K, there's about 20000 files like this... quite unfortunate
            if os.path.getsize(os.path.join(path,filename)) == 0: 
    # Fun fact: the BDD100K info dataset has a corrupted (empty) JSON file
    # This caused an enormous headache on my end while debugging
    # So now we will only process it if it's non-empty.
    # Otherwise, we skip the processing part, but still delete the file
    # (A more robust method would be to `try: json.loads`, but this would greatly increase processing time)
                with open(path + 'processed-' + filename, 'w') as fp:
                    print(f, file=fp)
        # These files take up a lot of space on my harddrive, so I will remove them here.
        # You may wish not to do this
        os.remove(filepath)


We went ahead and removed newline characters from our files, and threw away incompatible JSON files. Now we load the JSON files into our Dask bag, and initialize our reformatting function.

This function is customized to the BDD100K dataset to pull the necessary info out and reformat it into GeoJSON. You will have to write your own function customized for whatever dataset you choose to use.

In [None]:
# Now we load all the JSON files

dfbag = db.read_text('BDD100K/train/processed-*.json').map(json.loads)

# This is a helper function for reformatting

def bdd_reformat(jsonf):        
    listf = []
    if jsonf.get('gps') != None:
        for item in jsonf['gps']:
            listf.append({"type": "Feature",
          "geometry": {
            "type": "Point",
            "coordinates": [item["longitude"], item["latitude"]]
          },
          "properties": {
            "timestamp": item["timestamp"],
            "altitude": item["altitude"],
            "speed": item["speed"],
            "vertical accuracy": item["vertical accuracy"],
            "horizontal accuracy": item["horizontal accuracy"]
          }})
        geojsonf = {"type": "FeatureCollection", "features": listf}
        return geojsonf
    else:
        return jsonf

ValueError: ('No files found', 'BDD100K/train/processed-*.json')

While our data is now in GeoJSON, it is still stored as a Python dictionary. So we will export the files so we don't have to repeat this process later.

In [None]:
path = 'BDD100K/train/'
files = 'postprocessed-*.geojson'

# create a text trap and redirect stdout
text_trap = io.StringIO()
sys.stdout = text_trap

### TODO: Consider outputting to better file format, e.g. Avro, Parquet

dfbag = dfbag.map(bdd_reformat)
dfbag.map(json.dumps).to_textfiles(path + files)

# now restore stdout function
sys.stdout = sys.__stdout__

# I don't have the harddrive space to store both the intermediate and final files
# So I delete the old ones here
# You may prefer to not do this
for filename in os.listdir(path):
    if (not 'postprocessed' in filename):
        filepath = os.path.join(path, filename)
        os.remove(filepath)

# Note-- now may be a good time to zip and compress the processed files, in case something happens

NameError: name 'dfbag' is not defined

Now our data has been post-processed to a format that is compatible with map matching algorithms. The simplest way to utilize the data is to load it all into a Dask Bag, and `take(n,npartitions=n)` as needed (alternatively, you can load each GeoJSON as a partition in a Dask GeoDataFrame-- but this has complications). However, if you wish to do more in-depth data analysis on the dataset, thousands of JSON files aren't exactly optimal. We could try to apply functions on Dask Bags, but the simpler solution is to store the files into a SQLite database. Then we can access the database as needed and access filtered data quickly. 

Note: if you have no interest in utilizing the GeoJSON structure, you should create a database from the unprocessed files

In [None]:
# In our case, it makes more sense to store it into a SQLite database, but MySQL, MariaDB, or other formats work perfectly well.

# Fortunately, there are a lot of tools to convert GeoJSON to a spatially informed database
# So instead of trying to do it ourselves, we will use an external tool to do the heavy lifting
# Aren't you glad we processed the data into a more standard format?

# Run this only once
#! pip install geojson-to-sqlite
#! sudo pamac install spatialite-gui # Optional, but improves our database
#dfbag = db.read_text('BDD100K/train/postprocessed-*.json').map(json.loads)

In [None]:
path = 'BDD100K/train/'

# I am not adept at SQL, so I don't know the best way to store these files in an SQL table
# As a result, I will use an external program which converts GPS tracks to an SQL database
# However, it requires our files to be GPX, so let's do that first...

! bash geojson_to_gpx.sh

In [None]:
! gpx2spatialite create_db BDD100K/postprocessed_BDD100K.d
! gpx2spatialite import -d BDD100K/postprocessed_BDD100K.db -u Gabe BDD100K/train/*.gpx

Let's run a query to make sure it works.

In [None]:
### WIP


#import sqlite3
#conn = sqlite3.connect('BDD100K/postprocessed-BDD100K.db')
#conn.enable_load_extension(True)

# Now we load spatialite
#conn.execute('SELECT load_extension("mod_spatialite")')
#conn.execute('SELECT InitSpatialMetaData(1);')

# libspatialite
#conn.execute('SELECT load_extension("libspatialite")')
#conn.execute('SELECT InitSpatialMetaData();')

#cur = conn.cursor()
#cur.execute('SELECT ')

#conn.commit()
#conn.close()
#del conn

All done, right? Not quite. For example: is your data fused?

In [None]:
## Display data and see if fused

# Note: If you chose to store all your data in one SQL Table,
# Dask DataFrames can import from that.
gdfbag = dfbag.map(gpd.GeoDataFrame.from_features)

In our case, our data is already fused. But often you will have several datasets with asynchronous data that you will have to fuse first. We implemented a barebones method in mm_utils to handle this; here is an example of how to apply it.

Note that your data needs to be a (Geo)DataFrame or GeoJSON. Also, the first column of all the datasets needs to be the time, and must all share the same time formatting. If you aren't sure your time format will work, we recommmend converting it all to Unix time (most languages have a built-in method to do this)

In [None]:
# We create simulated asynchronous data

# Get columns of data
origdf = gdfbag.take(1)[0]



In [None]:
speed = origdf[['timestamp','speed']]

# Create rng to create noisy data
rng = np.random.default_rng()
unfused_spd = pd.DataFrame([])
for i in speed.index[:-1]:
    # Generate speed values based on original, at random timestamps
    row1 = pd.DataFrame([[speed.iloc[i][0] + rng.random()*1000,(speed.iloc[i][1] + speed.iloc[i+1][1])/2 + np.random.normal(0, (np.abs(speed.iloc[i][1] - speed.iloc[i+1][1]))/4)]], columns = ['timestamp','speed'], index = [i])
    unfused_spd = pd.concat([unfused_spd, row])
# Reset indices
unfused_spd = unfused_spd.sort_index().reset_index(drop=True)
unfused_spd

Unnamed: 0,timestamp,speed
0,1503834000000.0,4.695797
1,1503834000000.0,4.47984
2,1503834000000.0,4.365951
3,1503834000000.0,4.174768
4,1503834000000.0,3.970857
5,1503834000000.0,3.870786
6,1503834000000.0,3.882143
7,1503834000000.0,4.311895
8,1503834000000.0,4.999623
9,1503834000000.0,5.771902


Now we wish to fuse this DataFrame with our original dataframe (excluding the original speed column)

In [None]:
df1 = origdf[['timestamp', 'altitude', 'vertical accuracy', 'horizontal accuracy', 'geometry']]

df_prox = mm_utils.fuse(df1,unfused_spd,'timestamp','nearest neighbor')
df_avg = mm_utils.fuse(df1,unfused_spd,'timestamp','average')

Now let's see what the speed columns look like side-by-side

In [None]:
# This cell sets up styling to facilitate comparison
from IPython.display import display_html

df_sty = origdf[['speed']].style.set_table_attributes("style='display:inline'").set_caption('original df')
df_prox_sty = df_prox[['speed']].style.set_table_attributes("style='display:inline'").set_caption('proximity fuse')
df_avg_sty = df_avg[['speed']].style.set_table_attributes("style='display:inline'").set_caption('average fuse')

In [None]:
space = "\xa0" * 10
display_html(df_sty._repr_html_() + space
             + df_prox_sty._repr_html_() + space
             + df_avg_sty._repr_html_(), raw=True)


Unnamed: 0,speed
0,5.67
1,4.61
2,4.39
3,4.35
4,4.1
5,3.89
6,3.86
7,3.89
8,4.5
9,5.61

Unnamed: 0,speed
0,4.695797
1,4.695797
2,4.47984
3,4.365951
4,4.174768
5,3.970857
6,3.870786
7,3.882143
8,4.999623
9,5.771902

Unnamed: 0,speed
0,4.695797
1,4.695797
2,4.695797
3,4.695797
4,4.695797
5,3.970857
6,3.970857
7,3.882143
8,3.882143
9,3.882143


Because these are undersampled data points with heavy noise, the averaging method performs rather poorly. If sample size is far larger than the main dataset, average performs a lot better-- but then nearest neighbor is weak to high variance (as it only considers the closest point, which may be an outlier.

## Large-Scale Dataset

First, we import the data and write it as GEOJson

In [None]:
# Processing the coordinate track first

import csv
from shapely.geometry import Point
from shapely.geometry import LineString
path = 'map-matching-dataset/'


In [None]:

for filename in os.listdir(path):
    if '.track' in filename:
        gdf = gpd.GeoDataFrame([])
        
        filepath = os.path.join(path, filename)
        with open(filepath,newline='') as csvf:
            data = csv.reader(csvf,delimiter = '\t')
            # To be honest this is a lazy approach, but we only have 100 tracks anyway, so its fine...
            for row in data:
                pt = Point(float(row[0]), float(row[1]))
                gdf = pd.concat([gdf, gpd.GeoDataFrame([{'timestamp':row[2], 'geometry':pt}])])
        gdf.to_file(filepath[:-6] + 'track' + '.geojson', driver = 'GeoJSON')

In [None]:
# Now we construct our network

def string_to_coords(string):
    return tuple(map(float,string.split('\t')))

def string_to_ints(string):
    return tuple(map(int,string.split('\t')))


def helper_fn1(df,ls):
    return df.iloc[ls]

def helper_fn2(ls):
    return LineString(list(ls.loc[:,'geometry']))

for filename in os.listdir(path):
    if '.nodes' in filename:
        filepath = os.path.join(path, filename)
        with open(filepath,newline='') as f:
            fdata = f.read()
            flist = fdata.split('\n')
            flist = list(filter(None, flist))
            flist = map(Point,list(map(string_to_coords,flist)))
            ngdf = gpd.GeoDataFrame(flist, columns=['geometry'])
            # We need to do arcs while we do nodes, b.c. it is indexed by rows in .nodes
            with open(filepath[:-6] + '.arcs', newline='') as f2:
                arcdata = f2.read()
                arclist = arcdata.split('\n')
                arclist = list(filter(None,arclist))
                arclist = list(map(string_to_ints,arclist))
                arclistu = [] # We need to store the source and target nodes
                arclistv = []
                for i in range(len(arclist)):
                    arclistu.append(arclist[i][0])
                    arclistv.append(arclist[i][1])
                arclist = map(list, arclist)
                arcpts = list(map(lambda x: helper_fn1(ngdf,x), arclist))
                arcptsid = list(map(lambda x: helper_fn1(ngdf,x), arclist))
                arclist = list(map(helper_fn2, arcpts))
                agdf = gpd.GeoDataFrame({'geometry' : arclist, 'u' : arclistu, 'v' : arclistv})
        ngdf.to_file(filepath[:-6] + 'nodes' + '.geojson', driver = 'GeoJSON')
        agdf.to_file(filepath[:-6] + 'arcs' + '.geojson', driver = 'GeoJSON')

In [None]:
# Ground Truth
# This is the RIGHT way to do this... much faster. When I get time, rewrite the previous code block
for filename in os.listdir(path):
    if 'arcs.geojson' in filename:
        filepath = os.path.join(path, filename)
        agdf = gpd.read_file(filepath)
        with open(filepath[:-12] + '.route','r') as f:
            data = f.read()
            dlist = data.split("\n")
            dlist = list(filter(None, dlist))
            dlist = list(map(int,dlist))
            rgdf = agdf.iloc[dlist]
        rgdf.to_file(filepath[:-12] + 'route' + '.geojson', driver = 'GeoJSON', index = True)


In [None]:
for filename in os.listdir(path):
    if ('geojson' in filename):
        filepath = os.path.join(path, filename)
        with open(filepath, 'r+') as f:
            text = f.read().replace('\n', '')
            f.seek(0)
            f.write(text)
            f.truncate()

0

300442

300442

0

1199263

1199263

0

603893

603893

0

562932

562932

0

147112

147112

0

103096

103096

0

383522

383522

0

1485825

1485825

0

4441569

4441569

0

82750

82750

0

1080950

1080950

0

79850

79850

0

423408

423408

0

9619906

9619906

0

128305

128305

0

465869

465869

0

553634

553634

0

236548

236548

0

35607

35607

0

75455

75455

0

9787147

9787147

0

123458

123458

0

173454

173454

0

5991389

5991389

0

2658825

2658825

0

250190

250190

0

519747

519747

0

17636

17636

0

1833237

1833237

0

63967

63967

0

88972

88972

0

118314

118314

0

402882

402882

0

138516

138516

0

314384

314384

0

136276

136276

0

83570

83570

0

26303

26303

0

600442

600442

0

6039903

6039903

0

76887

76887

0

203180

203180

0

1242499

1242499

0

54259

54259

0

272283

272283

0

72326

72326

0

1103322

1103322

0

10798986

10798986

0

167175

167175

0

1964909

1964909

0

357782

357782

0

142056

142056

0

27333

27333

0

69960

69960

0

1481467

1481467

0

564098

564098

0

108095

108095

0

8158802

8158802

0

2747677

2747677

0

59293

59293

0

1708806

1708806

0

1517736

1517736

0

8941765

8941765

0

75250

75250

0

2882704

2882704

0

980481

980481

0

110256

110256

0

19269

19269

0

205329

205329

0

8837476

8837476

0

3977501

3977501

0

327263

327263

0

115821

115821

0

54131

54131

0

1117847

1117847

0

8755734

8755734

0

63109

63109

0

64422

64422

0

37874

37874

0

18285710

18285710

0

84058

84058

0

41256

41256

0

150757

150757

0

611541

611541

0

12385723

12385723

0

1444095

1444095

0

304050

304050

0

7251530

7251530

0

289647

289647

0

422170

422170

0

37885

37885

0

25359

25359

0

55540

55540

0

63025

63025

0

62650

62650

0

456947

456947

0

218648

218648

0

226942

226942

0

5616732

5616732

0

504600

504600

0

51204

51204

0

83435

83435

0

28235798

28235798

0

13184225

13184225

0

3747033

3747033

0

2561094

2561094

0

136060

136060

0

941738

941738

0

1305407

1305407

0

25787403

25787403

0

639283

639283

0

2244573

2244573

0

1565915

1565915

0

232140

232140

0

1096627

1096627

0

1010890

1010890

0

30504

30504

0

1048876

1048876

0

329981

329981

0

723684

723684

0

18442656

18442656

0

12402039

12402039

0

168695

168695

0

1793205

1793205

0

229760

229760

0

147659

147659

0

21906092

21906092

0

153174

153174

0

174726

174726

0

1199312

1199312

0

1102786

1102786

0

30832

30832

0

466811

466811

0

330477

330477

0

6284358

6284358

0

92008

92008

0

24069

24069

0

3037797

3037797

0

325817

325817

0

11003

11003

0

184064

184064

0

24804

24804

0

37565

37565

0

686401

686401

0

79268

79268

0

990634

990634

0

119912

119912

0

79372

79372

0

150165

150165

0

188480

188480

0

3057153

3057153

0

36843

36843

0

56803

56803

0

112002

112002

0

265863

265863

0

17635869

17635869

0

1737683

1737683

0

14648035

14648035

0

3098537

3098537

0

7022452

7022452

0

1025514

1025514

0

1879655

1879655

0

801972

801972

0

2727229

2727229

0

257201

257201

0

107636

107636

0

2249220

2249220

0

44491

44491

0

34058

34058

0

298713

298713

0

86485

86485

0

192609

192609

0

4055311

4055311

0

245673

245673

0

1801596

1801596

0

69747

69747

0

143501

143501

0

26864

26864

0

31355

31355

0

178777

178777

0

479937

479937

0

112412

112412

0

440509

440509

0

477930

477930

0

7198884

7198884

0

30286

30286

0

12985996

12985996

0

580038

580038

0

4338394

4338394

0

323214

323214

0

5716716

5716716

0

24199

24199

0

92360

92360

0

38931

38931

0

26749

26749

0

10104616

10104616

0

104462

104462

0

49557

49557

0

5588054

5588054

0

2570957

2570957

0

4178361

4178361

0

460299

460299

0

25618295

25618295

0

4779598

4779598

0

101387

101387

0

1636122

1636122

0

774261

774261

0

362203

362203

0

180340

180340

0

210089

210089

0

975825

975825

0

2544949

2544949

0

353739

353739

0

49653

49653

0

97639

97639

0

66878

66878

0

34288936

34288936

0

5183575

5183575

0

53301

53301

0

948345

948345

0

406019

406019

0

865904

865904

0

2438145

2438145

0

734572

734572

0

32394

32394

0

192904

192904

0

740081

740081

0

1802543

1802543

0

153900

153900

0

12671601

12671601

0

291237

291237

0

96854

96854

0

412434

412434

0

2540798

2540798

0

843698

843698

0

42436

42436

0

51092

51092

0

200267

200267

0

11987

11987

0

605167

605167

0

41023549

41023549

0

8322634

8322634

0

1736361

1736361

0

27320

27320

0

48057

48057

0

21667440

21667440

0

160242

160242

0

90037

90037

0

4426487

4426487

0

9159039

9159039

0

22527

22527

0

15112

15112

0

251063

251063

0

373475

373475

0

2231143

2231143

0

212366

212366

0

170656

170656

0

393291

393291

0

34834

34834

0

6608049

6608049

0

57183

57183

0

84811

84811

0

206518

206518

0

195905

195905

0

1102717

1102717

0

915163

915163

0

190920

190920

0

375898

375898

0

145605

145605

0

115513

115513

0

3513861

3513861

0

511416

511416

0

97520

97520

0

827979

827979

0

56306

56306

0

942215

942215

0

2294161

2294161

0

3267888

3267888

0

963777

963777

0

3965659

3965659

0

122082

122082

0

5072926

5072926

0

43517

43517

0

3070298

3070298

0

2962643

2962643

0

294838

294838

0

196468

196468

0

2704967

2704967

0

1147937

1147937

0

760447

760447

0

34187

34187

0

14294299

14294299

0

30235

30235

0

44342

44342

0

1024493

1024493

0

80895

80895

0

102835

102835

0

2413924

2413924

0

630569

630569

0

507854

507854

0

2933459

2933459

0

2261108

2261108

0

899503

899503

0

261583

261583

0

5433830

5433830

0

45851

45851

0

3243095

3243095

0

3417978

3417978

0

103108

103108

0

3233392

3233392

0

181530

181530

0

260004

260004

0

4137360

4137360

0

29641

29641

0

84664

84664

0

374122

374122

0

17606837

17606837

0

75996

75996

0

105110

105110

0

125754

125754

0

3558151

3558151

0

10936

10936

0

26487

26487

0

383394

383394

0

95957

95957

0

37914

37914

0

4158018

4158018

0

19855871

19855871

0

3196315

3196315

0

782216

782216

0

270214

270214

0

6958652

6958652

0

10498439

10498439

0

15516515

15516515

0

316358

316358

0

2676134

2676134

0

55234

55234

0

147028

147028

0

5145423

5145423

0

1437239

1437239

0

910905

910905

0

117374

117374

0

332543

332543

0

4479979

4479979

0

147860

147860

0

1635349

1635349

0

36667656

36667656

0

11306767

11306767

0

6890254

6890254

0

1408896

1408896

0

1349912

1349912

0

48676

48676

0

360075

360075

0

231612

231612

0

15073141

15073141

0

2070063

2070063

0

183822

183822

0

725333

725333

0

183844

183844

0

399872

399872

0

6778701

6778701

0

766781

766781

0

2424942

2424942

0

538934

538934

0

578363

578363

0

216780

216780

0

1955872

1955872

0

8559705

8559705

0

51102

51102

0

437640

437640

0

1169445

1169445

0

66980

66980

0

19822

19822

0

141083

141083

0

33127

33127

0

31155

31155

0

14761

14761

0

1406809

1406809

0

233516

233516

0

28945

28945

0

52595389

52595389

0

96782

96782

0

78600

78600

0

87638

87638

0

2128777

2128777

0

4617064

4617064

0

36832

36832

0

44824630

44824630

0

2638462

2638462

0

866735

866735

0

503471

503471

0

17280

17280

0

128932

128932

0

459473

459473

0

740265

740265

0

120506

120506

0

348126

348126

0

349392

349392

0

586345

586345

0

6677294

6677294