# Summary

The KITTI Detection and Tracking Benchmarks contain copies of subsets of the Raw Data.
The Detection Devkit includes `train_mapping.txt` and `train_rand.txt` files that map
Detection Benchmark files to Raw Data drives, but this mapping covers only training
examples.  The Tracking Devkit contains `seqmap` files that denote the number of
frames in the Tracking Benchmark train and test sequences, but these `seqmap` files
do not map back to the Raw Data.  Upon writing the KITTI authors, we discovered that
the Detection and Tracking benchmarks use the same (or similar) splits, and moreover
that the Benchmark <=> Raw Data mapping was not recorded when the Benchmark files
were assembled.

Below we reconstruct the Benchmark <=> Raw Data (Sync) mapping in order to:
 * allow us to join Benchmark labels against the Raw Data.
 * study the density of Benchmark labels.

Detection Benchmark: http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d  
Tracking Benchmark: http://www.cvlibs.net/datasets/kitti/eval_tracking.php  
Raw Data: http://www.cvlibs.net/datasets/kitti/raw_data.php  
Detection Devkit: https://s3.eu-central-1.amazonaws.com/avg-kitti/devkit_object.zip  
Tracking Devkit: https://s3.eu-central-1.amazonaws.com/avg-kitti/devkit_tracking.zip  

## Setup

To run this notebook locally, start a Jupyter instance in the `oarphpy/full` environment:

```
docker run \
    --name=psegs-kitti-ext -it --net=host \
    -v/:/outer_root -v `pwd`:/opt/psegs-kitti-ex -w /opt/psegs-kitti-ex \
      oarphpy/full:0.0.4 jupyter notebook --allow-root
```

Copies of assets that this notebook creates are included in the repo under the `assets` subdirectory.  You can run this notebook using those included assets.

### Recreating Assets: Download KITTI

In your clone of this repo, copy of symlink the root directory of your KITTI data to `./kitti_root`.  In `./kitti_root`, we expect to find the zipfiles for the Benchmarks (see constants `OBJECT_BENCHMARK_FNAMES` and `TRACKING_BENCHMARK_FNAMES` below) as well as the Raw Sync data (zip files that look like `2011_09_26_drive_0017_sync.zip`).  You do **NOT** need to decompress the zip files.

To download KITTI, you might consider using their [download script](http://www.cvlibs.net/download.php?file=raw_data_downloader.zip) or you might want to write a simple `wget`-based solution [as demonstrated in monodepth2](https://github.com/nianticlabs/monodepth2/blob/d1c5f03c38305cae4e68917e472d2f9d4eda0b98/README.md#-kitti-training-data).  Downloading the KITTI data requires explicit acceptance of their license, so we don't include here an automated download tool.


## Configuration And Constants

In [21]:
import os

# Local path to all Benchmark and Raw Sync zipfiles downloaded from the aforementioned URLs.
# Unless you use pre-computed Achive DataFrame data, we'll be reading these files in parallel
# and computing file hashes of their contents.
# Read-only spinning-disk storage OK.  We won't decompress the zips on-disk.
KITTI_DATA_BASE_DIR = './kitti_root'

# We'll save (or re-read) Parquet-formatted Archive DataFrame data from this directory.
# TODO
ARCHIVE_DF_SAVE_DIR = './assets/'

BENCHMARK_DF_PATH = os.path.join(ARCHIVE_DF_SAVE_DIR, 'benchmark_df')
RAW_SYNC_DF_PATH = os.path.join(ARCHIVE_DF_SAVE_DIR, 'raw_sync_df')
BENCH_TO_RAW_DF_PATH = os.path.join(ARCHIVE_DF_SAVE_DIR, 'bench_to_raw_df')

OBJECT_BENCHMARK_FNAMES = (
    'data_object_image_2.zip',
    'data_object_image_3.zip',
    'data_object_prev_2.zip',
    'data_object_prev_3.zip',
    'data_object_velodyne.zip',
)

TRACKING_BENCHMARK_FNAMES = (
    'data_tracking_image_2.zip',
    'data_tracking_image_3.zip',
    'data_tracking_velodyne.zip',
    'data_tracking_oxts.zip',
)

def is_raw_data_zip(path):
    return path.endswith('sync.zip')

def is_benchmark_zip(path):
    return os.path.basename(path) in (set(OBJECT_BENCHMARK_FNAMES) | set(TRACKING_BENCHMARK_FNAMES))

# Mined using a regex tool provided at the end of this notebook.
KITTI_CATEGORY_TO_SEGMENTS = {
    'city': ('2011_09_26_drive_0001', '2011_09_26_drive_0002', '2011_09_26_drive_0005', '2011_09_26_drive_0009', '2011_09_26_drive_0011', '2011_09_26_drive_0013', '2011_09_26_drive_0014', '2011_09_26_drive_0017', '2011_09_26_drive_0018', '2011_09_26_drive_0048', '2011_09_26_drive_0051', '2011_09_26_drive_0056', '2011_09_26_drive_0057', '2011_09_26_drive_0059', '2011_09_26_drive_0060', '2011_09_26_drive_0084', '2011_09_26_drive_0091', '2011_09_26_drive_0093', '2011_09_26_drive_0095', '2011_09_26_drive_0096', '2011_09_26_drive_0104', '2011_09_26_drive_0106', '2011_09_26_drive_0113', '2011_09_26_drive_0117', '2011_09_28_drive_0001', '2011_09_28_drive_0002', '2011_09_29_drive_0026', '2011_09_29_drive_0071'),
    'residential': ('2011_09_26_drive_0019', '2011_09_26_drive_0020', '2011_09_26_drive_0022', '2011_09_26_drive_0023', '2011_09_26_drive_0035', '2011_09_26_drive_0036', '2011_09_26_drive_0039', '2011_09_26_drive_0046', '2011_09_26_drive_0061', '2011_09_26_drive_0064', '2011_09_26_drive_0079', '2011_09_26_drive_0086', '2011_09_26_drive_0087', '2011_09_30_drive_0018', '2011_09_30_drive_0020', '2011_09_30_drive_0027', '2011_09_30_drive_0028', '2011_09_30_drive_0033', '2011_09_30_drive_0034', '2011_10_03_drive_0027', '2011_10_03_drive_0034'),
    'road': ('2011_09_26_drive_0015', '2011_09_26_drive_0027', '2011_09_26_drive_0028', '2011_09_26_drive_0029', '2011_09_26_drive_0032', '2011_09_26_drive_0052', '2011_09_26_drive_0070', '2011_09_26_drive_0101', '2011_09_29_drive_0004', '2011_09_30_drive_0016', '2011_10_03_drive_0042', '2011_10_03_drive_0047'),
    'campus': ('2011_09_28_drive_0016', '2011_09_28_drive_0021', '2011_09_28_drive_0034', '2011_09_28_drive_0035', '2011_09_28_drive_0037', '2011_09_28_drive_0038', '2011_09_28_drive_0039', '2011_09_28_drive_0043', '2011_09_28_drive_0045', '2011_09_28_drive_0047'),
    'person': ('2011_09_28_drive_0053', '2011_09_28_drive_0054', '2011_09_28_drive_0057', '2011_09_28_drive_0065', '2011_09_28_drive_0066', '2011_09_28_drive_0068', '2011_09_28_drive_0070', '2011_09_28_drive_0071', '2011_09_28_drive_0075', '2011_09_28_drive_0077', '2011_09_28_drive_0078', '2011_09_28_drive_0080', '2011_09_28_drive_0082', '2011_09_28_drive_0086', '2011_09_28_drive_0087', '2011_09_28_drive_0089', '2011_09_28_drive_0090', '2011_09_28_drive_0094', '2011_09_28_drive_0095', '2011_09_28_drive_0096', '2011_09_28_drive_0098', '2011_09_28_drive_0100', '2011_09_28_drive_0102', '2011_09_28_drive_0103', '2011_09_28_drive_0104', '2011_09_28_drive_0106', '2011_09_28_drive_0108', '2011_09_28_drive_0110', '2011_09_28_drive_0113', '2011_09_28_drive_0117', '2011_09_28_drive_0119', '2011_09_28_drive_0121', '2011_09_28_drive_0122', '2011_09_28_drive_0125', '2011_09_28_drive_0126', '2011_09_28_drive_0128', '2011_09_28_drive_0132', '2011_09_28_drive_0134', '2011_09_28_drive_0135', '2011_09_28_drive_0136', '2011_09_28_drive_0138', '2011_09_28_drive_0141', '2011_09_28_drive_0143', '2011_09_28_drive_0145', '2011_09_28_drive_0146', '2011_09_28_drive_0149', '2011_09_28_drive_0153', '2011_09_28_drive_0154', '2011_09_28_drive_0155', '2011_09_28_drive_0156', '2011_09_28_drive_0160', '2011_09_28_drive_0161', '2011_09_28_drive_0162', '2011_09_28_drive_0165', '2011_09_28_drive_0166', '2011_09_28_drive_0167', '2011_09_28_drive_0168', '2011_09_28_drive_0171', '2011_09_28_drive_0174', '2011_09_28_drive_0177', '2011_09_28_drive_0179', '2011_09_28_drive_0183', '2011_09_28_drive_0184', '2011_09_28_drive_0185', '2011_09_28_drive_0186', '2011_09_28_drive_0187', '2011_09_28_drive_0191', '2011_09_28_drive_0192', '2011_09_28_drive_0195', '2011_09_28_drive_0198', '2011_09_28_drive_0199', '2011_09_28_drive_0201', '2011_09_28_drive_0204', '2011_09_28_drive_0205', '2011_09_28_drive_0208', '2011_09_28_drive_0209', '2011_09_28_drive_0214', '2011_09_28_drive_0216', '2011_09_28_drive_0220', '2011_09_28_drive_0222'),
    'calibration': ('2011_09_26_drive_0119', '2011_09_28_drive_0225', '2011_09_29_drive_0108', '2011_09_30_drive_0072', '2011_10_03_drive_0058'),
}

# Segments listed at http://www.cvlibs.net/datasets/kitti/raw_data.php
# that have trackets in XML format.  This list of segments was collected
# manually from the link above.  Note: we find below that this set is
# *larger* than the set of segments referenced in the Tracking Benchmark.
SEGMENTS_WITH_TRACKLETS = (
  # City category
  '2011_09_26_drive_0001', '2011_09_26_drive_0002', '2011_09_26_drive_0005', '2011_09_26_drive_0009', '2011_09_26_drive_0011', '2011_09_26_drive_0013', '2011_09_26_drive_0014', '2011_09_26_drive_0017', '2011_09_26_drive_0018', '2011_09_26_drive_0048', '2011_09_26_drive_0051', '2011_09_26_drive_0056', '2011_09_26_drive_0057', '2011_09_26_drive_0059', '2011_09_26_drive_0060', '2011_09_26_drive_0084', '2011_09_26_drive_0091', '2011_09_26_drive_0093',

  # Residential category
  '2011_09_26_drive_0019', '2011_09_26_drive_0020', '2011_09_26_drive_0022', '2011_09_26_drive_0023', '2011_09_26_drive_0035', '2011_09_26_drive_0036', '2011_09_26_drive_0039', '2011_09_26_drive_0046', '2011_09_26_drive_0061', '2011_09_26_drive_0064', '2011_09_26_drive_0079', '2011_09_26_drive_0086', '2011_09_26_drive_0087',

  # Road category
  '2011_09_26_drive_0015', '2011_09_26_drive_0027', '2011_09_26_drive_0028', '2011_09_26_drive_0029', '2011_09_26_drive_0032', '2011_09_26_drive_0052', '2011_09_26_drive_0070',
)

## Setup And Utils

Declare until functions and create Spark session.  This cell is idempotent.

In [2]:
import glob
import os
import pprint

import numpy as np
import pandas as pd
from pyspark.sql import functions as F

from oarphpy import util
from oarphpy.spark import NBSpark
from oarphpy import spark as S

def to_archive_df(spark, archive_path, fname_whitelister=None):    
    
    def to_row(fw):
        # NB: we've tried decoding the image files using imagio and the mapping
        # of Benchmark <=> Raw Sync files remains the same.  Thus we'll just
        # use a binary hash (which works with velodyne files and is a bit
        # faster too)
        
        import hashlib
        digest = hashlib.sha1(fw.data).hexdigest()
                
        from pyspark.sql import Row
        return Row(filename=fw.name, digest=digest)

    fw_rdd = S.archive_rdd(spark, archive_path)
    if fname_whitelister:
        fw_rdd = fw_rdd.filter(lambda fw: fname_whitelister(fw.name))
    fw_rdd = fw_rdd.repartition(os.cpu_count())
    archive_df = spark.createDataFrame(fw_rdd.map(to_row))
    
    # Add the archive name
    archive_name = os.path.basename(archive_path)
    archive_df = archive_df.withColumn('archive_name', F.lit(archive_name))
    
    # Add the topic
    # E.g. 2011_09_28/2011_09_28_drive_0108_sync/oxts/data/0000000033.txt -> oxts
    #      2011_09_28/2011_09_28_drive_0108_sync/image_00/data/0000000023.png -> image_00
    topic_clause = F.when(archive_df.filename.like('%oxts%'), 'oxts')
    topic_clause = topic_clause.when(archive_df.filename.like('%velodyne%'), 'velodyne')
    for i in range(4):
        topic = 'image_0%s' % i
        
        # Convention used in Raw Sync and Tracking Benchmark
        key = '/image_0%s/' % i
        topic_clause = topic_clause.when(archive_df.filename.like('%' + key + '%'), topic)

        # Convention used in Benchmarks
        key = '/image_%s/' % i
        topic_clause = topic_clause.when(archive_df.filename.like('%' + key + '%'), topic)

        key = '/prev_%s/' % i
        topic_clause = topic_clause.when(archive_df.filename.like('%' + key + '%'), topic)
        
        
    topic_clause = topic_clause.otherwise('')
    archive_df = archive_df.withColumn('topic', topic_clause)
    return archive_df

def is_image_or_scan_or_oxt(path):
    return (
        path.endswith('.png') or
        path.endswith('.bin') or
        ('oxt' in path and path.endswith('.txt') and (not path.endswith('dataformat.txt'))))

spark = NBSpark.getOrCreate()

def show_query(q, max_rows=1000, truncate=False):
    spark.sql(q).show(max_rows, truncate=truncate)

2020-03-04 20:59:41,880	oarph 42 : Using source root /usr/local/lib/python3.6/dist-packages/IPython/core 
2020-03-04 20:59:41,881	oarph 42 : Using source root /usr/local/lib/python3.6/dist-packages/IPython 
2020-03-04 20:59:41,919	oarph 42 : Generating egg to /tmp/tmp1i570uc__oarphpy_eggbuild ...
2020-03-04 20:59:42,126	oarph 42 : ... done.  Egg at /tmp/tmp1i570uc__oarphpy_eggbuild/core-0.0.0-py3.6.egg


## Create Archive DataFrames

Now we'll create the `benchmark_df` and `raw_data_df` tables used for our analysis. The process of computing the SHA-1 for all data files is expensive, so this job can take 4.5hrs to run on a 12-thread machine.  Thanks to the licensing of KITTI, we include pre-computed copies of the tables in the `assets` directory of this repo.  In the cells below, we'll use those assets if present or recompute them if necessary.

In [3]:
paths = glob.glob(os.path.join(KITTI_DATA_BASE_DIR, '*'))

# Compute the tables if necssary; save them to disk to make the data persistent for our analysis below.
if not os.path.exists(BENCHMARK_DF_PATH):
    benchmark_dfs = [
        to_archive_df(spark, path, fname_whitelister=is_image_or_scan_or_oxt)
        for path in paths
        if is_benchmark_zip(path)
    ]
    benchmark_df = S.union_dfs(*benchmark_dfs)
    benchmark_df.write.parquet(BENCHMARK_DF_PATH, compression='gzip')

if not os.path.exists(RAW_SYNC_DF_PATH):
    raw_data_dfs = [
        to_archive_df(spark, path, fname_whitelister=is_image_or_scan_or_oxt)
        for path in paths
        if is_raw_data_zip(path)
    ]
    raw_data_df = S.union_dfs(*raw_data_dfs)
    raw_data_df.write.parquet(RAW_SYNC_DF_PATH, compression='gzip')

In [5]:
# Read the tables, which we may have (re-)computed above.
benchmark_df = spark.read.parquet(BENCHMARK_DF_PATH).persist()
raw_data_df = spark.read.parquet(RAW_SYNC_DF_PATH).persist()
benchmark_df.createOrReplaceTempView('benchmark_df')
raw_data_df.createOrReplaceTempView('raw_data_df')

### Sanity Check
Do the tables have the expected data, broken down by topic?  Some observations:
 * The Object Benchmark has 1005 Velodyne coverage, but the Tracking Benchmark is missing 4 Velodyne scans, and the Raw Sync data is missing 4 scans.
 * The Object Benchmark preceding image sets (e.g. `data_object_prev_2.zip`) are missing 102 frames.  That's more than the number of `SEGMENTS_WITH_TRACKLETS`, but less than the number of Raw Sync segments overall.

In [6]:
print("Benchmarks")
show_query("""
    SELECT
        archive_name benchmark,
        topic,
        COUNT(*) num
    FROM benchmark_df
    GROUP BY benchmark, topic
    ORDER BY benchmark, topic
""")

print("Raw Sync")
show_query("""
    SELECT topic, COUNT(*) n
    FROM raw_data_df
    GROUP BY topic
    ORDER BY topic
""")

# Expected
# Benchmarks
# +--------------------------+--------+-----+
# |benchmark                 |topic   |num  |
# +--------------------------+--------+-----+
# |data_object_image_2.zip   |image_02|14999|
# |data_object_image_3.zip   |image_03|14999|
# |data_object_prev_2.zip    |image_02|44895|
# |data_object_prev_3.zip    |image_03|44895|
# |data_object_velodyne.zip  |velodyne|14999|
# |data_tracking_image_2.zip |image_02|19103|
# |data_tracking_image_3.zip |image_03|19103|
# |data_tracking_oxts.zip    |oxts    |50   |
# |data_tracking_velodyne.zip|velodyne|19099|
# +--------------------------+--------+-----+

# Raw Sync
# +--------+-----+
# |topic   |n    |
# +--------+-----+
# |image_00|47889|
# |image_01|47889|
# |image_02|47889|
# |image_03|47889|
# |oxts    |48040|
# |velodyne|47885|
# +--------+-----+

Benchmarks
+--------------------------+--------+-----+
|benchmark                 |topic   |num  |
+--------------------------+--------+-----+
|data_object_image_2.zip   |image_02|14999|
|data_object_image_3.zip   |image_03|14999|
|data_object_prev_2.zip    |image_02|44895|
|data_object_prev_3.zip    |image_03|44895|
|data_object_velodyne.zip  |velodyne|14999|
|data_tracking_image_2.zip |image_02|19103|
|data_tracking_image_3.zip |image_03|19103|
|data_tracking_oxts.zip    |oxts    |50   |
|data_tracking_velodyne.zip|velodyne|19099|
+--------------------------+--------+-----+

Raw Sync
+--------+-----+
|topic   |n    |
+--------+-----+
|image_00|47889|
|image_01|47889|
|image_02|47889|
|image_03|47889|
|oxts    |48040|
|velodyne|47885|
+--------+-----+



Are we missing any expected segments?

In [8]:
import itertools
raw_segs = set(r[0].replace('_sync.zip', '') for r in raw_data_df.select('archive_name').collect())
segs_overall = set(
    itertools.chain.from_iterable(
        v
        for k, v in KITTI_CATEGORY_TO_SEGMENTS.items()
        if k != 'calibration'))
print('Missing segments: %s' % (segs_overall - num_raw_segs))

Missing segments: set()


### Do the Benchmarks and Raw Sync Segments have Duplicate Data?

We want to ensure two things:
 * No image/velodyne scan is repeated within each of the two Benchmarks and within the Raw Sync data.  We expect there to be some overlap between these three groups, but each group should be distinct.
 * We don't have any duplicate files in `KITTI_DATA_BASE_DIR`.

Thus the queries below should result in empty output.

In [9]:
show_query("""
    SELECT
        digest,
        COLLECT_LIST(filename),
        COUNT(*) n
    FROM benchmark_df
    WHERE archive_name = 'data_object_image_2.zip'
    GROUP BY digest
    HAVING n >= 2
""")
show_query("""
    SELECT
        digest,
        COLLECT_LIST(filename),
        COUNT(*) n
    FROM benchmark_df
    WHERE archive_name = 'data_tracking_image_2.zip'
    GROUP BY digest
    HAVING n >= 2
""")

show_query("""
    SELECT
        digest,
        COLLECT_LIST(archive_name),
        COLLECT_LIST(filename),
        COUNT(*) n
    FROM raw_data_df
    GROUP BY digest
    HAVING n >= 2
""")

+------+----------------------+---+
|digest|collect_list(filename)|n  |
+------+----------------------+---+
+------+----------------------+---+

+------+----------------------+---+
|digest|collect_list(filename)|n  |
+------+----------------------+---+
+------+----------------------+---+

+------+--------------------------+----------------------+---+
|digest|collect_list(archive_name)|collect_list(filename)|n  |
+------+--------------------------+----------------------+---+
+------+--------------------------+----------------------+---+



### Train/Test Splits
Let's infer the split for all files to study the Benchmark / Raw Sync overlap.  We expect the Raw Sync to only include training data.

In [11]:
benchmark_df_with_split = benchmark_df.withColumn(
    'split',
    F.when(benchmark_df.filename.like('%train%'), 'train').when(
        benchmark_df.filename.like('%test%'), 'test').otherwise('unknown'))
benchmark_df_with_split.createOrReplaceTempView('benchmark_df_with_split')

raw_data_df_with_split = raw_data_df.withColumn('split', F.lit('raw'))
raw_data_df_with_split.createOrReplaceTempView('raw_data_df_with_split')

### How do the Benchmarks and Raw Sync data compare?
Here are our main results:
 * **The `test` data in the Benchmarks is not available in the Raw Sync data release.**
 * The train and test sets are distinct.  The sets have no overlapping data.  Therefore, exist Raw Data segments that have never been made public, and the `test` data was draw from those segments.  (If you email Professor Geiger, he has copies of the scripts used to create the Benchmarks, and the content of those scripts substantiates this finding).
 * The Tracking and Object Benchmarks are drawn from partially overlapping sets of Raw Data segments.
 * There is slightly more `test` than `train` data for both Benchmarks.

The first result is the main warrant for this project.  Since the `test` Raw Sync segments are hidden, and since the `test` data is made available **exclusively** through the Benchmark zip files, researchers probably want to focus on the Benchmarks data release and ignore the Raw Sync data release.  Firstly, the Benchmark zips require less data to be downloaded.  Secondly, the Benchmarks define the `train` split; KITTI does not publish separately how to derive the `train` split from the public Raw Sync data made available.  Thirdly, if a researcher wants more training data today, they might look to a larger dataset like [NuScenes](https://www.nuscenes.org/), [Argoverse](https://www.argoverse.org/), or [Waymo Open Dataset](https://waymo.com/open/).  The extra Raw Sync data is certainly useful, but not as relevant today as when it was originally released in 2012.  Therefore, we aim to contribute two things in this project:
 * We recover the Benchmark <-> Raw Sync data mapping that shows how the Benchmarks were derived from the Raw Data. 
 * The Benchmark data files contain all the sensor data, but not the timestamp data.  Using the mapping above, we recover the timestamp data.  This data allows us to make the KITTI Benchmarks compatible with `avsegs`.

In [11]:
show_query("""
    SELECT
        b.split AS split,
        b.archive_name benchmark,
        COUNT(DISTINCT r.digest) AS n_rdigest,
        COUNT(*) AS n
    FROM benchmark_df_with_split AS b FULL OUTER JOIN raw_data_df_with_split r ON b.digest = r.digest
    GROUP BY b.split, benchmark
    ORDER BY benchmark, b.split DESC
""")

+-----+--------------------------+---------+------+
|split|benchmark                 |n_rdigest|n     |
+-----+--------------------------+---------+------+
|null |null                      |240646   |240646|
|train|data_object_image_2.zip   |7481     |7481  |
|test |data_object_image_2.zip   |0        |7518  |
|train|data_object_image_3.zip   |7481     |7481  |
|test |data_object_image_3.zip   |0        |7518  |
|train|data_object_prev_2.zip    |13904    |22394 |
|test |data_object_prev_2.zip    |0        |22501 |
|train|data_object_prev_3.zip    |13904    |22394 |
|test |data_object_prev_3.zip    |0        |22501 |
|train|data_object_velodyne.zip  |7481     |7481  |
|test |data_object_velodyne.zip  |0        |7518  |
|train|data_tracking_image_2.zip |8008     |8008  |
|test |data_tracking_image_2.zip |0        |11095 |
|train|data_tracking_image_3.zip |8008     |8008  |
|test |data_tracking_image_3.zip |0        |11095 |
|train|data_tracking_oxts.zip    |0        |21    |
|test |data_

### How well does the labeled data cover the segments?
We drill down a little further on the Benchmark <-> Raw Sync mapping and see that:
 * The Tracking Benchmark contain fully-labeled segments.
 * The Object Benchmark samples are *not taken uniformly at random*, likely in order to achieve a target number of examples of each class.
 * Given the two points above, we can deduce that the Tracking Benchmark actually contains additional labels that could be useful to the Object Benchmark.

In [17]:
# Append a `category` column to the Raw Sync table to show the category of segment.
seg_cat_clause = F.when(raw_data_df.archive_name.like(''), '')
for category, segments in KITTI_CATEGORY_TO_SEGMENTS.items():
    for segment in segments:
        seg_cat_clause = seg_cat_clause.when(
            raw_data_df.archive_name.like('%' + segment + '%'),
            category)
seg_cat_clause = seg_cat_clause.otherwise('')
raw_data_df_with_cat = raw_data_df.withColumn('category', seg_cat_clause)
raw_data_df_with_cat.createOrReplaceTempView('raw_data_df_with_category')

show_query("""
    WITH
      bb AS (  
          SELECT
            IF(filename LIKE '%test%', 'test', 'train') AS split,
            digest,
            archive_name benchmark
          FROM benchmark_df
          WHERE topic = 'image_02'  ),

      segment_count AS (
          SELECT
            archive_name segment,
            FIRST(category) category,
            COUNT(*) n_frames
          FROM raw_data_df_with_category
          WHERE topic = 'image_02'
          GROUP BY archive_name  ),

      label_counts AS (
        SELECT
            r.archive_name segment,
            FIRST(r.category) category,
            bb.split split,
            bb.benchmark benchmark,
            COUNT(*) n
        FROM raw_data_df_with_category r INNER JOIN bb
          ON r.digest = bb.digest
        GROUP BY split, benchmark, segment )
    
    SELECT 
       l.split,
       l.benchmark,
       l.category,
       l.segment,
       l.n n_labeled,
       s.n_frames n_total,
       l.n / s.n_frames frac_labeled
    FROM 
      label_counts l, segment_count s
    WHERE l.segment = s.segment
    ORDER BY l.split, l.benchmark, frac_labeled DESC, l.segment
       
""")

+-----+-------------------------+-----------+------------------------------+---------+-------+--------------------+
|split|benchmark                |category   |segment                       |n_labeled|n_total|frac_labeled        |
+-----+-------------------------+-----------+------------------------------+---------+-------+--------------------+
|train|data_object_image_2.zip  |city       |2011_09_26_drive_0014_sync.zip|235      |314    |0.7484076433121019  |
|train|data_object_image_2.zip  |road       |2011_09_26_drive_0015_sync.zip|215      |297    |0.7239057239057239  |
|train|data_object_image_2.zip  |city       |2011_09_26_drive_0056_sync.zip|197      |294    |0.6700680272108843  |
|train|data_object_image_2.zip  |city       |2011_09_26_drive_0009_sync.zip|287      |447    |0.6420581655480985  |
|train|data_object_image_2.zip  |city       |2011_09_26_drive_0095_sync.zip|166      |268    |0.6194029850746269  |
|train|data_object_image_2.zip  |road       |2011_09_26_drive_0101_sync.

### How much do the Object and Tracking Benchmarks overlap?
Above we showed that there is a complex overlap between the Tracking and Object Benchmarks.  Let's dive a little deeper.  We find that about 18% of the `train` sets overlap, and about 20% of the `test` sets overlap. 

In [13]:
show_query("""
    SELECT 
        archive_name,
        COUNT(DISTINCT digest)
    FROM benchmark_df
    GROUP BY archive_name
""")

show_query("""
    WITH 
      obj AS (  
          SELECT
            IF(filename LIKE '%test%', 'test', 'train') AS split,
            digest,
            archive_name benchmark
          FROM benchmark_df  
          WHERE archive_name = 'data_object_image_2.zip'   ),

      track AS (  
          SELECT
            IF(filename LIKE '%test%', 'test', 'train') AS split,
            digest,
            archive_name benchmark
          FROM benchmark_df  
          WHERE archive_name = 'data_tracking_image_2.zip'   )
    
    SELECT
        obj.split obj_split,
        track.split track_split,
        COUNT(*) n,
        IF(track.split is not null, ROUND(COUNT(*) / 19103, 3), 0) frac_tracking,
        IF(obj.split is not null, ROUND(COUNT(*) / 14999, 3), 0) frac_object
    FROM obj FULL OUTER JOIN track ON obj.digest = track.digest
    GROUP BY obj_split, track_split
    ORDER BY obj_split, track_split
""")


+--------------------------+----------------------+
|archive_name              |count(DISTINCT digest)|
+--------------------------+----------------------+
|data_tracking_image_2.zip |19103                 |
|data_tracking_oxts.zip    |50                    |
|data_object_prev_3.zip    |27328                 |
|data_object_velodyne.zip  |14999                 |
|data_tracking_velodyne.zip|19099                 |
|data_object_image_3.zip   |14999                 |
|data_object_image_2.zip   |14999                 |
|data_tracking_image_3.zip |19103                 |
|data_object_prev_2.zip    |27328                 |
+--------------------------+----------------------+

+---------+-----------+----+-------------+-----------+
|obj_split|track_split|n   |frac_tracking|frac_object|
+---------+-----------+----+-------------+-----------+
|null     |test       |7542|0.395        |0.0        |
|null     |train      |4910|0.257        |0.0        |
|test     |null       |3965|0.0          |0.264 

### The Raw Sync Tracklets are a superset of the Tracking Benchmark
Below we can see that the `train` split of the Tracking benchmark does not include all Raw Sync segments that have `tracklets.zip` files posted publicly.  We are unsure of this discrepency, but perhaps the excluded segments don't help with car/pedestrian class balance.

In [14]:
tracking_train_segments_df = spark.sql("""
    SELECT DISTINCT r.archive_name AS segment
    FROM benchmark_df b INNER JOIN raw_data_df r on b.digest = r.digest
    WHERE b.archive_name = 'data_tracking_image_2.zip'
""")

tracking_train_segments = set(
    r.segment.replace('.zip', '')
    for r in tracking_train_segments_df.collect())

print('Raw Sync segments excluded from Tracking Benchmark:')
pprint.pprint(set(SEGMENTS_WITH_TRACKLETS) - tracking_train_segments)
print('%s with tracklets vs %s Tracking Benchmark' % (len(SEGMENTS_WITH_TRACKLETS), tracking_train_segments_df.count()))

Raw Sync segments excluded from Tracking Benchmark:
{'2011_09_26_drive_0001',
 '2011_09_26_drive_0002',
 '2011_09_26_drive_0005',
 '2011_09_26_drive_0009',
 '2011_09_26_drive_0011',
 '2011_09_26_drive_0013',
 '2011_09_26_drive_0014',
 '2011_09_26_drive_0015',
 '2011_09_26_drive_0017',
 '2011_09_26_drive_0018',
 '2011_09_26_drive_0019',
 '2011_09_26_drive_0020',
 '2011_09_26_drive_0022',
 '2011_09_26_drive_0023',
 '2011_09_26_drive_0027',
 '2011_09_26_drive_0028',
 '2011_09_26_drive_0029',
 '2011_09_26_drive_0032',
 '2011_09_26_drive_0035',
 '2011_09_26_drive_0036',
 '2011_09_26_drive_0039',
 '2011_09_26_drive_0046',
 '2011_09_26_drive_0048',
 '2011_09_26_drive_0051',
 '2011_09_26_drive_0052',
 '2011_09_26_drive_0056',
 '2011_09_26_drive_0057',
 '2011_09_26_drive_0059',
 '2011_09_26_drive_0060',
 '2011_09_26_drive_0061',
 '2011_09_26_drive_0064',
 '2011_09_26_drive_0070',
 '2011_09_26_drive_0079',
 '2011_09_26_drive_0084',
 '2011_09_26_drive_0086',
 '2011_09_26_drive_0087',
 '2011_09_26

### Preceding Image Files
*Are there any oddities with the preceding image files? Nope*

The object benchmark files with preceding images (e.g. `data_object_prev_2.zip`) likely contain some duplicates.  For example, the images at time `T` and `T-1` may have two identical preceding images.  When digging deeper, we observed that indeed there are duplicates; in fact, archives like `data_object_prev_2.zip` contain duplicate files (with differnet file names).  There are a few other observations:
 * The `train` and `test` sets are distinct (as expected) and have a similar distributions of duplicate image files.
 * There are no `train` preceding images that are not also represented in the `raw` data.

In [17]:
show_query("""
    SELECT 
        COUNT(*) distinct_files, splits, n_dupes
    FROM (
        SELECT
            digest,
            COLLECT_LIST(filename),
            COLLECT_LIST(archive_name),
            COLLECT_SET(split) AS splits,
            COUNT(*) n_dupes
        FROM (SELECT * FROM benchmark_df_with_split UNION SELECT * FROM raw_data_df_with_split)
        GROUP BY digest
        ORDER BY n_dupes DESC
    )
    GROUP BY n_dupes, splits
    ORDER BY distinct_files DESC
""")

+--------------+------------+-------+
|distinct_files|splits      |n_dupes|
+--------------+------------+-------+
|240646        |[raw]       |1      |
|25608         |[test]      |1      |
|20643         |[train, raw]|2      |
|15682         |[train, raw]|3      |
|14023         |[test]      |2      |
|6950          |[test]      |3      |
|5276          |[train, raw]|4      |
|3634          |[test]      |4      |
|2754          |[train, raw]|5      |
|2480          |[train, raw]|6      |
|2366          |[test]      |5      |
|21            |[train]     |1      |
+--------------+------------+-------+



# Sensor Sample Rate Study
One of the [KITTI papers](http://www.cvlibs.net/publications/Geiger2013IJRR.pdf) reports that the cameras and lidar are synchronized to record at 10Hz, and that the IMU (`oxts`) readings are a few milliseconds off.  Below we study this claim in detail using the timestamp data embedded in the Raw Sync datasets.

This study also helps us make the Benchmarks compatible with `avsegs`.  Since the Benchmarks (**in particular the `test` split data**) do not include timestamps, but `avsegs` assumes timestamped data, we can use the results of this study to synthesize timestamps for the `test` data in `avsegs`.

In [7]:
def to_topic_time_df(spark, archive_paths):
    
    def to_rows(fw):
        topic = os.path.basename(os.path.dirname(fw.name))
        
        ts_lines = fw.data.decode("utf-8").split('\n')
        def to_nanostamp(line):
            import datetime
            import time
            sec_str, ns_str = line.split('.')
            t_sec = time.mktime(
                datetime.datetime.strptime(sec_str, '%Y-%m-%d %H:%M:%S').timetuple())
            t_nsec = int(ns_str)
            return int(t_sec * 1e9) + t_nsec
            
        ts = [to_nanostamp(line) for line in ts_lines if line]
        
        segment = os.path.basename(fw.archive.archive_path).replace('.zip', '')
        for frame_id, t in enumerate(ts):
            # Deduce the path to the data file for this frame
            if topic == 'oxts':
                extension = '.txt'
            elif topic == 'velodyne_points':
                extension = '.bin'
            else:
                extension = '.png'
            N_DIGITS = 10
            frame_fname = str(frame_id)
            frame_fname = ('0' * (10 - len(frame_fname))) + frame_fname
                # Left-pad with 0s; 17 -> 0000000017
            frame_fname = fw.name.replace('timestamps.txt', 'data/' + frame_fname + extension)
            
            from pyspark.sql import Row
            yield Row(
                segment=segment,
                topic=topic,
                frame=frame_id,
                nanostamp=t,
                filename=frame_fname)

    def to_fws(path):
        fws = util.ArchiveFileFlyweight.fws_from(path)
        return [fw for fw in fws if fw.name.endswith('timestamps.txt')]
    
    path_rdd = spark.sparkContext.parallelize(archive_paths)
    fw_rdd = path_rdd.flatMap(to_fws)
    df = spark.createDataFrame(fw_rdd.flatMap(to_rows))
    return df

rpaths = [
    path
    for path in glob.glob(os.path.join(KITTI_DATA_BASE_DIR, '*'))
    if is_raw_data_zip(path)
]

topic_time_df = to_topic_time_df(spark, rpaths)

In [8]:
# Quick sanity check
topic_time_df = topic_time_df.persist()
topic_time_df.filter('topic like "%image%"').show(truncate=False)
topic_time_df.count()
topic_time_df.createOrReplaceTempView('topic_time_df')

+------------------------------------------------------------------+-----+-------------------+--------------------------+--------+
|filename                                                          |frame|nanostamp          |segment                   |topic   |
+------------------------------------------------------------------+-----+-------------------+--------------------------+--------+
|2011_09_26/2011_09_26_drive_0051_sync/image_00/data/0000000000.png|0    |1317046695077571328|2011_09_26_drive_0051_sync|image_00|
|2011_09_26/2011_09_26_drive_0051_sync/image_00/data/0000000001.png|1    |1317046695179708672|2011_09_26_drive_0051_sync|image_00|
|2011_09_26/2011_09_26_drive_0051_sync/image_00/data/0000000002.png|2    |1317046695283594496|2011_09_26_drive_0051_sync|image_00|
|2011_09_26/2011_09_26_drive_0051_sync/image_00/data/0000000003.png|3    |1317046695389738496|2011_09_26_drive_0051_sync|image_00|
|2011_09_26/2011_09_26_drive_0051_sync/image_00/data/0000000004.png|4    |131704669

In [9]:
# Generate a nice report
segments = [r.segment for r in topic_time_df.select('segment').distinct().collect()]
for segment in segments:
    df = topic_time_df.filter(topic_time_df.segment == segment).toPandas()
    
    from datetime import datetime
    dt = datetime.utcfromtimestamp(df['nanostamp'].min() * 1e-9)
    start = dt.strftime('%Y-%m-%d %H:%M:%S')
    duration = 1e-9 * (df['nanostamp'].max() - df['nanostamp'].min())
    
    n_images = len(df[df['topic'] == 'image_02'])
    n_lidars = len(df[df['topic'] == 'velodyne_points'])
    n_frames = df['frame'].max() + 1
    
    # Print a report
    print('---')
    print('---')

    print('Segment %s' % segment)
    print('Start %s \tDuration %s sec' % (start, duration))
    print('Num Images %s (Lidars %s)' % (n_images, n_lidars))
    import pandas as pd
    from collections import OrderedDict
    rows = []
    for name in sorted(df['topic'].unique()):
        def get_series(name):
            srows = df[df['topic'] == name]
            fids = pd.DataFrame({'frame': range(n_frames)}, columns=['frame'])
            srows = srows.set_index('frame')
            fids = fids.set_index('frame')
            joined = fids.join(srows, on='frame', how='left').fillna(0)
            return 1e-9 * joined['nanostamp'].to_numpy()

        series = get_series(name)
        freqs = series[1:] - series[:-1]

        image_series = get_series('image_02')
        all_diff_ms = 1e3 * np.abs(image_series - series)
        diff_image_ms = np.mean(all_diff_ms)
        diff_image_ms_stdev = np.std(all_diff_ms)
        
        
        rows.append(OrderedDict((
          ('Series',             name),
          ('Freq Hz',            1. / np.mean(freqs)),
          ('Diff img (msec)',    diff_image_ms),
          ('std (msec)',         diff_image_ms_stdev),
          ('Duration',           (series[-1] - series[0])),
          ('Support',            len(series)),
        )))
    print(pd.DataFrame(rows))

    print()
    print()
    

---
---
Segment 2011_09_26_drive_0104_sync
Start 2011-09-26 15:30:02 	Duration 32.19432861 sec
Num Images 312 (Lidars 312)
            Series   Freq Hz  Diff img (msec)  std (msec)   Duration  Support
0         image_00  9.665068         6.635582    0.968101  32.177737      312
1         image_01  9.665040         6.418097    0.697128  32.177829      312
2         image_02  9.665039         0.000000    0.000000  32.177831      312
3         image_03  9.665047         0.544529    0.071898  32.177805      312
4             oxts  9.666975         7.907448    4.793579  32.171387      312
5  velodyne_points  9.665116        10.512416    0.053337  32.177574      312


---
---
Segment 2011_09_28_drive_0102_sync
Start 2011-09-28 13:42:00 	Duration 4.65559889 sec
Num Images 46 (Lidars 46)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.685711         6.131856    0.032090  4.646019       46
1         image_01  9.685618         6.128866    0.0310

---
---
Segment 2011_09_28_drive_0208_sync
Start 2011-09-28 14:14:41 	Duration 5.474233005 sec
Num Images 54 (Lidars 54)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.695847         6.128567    0.037494  5.466258       54
1         image_01  9.695844         6.129826    0.028417  5.466259       54
2         image_02  9.695843         0.000000    0.000000  5.466260       54
3         image_03  9.695846         0.561608    0.060171  5.466259       54
4             oxts  9.688291         6.265976    2.899946  5.470521       54
5  velodyne_points  9.695853         1.598866    0.040474  5.466254       54


---
---
Segment 2011_09_26_drive_0014_sync
Start 2011-09-26 13:11:15 	Duration 32.429831132000004 sec
Num Images 314 (Lidars 314)
            Series   Freq Hz  Diff img (msec)  std (msec)   Duration  Support
0         image_00  9.656624         6.442693    0.648236  32.412983      314
1         image_01  9.657144         6.362273    0.5

---
---
Segment 2011_09_28_drive_0034_sync
Start 2011-09-28 12:46:15 	Duration 4.980687883 sec
Num Images 49 (Lidars 49)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.651233         6.170000    0.147200  4.973458       49
1         image_01  9.651271         6.164955    0.143738  4.973438       49
2         image_02  9.651269         0.000000    0.000000  4.973439       49
3         image_03  9.651028         0.548353    0.084506  4.973563       49
4             oxts  9.657027         6.197603    2.876082  4.970474       49
5  velodyne_points  9.651146         1.657316    0.477587  4.973503       49


---
---
Segment 2011_09_26_drive_0051_sync
Start 2011-09-26 14:18:15 	Duration 45.217125954000004 sec
Num Images 438 (Lidars 438)
            Series   Freq Hz  Diff img (msec)  std (msec)   Duration  Support
0         image_00  9.668678         6.511785    0.784826  45.197495      438
1         image_01  9.668667         6.516293    0.8

---
---
Segment 2011_09_28_drive_0209_sync
Start 2011-09-28 14:14:49 	Duration 8.773894912000001 sec
Num Images 86 (Lidars 86)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.695150         6.138469    0.016203  8.767270       86
1         image_01  9.695151         6.134607    0.014029  8.767270       86
2         image_02  9.695152         0.000000    0.000000  8.767268       86
3         image_03  9.695151         0.545701    0.059385  8.767269       86
4             oxts  9.702322         6.186704    2.875036  8.760790       86
5  velodyne_points  9.695145         1.609018    0.038476  8.767275       86


---
---
Segment 2011_09_28_drive_0179_sync
Start 2011-09-28 14:09:16 	Duration 4.339706280000001 sec
Num Images 43 (Lidars 43)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.693684         6.138386    0.021530  4.332718       43
1         image_01  9.693604         6.133418    0.

---
---
Segment 2011_09_28_drive_0087_sync
Start 2011-09-28 13:32:28 	Duration 8.390746779 sec
Num Images 82 (Lidars 82)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.666034         6.141381    0.009185  8.379859       82
1         image_01  9.666034         6.137293    0.004403  8.379859       82
2         image_02  9.666043         0.000000    0.000000  8.379851       82
3         image_03  9.666182         0.551142    0.059435  8.379730       82
4             oxts  9.664964         5.938728    2.882223  8.380786       82
5  velodyne_points  9.666080         1.578750    0.033370  8.379819       82


---
---
Segment 2011_09_28_drive_0002_sync
Start 2011-09-28 12:26:56 	Duration 38.889539688 sec
Num Images 376 (Lidars 376)
            Series   Freq Hz  Diff img (msec)  std (msec)   Duration  Support
0         image_00  9.645428         6.138344    0.019915  38.878525      376
1         image_01  9.645427         6.134213    0.018395 

---
---
Segment 2011_09_26_drive_0048_sync
Start 2011-09-26 14:14:10 	Duration 2.1934633210000003 sec
Num Images 22 (Lidars 22)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.677214         6.599340    1.025023  2.170046       22
1         image_01  9.670676         6.342953    0.643937  2.171513       22
2         image_02  9.670674         0.000000    0.000000  2.171514       22
3         image_03  9.670101         0.502695    0.058095  2.171642       22
4             oxts  9.676287         6.235816    3.182645  2.170254       22
5  velodyne_points  9.670562        10.502772    0.028223  2.171539       22


---
---
Segment 2011_09_28_drive_0078_sync
Start 2011-09-28 13:30:12 	Duration 3.732179968 sec
Num Images 37 (Lidars 37)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.663385         6.230619    0.373261  3.725403       37
1         image_01  9.663447         6.224632    0.36739

---
---
Segment 2011_09_26_drive_0087_sync
Start 2011-09-26 15:07:08 	Duration 75.404508143 sec
Num Images 729 (Lidars 729)
            Series   Freq Hz  Diff img (msec)  std (msec)   Duration  Support
0         image_00  9.657069         6.467910    0.653809  75.385190      729
1         image_01  9.657073         6.427661    0.618981  75.385162      729
2         image_02  9.657079         0.000000    0.000000  75.385117      729
3         image_03  9.657068         0.543578    0.066353  75.385198      729
4             oxts  9.657307         6.774005    3.981861  75.383335      729
5  velodyne_points  9.657073        10.506749    0.088593  75.385158      729


---
---
Segment 2011_09_28_drive_0146_sync
Start 2011-09-28 14:02:09 	Duration 7.2328701010000005 sec
Num Images 71 (Lidars 71)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.692316         6.130289    0.032475  7.222216       71
1         image_01  9.692317         6.126938 

            Series   Freq Hz  Diff img (msec)  std (msec)    Duration  Support
0         image_00  9.628397         6.167639    0.197054  109.883299     1059
1         image_01  9.628397         6.162090    0.189966  109.883298     1059
2         image_02  9.628396         0.000000    0.000000  109.883312     1059
3         image_03  9.628396         0.554551    0.062637  109.883307     1059
4             oxts  9.628152         6.131676    2.962392  109.886087     1059
5  velodyne_points  9.628397         1.151734    0.341468  109.883291     1059


---
---
Segment 2011_09_28_drive_0141_sync
Start 2011-09-28 14:01:03 	Duration 7.229592576000001 sec
Num Images 71 (Lidars 71)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.691361         6.222433    0.313118  7.222928       71
1         image_01  9.691360         6.174893    0.149368  7.222929       71
2         image_02  9.691304         0.000000    0.000000  7.222970       71
3         

            Series   Freq Hz  Diff img (msec)  std (msec)   Duration  Support
0         image_00  9.687527         6.493127    0.858035  40.670854      395
1         image_01  9.687509         6.488397    0.778139  40.670930      395
2         image_02  9.687507         0.000000    0.000000  40.670939      395
3         image_03  9.687537         0.541233    0.065477  40.670812      395
4             oxts  9.687293         6.544950    3.116384  40.671837      395
5  velodyne_points  9.687523        10.491073    0.114318  40.670871      395


---
---
Segment 2011_09_28_drive_0183_sync
Start 2011-09-28 14:10:01 	Duration 3.92920661 sec
Num Images 39 (Lidars 39)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.695534         6.130793    0.027674  3.919330       39
1         image_01  9.695534         6.128892    0.027182  3.919330       39
2         image_02  9.695527         0.000000    0.000000  3.919333       39
3         image_03  9.69

---
---
Segment 2011_09_28_drive_0098_sync
Start 2011-09-28 13:35:05 	Duration 4.450670829 sec
Num Images 44 (Lidars 44)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.683281         6.182064    0.148344  4.440644       44
1         image_01  9.683239         6.178769    0.153686  4.440663       44
2         image_02  9.683300         0.000000    0.000000  4.440635       44
3         image_03  9.683571         0.531836    0.060680  4.440511       44
4             oxts  9.684127         6.357258    2.891090  4.440256       44
5  velodyne_points  9.683383         1.614923    0.128168  4.440597       44


---
---
Segment 2011_09_28_drive_0185_sync
Start 2011-09-28 14:10:21 	Duration 8.155077376000001 sec
Num Images 80 (Lidars 80)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.695241         6.139785    0.009675  8.148328       80
1         image_01  9.695242         6.135884    0.007723

            Series   Freq Hz  Diff img (msec)  std (msec)   Duration  Support
0         image_00  9.653434         6.312019    0.478171  11.705679      114
1         image_01  9.651207         6.470013    0.684287  11.708380      114
2         image_02  9.651208         0.000000    0.000000  11.708379      114
3         image_03  9.651111         0.534353    0.064022  11.708497      114
4             oxts  9.657663         6.235953    2.926369  11.700553      114
5  velodyne_points  9.651144        10.513651    0.040327  11.708457      114


---
---
Segment 2011_09_28_drive_0153_sync
Start 2011-09-28 14:03:37 	Duration 9.190994855000001 sec
Num Images 90 (Lidars 90)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.693460         6.137199    0.025184  9.181448       90
1         image_01  9.693468         6.132269    0.022939  9.181440       90
2         image_02  9.693485         0.000000    0.000000  9.181424       90
3         image_0

---
---
Segment 2011_09_28_drive_0177_sync
Start 2011-09-28 14:08:54 	Duration 7.950938682 sec
Num Images 78 (Lidars 78)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.692790         6.140370    0.014802  7.944049       78
1         image_01  9.692790         6.134250    0.014545  7.944049       78
2         image_02  9.692785         0.000000    0.000000  7.944053       78
3         image_03  9.692640         0.548467    0.059956  7.944172       78
4             oxts  9.696876         6.153629    2.883202  7.940702       78
5  velodyne_points  9.692684         1.607143    0.034620  7.944136       78


---
---
Segment 2011_09_28_drive_0122_sync
Start 2011-09-28 13:47:21 	Duration 4.345460211000001 sec
Num Images 43 (Lidars 43)
            Series   Freq Hz  Diff img (msec)  std (msec)  Duration  Support
0         image_00  9.689967         6.134022    0.027964  4.334380       43
1         image_01  9.689964         6.130939    0.027160

# Benchmark to Raw Mapping
We now aggregate most of the data explored in this notebook in order to produce a final table asset useful for `avsegs` and other projects.  In particular, we produce a table that helps map Benchmark assets to Raw Sync data assets.

In [23]:
bench_to_raw_df = spark.sql("""
    WITH
      bench_to_raw AS (
          SELECT
            b.digest AS b_digest,
            r.digest AS r_digest,
            b.archive_name AS benchmark,
            r.category AS segment_category,
            b.filename AS b_filename,
            r.filename AS r_filename,
            b.split AS split
          FROM
            benchmark_df_with_split b FULL OUTER JOIN
            raw_data_df_with_category r ON b.digest = r.digest )
    
    SELECT *
    FROM
      bench_to_raw br FULL OUTER JOIN topic_time_df ttdf
      ON br.r_filename = ttdf.filename
        
    
    """)

# Reduce cardinality for faster I/O
bench_to_raw_df = bench_to_raw_df.repartition(20)

# We want to save as parquet.  When Pandas / pyarrow reads numeric columns that have nulls, 
# Pandas casts to float, which is really NOT what we want for timestamps (and other cols)
COLS_DEFAULT_ZERO = ('nanostamp', 'frame')
for col in COLS_DEFAULT_ZERO:
    bench_to_raw_df = bench_to_raw_df.withColumn(
                        col,
                        F.when(bench_to_raw_df[col].isNull(), 0).otherwise(bench_to_raw_df[col]))


In [24]:
# Sanity Check
bench_to_raw_df.limit(50).toPandas()

Unnamed: 0,b_digest,r_digest,benchmark,segment_category,b_filename,r_filename,split,filename,frame,nanostamp,segment,topic
0,,71c5f62767ea9164e1598c93d08491520bc1c809,,residential,,2011_09_30/2011_09_30_drive_0020_sync/oxts/dat...,,2011_09_30/2011_09_30_drive_0020_sync/oxts/dat...,618,1317384570631488023,2011_09_30_drive_0020_sync,oxts
1,befdaae80adbb863093db4b1ff255f863584b6c4,befdaae80adbb863093db4b1ff255f863584b6c4,data_object_prev_2.zip,person,training/prev_2/003675_03.png,2011_09_28/2011_09_28_drive_0119_sync/image_02...,train,2011_09_28/2011_09_28_drive_0119_sync/image_02...,31,1317217607975304704,2011_09_28_drive_0119_sync,image_02
2,,2cfeeb792b9dc613c2b44e47a5e7a51cdbcf100a,,city,,2011_09_26/2011_09_26_drive_0009_sync/image_01...,,2011_09_26/2011_09_26_drive_0009_sync/image_01...,433,1317042549794779392,2011_09_26_drive_0009_sync,image_01
3,,aa0f95675666d1edfe010ec1c24af1da3d0da2b5,,residential,,2011_09_30/2011_09_30_drive_0028_sync/oxts/dat...,,2011_09_30/2011_09_30_drive_0028_sync/oxts/dat...,3525,1317386928113621401,2011_09_30_drive_0028_sync,oxts
4,750243d3ca3b6e063c6339975ec3cc42a64414b4,750243d3ca3b6e063c6339975ec3cc42a64414b4,data_object_prev_2.zip,city,training/prev_2/005376_03.png,2011_09_26/2011_09_26_drive_0117_sync/image_02...,train,2011_09_26/2011_09_26_drive_0117_sync/image_02...,582,1317051703639338240,2011_09_26_drive_0117_sync,image_02
5,675eddc56c291e553e2cb1090aa9bf813032c777,675eddc56c291e553e2cb1090aa9bf813032c777,data_object_prev_2.zip,city,training/prev_2/001241_03.png,2011_09_26/2011_09_26_drive_0093_sync/image_02...,train,2011_09_26/2011_09_26_drive_0093_sync/image_02...,419,1317050356383268864,2011_09_26_drive_0093_sync,image_02
6,971bb49ab0ba7dde4f0ef83f67fa794e15df5fab,971bb49ab0ba7dde4f0ef83f67fa794e15df5fab,data_object_velodyne.zip,person,training/velodyne/007372.bin,2011_09_28/2011_09_28_drive_0090_sync/velodyne...,train,2011_09_28/2011_09_28_drive_0090_sync/velodyne...,23,1317216793006907907,2011_09_28_drive_0090_sync,velodyne_points
7,7a9a3537cd4d22324bf7545433b7e94b40f741e3,7a9a3537cd4d22324bf7545433b7e94b40f741e3,data_tracking_image_2.zip,city,training/image_02/0001/000434.png,2011_09_26/2011_09_26_drive_0009_sync/image_02...,train,2011_09_26/2011_09_26_drive_0009_sync/image_02...,434,1317042549892206336,2011_09_26_drive_0009_sync,image_02
8,,a7f33be4f9b888928dcf8d5c41b6d1c1df36a312,,residential,,2011_09_26/2011_09_26_drive_0039_sync/oxts/dat...,,2011_09_26/2011_09_26_drive_0039_sync/oxts/dat...,157,1317045952223358134,2011_09_26_drive_0039_sync,oxts
9,d43e12a724baec56b3b3fa8695f97b0e9ecdd004,d43e12a724baec56b3b3fa8695f97b0e9ecdd004,data_object_prev_2.zip,road,training/prev_2/005949_03.png,2011_09_26/2011_09_26_drive_0101_sync/image_02...,train,2011_09_26/2011_09_26_drive_0101_sync/image_02...,912,1317050856388794368,2011_09_26_drive_0101_sync,image_02


In [25]:
if not os.path.exists(BENCH_TO_RAW_DF_PATH):
    bench_to_raw_df.write.parquet(BENCH_TO_RAW_DF_PATH, compression='gzip')

# Misc

In [None]:
# Tool for extracting segment names from KITTI website
# http://www.cvlibs.net/datasets/kitti/raw_data.php
# Example extracted from calibration page
page_raw_text = """Data Category: Calibration
Before browsing, please wait some moments until this page is fully loaded.
2011_09_26_drive_0119 (0.1 GB)
Length: 17 frames (00:01 minutes)
Image resolution: 1392 x 512 pixels
Labels: 0 Cars, 0 Vans, 0 Trucks, 0 Pedestrians, 0 Sitters, 0 Cyclists, 0 Trams, 0 Misc
Downloads: [unsynced+unrectified data]
2011_09_28_drive_0225 (0.1 GB)
Length: 19 frames (00:01 minutes)
Image resolution: 1392 x 512 pixels
Labels: 0 Cars, 0 Vans, 0 Trucks, 0 Pedestrians, 0 Sitters, 0 Cyclists, 0 Trams, 0 Misc
Downloads: [unsynced+unrectified data]
2011_09_29_drive_0108 (0.1 GB)
Length: 20 frames (00:02 minutes)
Image resolution: 1392 x 512 pixels
Labels: 0 Cars, 0 Vans, 0 Trucks, 0 Pedestrians, 0 Sitters, 0 Cyclists, 0 Trams, 0 Misc
Downloads: [unsynced+unrectified data]
2011_09_30_drive_0072 (0.0 GB)
Length: 11 frames (00:01 minutes)
Image resolution: 1392 x 512 pixels
Labels: 0 Cars, 0 Vans, 0 Trucks, 0 Pedestrians, 0 Sitters, 0 Cyclists, 0 Trams, 0 Misc
Downloads: [unsynced+unrectified data]
2011_10_03_drive_0058 (0.1 GB)
Length: 35 frames (00:03 minutes)
Image resolution: 1392 x 512 pixels
Labels: 0 Cars, 0 Vans, 0 Trucks, 0 Pedestrians, 0 Sitters, 0 Cyclists, 0 Trams, 0 Misc
Downloads: [unsynced+unrectified data]"""

import re
import pprint
pprint.pprint(sorted(re.findall('\d\d\d\d_\d\d_\d\d_drive_\d\d\d\d', page_raw_text)))