# pNeuma understanding the urban traffic monitoring with massive drone data

## Summary: 
When it comes to transportation data, it is emphasized that traffic monitoring and analysis are one of the most important tools. And there are different techniques to collect data from traffic, such as cameras, sensors, GPS, and Unmanned Aerial Systems or called drones.

The pNeuma (New Era of Urban traffic Monitoring with Aerial footage) is an urban dataset to study congestion focused on the use of drones. The goal of this experiment is to record traffic streams in a multi-modal congested environment over an urban setting using UAS that can allow the deep investigation of critical traffic phenomena.

The advantages of the use of drones are that the use of expensive satellites is not necessary, they can be equipped with communication systems to inform commuters in real-time and they have great capabilities in data acquisition. The term "swarm of drones" is used to refer to a coordinated team of drones flying together without colliding to perform a task, for that reason these are perfect to monitor traffic congestion in different parts of a congested city.

The chosen place for this experiment was the central district of the city of Athens, Greece was selected as an urban, multimodal, busy environment that can allow different kinds of transportation phenomena to be examined in which there are 6 types of vehicles that are Car, Taxi, Bus, Medium Vehicle, Heavy Vehicle, Motorcycle. The pNeuma uses a swarm of 10 drones hovering over the city over five days to record traffic streams in a congested area of a 1.3km^2 area with more than 100 km-lanes of road network, around 100 busy intersections (signalized or not), more than 30 bus stops and close to half a million trajectories. 

Drones allow to analyze different traffic parameters, such as speed, flow, density, shockwaves, signal cycle length, queue lengths, queue dissipation time etc. and capacity by generating origin-destination (OD) matrices in the scenario of urban roundabouts and four-legged intersections.

Given the city regulations, the data was captured in the morning peak (8:00-10:30) for each working day of a week. It was important to consider that drones are not able to record the traffic stream for 2.5 hours continuously, so the alternative was to fly the swarm in sequential sessions with 'blind' gaps between, so it was expected that 10 minutes of each 30 minutes of no data would cause no significant issues. 

On the official page of pNeuma we can find the different types of datasets that were captured by each drone in its respective area. The .csv is organized such that each row represents the data of a single vehicle, the first 10 columns in the 1st row include the columns’ names, the first 4 columns include information about the trajectory like the unique trackID, the type of vehicle, the distance traveled and the average speed of the vehicle, the last 6 columns are then repeated every 6 columns based on the time frequency. They can be downloaded depending on a selected date, and the time the capture was made. 


## Data Dictionary:

| Data       | Data Type | Description |Atributes/Measures|   |
|------------|------|-------------|---------|---|
| track_id   |Integer|Track Id of the type of vehicle|Specific "Id" to identify the vehicle type|   |
| Type       |String|Type of Vehicle|Car, Taxi, Bus, Medium Vehicle, Heavy Vehicle, Motorcycle.|   |
| traveled_d |Float|Distance Traveled for the Vehicle|Meters|   |
| lat        |Float|Geographic coodinate of one point from north to south in the earth superfice|Degrees °|   |
| lon        |Float|Geographic coordinate of one point from east to west in the earth superfice|Degrees °|   |
| speed      |Float|Velocity of movement of the vehicle| kilometers per hours(Km/hours)|   |
| lon_acc    |Float| Longitudinal Acceleration| m/sec2|   |
| lat_acc    |Float|Lateral Acceleration| m/sec2|   |
| time       |Float|Time took for the vehicle to move that final distance|Seconds (S)|   |

Note: To work with this you need to use geopandas and movinpandas, if you are working on windows I do recommend to use conda because the installation is easier, do it manually on windows is usually complicated. <br>
If you are on Linux then you can do it manually, you will still have problems but it is easiers to solve.

The pNeuma dataset files are fairly large and do not posses the same number of columns. Direct import into a dataframe is not possible. So, we process line by line, and we  build a dict with the path object as a collection of points. Finally, create a geopandas dataframe from the list of dicts.

A possible problem is that the whole dataset may not fit into ram, then we need to find a way to store information directy in disk.

In [1]:
# Importing the whole libraries that we will use to make the work
import geopandas as gpd
from datetime import datetime, timedelta
import re
import movingpandas as mpd
import numpy as np
from shapely.geometry import Point
import pandas as pd
import matplotlib.pyplot as plt
#from pyproj import CRS
import dask.bag as db
import dask.dataframe as dd
from dask.delayed import delayed
import dask_geopandas
import os

## Preproccesing

In [2]:
def get_date_cols(fname):
    pat = re.compile(r'^([0-9]{8})_d([0-9]{1}|[0-9]{2}|X)_([0-9]{2})([0-9]{2})_([0-9]{2})([0-9]{2})\.csv')
    ymd, drone, start_hour, start_min, end_hour, end_min = [int(x) for x in pat.findall(fname)[0]]
    return ymd, start_hour, start_min, drone

In [3]:
def line_to_df(line, date_cols):
    """ Line is a string delimited by ; with trailing newline.
    Create a small df with each line and return it. """
    
    line = line.replace(" ","").split(';')[:-1]
    track_id, type_, traveled_d, avg_speed = line[:4]
    lon = np.array(line[5::6], dtype=float)
    lat = np.array(line[4::6], dtype=float)
    speed = np.array(line[6::6], dtype=float)
    lon_acc = np.array(line[7::6], dtype=float)
    lat_acc = np.array(line[8::6], dtype=float)
    # secs = np.array(line[9::6], dtype=float)
    
    func_delta = lambda x: timedelta(milliseconds=int(x))
    
    # Date information
    n = len(lon)
    ymd, hour, mins, drone = date_cols
    ymd_s = str(ymd)
    year = int(ymd_s[0:4])
    month = int(ymd_s[4:6])
    day = int(ymd_s[6:8])
    
    dates = np.full(n,datetime(year, month, day, hour, mins))
    
    # if the year is different
    # ID = np.full(n, int(str(ymd) + str(drone) + track_id + str(hour) + str(mins)), dtype=np.int64)
    # if the year is the same
    ID = np.full(n, int(str(ymd)[4:] + str(hour) + str(mins) + str(drone) + track_id), dtype=np.int64)
    # It seems time in secs from the start of the experiment chunk
    # Since precision given in sec with two decimals,
    # Rewrite in miliseconds since 8 (integer, avoid precision issues)
    secs = np.array((np.array(line[9::6], dtype=float) + (hour - 8)*3600)*1000, dtype=int)
    milisecs = np.array(list(map(func_delta, secs)))
    dates_to_df = dates + milisecs
    
    # oints = gpd.points_from_xy(lon, lat, crs='EPSG:4326')
    df = pd.DataFrame({'ID': ID, 'type': type_, 'lon_4326': lon, 'lat_4326': lat, 'secs': secs, 'speed':speed, 'long_acc':lon_acc, 'lat_acc':lat_acc, 'date':dates_to_df})
    
    df['date'] = df['date'].dt.strftime('%Y-%m-%d %H:%M:%S.%MS')
    return df

In [4]:
os.getcwd()

'C:\\Users\\artur\\code\\estancias\\Neuma'

In [12]:
os.chdir("..")

In [5]:
os.chdir("drone_1")

In [6]:
fnames = os.listdir("./")
fnames

['.ipynb_checkpoints',
 '20181029_d1_0800_0830.csv',
 '20181029_d1_0830_0900.csv',
 '20181029_d1_0900_0930.csv',
 '20181029_d1_0930_1000.csv',
 '20181029_d1_1000_1030.csv']

In [7]:
fnames[1:6]

['20181029_d1_0800_0830.csv',
 '20181029_d1_0830_0900.csv',
 '20181029_d1_0900_0930.csv',
 '20181029_d1_0930_1000.csv',
 '20181029_d1_1000_1030.csv']

In [8]:
fnames = fnames[1:6]

In [9]:
fnames

['20181029_d1_0800_0830.csv',
 '20181029_d1_0830_0900.csv',
 '20181029_d1_0900_0930.csv',
 '20181029_d1_0930_1000.csv',
 '20181029_d1_1000_1030.csv']

In [11]:
i = 1
for fname in fnames:
    date_cols = get_date_cols(fname)
    f = open(fname)
    f.readline();
    dfs = [delayed(line_to_df)(l, date_cols) for l in f]
    df = dd.from_delayed(dfs)
    f.close()
    # Write dask df to parquet
    df.to_parquet('drone_uno.parquet', append=True, engine="pyarrow",  write_index=False)
    # df.to_parquet(f'hora_{i}.parquet', engine="pyarrow",  write_index=False)
    i += 1
    
    #df = dd.read_parquet(fname.split('.')[0] + '-points.parquet')
    #df = dask_geopandas.from_dask_dataframe(df)
    #df = df.set_geometry(
     #   dask_geopandas.points_from_xy(df, 'lon', 'lat')
    #)