# Uber Driver Data Analysis

## Data processing pipeline
The data processing pipeline consists of 2 parts, that should be executed in order.
The explanation of each part can be found below, in later cells.

## File structure
### Required input files
- `data`
  - `raw`
    - `02 - Driver Lifetime Trips.csv`
    - `05 - Driver Online OffLine.csv`
    - `08 - Driver Dispatches Offered and Accepted.csv`

### Created output files
- `data`
  - `processed`
    - `02-events_df.csv`
    - `02-period_df.csv`
    - `05-events_df.csv`
    - `05-period_df.csv`
    - `08-period_df.csv`
    - `08-events_df.csv`
    - `fusion_df.csv`

Note: for the new data returned by SAR, the number of each file changed:
- `01 - Driver Lifetime Trips.csv`
- `02 - Driver Online OffLine.csv`
- `03 - Driver Dispatches Offered and Accepted.csv`

In [1]:
import os
import uuid
from functools import reduce
from pathlib import Path, PurePath
from typing import TypedDict, Callable, Optional

import numpy as np
import pandas as pd
from ncls import NCLS

In [2]:
data_folder = PurePath(os.getcwd()) / 'data'
raw_folder = data_folder / 'raw'
processed_folder = data_folder / 'processed'

In [3]:
SeriesMapping = Callable[[pd.Series], pd.Series]

In [None]:
class Config(TypedDict):
    key: str
    filename: str
    datetime_columns: list[str]
    duration_columns: dict[str, str]
    columns_to_merge: list[list[str]]
    value_names: list[str]
    label_names: list[str]


configs: list[Config] = [
    {
        'key': '02',
        'filename': '02 - Driver Lifetime Trips.csv',
        'datetime_columns': [
            'request_timestamp_local',
            'request_timestamp_utc',
            'begintrip_timestamp_local',
            'begintrip_timestamp_utc',
            'dropoff_timestamp_local',
            'dropoff_timestamp_utc',
            'rewindtrip_timestamp_local',
            'rewindtrip_timestamp_utc'
        ],
        'duration_columns': {
            'request_to_begin_duration_seconds': 'second',
            'trip_duration_seconds': 'second',
            'fare_duration_minutes': 'minute',
            'wait_duration_minutes': 'minute'
        },
        'columns_to_merge': [
            [
                'request_timestamp_utc',
                'begintrip_timestamp_utc',
                'dropoff_timestamp_utc'
            ],
            [
                'request_lat',
                'begintrip_lat',
                'dropoff_lat'
            ],
            [
                'request_lng',
                'begintrip_lng',
                'dropoff_lng'
            ]
        ],
        'value_names': [
            'event_timestamp_utc',
            'event_lat',
            'event_lng'
        ],
        'label_names': [
            'request',
            'begintrip',
            'dropoff'
        ]
    },
    {
        'key': '05',
        'filename': '05 - Driver Online Offline.csv',
        'datetime_columns': [
            'begin_timestamp',
            'end_timestamp',
            'begin_timestamp_local',
            'end_timestamp_local'
        ],
        'duration_columns': {
            'duration_seconds': 'second'
        },
        'columns_to_merge': [
            [
                'begin_timestamp',
                'end_timestamp'
            ],
            [
                'begin_lat',
                'end_lat'
            ],
            [
                'begin_lng',
                'end_lng'
            ]
        ],
        'value_names': [
            'event_timestamp_utc',
            'event_lat',
            'event_lng'
        ],
        'label_names': [
            'start',
            'end'
        ]
    },
    {
        'key': '08',
        'filename': '08 - Driver Dispatches Offered and Accepted.csv',
        'datetime_columns': [
            'start_timestamp_utc',
            'end_timestamp_utc',
            'start_timestamp_local',
            'end_timestamp_local'
        ],
        'duration_columns': {
            'minutes_online': 'minute',
            'minutes_active': 'minute',
            'minutes_on_trip': 'minute'
        },
        'columns_to_merge': [
            [
                'start_timestamp_utc',
                'end_timestamp_utc'
            ],
            [
                'start_timestamp_local',
                'end_timestamp_local'
            ]
        ],
        'value_names': [
            'event_timestamp_utc',
            'event_timestamp_local'
        ],
        'label_names': [
            'start',
            'end'
        ]
    }
]

## Part 1
### Pivoting events by type

This pipeline is applied to the three raw files described above.

For each file, the main operation is the pivoting of columns according to the type of the event in the row.
Each event is thus given an id and some extra information such as its type.

The data is then split into two tables:
- `period`: the same as the original table but extended with an id for each event type and with a period id
- `events`: the list of events and their properties

Emmanuel's description:
> Separates "events" and "periods" in Uber data files, with UUIDs matching events to start or end limits of intervals

In [5]:
def replace_NaN(df: pd.DataFrame, NaN_expressions: list[str]) -> pd.DataFrame:
    """Replaces all occurrences of {NaN_expressions} by {np.nan} in {df}"""
    for NaN_expression in NaN_expressions:
        df = df.replace({NaN_expression: np.nan})
    return df


def apply_mappings(
        df: pd.DataFrame,
        column_mappings: dict[str, SeriesMapping]
) -> pd.DataFrame:
    """Applies functions in {columns_mappings} to {df}"""
    remaining = list(set(df.columns) - set(column_mappings.keys()))
    return pd.concat([df.transform(column_mappings), df[remaining]], axis=1)


def load_data(
        file_path: Path,
        column_mappings: Optional[dict[str, SeriesMapping]] = None,
        NaN_expressions: Optional[list[str]] = None
) -> pd.DataFrame:
    df = pd.read_csv(file_path)
    df = replace_NaN(df, (NaN_expressions or []) + ['NaN', 'NA', 'N/A', r'\N'])
    if column_mappings is not None:
        df = apply_mappings(df, column_mappings)
    return df


def duration_mapping(shorthand: str) -> SeriesMapping:
    return lambda s: pd.to_timedelta(s.astype(float), unit=shorthand)


datetime64: SeriesMapping = lambda s: s.astype('datetime64[ns]')
get_duration: dict[str, SeriesMapping] = {'second': duration_mapping('s'), 'minute': duration_mapping('m')}

In [6]:
def pivot_events(
        df: pd.DataFrame,
        columns_to_merge: list[list[str]],
        value_names: list[str],
        label_names: list[str]
) -> pd.DataFrame:
    assert len(set([len(c) for c in
                    columns_to_merge])) == 1, f'extract_1_events: columns_to_merge must have lines of equal lengths'
    assert len(value_names) == len(
        columns_to_merge), f'extract_1_events: value_names must have as many items as number of lines in columns_to_merge'
    assert len(label_names) == len(columns_to_merge[
                                       0]), f'extract_1_events: label_names must have as many items as the number of items in each line of columns_to_merge'

    df_no_index = df.reset_index()
    dfs = []
    for (column_to_merge, value_name) in zip(columns_to_merge, value_names):
        # the pivoting of events
        pivoted = df_no_index.melt(id_vars='index', value_vars=column_to_merge, var_name='event_type',
                                   value_name=value_name).rename(columns={'index': 'period_id'})
        # e.g. replaces instances of {begin_timestamp} with {begin} (since it is in column {timestamp})
        for (column, label) in zip(column_to_merge, label_names):
            pivoted = pivoted.replace(column, label)
        dfs.append(pivoted)

    merged = reduce(lambda l, r: pd.merge(l, r, how='left', on=['period_id', 'event_type']), dfs)

    merged['event_UUID'] = [uuid.uuid4() for _ in merged.index]
    merged['period_UUID'] = [uuid.uuid4() for _ in df.index] * len(label_names)
    return merged


def extend_event_info(df: pd.DataFrame, events_df: pd.DataFrame, label_names: list[str]) -> pd.DataFrame:
    df = df.copy()  # to avoid in-place operations
    for label in label_names:
        df[label + '_event_UUID'] = events_df[events_df['event_type'] == label]['event_UUID'].tolist()
    df['period_UUID'] = events_df[events_df['event_type'] == label_names[0]]['period_UUID'].tolist()
    return df

In [7]:
def first_pipeline(
        key: str,
        filename: str,
        datetime_columns: list[str],
        duration_columns: dict[str, str],
        columns_to_merge: list[list[str]],
        value_names: list[str],
        label_names: list[str]
) -> (pd.DataFrame, pd.DataFrame):
    datetime_columns = {c: datetime64 for c in datetime_columns}
    duration_columns = {k: get_duration[v] for k, v in duration_columns.items()}

    filepath = raw_folder / filename

    data_df = load_data(filepath, {**datetime_columns, **duration_columns})

    events_df = pivot_events(data_df, columns_to_merge, value_names, label_names)
    period_df = extend_event_info(data_df, events_df, label_names)

    print(filename)
    display(data_df.head(1))
    print(f'{key}-events_df.csv')
    display(events_df.head(1))
    print(f'{key}-period_df.csv')
    display(period_df.head(1))

    processed_folder.mkdir(parents=True, exist_ok=True)
    events_df.to_csv(processed_folder / f'{key}-events_df.csv', index=False)
    period_df.to_csv(processed_folder / f'{key}-period_df.csv', index=False)

    return events_df, period_df

#### 02 - Driver Lifetime Trips.csv
For this file, turns rows of
```
[request_time, request_lng, request_lat,
 begintrip_time, begintrip_lng, begintrip_lat,
 dropoff_time, dropoff_lng, dropoff_lat]
```
into rows of
`[event_id, event_type, event_time, event_lng, event_lat]`
where `event_type` is one of `[request, begintrip, dropoff]`.

In [8]:
_, _ = first_pipeline(**configs[0])

02 - Driver Lifetime Trips.csv


Unnamed: 0,request_timestamp_local,request_timestamp_utc,begintrip_timestamp_local,begintrip_timestamp_utc,dropoff_timestamp_local,dropoff_timestamp_utc,rewindtrip_timestamp_local,rewindtrip_timestamp_utc,request_to_begin_duration_seconds,trip_duration_seconds,...,has_driver_upfront_fare,is_cash_trip,wait_time_fare_local,is_on_time,earnings_boost_usd,wait_time_fare_usd,service_fee_usd,trip_distance_miles,rounding_down_amount_local,long_distance_surcharge_local
0,2017-11-02 14:27:40,2017-11-02 13:27:40,2017-11-02 14:36:01,2017-11-02 13:36:01,2017-11-02 14:48:40,2017-11-02 13:48:40,NaT,NaT,0 days 00:08:21,0 days 00:12:40,...,False,False,,,,,,1.820242,0.0,


02-events_df.csv


Unnamed: 0,period_id,event_type,event_timestamp_utc,event_lat,event_lng,event_UUID,period_UUID
0,0,request,2017-11-02 13:27:40,46.191392,6.153364,988b4af7-6e80-4a39-aa69-5a7aa18b76b6,41e7cec2-4f31-4fb2-af12-93970582678d


02-period_df.csv


Unnamed: 0,request_timestamp_local,request_timestamp_utc,begintrip_timestamp_local,begintrip_timestamp_utc,dropoff_timestamp_local,dropoff_timestamp_utc,rewindtrip_timestamp_local,rewindtrip_timestamp_utc,request_to_begin_duration_seconds,trip_duration_seconds,...,earnings_boost_usd,wait_time_fare_usd,service_fee_usd,trip_distance_miles,rounding_down_amount_local,long_distance_surcharge_local,request_event_UUID,begintrip_event_UUID,dropoff_event_UUID,period_UUID
0,2017-11-02 14:27:40,2017-11-02 13:27:40,2017-11-02 14:36:01,2017-11-02 13:36:01,2017-11-02 14:48:40,2017-11-02 13:48:40,NaT,NaT,0 days 00:08:21,0 days 00:12:40,...,,,,1.820242,0.0,,988b4af7-6e80-4a39-aa69-5a7aa18b76b6,7b9a07f1-fec1-471b-8a2a-f0c086c3be38,b1a62f65-f4e1-4f80-9012-5599148ad89b,41e7cec2-4f31-4fb2-af12-93970582678d


#### 05 - Driver Online Offline.csv
For this file, turns rows of
```
[begin_timestamp, end_timestamp,
 begin_lat, end_lat,
 begin_lng, end_lng]
```
into rows of
`[event_id, event_type, event_timestamp_utc, event_lng, event_lat]`
where `event_type` is one of `[start, end]`.

In [9]:
_, _ = first_pipeline(**configs[1])

05 - Driver Online Offline.csv


Unnamed: 0,begin_timestamp,end_timestamp,begin_timestamp_local,end_timestamp_local,duration_seconds,end_lng,vehicle_uuid,end_lat,begin_lat,status,begin_lng,city_id
0,2017-11-01 15:36:41,2017-11-01 15:36:49,2017-11-01 16:36:41,2017-11-01 16:36:49,0 days 00:00:08,6.13493,63d9a727-1ee4-4cc0-998c-69d321dd8028,46.177765,46.177765,open,6.13493,266


05-events_df.csv


Unnamed: 0,period_id,event_type,event_timestamp_utc,event_lat,event_lng,event_UUID,period_UUID
0,0,start,2017-11-01 15:36:41,46.177765,6.13493,eea8b59c-709e-428e-bfce-00e49ad3bd66,84da5ddb-1716-4ef6-960b-68174852019b


05-period_df.csv


Unnamed: 0,begin_timestamp,end_timestamp,begin_timestamp_local,end_timestamp_local,duration_seconds,end_lng,vehicle_uuid,end_lat,begin_lat,status,begin_lng,city_id,start_event_UUID,end_event_UUID,period_UUID
0,2017-11-01 15:36:41,2017-11-01 15:36:49,2017-11-01 16:36:41,2017-11-01 16:36:49,0 days 00:00:08,6.13493,63d9a727-1ee4-4cc0-998c-69d321dd8028,46.177765,46.177765,open,6.13493,266,eea8b59c-709e-428e-bfce-00e49ad3bd66,e90a3355-dc6e-463d-a67b-74408d55bb40,84da5ddb-1716-4ef6-960b-68174852019b


#### 08 - Driver Dispatches Offered and Accepted.csv
For this file, turns rows of
```
[start_timestamp_utc, end_timestamp_utc
 start_timestamp_local, end_timestamp_local]
```
into rows of
`[event_id, event_type, event_timestamp_utc, event_timestamp_local]`
where `event_type` is one of `[start, end]`.

In [10]:
_, _ = first_pipeline(**configs[2])

08 - Driver Dispatches Offered and Accepted.csv


Unnamed: 0,start_timestamp_utc,end_timestamp_utc,start_timestamp_local,end_timestamp_local,minutes_online,minutes_active,minutes_on_trip,driver_adjusted_fares,partner_uuids,rider_cancellations,...,driver_cancellations,rejections,flow_type,accepts,expireds,dispatches,city_id,completed_trips,trip_fares,vehicle_uuids
0,2021-11-18 12:00:00,2021-11-18 13:00:00,2021-11-18 13:00:00,2021-11-18 14:00:00,0 days 01:00:00,0 days,0 days,0.0,"[""f0699b53-7acb-48ba-9cea-e872a1de9fb9""]",0,...,0,0,UberX,0,0,0,266,0,0.0,"[""63d9a727-1ee4-4cc0-998c-69d321dd8028""]"


08-events_df.csv


Unnamed: 0,period_id,event_type,event_timestamp_utc,event_timestamp_local,event_UUID,period_UUID
0,0,start,2021-11-18 12:00:00,2021-11-18 13:00:00,3e43d97e-39d9-438c-9365-35e29f9fe212,7573c8bc-b0da-4e43-9809-bfff49aec160


08-period_df.csv


Unnamed: 0,start_timestamp_utc,end_timestamp_utc,start_timestamp_local,end_timestamp_local,minutes_online,minutes_active,minutes_on_trip,driver_adjusted_fares,partner_uuids,rider_cancellations,...,accepts,expireds,dispatches,city_id,completed_trips,trip_fares,vehicle_uuids,start_event_UUID,end_event_UUID,period_UUID
0,2021-11-18 12:00:00,2021-11-18 13:00:00,2021-11-18 13:00:00,2021-11-18 14:00:00,0 days 01:00:00,0 days,0 days,0.0,"[""f0699b53-7acb-48ba-9cea-e872a1de9fb9""]",0,...,0,0,0,266,0,0.0,"[""63d9a727-1ee4-4cc0-998c-69d321dd8028""]",3e43d97e-39d9-438c-9365-35e29f9fe212,853539f5-daca-4e8d-8566-45ee7a5a2d00,7573c8bc-b0da-4e43-9809-bfff49aec160


## Part 2
### Merging time intervals

*The understanding of this part is in progress.*

There is some form of time-interval merging, but it is not clear to what end yet.

Emmanuel's description:
> merges Period files to find intersections between periods events recorded automaticaly by the Uber application ("Driver Online Offline") and trips performed by the human driver ("Driver Lifetime Trips")

In [18]:
def fix_timestamp(df: pd.DataFrame, col_to_replace: str, col_rescue: str):
    # If {col_to_replace} is null (NaT), replace with {col_rescue} of the next row
    df[col_to_replace] = np.where(df[col_to_replace].isnull(),
                                  df[col_rescue].shift(-1),
                                  df[col_to_replace])
    return df


def load_and_convert_dates(filepath: PurePath, datetime_columns: Optional[list[str]] = None):
    df = pd.read_csv(filepath)
    for i in (datetime_columns or []):
        df[i] = pd.to_datetime(df[i])
    return df

In [11]:
def merge_over_time_intervals(df1, df2, start1, end1, start2, end2):
    df1['start_unixts'] = df1[start1].view('int64')
    df1['end_unixts'] = df1[end1].view('int64')
    df2['start_unixts'] = df2[start2].view('int64')
    df2['end_unixts'] = df2[end2].view('int64')

    ncls = NCLS(df1['start_unixts'], df1['end_unixts'], df1.index.values)

    x1, x2 = ncls.all_overlaps_both(df2['start_unixts'].values, df2['end_unixts'].values, df2.index.values)

    df1 = df1.reindex(x2).reset_index(drop=True)
    df2 = df2.reindex(x1).reset_index(drop=True)

    df = df1.join(df2, rsuffix='2')

    df.drop(['start_unixts', 'end_unixts'], axis=1, inplace=True)
    return df

In [19]:
def second_pipeline() -> pd.DataFrame:
    trips_df = load_and_convert_dates(processed_folder / '02-period_df.csv',
                                      datetime_columns=next(c for c in configs if c['key'] == '02')['datetime_columns'])
    app_connection_full_df = load_and_convert_dates(processed_folder / '05-period_df.csv',
                                                    datetime_columns=next(c for c in configs if c['key'] == '05')[
                                                        'datetime_columns'])

    app_connection_full_df = fix_timestamp(app_connection_full_df, 'end_timestamp', 'begin_timestamp')
    app_connection_full_df = app_connection_full_df.dropna(subset=['begin_timestamp', 'end_timestamp'])
    trips_df = fix_timestamp(trips_df, 'begintrip_timestamp_utc', 'begintrip_timestamp_utc')
    trips_df = fix_timestamp(trips_df, 'dropoff_timestamp_utc', 'begintrip_timestamp_utc')

    merged = merge_over_time_intervals(app_connection_full_df, trips_df,
                                       'begin_timestamp', 'end_timestamp',
                                       'request_timestamp_utc', 'dropoff_timestamp_utc')

    merged.to_csv(processed_folder / 'fusion_df.csv', index=False)

    return merged

In [20]:
fusion_df = second_pipeline()