### 0.1 Case Study

#### Scenario
At BMW, we reimagine the future of mobility. Lets fast forward to 2030, and flying taxis are roaming above our cities, bringing people to
their desired destination. You work for, Duoro Hawk a company that is pioneering the world's first large fleet of fully electric, self-piloting
autonomous flying taxis. The company wants to deploy the first network of autonomous air taxis in the coming year. As part of our data
science and enginering team, you are responsible for predicting the destination of our fleet of autonomous flying taxis based on the
manned test flights that have been performed.

#### About the Dataset
A fictional dataset describing a complete year (from 01/07/2014 to 30/06/2014) of all the trajectories for all 442 of our flying taxis that
were simulated in the city of Porto. Our autonomous fleet of taxis fly from a central ground station
• There are three different types of rides: A) phone call-based, B) stand-based where people wait at a stand for their flying taxi or C) 
random place. For type A, we provide an anonymized ID, to represent the telephone call. Categories B and C refers to cases where the
taxis were directly called by the customer.

#### Dataset
##### train.csv
Each data sample corresponds to one completed trip. It contains a total of 9 (nine) features, described as follows:

- TRIP_ID: (String) It contains an unique identifier for each trip;

- CALL_TYPE: (char) It identifies the way used to demand this service. It may contain one of three possible values:
     - ‘A’ if this trip was dispatched from the central;
     - ‘B’ if this trip was demanded directly to a taxi driver on a specific stand;
     - ‘C’ otherwise (i.e. a trip demanded on a random street).
     
- ORIGIN_CALL: (integer) It contains an unique identifier for each phone number which was used to demand, at least, one service. It identifies the trip’s customer if CALLTYPE=’A’. Otherwise, it assumes a NULL value;

- ORIGINSTAND: (integer): It contains an unique identifier for the taxi stand. It identifies the starting point of the trip if CALLTYPE=’B’. Otherwise, it assumes a NULL value;

- WEATHER: (String): Information on the weather that day, unique values include: Sunny, Rainy, Cloudy, Windy, and Foggy
- TAXI_ID: (integer): It contains an unique identifier for the flying taxi that performed each trip;
- TIMESTAMP: (integer) Unix Timestamp (in seconds). It identifies the trip’s start;
- MISSING_DATA: (Boolean) It is FALSE when the GPS data stream is complete and TRUE whenever one (or more) locations are missing
- POLYLINE: (String): It contains a list of GPS coordinates (i.e. WGS84 format) mapped as a string. The beginning and the end of the string are identified with brackets (i.e. [ and ], respectively). Each pair of coordinates is also identified by the same brackets as
- [LONGITUDE, LATITUDE]. This list contains one pair of coordinates for each 15 seconds of trip. The last list item corresponds to the trip’s destination while the first one represents its start


##### test.csv
Personal records for the remaining one-third (~110) of the trips, to be used as test data. Your task is to predict the value of coordinates of the trip‘s destination

##### sample_submission.csv 
A submission file in the correct format.
- TripId - Id for each Tip in the test set
- Longitude - the longitude of the destination of the flying taxi
- Latitude – the latitude of the destination of the flying taxi

The total travel time of the trip (the prediction target of this competition) is defined as the (number of points-1) x 15 seconds. For example, a trip with 101 data points in POLYLINE has a length of (101-1) * 15 = 1500 seconds. Some trips have missing data points in POLYLINE, indicated by MISSING_DATA column, and it is part of the challenge how you utilize this knowledge.

### 0.2 Imports

In [28]:
!pip install awswrangler
!pip install geojson



In [29]:
import os
import sys
import pandas as pd
import numpy as np
import datetime
from datetime import datetime
import time
import json
import matplotlib.pyplot as plt
import seaborn as sns
import awswrangler as aw

In [30]:
utils_path = os.path.join('/home/ec2-user/SageMaker/thinktank_casestudy/src/utils/')
pp_path = os.path.join('/home/ec2-user/SageMaker/thinktank_casestudy/src/preprocessing/')

sys.path.append(utils_path)
sys.path.append(pp_path)

In [31]:
from utils import *
from preprocessing import *

### 0.3 Load Data

In [32]:
train_data = pd.read_parquet('s3://think-tank-casestudy/train_df.parquet')

In [33]:
test_data = pd.read_parquet('s3://think-tank-casestudy/test_df.parquet')

In [34]:
train_data.columns

Index(['Unnamed: 0', 'TRIP_ID', 'CALL_TYPE', 'ORIGIN_CALL', 'ORIGIN_STAND',
       'TAXI_ID', 'TIMESTAMP', 'DAY_TYPE', 'MISSING_DATA', 'POLYLINE',
       'WEATHER'],
      dtype='object')

In [35]:
test_data.columns

Index(['TRIP_ID', 'CALL_TYPE', 'ORIGIN_CALL', 'ORIGIN_STAND', 'TAXI_ID',
       'TIMESTAMP', 'DAY_TYPE', 'MISSING_DATA', 'POLYLINE', 'WEATHER'],
      dtype='object')

In [36]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,TRIP_ID,CALL_TYPE,ORIGIN_CALL,ORIGIN_STAND,TAXI_ID,TIMESTAMP,DAY_TYPE,MISSING_DATA,POLYLINE,WEATHER
0,0,1372636858620000589,C,,,20000589,1372636858,A,False,"[[-8.618643,41.141412],[-8.618499,41.141376],[...",Rainy
1,1,1372637303620000596,B,,7.0,20000596,1372637303,A,False,"[[-8.639847,41.159826],[-8.640351,41.159871],[...",Foggy
2,2,1372636951620000320,C,,,20000320,1372636951,A,False,"[[-8.612964,41.140359],[-8.613378,41.14035],[-...",Rainy
3,3,1372636854620000520,C,,,20000520,1372636854,A,False,"[[-8.574678,41.151951],[-8.574705,41.151942],[...",Cloudy
4,4,1372637091620000337,C,,,20000337,1372637091,A,False,"[[-8.645994,41.18049],[-8.645949,41.180517],[-...",Windy


In [37]:
train_data = train_data.drop(['Unnamed: 0'], axis=1)

Target prediction is longitude and latitude for each TRIP in test_data

DAY_TYPE is not mentioned in list of attributes and attribute is uniformly distributed for both train and test data --> no information gain and can therefore be dropped

In [38]:
train_data = train_data.drop(['DAY_TYPE'], axis=1)
test_data = test_data.drop(['DAY_TYPE'], axis=1)

In [41]:
train_data = convert_polyline_to_geojson_format(data=train_data,
                                                name_column='POLYLINE')

NameError: name 'convert_polyline_to_geojson_format' is not defined

In [None]:
type(x)

In [42]:
test_data = adjust_datatypes(data=test_data)

In [None]:
train_data = adjust_datatypes(data=train_data)

In [None]:
print(train_data.TIMESTAMP_DT.min())
print(train_data.TIMESTAMP_DT.max())
print(test_data.TIMESTAMP_DT.min())
print(test_data.TIMESTAMP_DT.max())

All dates are in previously defined valid ranges

### 0.4 Sanity Checks

In [None]:
#Assert that train and test data have same column shape and attributes
def perform_sanity_checks():
    try:
        assert(train_data.shape[1] == test_data.shape[1])
        print("Column shape train vs test passed")
        assert((train_data.columns == test_data.columns).all())
        print("Column naming train vs test passed")
        assert(train_data.TRIP_ID.nunique() == train_data.shape[0])
        print("Check for unique trips passed - train data")
        assert(test_data.TRIP_ID.nunique() == test_data.shape[0])
        print("Check for unique trips passed - test data")
        print('All checks passed!')
    except:
        print("Sanity Check failed")

In [None]:
perform_sanity_checks()

### 0.5 Cleaning Data

#### 0.5.1 NAN/Null Missing values

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

ORIGIN_CALL and ORIGIN_STAND have null values which is to be expected as they are determined dependent on the call type

#### 0.5.2 Duplicated TRIP_IDs
The Sanity Checks in 0.4 showed that the TRIP_IDs are not unique. 

In [None]:
vc = train_data.TRIP_ID.value_counts().reset_index()

In [None]:
vc

In [None]:
DUPLICATED_IDs = vc[vc['count'] > 1]['TRIP_ID'].unique()
print(f'{len(DUPLICATED_IDs)} TRIP_IDs are duplicated')
print(f'{(len(DUPLICATED_IDs)/train_data.TRIP_ID.nunique()*100)} % out of all unique TRIPs.')

**Findings**:
- 159 cases
- Missing Data == False for all cases
- 80 TRIP_IDs are duplicated
- Affected data is insignifcant (less than 1% of all  TRIPs)

**Assumptions**:
- Potential reasons could be cancellation by dispatcher after a person called for some reasons, failed flight attempts, broken flight taxi etc.
- The trips per ID with the longest POLYLINE are kept as these are assumed to be valid trips. Additionally trips with no POLYLINE or only one coordinate point are assumed invalid and filtered from the dataset. 
- Also it is assumed that only POLYLINEs with at least 10 coordinate points are sufficient. 
- Further investigation will should be done and analyzed together with sensor/technical data from flight taxi. Also the reason could be that a flight is interrupted and re-started again, that could be analyzed by plotting the POLYLINE and compare the start and end point of the duplicated TRIPs. Will be part of further optimization

To do as mentioned in Assumptions, the number of points in the POLYLINE needs to be calculated. In addition we calculate the total flight time at this point.

#### 0.5.3 Data Cleaning POLYLINE
To do cleaning regarding the POLYLINE, a few more attributes are calculated:
- N_COORDINATE_POINTS - number of total points
- TOTAL_FLIGHT_TIME_SECONDS, TOTAL_FLIGHT_TIME_MINUTES - flight time total
- START_POINT - Starting point for each trip
- DEST_POINT - Last point for each trip 
- TOTAL_DISTANCE - total distance of trip in km with haversine formulam

In [None]:
train_data.head()

In [None]:
train_data = calculate_POLYLINE_features(train_data)
test_data = calculate_POLYLINE_features(test_data)

Based on assumption, keeping only polylines with at least 10 coordinate points

In [None]:
train_data = filter_invalid_trips(train_data, n_points=10)
test_data = filter_invalid_trips(test_data, n_points=10)

Calculating the total distance of the trip in km

In [None]:
train_data = calculate_total_distance(train_data)
test_data = calculate_total_distance(test_data)

In [None]:
train_data['SEQUENCE'] = train_data.POLYLINE.apply(lambda row: np.hstack(row))
test_data['SEQUENCE'] = test_data.POLYLINE.apply(lambda row: np.hstack(row))

In [None]:
train_data = train_data.drop(['POLYLINE'],axis=1)
test_data = test_data.drop(['POLYLINE'],axis=1)

#### 0.5.4 MISSING_DATA == TRUE

In [None]:
print(f'Train Data: {train_data.MISSING_DATA.value_counts()[1]} Trips with MISSING_DATA == True')

In [None]:
test_data.MISSING_DATA.value_counts()

In [None]:
train_data[train_data.MISSING_DATA == True]

- Amount of data with missing values insignificant compared to total amount of data
- Majority of trips is WEATHER == Rainy, however total amount of trips is not significant enought to draw a conclusion/make an assumption
- Number of points/length of polyline is in general unequal so there is no indication in that sense how much data is missing, also no information if data is missing at the start/middle or end of POLYLINE 

Based on these Findings, I would simply drop these values, mainly as their effect is expected to be very little. If the number of data samples would be higher, I would try to impute the missing coordinates in this case with the Nearest Neighbour. However the problem of knowing if the cooordinates are missing in start/middle/end would prevail. In case I find very similar trips through additional logic (difference in n_coordinate_points <= 5 and overall_distance between points < threshold) I could minimize this problem. These tasks could be part of further optimization.

In [None]:
train_data = train_data[train_data.MISSING_DATA != True]

#### 0.5.5 OUTLIER
To handle the outliers, we look at statistical indicators and plot the boxplot.

In [None]:
train_data.head()

In [None]:
plt.figure(figsize=(10, 10))
sns.boxplot(data=train_data[['N_COORDINATE_POINTS','TOTAL_FLIGHT_TIME_MINUTES','TOTAL_DISTANCE_KM']])
plt.show()

The cotinous attributes show a high number of outliers, with the number of coordinate points the widest spread.
To avoid loosing too much data, keeping the 95% quantile of the data regarding the N_COORDINATE_POINTS and TOTAL_DISTANCE seems to be the best choice. 

In [None]:
train_data = train_data[(train_data.N_COORDINATE_POINTS <= train_data.N_COORDINATE_POINTS.quantile(0.90))
                 & (train_data.TOTAL_DISTANCE_KM <= train_data.TOTAL_DISTANCE_KM.quantile(0.90))]

In [None]:
plt.figure(figsize=(10, 10))
sns.boxplot(data=train_data[['N_COORDINATE_POINTS','TOTAL_FLIGHT_TIME_MINUTES','TOTAL_DISTANCE_KM']])
plt.show()

We can see some outliers remaining, however the spread is significantly reduced. Outliers in the test data will be kept to avoid too much reduction.

In [None]:
sns.set()
plt.hist(train_data.TOTAL_FLIGHT_TIME_MINUTES, 
         label=f'Post invalid trips N={train_data.shape[0]}')
plt.title('Distribution - total flight time in minutes (95% quantile for visualization reasons)')
plt.legend()
plt.show()

In [None]:
sns.set()
plt.hist(train_data.TOTAL_DISTANCE_KM, 
         label=f'Post invalid trips N={train_data.shape[0]}')
plt.title('Distribution - Count total distance km')
plt.legend()
plt.show()

The reduction of the training data does not have a major effect on the data distribution. Optimization could be to compare performance with/without outliers 

In [None]:
perform_sanity_checks()

#### 0.5.5 CALL_TYPE LOGIC

In [None]:
def check_call_type(data):
    data_A = data[(data.CALL_TYPE == 'A') & (data.ORIGIN_CALL == np.NaN)]
    data_B = data[(data.CALL_TYPE == 'B') & (data.ORIGIN_STAND == np.NaN)]
    data_C = data[(data.CALL_TYPE == 'C') & (data.ORIGIN_STAND != np.NaN)].ORIGIN_STAND.nunique()
    return data_A, data_B, data_C

In [None]:
check_call_type(train_data)

In [None]:
check_call_type(test_data)

In [None]:
train_data.drop(['TIMESTAMP','MISSING_DATA','TOTAL_FLIGHT_TIME_SECONDS'],
                axis=1, inplace=True)
test_data.drop(['TIMESTAMP','MISSING_DATA','TOTAL_FLIGHT_TIME_SECONDS'],
                axis=1, inplace=True)

In [None]:
train_data.info()

In [None]:
aw.s3.to_parquet(df=train_data, path='s3://think-tank-casestudy/preprocessed_data/train_data_preprocess.parquet')
aw.s3.to_parquet(df=test_data, path='s3://think-tank-casestudy/preprocessed_data/test_data_preprocess.parquet')