# Trip Duration Prediction
*By Michael Florip, Spring 2024*

**Project Goal**
The goal of this project is to explore use cases of telematics data using the *v2x_columbus_trips/v2x_columbus_trips_summary_clean* dataset provided by the 99P Labs team at the Honda Research Institute USA. 

In [None]:
# Import necessary libraries
import pandas as pd
from datetime import datetime

In [None]:
# Explore overall dataset
df = pd.read_csv("/Users/michaelflorip/Desktop/Berkeley/hri99p/v2x_columbus_trips/v2x_columbus_trips_summary_clean.csv", encoding='ascii')
print(df.head())
print(df.describe(include='all'))

### Observations and Potential Predictive Modeling Ideas

**Temporal Data**
The dataset contains several time-related fields (timestamp, startlocaltime, endlocaltime). These could be used to predict traffic patterns or trip durations.

**Geospatial Data**
Latitude and longitude fields (startlatitude, startlongitude, endlatitude, endlongitude) are present, which could be useful for predicting trip destinations or analyzing spatial patterns.

**Vehicle and Trip Metrics** 
Fields like numbsmtx (number of BSM transmissions), numnormalbsmrx (number of normal BSM receptions), and numintersectionencounters could be used to predict vehicle behavior or safety-related metrics.

**Device and Configuration**
Fields like device, firmwareversionstring, and configsversionstring might be useful for predicting device performance or maintenance needs.

### Potential Project Scopes
**Predicting Trip Duration**
Use start and end times to predict the duration of trips. This could be useful for route planning and traffic management.

**Predicting Destination**
Use starting coordinates and other trip data to predict the end coordinates. This could be useful for navigation and ride-sharing applications.

**Predicting Vehicle Performance Metrics**
Use device data and trip metrics to predict performance-related outcomes like fuel efficiency or maintenance needs.

### Predictive Modeling Options
**Regression Model**
If predicting a continuous variable like trip duration or fuel efficiency

**Classification Model**
If predicting categorical outcomes like whether a trip will encounter a certain number of intersections or safety incidents

## Game Plan for Model Creation
**Data Preparation**
* Convert *startlocaltime* and *endlocaltime* from their current format to a more useable datetime format
* Calculate the trip duration by simply subtracting *startlocaltime* from *endlocaltime*
* Handle any missing or erroneous data

**Feature Selection**
* Decide on which features might influence trip duration
* Potential features include: *startlatitude*, *startlongitude*, *endlatitude*, *endlongitude*, *timestamp*, *filedate*, *device*, *firmwareversionstring*
* Create new features if necessary (example: time of day from *timestamp*)

**Model Selection**
Since trip duration is a continuous variable, I will take a regression approach. Potential models include:
* Linear regression
* Decision Tree regressor
* Random Forest regressor
* Gradient Boosting machines

**Model Training**
* Split the data into training and testing sets (80/20 split)
* Train the model on the training set

**Model Evaluation**
* Evaluate the model on the testing set using metrics such as Mean Absolute Error (MAE), Mean Square Error (MSE), or Root Mean Square Error (RMSE)

### Data Preparation

In [None]:
# Convert timestamps to datetime
try:
    df['startlocaltime'] = pd.to_datetime(df['startlocaltime'], unit='ms')
    df['endlocaltime'] = pd.to_datetime(df['endlocaltime'], unit='ms')
except Exception as e:
    print('Error converting to datetime:', e)

# Calculate trip duration in minutes
try:
    df['trip_duration'] = (df['endlocaltime'] - df['startlocaltime']).dt.total_seconds() / 60
except Exception as e:
    print('Error calculating trip duration:', e)

# Display the updated dataframe
print(df[['startlocaltime', 'endlocaltime', 'trip_duration']].head())

**Observations**
* The *startlocaltime* and *endlocaltime* have been successfully converted to datetime format
* The *trip_duration* column has been added, representing the duration of each trip in minutes

### Next Steps
**Data Cleaning**
Check for any anomalies or outliers in the trip durations, such as extremely long or short trips, which might affect the model's performance

**Feature Engineering**
Create additional features that might be relevant for predicting trip duration

**Model Training**
Begin training a simple regression model

In [None]:
# Check for anomalies or outliers in trip durations
print('Descriptive statistics for trip duration:\
', df['trip_duration'].describe())

# Identify extreme outliers
outliers = df[df['trip_duration'] > df['trip_duration'].quantile(0.99)]
print('\
Outliers (above 99th percentile):\
', outliers[['trip_duration', 'startlocaltime', 'endlocaltime']].head())

# Check for any trips with negative durations (errors)
negative_durations = df[df['trip_duration'] < 0]
print('\
Negative trip durations (errors):\
', negative_durations[['trip_duration', 'startlocaltime', 'endlocaltime']].head())

**Observations**
* The mean trip duration is about -5.35 minutes, which suggests there are errors in the data, since trip duration which is in minutes cannot be negative
* The standard deviation is quite high, indicating significant variability in trip durations
* The minimum trip duration is -1438.49 minutes, confirming that there is erroneous data

**Next Steps**
* The negative durations are clearly errors and will be removed from the dataset
* To handle outliers, I will cap the outliers at the 99th percentile

In [None]:
# Remove negative trip durations
df = df[df['trip_duration'] >= 0]
print('Removed negative trip durations.')

# Cap outliers at the 99th percentile
percentile_99 = df['trip_duration'].quantile(0.99)
df.loc[df['trip_duration'] > percentile_99, 'trip_duration'] = percentile_99
print('Capped outliers at the 99th percentile.')

### Next Steps
**Feature Engineering**
Create additional features that might be relevant for predicting trip duration

**Model Training**
Begin training a simple regression model

In [None]:
# Create time-based features
df['start_hour'] = df['startlocaltime'].dt.hour
df['day_of_week'] = df['startlocaltime'].dt.dayofweek

print('Added time-based features: start_hour and day_of_week.')

Time-based features have been added to the dataset:
* *start_hour*: represents the hour of the day the trip started
* *day_of_week*: Indicates the day of the week the trip started (O = Monday, 6 = Sunday)

These features can help capture daily and weekly patterns in trip data.

### Next Steps
**Data Split for Training & Testing**

Split the data into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

# Define features and target variable
X = df[['start_hour', 'day_of_week']]  # Assuming these are the only features for now
y = df['trip_duration']

# Split the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Data split into training and testing sets.')

**Choosing a model for regression tasks**
* Linear regression: simple and fast
* Decision Tree: handles non-linear data well
* Random Forest: A more robust version of the decision trees, good for complex data

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Initialize the Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on the testing set
y_pred = model.predict(X_test)

# Calculate the Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

print(f'Model trained. Mean Squared Error on test set: {mse:.2f}')

### Next Steps
**Feature Importance Analysis**

In [None]:
# Get feature importances from the model
feature_importances = pd.Series(model.feature_importances_, index=X.columns)

# Sort the features by importance
sorted_importances = feature_importances.sort_values(ascending=False)

print('Feature Importance:\
', sorted_importances)

**Insights**
The feature importance analysis from the Random Forest model indicates the following:

*start_hour*: This feature has a relative importance of 1.0, suggesting it is the most influential in predicting trip duration.

*day_of_week*: This feature has a relative importance of 0.0, indicating it does not significantly impact the prediction of trip duration in the current model setup.

Given that *start_hour* is highly influential, it might be beneficial to further explore or engineer additional time-related features or consider other aspects of the data that could enhance the model's predictive power.


**Feature Importance**

The graph below shows the importance of each feature in predicting trip duration. The *start_hour* has a significantly higher importance compared to *day_of_week*.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetic style of the plots
sns.set(style='whitegrid')

# Feature Importance Plot
plt.figure(figsize=(10, 6))
feature_importances.plot(kind='bar')
plt.title('Feature Importance')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Prediction vs Actual**

The below scatter plot compares the actual trip durations against the predicted values from our model. The closer the points are to the dashed line, the more accurate the predictions.

In [None]:
# Prediction vs Actual Plot
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.title('Prediction vs Actual')
plt.xlabel('Actual Duration')
plt.ylabel('Predicted Duration')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
plt.tight_layout()
plt.show()

### Next Steps
**Model Evaluation**

To evaluate the accuracy of our model, we can use the coefficient of determination, commonly known as R² score, which provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model. Let's calculate that next.

In [None]:
from sklearn.metrics import r2_score

# Calculate R^2 score
r2 = r2_score(y_test, y_pred)
print(f'R^2 Score: {r2:.2f}')

The R² score for the model is **0.15**. 

This score indicates that the model explains about 15% of the variance in trip duration based on the current features (start_hour and day_of_week). This suggests that while the model has some predictive power, there is significant room for improvement, possibly by incorporating more relevant features or using more complex modeling techniques.

### Proposed Next Steps
**Calculate Distance**
Use the latitude and longitude to calculate the haversine distance for each trip

**Refine Time Features**
Create more granular time bins (e.g., morning, afternoon) and possibly consider the effect of holidays

**Incorporate Traffic Indicators**
Use *numwarnings* and *numinforms* as proxies for traffic conditions

In [None]:
import numpy as np 
# Define a function to calculate the haversine distance between two points on the earth
def haversine(lat1, lon1, lat2, lon2):
    # Convert decimal degrees to radians 
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])

    # Haversine formula 
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a)) 
    km = 6371 * c # Radius of earth in kilometers. Use 3956 for miles
    return km

# Calculate the distance for each trip
df['trip_distance'] = df.apply(lambda x: haversine(x['startlatitude'], x['startlongitude'], x['endlatitude'], x['endlongitude']), axis=1)

# Display the head of the dataframe to confirm the new column
print(df[['trip_distance']].head())

## Summary of Insights and Analysis

### Workflow Summary
**Data Preparation and Cleaning**
Handled negative and outlier trip durations by removing or capping them, ensuring the data quality for model training

**Feature Engineering**
* Added time-based features such as *start_hour* and *day_of_week* to capture temporal patterns in trip durations
* We calculated the distance of each trip using geographical coordinates, which is a direct factor affecting trip duration

**Model Building and Evaluation**
* Trained a Random Forest Regressor model that had an R^2 score of **0.15**. This indicates it could explain 15% of the variance in trip durations.
* Feature importance analysis showed that the *start_hour* was significantly more important than *day_of_week*, suggesting that the time of day has a greater impact on trip duration

**Visual Insights**
The feature importance plot highlighted the relative importance of different features in predicting trip duration

The prediction vs actual plot provided a visual assessment of the model's performance, showing how close the predictions were to the actual values

### Recommendations for Further Improvement
**Incorporate Additional Features**
Including weather conditions, real-time traffic data, and distinguishing between types of days (weekdays vs weekends) could further enhance the model's accuracy

**Advanced Modeling Techniques**
Trying different algorithms and tuning model parameters could yield better results

**Increase Data Size and Quality**
This used the *v2x_columbus_trips/v2x_columbus_trips_summary_clean* dataset, which is not the most granular dataset available. I chose this dataset because of the overwhelming size of the full *v2x_columbus_trips* dataset.