In [None]:
# Imports

<h1>1. Introduction/Data Background</h1>

(add an introduction to our project and a background on the data we are using, this includes some data visualizations and whatnot. [Jeff mentioned he will do this.])

<h1>2. Feature Engineering</h1>

In our pursuit to enhance the predictive power of our model, we identified the necessity for additional features in our dataset. This realization led to the development of three distinct feature groups: datetime, distance, and weather.

<h2>2.0 Datetime</h2>

A substantial portion of our dataset revolves around passenger pickup time, originally represented as a string in the format `YYYY-MM-DD HH:MM:SS`. Recognizing the challenges posed by this format, we undertook the task of creating several features based on pickup time. These include `pickup_month`, `pickup_day` (one-hot encoded for each day of the week), `pickup_hour`, and `pickup_minute`, effortlessly implemented with Pandas datetime functionalities.

In addition to the aforementioned features, we introduced four others: `pickup_period` (one-hot encoded), `pickup_hour_sin`, `pickup_hour_cos`, and `pickup_datetime_norm`. We will now go into detail about how and why we created these features.

<h3>2.0.0 Pickup Period</h3>

A pivotal feature, `pickup_period`, was designed to capture the time of day when passengers were picked up. This one-hot encoded feature categorized the day into four periods: morning, afternoon, evening, and night. The chosen time divisions were: morning (6:00 AM to 12:00 PM), afternoon (12:00 PM to 6:00 PM), evening (6:00 PM to 12:00 AM), and night (12:00 AM to 6:00 AM). These divisions aligned intuitively with significant periods of the day for taxi services, such as morning rush hours and evening nightlife.

The implementation involved binning pickup hours and subsequent one-hot encoding:

```python
df['pickup_period'] = pd.cut(df['pickup_hour'], bins=[-1, 6, 12, 18, 24], 
                                                labels=['night', 'morning', 'afternoon', 'evening'])

df = pd.get_dummies(df, columns=['pickup_period'], drop_first=True)
```

<h3>2.0.1 Pickup Period Sine/Cosine</h3>

Acknowledging the cyclical nature of time, we applied circular encoding to the hour of the day. This technique involved creating `pickup_hour_sin` and `pickup_hour_cos` features using sine and cosine transformations, enhancing the model's ability to capture periodic patterns. We chose this because circular encoding is quite useful for capturing the cyclical nature of time, avoiding discontinuities (such as the start and end of a day), encoding periodic patterns, and preserving temporal relationships between different time points. Each of these reasons fit will with the desired outcome of our project, so we added this feature to our dataset, using the following formulas:

\begin{align}
    \begin{split}
        \text{hour\_sin} &= \sin \left( \frac{2 \pi \cdot \text{pickup\_hour}}{24} \right) \\
        \text{hour\_cos} &= \cos \left( \frac{2 \pi \cdot \text{pickup\_hour}}{24} \right).
    \end{split}
\end{align}

<h3>2.0.2 Pickup Datetime Norm</h3>

The final feature we created in the datetime feature grouping was `pickup_datetime_norm`. This feature is simply the normalized pickup datetime. To normalize the pickup datetime, we simply converted the pickup datetime to an integer, then divided by $10^9$ to convert from nanoseconds to second. Following this, we used a simple min-max scaling. This placed all the values between 0 and 1. This feature was created using the following code:

```python
df['pickup_datetime_norm'] = pd.to_datetime(df['pickup_datetime']).astype('int64') // 10**9
df['pickup_datetime_norm'] = (df['pickup_datetime_norm'] - df['pickup_datetime_norm'].min()) / 
                                (df['pickup_datetime_norm'].max() - df['pickup_datetime_norm'].min()).
```

<h2>2.1 Distance</h2>

In the distance feature grouping, we created two features involing the estimated distance between the pickup and dropoff locations. The first feature estimates the distance between the points using the Manhattan distance, while second feature estimates the distance using a modified version of Dijkstra's algorithm. <span style="color:maroon">(Dallin, make sure I am correct about this.)</span> We will now go into detail about how and why we created these features.

<h3>2.1.0 Manhattan Distance</h3>

The Manhattan distance, also known as the $L_1$ distance or taxicab distance, is a metric used to calculate the distance between two points in a grid-based system. We chose to use the Manhattan distance for two main reasons. The first reason is because of its usefulness in grid-based navigation. While New York City is not a perfect grid, many of its streets are laid out in a grid-like fashion, thus the Manhattan distance will be a much better approximation for distance than the Euclidean distance. The second reason is because of its simplicity and interpretability. The Manhattan distance is simply the sum of the absolute values of the differences between the $x$ and $y$ coordinates of the two points. This makes it very easy to calculate and understand.

In our dataset, we have the pickup and dropoff latitude and longitude coordinates. We can use these coordinates to calculate the Manhattan distance between the pickup and dropoff locations (in kilometers). We calculated the Manhattan distance by first converting the pickup and dropoff coordinate points into radians and using the following formula:

\begin{equation}
    \text{Manhattan Distance} = R \cdot \left( \left| \text{pickup\_latitude} - \text{dropoff\_latitude} \right| + \left| \text{pickup\_longitude} - \text{dropoff\_longitude} \right| \right)
\end{equation}

where $R$ is the radius of the Earth in kilometers (6371 km). This gives us the approximated distance between the pickup and dropoff locations in kilometers. While not perfect, we found this to be a decent alternative to using an expensive and time consuming API to get the actual distance between the two points.

<h3>2.1.1 Modified Dijkstra's Algorithm</h3>

<span style='color:maroon'>(Dallin will add a little something-something here about his Dijkstra's stuff.)</span>


<h2>2.2 Weather</h2>

When considering potential features to add to our dataset, one of the first things we thought of was the weather. We thought it reasonable that the weather could have a significant impact on the duration of a taxi ride. For example, if it is raining or snowing, there will be more traffic on the roads, thus increasing the duration of the taxi ride. We found a dataset created by Kaggle user [@Aadam](https://www.kaggle.com/aadimator) that contains a miriad of weather data for New York City between 2016 and 2022 on an hour by hour basis. Included in this dataset is features such as temperature (in Celcius), precipition (in mm), cloud cover (low, mid, high, and total), wind speed (in km/h), and wind direction. We decided to use temperature, precipitation, and total cloud cover as features in our dataset. This was accomplished by a simple join on the pickup datetime, rounded to the nearest whole hour.


<h1>3. Modeling and Results</h1>

(Add some nonsense here about models and stuff.)

<h1>4. Interpretation</h1>

(Add some nonsense here about interpretation and stuff.)