<h1>1. Introduction/Data Background</h1>

We aim to tackle the challenge of predicting taxi ride durations in New York City based on fundamental situational information. This research question holds significant relevance for urban commuters and transportation systems, delving into the myriad factors influencing taxi ride times, including traffic patterns, congestion, time of day, and external variables such as events or weather conditions. Our primary objective is to develop models capable of accurately forecasting the duration of taxi rides. The diverse data available further enhances the feasibility of answering the question, laying a solid foundation for training models to discern patterns and make precise predictions. The strengths of machine learning methods align well with the intricacies of this problem and allow us to uncover more nuanced relationships within the data. Beyond individual convenience, the ability to predict taxi ride times carries practical implications for optimizing taxi fleet management, resource allocation, and enhancing overall traffic flow in the city. 

Since our data comes from a Kaggle competition, over one thousand other groups have investigated this research. Successful teams performed feature engineering to create fields such as month, day, hour, day of the week, haversine/manhattan distances. The used techniques including Random Forest Regression, Extra Trees Regression, PCA, XGBoost, linear regression, and Light GBM.

The competition tasks developers to predict New York City taxi cab travel times from 2016 based primarily off of pick up and drop off locations. However, the dataset also includes fields for the tax driver, the number of passengers, and weather the trip time was recording in real time. The second dataset contains weather information for the same time period and is sourced directly from the Open Meteo website. The available fields include timestamps, temperature, precipitation, cloud cover, and wind information. The NYC taxi cab dataset is well documented and densely populated with over one million data points. It was published by NYC Taxi and Limousine Commission (TLC) in Big Query on Google Cloud Platform. The weather dataset is much more sparse, specially in the wind and cloud fields, and has a much larger time range than needed to match the NYC Taxi data. These datasets provide a good source for addressing our research questions because they extensively cover NYC taxi travel for a significant time period. By combining this taxi data with meteorological information, we can delve into various factors influencing trip duration in an urban landscape and employ machine learning techniques such as XGBoost. In short, the data allows us to effectively identify significant features impacting taxi trip duration and develop a robust predictive model for accurate estimations.

While our primary investigation is focused on determining taxi cab trip duration, we are also interested in several other questions. What dates, days of the week, and times of day are most busy? Where are the most popular destinations? What factors influence taxi trip time the most? The fusion of NYC taxi data and meteorological information to understand factors influencing trip duration in a bustling metropolis offers a compelling inquiry into urban transportation dynamics. Furthermore, the multifaceted exploration of temporal, spatial, and environmental influences on travel durations in NYC, complemented by the weather dataset, presents a well-rounded analysis to reveal comprehensive insights into travel behaviors. Our data is well suited for our primary, precise research in exploring and predicting taxi trip duration, and has a single, clear answer. The dataset allows us to employ machine learning techniques like XGBoost to effectively identify and classify the significant features that impact taxi trip duration. This process helps us go one step further and develop a robust predictive model to estimate and understand trip durations accurately.

<h1>2. Feature Engineering
</h1>

In our pursuit to enhance the predictive power of our model, we identified the necessity for additional features in our dataset. This realization led to the development of three distinct feature groups: datetime, distance, and weather. 

<h2>2.0 Data Cleaning</h2>

The taxi cab duraction dataset from Kaggle was already cleaned and prepared for the competition. However, after doing some basic data visualization, we realized that there were some outlier data that we needed to remove. For example, there were some trips taken that, according to <code>trip_duration</code> lasted 1 second, and some that lasted about 980 days. To combat this, we removed any data points where <code>trip_duration</code> was less than 60 seconds, and where <code>trip_duration</code> in the $.005$ quantile.

In addition to outliers in <code>trip_duration</code>, we realized there were outliers in some of the pick up and drop off locations, meaning they were far outside New York City. We fixed this problem by removed any point outside of city limits.

<h2>2.1 Datetime</h2>

One of the most important features in our dataset is passenger pickup time, originally represented in a string in the format `YYYY-MM-DD HH:MM:SS`. We created multiple time features from this column including `pickup_month`, one-hot encoded `pickup_day`, `pickup_hour`, and `pickup_minute`. We also added other versions of these data points including one-hot encoded `pickup_period`, `pickup_hour_sin`, `pickup_hour_cos`, and `pickup_datetime_norm`. 

<h3>2.0.0 Pickup Period</h3>

The feature `pickup_period` captures the time of day when passengers were picked up in one of four periods: morning (6:00 AM to 12:00 PM), afternoon (12:00 PM to 6:00 PM), evening (6:00 PM to 12:00 AM), and night (12:00 AM to 6:00 AM). These divisions align intuitively with significant periods of the day for taxi services, such as morning rush hours and evening nightlife.

```python
df['pickup_period'] = pd.cut(df['pickup_hour'], bins=[-1, 6, 12, 18, 24], 
                                labels=['night', 'morning', 'afternoon', 'evening'])
df = pd.get_dummies(df, columns=['pickup_period'], drop_first=True)
```

<h3>2.0.1 Pickup Period Sine/Cosine</h3>

We applied a circular encoding to the hour of the day to account for the cyclical nature of the hours of the day. We created `pickup_hour_sin` and `pickup_hour_cos` features using sine and cosine transformations, that avoid discontinuities (such as the start and end of a day).

\begin{align}
    \begin{split}
        \text{hour\_sin} = \sin \left( \frac{2 \pi \cdot \text{pickup\_hour}}{24} \right) \quad
        \text{hour\_cos} = \cos \left( \frac{2 \pi \cdot \text{pickup\_hour}}{24} \right).
    \end{split}
\end{align}

<h3>2.0.2 Pickup Datetime Norm</h3>

The final feature we created in the datetime feature grouping was `pickup_datetime_norm` to represent the normalized pickup datetime. This feature converts the pickup datetime from nanoseconds to seconds, then scaled the value by the maximum to place all the values between 0 and 1. 

```python
df['pickup_date_time_norm'] = pd.to_datetime(df['pickup_date_time_norm']).view('int64') // 10**9
df['pickup_date_time_norm'] = (df['pickup_date_time_norm']-df['pickup_date_time_norm'].min()) / 
                    (df['pickup_date_time_norm'].max() - df['pickup_date_time_norm'].min()).
```

<h2>2.1 Distance</h2>

We created three features that estimate distance between pickup and drop off locations: the Manhattan distance, the shortest path along New York roads according to Dijkstra's algorithm, and the average distances between local coordinate clusters. 

<h3>2.1.0 Manhattan Distance</h3>

We include the Manhattan Distance feature because of its grid based metric. Many of the streets of New York are laid out in a grid-like fashion, so this metric can better approximate road distances than the Euclidean distance. The Manhattan distance is also more simple and interpretable, since it computes the sum of the absolute values of the differences between the $x$ and $y$ coordinates of the two points. We calculated the Manhattan distance by first converting the pickup and dropoff coordinate points into radians and using the following formula:

\begin{equation}
    \text{Manhattan Distance} = R \cdot \left( \left| \text{pickup\_latitude} - \text{dropoff\_latitude} \right| + \left| \text{pickup\_longitude} - \text{dropoff\_longitude} \right| \right)
\end{equation}

where $R$ is the radius of the Earth in kilometers (6371 km). This equation approximates the distance between the pickup and dropoff locations in kilometers. 

<h3>2.1.1 Computing Distance with Dijkstra's Algorithm</h3>

With over 1.4 million data points over a relatively small geography, the pick up and drop off locations provide an effective proxy for the actual road map of New York. We can use this information to estimate the distance between the two points using a modified version of Dijkstra's algorithm. We chose to use Dijkstra's algorithm because it is a well known and well documented algorithm for finding the shortest path between two points in a graph. We modified the algorithm to find the shortest path between two points in a graph, where the edges of the graph are the roads in New York City with weights equal to their geographic length, and the nodes are the pick up and drop off locations. 

We can reduce the size of the graph by removing nodes that are too close together, removing the few hundred nodes that lie well outside the metropolitan area, and cropping the graph for each distance calculation to only include nodes that lie roughly between the pick up and drop off locations. This reduces the size of the graph from over 1.4 million nodes to a few thousand nodes. We can then use Dijkstra's algorithm to find the shortest path between the pick up and drop off locations. The resulting cost of the shortest path is an estimate of the actual driving distance required to travel between the two points, which is one of the strongest predictors of taxi ride duration.

<h3>2.1.2 KMeans Clustering Average Duration</h3>

Along with incorporating distance metrics and weather information, we also added a duration feature that acts as an initial estimate for the model's actual trip duration prediction. To do this, we fit a pickup and dropoff KMeans clustering model with 200 clusters. Images of these are shown in section `4. Data Visualization`. We then labeled each pickup location and drop off location in the data with its respective cluster label. By grouping the data by these cluster pairs, we computed the average trip_duration between each cluster pair and merged this onto the original dataframe. See Appendix for the full code implementation.

<h2>2.2 Weather</h2>

When considering potential features to add to our dataset, accounting for the effect of local weather on taxi ride tme was one of the most obvious additions to include. For example, if it is raining or snowing, there will be more traffic on the roads, and more people who would normally walk would prefer a taxi. A dataset created by Kaggle user [@Aadam](https://www.kaggle.com/aadimator) contains a miriad of weather data for New York City between 2016 and 2022 on an hourly basis. This dataset includes features such as temperature (in Celcius), precipition (in mm), cloud cover (low, mid, high, and total), wind speed (in km/h), and wind direction. We decided to use temperature, precipitation, and total cloud cover as features in our dataset with a simple join on the pickup datetime, rounded to the nearest whole hour.

In [3]:
# All feature engineering is done in the function get_X_y, found in appendix
X, y = get_X_y(force_clean=True) # force clean drops the outliers that don't resemble reasonable data points
feature_X = generate_features(X) # generate features adds more features to the data, including weather data
feature_X.shape, y.shape

((1442663, 27), (1442663,))

In [13]:
feature_X # show an example of the data, you can't see all of the features on this page
# the labels (y) are the trip duration for each id
feature_X.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,pickup_month,pickup_day_Monday,...,pickup_period_morning,pickup_period_afternoon,pickup_period_evening,pickup_hour_sin,pickup_hour_cos,pickup_datetime_norm,distance_km,temperature_2m (°C),precipitation (mm),cloudcover (%)
0,id2875421,1,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,3,1,...,0,1,0,-0.965926,-0.258819,0.405086,2.208255,6.4,0.2,100.0
1,id3758366,1,2016-03-14 17:19:42,1,-73.993095,40.747917,-74.00634,40.734406,3,1,...,0,1,0,-0.965926,-0.258819,0.405066,2.975163,6.4,0.2,100.0
2,id2886671,1,2016-03-14 16:57:39,5,-73.979507,40.785347,-73.970268,40.799091,3,1,...,0,1,0,-0.866025,-0.5,0.404982,2.555654,6.4,0.2,100.0
3,id0790528,1,2016-03-14 17:01:41,1,-73.984245,40.749043,-73.999893,40.734074,3,1,...,0,1,0,-0.965926,-0.258819,0.404997,3.404428,6.4,0.2,100.0
4,id2704865,1,2016-03-14 17:16:07,1,-74.00676,40.705559,-73.980568,40.78754,3,1,...,0,1,0,-0.965926,-0.258819,0.405052,12.02833,6.4,0.2,100.0


<h1>3. Feature Selection</h1>

As seen is section 2, we created quite a few new features in addition to the ones that were already in the dataset. It is important to consider which, if any, are not needed and do not help the model. 

## 3.1 $L^1$ Regularization

To figure out which of our features is most important, we utilized $L^1$ regularization since $L^1$ regularization naturally sets unneeded feature coefficents to 0, and typically out performs step-wise feature removal.

In [41]:
# lasso_feature_selection performs feature selection using the LassoLarsIC method
lasso_feature_selection(X, y)

{'Optimal Alpha': 0.7226868052357491,
 'Optimal BIC': 18077554.89883142,
 'Lasso Coeffs': array([-0.    , -0.    ,  0.    , -0.    , -0.    ,  0.0778, -0.    ,
         0.    ,  0.    , -0.    , -0.    ,  0.    ,  0.0162,  0.0159,
        -0.    ,  0.    , -0.    , -0.    , -0.    ,  0.    , -0.0367,
        -0.    ,  0.    ,  0.013 ]),
 'Important Features': array(['pickup_month', 'pickup_hour', 'pickup_minute', 'distance_km',
        'cloudcover (%)'], dtype=object)}

<h2>3.2 Grid Search on LightGBM</h2>

In addition to finding the most important features, we also wanted to find the best hyperparameters for our model. To do this, we used a grid search on LightGBM. We chose to use LightGBM over XGBoost because of its speed with high dimensional data.

In [55]:
# The above function takes 45 minutes to run and produces the following results
lightgbm_hyperparameter_search(X, y)

{'Best parameters from grid search': {'boosting_type': 'gbdt', 'learning_rate': 0.01, 'max_depth': 20,
				      'n_estimators': 100, 'num_leaves': 30, 'reg_alpha': 0.1, 'reg_lambda': 0.1},
 'Best MSE': 609.5768}


##### Training result after updating features 
The hyperparameters are all the same but the MSE improved by 2

(formatted)
Best parameters from grid search:

	 boosting_type : gbdt

	 learning_rate : 0.01

	 max_depth :	 20

	 n_estimators :	 100

	 num_leaves :	 30

	 reg_alpha :	 0.1

	 reg_lambda :	 0.5

Best MSE: 607.5797035091151

(formatted how it would return)
 {'Best parameters from grid search': {'boosting_type': 'gbdt', 'learning_rate': 0.01, 'max_depth': 20, 'n_estimators': 100, 'num_leaves': 30, 'reg_alpha': 0.1, 'reg_lambda': 0.5}, 'Best MSE': 607.5797035091151}

<h1>4. Data Visualization</h1>

In the following figure, the top two graphs visualize the pickup and dropoff locations overlaid over a map of NYC. The bottom two graphs shows the pickup and dropoff locations clustered into groups using K-means clustering. The pickup locations are more heavily clustered around downtown (Manhattan), while the dropoff locations are more evenly distributed throughout the city.

<table>
    <tr>
        <td><img src="images/pickup_locations.png" alt="pickup_locations", width=400, height=300></td>
        <td><img src="images/dropoff_locations.png" alt="dropoff_locations", width=400, height=300></td>
    </tr>
    <tr>
        <td><img src="images/kmeans_200_pickup.png" alt="kmeans_200_pickup", width=400, height=300></td>
        <td><img src="images/kmeans_200_dropoff.png" alt="kmeans_200_dropoff", width=400, height=300></td>
    </tr>
</table>

We can also learn about the factors that influence a New Yorker's decision to take a taxi from the data. For example, in the figure below, the graph on the left displays the most popular times of day to hail a taxi, which peeks around 6:00 PM. We also see that poor weather encourages more people to take a taxi during the day when people are more likely to be returning from their daily activities, but less likely to choose to go out at night in the first place. The graph on the right shows the most popular days of the week for taxis, which peeks on Friday and Saturday and during the evenings of the weekdays.

<table>
    <tr>
        <td><img src="images/pickup_rain_freq.png" alt="pickup_rain_freq", width=400, height=300></td>
        <td><img src="images/day_hour.png" alt="day_hour", width=600, height=300></td>
    </tr>
</table>

<h1>5. Modeling and Results</h1>

1. Analyze the data using the techniques discussed in class.
2. (Markdown) Explain what research questions you can answer using the techniques presented this semester.
3. Don't falsify results

<h1>6. Interpretation</h1>

(Add some nonsense here about interpretation and stuff.)
1. (Markdown) Analyze the data, draw conclusions, and effectively communicate your main observations and results.
2. (Markdown) Explain the results of your analysis, whether the results are meaningful, and why you chose the tools that you used.

<h2>4.1 Ethical Implications</h2>

Our research involves analyzing a large dataset created by tracking some of the life-style patterns of real people living in New York during 2016, raising concerns about privacy and responsible data usage. The taxi dataset contains information about drivers, passengers, and their travel patterns which necessitates careful handling to prevent unintended consequences. 
Individual behaviors, locations, and travel patterns are sensitive information. This dataset and our model protect this private information by excluding all personally identifiable information to completely anonymize the data. This step ensures that only aggregate information can be meaningful, and the patterns of individuals remain indiscernable. this privacy is crucial, requiring anonymization and secure data storage practices.

There's a risk of misinterpretation or misuse of our predictive models. Users could misunderstand predictions, leading to inappropriate decision-making. Our predictive model may contain and inadvertently perpetuate biases that we are unaware of, such as inappropriate associations certain neighborhoods and taxis. Furthermore, users might misunderstand the predictive nature of the model and treat its estimates as certainties. For instance, a passenger might assume the predicted duration is a guaranteed travel time, leading to potential frustration or dissatisfaction if actual conditions deviate. If taxi companies or transportation authorities were to make decisions solely based on the model's predictions without considering broader traffic management strategies, it could inadvertently lead to concentrated traffic, worsening congestion in certain areas. If the model is integrated into pricing algorithms, there is a risk of surge pricing being triggered based on predicted high demand, potentially disadvantaging users in specific locations during peak times.

To address these issues, clear communication about the model's limitations, potential biases, and intended use is crucial. Providing educational resources for users and implementing fairness-aware algorithms can contribute to responsible and ethical deployment. Implementating user-friendly interfaces, transparent model documentation, and accessible educational resources can also mitigate the risk of misuse. 


If our model were deployed in conjunction with algorithms influencing taxi availability, the system might inadvertently create self-fulfilling feedback loops, disproportionately affecting certain areas or demographics. Regular assessments, periodic audits, and interventions are necessary to avoid reinforcing existing biases. By regularly updating models based on real-world outcomes, developers can help refine and improve fairness why mitigating the risk of a destructive feedback loops. We have also considered ethical resposibilities such as the responsible disclosure of findings, ensuring the public benefits from the research, and avoiding any unintentional harm. Active engagement with potential stakeholders and the community can help address concerns and foster ethical practices.

<h1>Appendix</h1>

<h2>A.1 Code</h2>

In [None]:
# python imports
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import pickle
from copy import deepcopy
from sklearn.cluster import KMeans

# Native imports
from py_files.features import generate_features
from py_files.data_manager import get_X_y, get_nyc_gdf

In [None]:
# KMeans Clustering code

# fit the kmeans models and label each pickup and dropoff location by its cluster
kmeans_pickup = (KMeans(n_clusters=n_clusters)
    .fit(df.loc[:, ['pickup_longitude', 'pickup_latitude']].values))
kmeans_dropoff = (KMeans(n_clusters=n_clusters)
    .fit(df.loc[:, ['dropoff_longitude', 'dropoff_latitude']].values))
df['pickup_cluster'] = kmeans_pickup.predict(df[['pickup_longitude', 'pickup_latitude']].values)
df['dropoff_cluster'] = kmeans_dropoff.predict(df[['dropoff_longitude', 'dropoff_latitude']].values)

# compute the average duration between each cluster and merge this onto the original dataframe
group_durations = (df
    .groupby(['pickup_cluster', 'dropoff_cluster'])['trip_duration']
    .mean()
    .reset_index()
    .rename(columns={'trip_duration': 'avg_cluster_duration'}))
df = pd.merge(
    left=df, right=group_durations, how='left',
    left_on=['pickup_cluster', 'dropoff_cluster'], right_on=['pickup_cluster', 'dropoff_cluster'])

# fill the missing values with the mean of the average duration from cluster to cluster
df['avg_cluster_duration'] = df['avg_cluster_duration'].fillna(df['avg_cluster_duration'].mean())
df.drop(columns=['pickup_200_cluster', 'dropoff_200_cluster', 'trip_duration'], inplace=True)

In [None]:
# lasso feature selection
def lasso_feature_selection(X, y):
    lasso_lars_ic = make_pipeline(StandardScaler(with_mean=False), LassoLarsIC(criterion="bic", normalize=False)).fit(X, y)

    results = pd.DataFrame(
        {
            "alphas": lasso_lars_ic[-1].alphas_,
            "BIC criterion": lasso_lars_ic[-1].criterion_,
        }
    ).set_index("alphas")

    optimal_alpha = results[results['BIC criterion'] == results['BIC criterion'].min()].index

    # Train a Lasso model with the optimal alpha for feature selection
    lasso = linear_model.Lasso(alpha=optimal_alpha)
    lasso.fit(X, y)

    return {'Optimal Alpha': optimal_alpha.values[0], 'Optimal BIC': results.loc[optimal_alpha].values[0].tolist()[0],
            'Lasso Coeffs': lasso.coef_.round(4), 'Important Features': X_train.columns[lasso.coef_ != 0].values}

In [None]:
# LightGBM Hyperparameter Selection
def lighhtgbm_hyperparameter_search(X, y):
    # Create param grid
    param_grid = {
        'boosting_type': ['gbdt', 'dart'],
        'num_leaves': [30, 40],
        'learning_rate': [0.01, 0.05],
        'n_estimators': [100, 200],
        'max_depth': [10, 20],
        'reg_alpha': [0.1, 0.5],
        'reg_lambda': [0.1, 0.5],
    }

    # LightGBM
    lgb_train = lgb.LGBMRegressor()

    # Grid search
    grid_search = GridSearchCV(estimator=lgb_train, param_grid=param_grid, cv=3, scoring='neg_root_mean_squared_error', verbose=1)
    grid_search.fit(X_train, y_train)

    return {'Best parameters from grid search': grid_search.best_params_, 'Best MSE': -grid_search.best_score_}

In [None]:
# constants/parameters for this code cell
SHOW_PLOTS = True
LOAD_SAVED_KMEANS_MODELS = True

# load in the cleaned training data and the NYC geopandas dataframe
# with all of the NYC streets
X, y = get_X_y(force_clean=True)
nyc_gdf = get_nyc_gdf()

#########################
# PLOT PICKUP LOCATIONS #
#########################

# plot the nyc streets
plt.gcf().set_dpi(500)
nyc_gdf.plot(linewidth=0.1, edgecolor='black', figsize=(12, 12), alpha=0.5, label="NYC Streets")

# plot the pickup locations as a scatter plot on top of the nyc streets
plt.scatter(X['pickup_longitude'], X['pickup_latitude'], c='red', alpha=0.75, s=0.1, label="Pickup Locations")
leg = plt.legend(loc='upper left')
for lh in leg.legend_handles: 
    lh.set_alpha(1)
plt.title("Pickup Locations")
plt.xlabel("Longitude")
plt.ylabel("Latitude")

# save the plot
plt.savefig("images/pickup_locations_save.png")
plt.show() if SHOW_PLOTS else plt.clf()

##########################
# PLOT DROPOFF LOCATIONS #
##########################

# plot the nyc streets
plt.gcf().set_dpi(500)
nyc_gdf.plot(linewidth=0.1, edgecolor='black', figsize=(12, 12), alpha=0.5, label="NYC Streets")

# plot the dropoff locations as a scatter plot on top of the nyc streets
plt.scatter(X['dropoff_longitude'], X['dropoff_latitude'], c='green', alpha=0.75, s=0.1, label="Dropoff Locations")
leg = plt.legend(loc='upper left')
for lh in leg.legend_handles: 
    lh.set_alpha(1)
plt.title("Dropoff Locations")
plt.xlabel("Longitude")
plt.ylabel("Latitude")

# save the plot
plt.savefig("images/dropoff_locations_save.png")
plt.show() if SHOW_PLOTS else plt.clf()

#####################
# KMEANS CLUSTERING #
#####################

df = deepcopy(X)

# load kmeans_pickup and kmeans_dropoff from the models folder using pickle
if LOAD_SAVED_KMEANS_MODELS:
    with open("models/kmeans_200_pickup.pkl", "rb") as file:
        kmeans_200_pickup = pickle.load(file)
    with open("models/kmeans_200_dropoff.pkl", "rb") as file:
        kmeans_200_dropoff = pickle.load(file)
    
        
# fit kmeans_pickup and kmeans_dropoff with 200 clusters
else:
    n_clusters = 200
    kmeans_pickup = (KMeans(n_clusters=n_clusters)
        .fit(df.loc[:, ['pickup_longitude', 'pickup_latitude']].values))
    kmeans_dropoff = (KMeans(n_clusters=n_clusters)
        .fit(df.loc[:, ['dropoff_longitude', 'dropoff_latitude']].values))
    
    # save the models to pickle files for loading later
    with open("models/kmeans_200_pickup.pkl", "wb") as file:
        pickle.dump(kmeans_pickup, file)
    with open("models/kmeans_200_dropoff.pkl", "wb") as file:
        pickle.dump(kmeans_dropoff, file)

# predict the clusters for each pickup and dropoff location
df['pickup_200_cluster'] = kmeans_200_pickup.predict(df[['pickup_longitude', 'pickup_latitude']].values)
df['dropoff_200_cluster'] = kmeans_200_dropoff.predict(df[['dropoff_longitude', 'dropoff_latitude']].values)

# get the centers
pickup_200_centers = kmeans_200_pickup.cluster_centers_
dropoff_200_centers = kmeans_200_dropoff.cluster_centers_

#######################################
# PLOT PICKUP LOCATIONS WITH CLUSTERS #
#######################################

# plot the nyc streets
plt.gcf().set_dpi(500)
nyc_gdf.plot(linewidth=0.1, edgecolor='black', figsize=(12, 12), alpha=0.5, label="NYC Streets")

# plot the cluster locations and the pickup locations color-coded
# to their associated cluster
plt.scatter(df['pickup_longitude'], df['pickup_latitude'], c=df['pickup_200_cluster'], cmap='magma', alpha=1.0, s=0.1, label="Pickup Locations")
plt.scatter(pickup_200_centers[:, 0], pickup_200_centers[:, 1], c='red', alpha=1, s=10, label="Cluster Centers")
leg = plt.legend(loc='upper left')
for lh in leg.legend_handles: 
    lh.set_alpha(1)
plt.title("200-KMeans Clustering for Pickup Locations")
plt.xlabel("Longitude")
plt.ylabel("Latitude")

# save the plot
plt.savefig("images/kmeans_200_pickup_save.png")
plt.show() if SHOW_PLOTS else plt.clf()

########################################
# PLOT DROPOFF LOCATIONS WITH CLUSTERS #
########################################

# plot the nyc streets
plt.gcf().set_dpi(500)
nyc_gdf.plot(linewidth=0.1, edgecolor='black', figsize=(12, 12), alpha=0.5, label="NYC Streets")

# plot the cluster locations and the pickup locations color-coded
# to their associated cluster
plt.scatter(df['dropoff_longitude'], df['dropoff_latitude'], c=df['dropoff_200_cluster'], cmap='viridis', alpha=1.0, s=0.1, label="Dropoff Locations")
plt.scatter(dropoff_200_centers[:, 0], dropoff_200_centers[:, 1], c='blue', alpha=1, s=10, label="Cluster Centers")
leg = plt.legend(loc='upper left')
for lh in leg.legend_handles: 
    lh.set_alpha(1)
plt.title("200-KMeans Clustering for Dropoff Locations")
plt.xlabel("Longitude")
plt.ylabel("Latitude")

# save the plot
plt.savefig("images/kmeans_200_dropoff_save.png")
plt.show() if SHOW_PLOTS else plt.clf()
