# 2.5 Advanced Geospatial Plotting

## This script contains the following:
#### [1. Import Libraries](#import-libraries)
#### [2. Import Data](#import-data)
#### [3. Data Preprocessing](#preprocessing)
#### [4. Geospatial Plotting](#plotting)
#### [5. Export Visualization](#export)

### 1. Import Libraries<a id='import-libraries'></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%%capture

import pandas as pd
import os
try:
  from keplergl import KeplerGl
except:
  !pip install keplergl
  from keplergl import KeplerGl
from pyproj import CRS
import numpy as np
from matplotlib import pyplot as plt

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

### 2. Import Data<a id='import-data'></a>

In [None]:
folderpath = r'/content/drive/MyDrive/CAREER FOUNDRY'

df = pd.read_pickle(os.path.join(folderpath, 'cleaned_nyc_bike_weather_data.pkl'))

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.head()

### 3. Data Preprocessing<a id='preprocessing'></a>

#### FINDING TOTAL TRIPS FROM AND TO EACH STATION

In [None]:
# Create a value column and group by start and end station
df['value'] = 1
df_group = df.groupby(['start_station_name', 'end_station_name'])['value'].count().reset_index()

In [None]:
# Check the output
df_group

In [None]:
# Rename the value column for clarity
df_group.rename(columns = {'value' : 'trips'}, inplace = True)

In [None]:
# Check that the total trips is equal to the number of rows in the original dataframe
print(df_group['trips'].sum())
print(df.shape)

In [None]:
df_group['trips'].describe()

The median number of trips between any two stations was 4, but the average number is ~30. This means there are a large number of routes that are very unpopular, with the smaller number of very popular routes pulling the average up.

This is good for our purposes, because it means there is a pattern that most riders follow. It is very likely that the density of the rides taken is centered around a select number of stations.

#### ADDING GEOSPATIAL COORDINATES TO TOTAL TRIPS

In [None]:
# Isolate the start and end coordinates of routes taken
stations = df[['start_station_name', 'end_station_name', 'start_lat', 'start_lng', 'end_lat', 'end_lng']].drop_duplicates().reset_index(drop=True)

In [None]:
stations.head()

In [None]:
stations.shape

#### NOTE:
  Most (if not all) of the stations had more than one latitude and longitude coordinate pairing associated with them. This makes sense because the stations are not individual points on a map, and may extend up to 10 meters depending on the station. However, we will not be able to appropriately visualize the data if individual stations appear on the map more than once.

#### TESTING UNIQUE COORDINATES

In [None]:
test = stations.loc[stations['start_station_name'] == 'Flatbush Ave & Ocean Ave']
test

In [None]:
test['start_lat'].unique().shape

In [None]:
test['start_lat'].describe()

In [None]:
test['start_lat'].max() - test['start_lat'].min()

Just looking at the coordinates for the Flatbush Ave & Ocean Ave station, there are 1,390 UNIQUE COORDINATES used for this station when it was the start station (there could be different coordinates when it is an end station). But when looking at these 1,390 coordinates, the max value is 40.665249 and the min value is 40.658378 for a difference of 0.006871 degrees.

Although the station has different coordinates, they are approximating the same place. Thus, I feel that the MEDIAN of these values would be an acceptable way to measure a station's location.

#### USING THE MEDIAN TO APPLY COORDINATES

In [None]:
# Group the dataframe by start and end station name, and median the coordinates
stations = stations.groupby(['start_station_name', 'end_station_name'])[['start_lat', 'start_lng', 'end_lat', 'end_lng']].median().reset_index()

In [None]:
stations.shape

After grouping the dataframe by start and end station name, and taking the median of the latitude and longitude coordinates, we have a dataframe that is the same size as the df_group dataframe. Thus, there were 1,013,397 different routes taken in 2022.

In [None]:
# Merge the two dataframes
df_final = df_group.merge(stations, how='inner', on=['start_station_name', 'end_station_name'], indicator = 'merge_flag')

In [None]:
df_final

In [None]:
df_final['merge_flag'].value_counts()

### 4. Geospatial Plotting<a id='plotting'></a>

In [None]:
# Create KeplerGl instance

m = KeplerGl(height = 700, data={"data_1": df_final})
m

#### MAP FORMATTING NOTES:
  
  Color - Stations are in red, and the routes go from light blue (start station) to dark blue (end station). These colors were chosen because they are the colors used in the iconography for Citi Bank.

  Layer Blending - The layers are additive creating a white color in especially dense areas. This makes it very easy to visualize, against the dark map, which areas of the city were most popular.

  Filter - A filter based on the number of trips was added, and it is tentatively placed at 1000 trips. So, the above map only shows the routes that had 1000+ trips taken in 2022.

#### OBSERVATIONS:

  Manhattan is the clear winner for Citi Bike usage. The further from Manhattan we go, the fewer trips were taken (to or from). Additionally, the number of trips taken correlates with the number of stations available. People living in Queens and the Bronx have fewer options available to start or end a trip, so it makes sense that they take fewer trips overall.

  Within Manhattan, the pattern of most trips taken follows along MTA routes (particularly the subway locations). If we were to add a layer of subway entrances on this map, it would mimic the areas of highest trip density. It seems New Yorkers who use the subway are also likely to bike to/from their subway stop of choice.

  Corresponding to the point above, most of the trips were short. We reached that conclusion in a previous visualization, but we can more clearly understand why now. Travelers don't need to ride their bike far from their homes/destination to get to the nearest subway stop. However, the longer trips are almost all on the east and west coasts of the island (by the Hudson and East River). There are no subways running at these coasts, so that makes sense.

  Finally, along the perimeter of Central Park is a high concentration of rides. This confirms that Central Park is a popular destination, with the lower half of the park being the more accessed half of the park.

### 5. Export Visualization<a id='export'></a>

In [None]:
config = m.config

In [None]:
import json
with open("config.json", "w") as outfile:
    json.dump(config, outfile)

In [None]:
m.save_to_html(file_name = 'Citi_Bike_Trips_Aggregated.html', read_only = False, config = config)