In [1]:
import cabi_functions
import pandas as pd

# Creating trip summary table
Our goal is to create a table showing the stations in a certain pair of stations
## Loading trip data
First we load the table of all trips, remove any with incomplete data, and convert the datatypes to the most appropriate

In [2]:
df = cabi_functions.return_trip_datatable()
# clean the NA values out
df = df.dropna()
# convert them to appropriate datatypes
df = df.convert_dtypes()


## Removing invalid trips
We have some number of trips where the end station is listed as 0. Obviously these are invalid

In [3]:
df = df[df.start_station_id>0]
df = df[df.end_station_id>0]


## Removing trips from removed stations
We also need to make sure we know the location of the station. Therefore we load the list of stations names. Maybe in the future if I can find a table of the locations of removed stations and we can add them to the visualization
### loading our current stations

In [4]:
# Define which attributes to lookup from airports.csv
cabi_stations = 'https://raw.githubusercontent.com/mlinds/cabi-data/main/data/stationLookup.csv'
station_names_list = list(pd.read_csv(cabi_stations).short_name)

### Selecting only trips involving extant stations
We rewrite the dataframe to include only stations that exist in our location lookup table

In [5]:
df = df[df.end_station_id.map(lambda x:x in station_names_list)]
df = df[df.start_station_id.map(lambda x:x in station_names_list)]

## Merging based on which stations are involved
Now to merge based on the same *pairing* of stations (e.g. we do not need to care about which is the origin and which is the destination)

In [6]:
# create a list of the sorted stations
sorted_stations = [sorted([int(x),int(y)]) for x,y in zip(df.start_station_id,df.end_station_id)]
sorted_stations_combined = [int(str(x)+str(y)) for x,y in sorted_stations]
# assign the station to a column in the dataframe,and group it by the unique station combo, then return the results to a seperate dataframe
grouped = df.assign(sorted_stations=sorted_stations_combined).groupby('sorted_stations')

route_popularity = grouped.count().reset_index()
route_popularity = route_popularity[['sorted_stations','end_station_id']]
route_popularity.columns=['sorted','popularity']

We will now reassign the start and end, so that we can seperately plot them later

In [7]:
a = [int(str(val)[0:5]) for val in route_popularity.sorted]
b = [int(str(val)[5:10]) for val in route_popularity.sorted]

route_popularity = route_popularity.assign(st=a,en=b)
route_popularity.sort_values('popularity',ascending=False,inplace=True)

In [8]:
route_popularity.to_csv('data/connections_csv.csv',columns=['st','en','popularity'],index=False)