# Adding Pickup Locations, Drones, and Using BART


## Context

Adding more pickup locations may help to grow the customer base and increase the frequency at which customers purchase meals. This would necessarily entail renting or purchasing property and/or renovating space to open these additional pickup locations.

Since the business would be considering longer term leases or purchases with potential costly renovations, we need to choose locations which are future proof.

Locations near BART stations would be good choices because riders could easily pick up meals at or near the stations they travel through on the way to or from work.

[add stuff for drones]

[add stuff for public transit]


## Methodology

We will cluster the BART stations to identify which stations naturally belong to certain groupings. For each cluster, we will examine each station's degree centrality, betweenness centrality, and PageRank. Degree centrality will indicate the number of stations connected to the station of interest. Betweenness centrality will indicate the number of routes which pass through that station. Finally, PageRank will indicate the overall influence of that station within the cluster.

# Included Modules and Packages

In [1]:
import neo4j

import csv

import math
import numpy as np
import pandas as pd

import psycopg2
from geographiclib.geodesic import Geodesic

import warnings
warnings.filterwarnings("ignore")

# Supporting Code

In [2]:
# Connect to Neo4j

driver = neo4j.GraphDatabase.driver(uri="neo4j://neo4j:7687", auth=("neo4j","w205"))
session = driver.session(database="neo4j")

# Connect to PostgreSQL

connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

cursor = connection.cursor()

In [3]:
# function to run a select query and return rows in a pandas dataframe
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer

def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)

In [4]:
def my_calculate_box(point, miles):
    "Given a point and miles, calculate the box in form left, right, top, bottom"
    
    geod = Geodesic.WGS84

    kilometers = miles * 1.60934
    meters = kilometers * 1000

    g = geod.Direct(point[0], point[1], 270, meters)
    left = (g['lat2'], g['lon2'])

    g = geod.Direct(point[0], point[1], 90, meters)
    right = (g['lat2'], g['lon2'])

    g = geod.Direct(point[0], point[1], 0, meters)
    top = (g['lat2'], g['lon2'])

    g = geod.Direct(point[0], point[1], 180, meters)
    bottom = (g['lat2'], g['lon2'])
    
    return(left, right, top, bottom)

In [5]:
def my_station_get_zips(station, miles):
    "given a station, pull all zip codes with miles distance, print them, sum the population"
    
    connection.rollback()
    
    query = "select latitude, longitude from stations "
    query += "where station = '" + station + "'"
    
    cursor.execute(query)
    
    connection.rollback()
    
    rows = cursor.fetchall()
    
    for row in rows:
        latitude = row[0]
        longitude = row[1]
        
    point = (latitude, longitude)
        
    (left, right, top, bottom) = my_calculate_box(point, miles)
    
    query = "select zip, population from zip_codes "
    query += " where latitude >= " + str(bottom[0])
    query += " and latitude <= " + str(top [0])
    query += " and longitude >= " + str(left[1])
    query += " and longitude <= " + str(right[1])
    query += " order by 1 "

    cursor.execute(query)
    
    connection.rollback()
    
    rows = cursor.fetchall()
    
    total_population = 0
    
    for row in rows:
        zip, population = row[0], row[1]
        total_population += population
    return float(total_population)  

In [15]:
def my_station_get_zip_list(station, miles):
    "given a station, pull all zip codes with miles distance, print them, sum the population"
    
    connection.rollback()
    
    query = "select latitude, longitude from stations "
    query += "where station = '" + station + "'"
    
    cursor.execute(query)
    
    connection.rollback()
    
    rows = cursor.fetchall()
    
    for row in rows:
        latitude = row[0]
        longitude = row[1]
        
    point = (latitude, longitude)
        
    (left, right, top, bottom) = my_calculate_box(point, miles)
    
    query = "select zip, population from zip_codes "
    query += " where latitude >= " + str(bottom[0])
    query += " and latitude <= " + str(top [0])
    query += " and longitude >= " + str(left[1])
    query += " and longitude <= " + str(right[1])
    query += " order by 1 "

    cursor.execute(query)
    
    connection.rollback()
    
    rows = cursor.fetchall()
    
    total_population = 0
    
    zip_list = []
    
    for row in rows:
        zip = row[0]
        population = row[1]
        total_population += population
        zip_list.append(row[0])
    return zip_list

In [6]:
def cleanse_stations(df):
    """Returns a data frame with unique station names cleansed of line(s) and depart, arrive"""
    
    words = ["blue", "green", "orange", "red", "yellow", "orange", "gray", "depart", "arrive"]
    regex_pattern = r'\b(?:{})\b'.format('|'.join(words))
    df["name"] = df["name"].str.replace(regex_pattern, '')
    return df

In [7]:
def my_neo4j_run_query_pandas(query, **kwargs):
    "run a query and return the results in a pandas dataframe"
    
    result = session.run(query, **kwargs)
    
    df = pd.DataFrame([r.values() for r in result], columns=result.keys())
    
    return df

# Generate Data Frame for Analysis

In [8]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select station,
        latitude,
        longitude
from stations
order by station

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

##### Add population within 1.5 miles of each station, which is the delivery range for a drone

In [9]:
df["pop_1_5"] = [my_station_get_zips(station, 1.5) for station in df["station"]]

##### Add degree centrality, which measures the number of incoming and outgoing connections. High degree centrality indicates that the station connects with many others.

In [10]:
# Degree centrality for the connected graph

query = """

CALL gds.degree.stream('ds_graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score as degree
ORDER BY degree DESC, name

"""

deg_df = my_neo4j_run_query_pandas(query)

# Remove the line and depart / arrive designations

deg_df = cleanse_stations(deg_df)

# Keep the entry for each station with the maximum degree centrality

deg_df = deg_df.groupby(["name"])["degree"].max()
deg_df = deg_df.to_frame()

# Add degree centrality to df

df.set_index("station", inplace=True)
df["degree_centrality"] = deg_df["degree"].values

##### Add betweenness centrality, which measures the number of paths which pass through a node (station). High betweenness centrality for a station indicates a high number of paths which pass through that station.

In [12]:
# Betweenness centrality

query = """

CALL gds.betweenness.stream('ds_graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score as betweenness
ORDER BY betweenness DESC

"""

bet_df = my_neo4j_run_query_pandas(query)

# Remove the line and depart / arrive designations

bet_df = cleanse_stations(bet_df)

# Keep the entry for each station with the maximum betweenness centrality

bet_df = bet_df.groupby(["name"])["betweenness"].max()
bet_df = bet_df.to_frame()

# Add degree centrality to df

df["bet_centrality"] = bet_df["betweenness"].values

##### Add PageRank for each station, which measures the influence of that station in the graph. High PageRank indicates an influential station in the BART map.

In [13]:
# PageRank for each station

query = """

CALL gds.pageRank.stream('ds_graph',
                         { maxIterations: $max_iterations,
                           dampingFactor: $damping_factor}
                         )
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score as page_rank
ORDER BY page_rank DESC, name ASC

"""

max_iterations = 20
damping_factor = 0.05

pr_df = my_neo4j_run_query_pandas(query, max_iterations=max_iterations, damping_factor=damping_factor)

# Remove the line and depart / arrive designations

pr_df = cleanse_stations(pr_df)

# Keep the entry for each station with the maximum page rank

pr_df = pr_df.groupby(["name"])["page_rank"].max()
pr_df = pr_df.to_frame()

# Add degree centrality to df

df["page_rank"] = pr_df["page_rank"].values

##### Impute population values for Antioch, Milpitas, OAK, and Pittsburg

In [29]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select *
from zip_codes

"""

temp = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

In [55]:
# Using the zip_codes table, find the population for each of the four corresponding zip codes

antioch_station_zip = "94509"
milpitas_station_zip = "95035"
OAK_station_zip = "94621"
pittsburg_station_zip = "94565"

antioch_pop = int(temp.loc[temp["zip"] == antioch_station_zip, "population"])
milpitas_pop = int(temp.loc[temp["zip"] == milpitas_station_zip, "population"])
OAK_pop = int(temp.loc[temp["zip"] == OAK_station_zip, "population"])
pittsburg_pop = int(temp.loc[temp["zip"] == pittsburg_station_zip, "population"])

In [63]:
# Assign the population values back to the data frame

df.loc[df.index=="Antioch", "pop_1_5"] = antioch_pop
df.loc[df.index=="Milpitas", "pop_1_5"] = milpitas_pop
df.loc[df.index=="OAK", "pop_1_5"] = OAK_pop
df.loc[df.index=="Pittsburg", "pop_1_5"] = pittsburg_pop

##### Cluster the stations

In [217]:
query = """

CALL gds.louvain.stream('ds_graph')
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).name AS name, communityId as community, intermediateCommunityIds as intermediate_community
ORDER BY community, name ASC

"""

# Clean the results

temp = my_neo4j_run_query_pandas(query)
temp = cleanse_stations(temp)
temp["name"] = temp["name"].str.lstrip()
temp = temp.groupby("name")["community"].max().to_frame()
temp["cluster"] = [0 for i in range(0,50)]

# Refine the cluster - Louvain returns too many clusters

temp.loc[temp["community"] < temp["community"].quantile(0.25), "cluster"] = 1
temp.loc[(temp["community"] >= temp["community"].quantile(0.25)) & 
         (temp["community"] < temp["community"].quantile(0.5)), "cluster"] = 2
temp.loc[(temp["community"] >= temp["community"].quantile(0.5)) & 
     (temp["community"] < temp["community"].quantile(0.75)), "cluster"] = 3
temp.loc[temp["community"] >= temp["community"].quantile(0.75), "cluster"] = 4

# Drop community

temp.drop("community", axis=1, inplace=True)

# Append to df

df["cluster"] = temp["cluster"]


# Analysis

## Identify which stations in each cluster look like good candidates for a pickup location

##### Start by finding which stations in each cluster have higher values than Berkeley's

In [232]:
# Create values for Downtown Berkeley

berk_pop_1_5 = df.loc[df.index == "Downtown Berkeley", "pop_1_5"][0]
berk_deg_cent = df.loc[df.index == "Downtown Berkeley", "degree_centrality"][0]
berk_bet_cent = df.loc[df.index == "Downtown Berkeley", "bet_centrality"][0]
berk_page_rank = df.loc[df.index == "Downtown Berkeley", "page_rank"][0]

In [263]:
df[(df["cluster"] == 1) & (df["bet_centrality"] > df["bet_centrality"].quantile(0.66))]

Unnamed: 0_level_0,latitude,longitude,pop_1_5,degree_centrality,bet_centrality,page_rank,cluster
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Powell Street,37.784,-122.408,207857.0,6.0,3339.4838,1.003696,1


In [264]:
df[(df["cluster"] == 2) & (df["bet_centrality"] > df["bet_centrality"].quantile(0.66))]

Unnamed: 0_level_0,latitude,longitude,pop_1_5,degree_centrality,bet_centrality,page_rank,cluster
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Bay Fair,37.697,-122.1265,71001.0,5.0,3348.740208,1.013135,2
Coliseum,37.753611,-122.196944,69880.0,6.0,4306.942363,1.012288,2
Fruitvale,37.7748,-122.2241,90602.0,5.0,4641.959661,1.004389,2
Lake Merritt,37.797773,-122.266588,85861.0,5.0,5155.831877,1.006025,2


In [265]:
df[(df["cluster"] == 3) & (df["bet_centrality"] > df["bet_centrality"].quantile(0.66))]

Unnamed: 0_level_0,latitude,longitude,pop_1_5,degree_centrality,bet_centrality,page_rank,cluster
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Embarcadero,37.793056,-122.397222,170877.0,6.0,3648.987775,1.00371,3
Lafayette,37.893186,-122.124614,29639.0,3.0,4469.0,1.031895,3
Montgomery Street,37.789355,-122.401942,178168.0,6.0,3492.402727,1.003696,3
Orinda,37.878427,-122.18374,19341.0,3.0,4997.0,1.031779,3
Pleasant Hill,37.928399,-122.055992,22734.0,3.0,3365.0,1.031897,3
Rockridge,37.844452,-122.252083,75154.0,3.0,5509.0,1.024828,3
Walnut Creek,37.905724,-122.067332,22734.0,3.0,3925.0,1.031897,3
West Oakland,37.8049,-122.2951,42316.0,6.0,3942.135136,1.005418,3


In [266]:
df[(df["cluster"] == 4) & (df["bet_centrality"] > df["bet_centrality"].quantile(0.66))]

Unnamed: 0_level_0,latitude,longitude,pop_1_5,degree_centrality,bet_centrality,page_rank,cluster
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
12th Street,37.803608,-122.272006,54365.0,5.0,5139.715461,1.006042,4
19th Street,37.807869,-122.26898,85861.0,5.0,4820.250748,1.006131,4
MacArthur,37.82826,-122.267275,100658.0,5.0,5999.809223,1.01315,4
SFO,37.6164,-122.391,22845.0,5.0,3596.811474,1.004461,4
