# Option: Adding Pickup Locations

## Context

Adding more pickup locations may help to grow the customer base and increase the frequency at which customers purchase meals. This would necessarily entail renting or purchasing property and/or renovating space to open these additional pickup locations.

Since the business would be considering longer term leases or purchases with potential costly renovations, we need to choose locations which are future proof.

Locations near BART stations would be good choices because riders could easily pick up meals at or near the stations they travel through on the way to or from work.

## Methodology

We will use graph community detection algorithms to identify stations at/near which we could open new pickup locations. More specifically, we will examine degree centrality for each station, which indicates how well-connected (number of connections) each station is with the others. Additionally, we will examine betweenness centrality, which indicates the number of paths that pass through each station. Finally, we will examine each stations PageRank, which indicates how influential that station is within the overall BART network.

In order to select which stations would be good candidates, we will:

* Identify which stations have higher measures on all of the dimensions listed above compared to the Downtown Berkeley station (current pickup location). We are assuming that higher measures on these dimensions relative to Downtown Berkeley are indicative of higher sales.
* Identify stations that have low betweenness, but dense surrounding populations. These stations are likely to be at the ends of the BART lines, which will enable us to expand further into the suburbs to capture marketshare from commuters to/from the Peninsula, South Bay, etc.
* Identify stations that have the densest surrounding populations. These stations are likely to be within San Francisco, which would enable us to expand into the Peninsula and capture customers who may not even ride on BART.

## Included Modules and Packages

In [1]:
import neo4j

import csv

import math
import numpy as np
import pandas as pd

import psycopg2
from geographiclib.geodesic import Geodesic

import warnings
warnings.filterwarnings("ignore")

## Supporting Code

In [2]:
driver = neo4j.GraphDatabase.driver(uri="neo4j://neo4j:7687", auth=("neo4j","w205"))

In [3]:
session = driver.session(database="neo4j")

In [4]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

In [5]:
cursor = connection.cursor()

In [6]:
# function to run a select query and return rows in a pandas dataframe
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer

def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)

In [7]:
def my_calculate_box(point, miles):
    "Given a point and miles, calculate the box in form left, right, top, bottom"
    
    geod = Geodesic.WGS84

    kilometers = miles * 1.60934
    meters = kilometers * 1000

    g = geod.Direct(point[0], point[1], 270, meters)
    left = (g['lat2'], g['lon2'])

    g = geod.Direct(point[0], point[1], 90, meters)
    right = (g['lat2'], g['lon2'])

    g = geod.Direct(point[0], point[1], 0, meters)
    top = (g['lat2'], g['lon2'])

    g = geod.Direct(point[0], point[1], 180, meters)
    bottom = (g['lat2'], g['lon2'])
    
    return(left, right, top, bottom)

In [42]:
def my_station_get_zips(station, miles):
    "given a station, pull all zip codes with miles distance, print them, sum the population"
    
    connection.rollback()
    
    query = "select latitude, longitude from stations "
    query += "where station = '" + station + "'"
    
    cursor.execute(query)
    
    connection.rollback()
    
    rows = cursor.fetchall()
    
    for row in rows:
        latitude = row[0]
        longitude = row[1]
        
    point = (latitude, longitude)
        
    (left, right, top, bottom) = my_calculate_box(point, miles)
    
    query = "select zip, population from zip_codes "
    query += " where latitude >= " + str(bottom[0])
    query += " and latitude <= " + str(top [0])
    query += " and longitude >= " + str(left[1])
    query += " and longitude <= " + str(right[1])
    query += " order by 1 "

    cursor.execute(query)
    
    connection.rollback()
    
    rows = cursor.fetchall()
    
    total_population = 0
    
    for row in rows:
        zip, population = row[0], row[1]
        total_population += population
    return float(total_population)    

In [43]:
def cleanse_stations(df):
    """Returns a data frame with unique station names cleansed of line(s) and depart, arrive"""
    
    words = ["blue", "green", "orange", "red", "yellow", "orange", "gray", "depart", "arrive"]
    regex_pattern = r'\b(?:{})\b'.format('|'.join(words))
    df["name"] = df["name"].str.replace(regex_pattern, '')
    return df

In [44]:
def my_neo4j_run_query_pandas(query, **kwargs):
    "run a query and return the results in a pandas dataframe"
    
    result = session.run(query, **kwargs)
    
    df = pd.DataFrame([r.values() for r in result], columns=result.keys())
    
    return df

## Generate Data Frame for Analysis

In [45]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select station
from stations
order by station

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

##### Add population within 5 miles of the station. Based on prior analysis, we found that customers who sign up for delivery live within 5 miles.

In [46]:
df["pop_5"] = [my_station_get_zips(station, 5) for station in df["station"]]

##### Add degree centrality, which measures the number of incoming and outgoing connections. High degree centrality indicates that the station connects with many others.

In [47]:
# Degree centrality for the connected graph

query = """

CALL gds.degree.stream('ds_graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score as degree
ORDER BY degree DESC, name

"""

deg_df = my_neo4j_run_query_pandas(query)

In [48]:
# Remove the line and depart / arrive designations

deg_df = cleanse_stations(deg_df)

# Keep the entry for each station with the maximum degree centrality

deg_df = deg_df.groupby(["name"])["degree"].max()
deg_df = deg_df.to_frame()

# Add degree centrality to df

df.set_index("station", inplace=True)
df["degree_centrality"] = deg_df["degree"].values

##### Add betweenness centrality, which measures the number of paths which pass through a node (station). High betweenness centrality for a station indicates a high number of paths which pass through that station.

In [49]:
# Betweenness centrality

query = """

CALL gds.betweenness.stream('ds_graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score as betweenness
ORDER BY betweenness DESC

"""

bet_df = my_neo4j_run_query_pandas(query)

In [50]:
# Remove the line and depart / arrive designations

bet_df = cleanse_stations(bet_df)

# Keep the entry for each station with the maximum betweenness centrality

bet_df = bet_df.groupby(["name"])["betweenness"].max()
bet_df = bet_df.to_frame()

# Add degree centrality to df

df["bet_centrality"] = bet_df["betweenness"].values

##### Add PageRank for each station, which measures the influence of that station in the graph. High PageRank indicates an influential station in the BART map.

In [51]:
# PageRank for each station

query = """

CALL gds.pageRank.stream('ds_graph',
                         { maxIterations: $max_iterations,
                           dampingFactor: $damping_factor}
                         )
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score as page_rank
ORDER BY page_rank DESC, name ASC

"""

max_iterations = 20
damping_factor = 0.05

pr_df = my_neo4j_run_query_pandas(query, max_iterations=max_iterations, damping_factor=damping_factor)

In [52]:
# Remove the line and depart / arrive designations

pr_df = cleanse_stations(pr_df)

# Keep the entry for each station with the maximum page rank

pr_df = pr_df.groupby(["name"])["page_rank"].max()
pr_df = pr_df.to_frame()

# Add degree centrality to df

df["page_rank"] = pr_df["page_rank"].values

## Analysis

In [53]:
df.describe()

Unnamed: 0,pop_5,degree_centrality,bet_centrality,page_rank
count,50.0,50.0,50.0,50.0
mean,468471.8,4.16,2505.701251,1.014235
std,228075.111885,1.251285,1600.224563,0.010822
min,152632.0,2.0,179.812881,1.003167
25%,305343.0,3.0,1142.5775,1.00557
50%,447007.5,4.0,2435.005109,1.011688
75%,543847.25,5.0,3570.709287,1.016111
max,989138.0,6.0,5999.809223,1.040071


In [54]:
# Create values for Downtown Berkeley

berk_pop_5 = df.loc[df.index == "Downtown Berkeley", "pop_5"][0]
berk_deg_cent = df.loc[df.index == "Downtown Berkeley", "degree_centrality"][0]
berk_bet_cent = df.loc[df.index == "Downtown Berkeley", "bet_centrality"][0]
berk_page_rank = df.loc[df.index == "Downtown Berkeley", "page_rank"][0]

#### Which stations have higher values on all measures than Downtown Berkeley?

Only 3 stations score higher than Downtown Berkeley on all measures - Bay Fair, Coliseum, and MacArthur. These are all East Bay stores and are relatively close to one another. It would make sense to open a pickup location near one of these stations - likely Coliseum given its proximity to the OAK airport and its location between Bay Fair and MacArthur.

Potential customers could pick up meals not only on their way home from work, but also on their way home from the airport. This could be especially appealing to people who travel for work frequently, like consultants. Additionally, these stations are located in densely populated areas. Given our record during the POC with Peak Deliveries, we could always use a delivery option with the new store as the base from which to deliver as a backup if we found that customers did not pick up meals from our Coliseum station at the rate we foresee.

In [55]:
# Generate a data frame with values greater than Downtown Berkeley

berk_df = df[(df["pop_5"] > berk_pop_5) &
        (df["degree_centrality"] > berk_deg_cent) &
        (df["bet_centrality"] > berk_bet_cent) & 
        (df["page_rank"] > berk_page_rank)]
berk_df

Unnamed: 0_level_0,pop_5,degree_centrality,bet_centrality,page_rank
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bay Fair,457901.0,5.0,3348.740208,1.013135
Coliseum,495903.0,6.0,4306.942363,1.012288
MacArthur,524629.0,5.0,5999.809223,1.01315


### Which locations would help us to expand further from East Bay?

Looking for stations that have relatively low betweenness centrality but dense populations will help us to expand geographically further from the East Bay while still ensuring that we have an addressable market. The Berryessa station in particular shows promise given its role as a gateway to the South Bay.

Commuters who live/work in the East Bay and South Bay that take BART must pass through Berryessa. Additionally, this is a relatively long commute, so the convenience of being able to pick up meals on the way home will likely appeal to these potential customers.

In [22]:
edges_df = df[(df["bet_centrality"] < df["bet_centrality"].quantile(0.25)) &
            (df["pop_5"] > df["pop_5"].quantile(0.75))]
edges_df

Unnamed: 0_level_0,pop_5,degree_centrality,bet_centrality,page_rank
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Berryessa,559010.0,3.0,179.812881,1.003167


### Which location(s) have the densest surrounding populations?

The location with the densest surrounding population is 24th Street Mission. This location has the largest surrounding population (within 5 miles) of all BART stations, but fewer paths pass through this station compared to Powell Street.

Opening a pickup location near the Powell Street station would help us to capture potential customers within San Francisco, even if those customers are not commuters. Depending on the success of opening a location here, we could then assess if opening another pickup location near the 24th Stree Mission station makes financial sense. 

In [56]:
df.sort_values("pop_5", ascending=False).head(5)

Unnamed: 0_level_0,pop_5,degree_centrality,bet_centrality,page_rank
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
24th Street Mission,989138.0,6.0,2829.403538,1.003696
Glen Park,986074.0,6.0,2637.248955,1.003709
Balboa Park,936912.0,6.0,2437.338289,1.005317
Powell Street,870044.0,6.0,3339.4838,1.003696
Civic Center,870044.0,6.0,3180.147417,1.003696
