# Project 1, Part 6, Best Recommendation

University of California, Berkeley
Master of Information and Data Science (MIDS) program
w205 - Fundamentals of Data Engineering

Student: Landon Morin

Year: 2022

Semester: Spring

Section: 9


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [65]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter
import matplotlib.ticker as mticker

import json

import os

import gmaps
import gmaps.geojson_geometries

from geographiclib.geodesic import Geodesic




import psycopg2

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  Remember you can use any code from the labs.

In [66]:
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer
#

def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)

In [67]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

In [68]:
cursor = connection.cursor()

## The executives would like your best recommendation for the business. 

## Create an executive summary giving and explaining your best recommendation for the business. 

## You must support your summary with data, in the form of output of queries, data visualization, etc. There is a 1 query minimum.

In [69]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select c.customer_id,
       s.city as store,
       z.density, 
       z.longitude, 
       z.latitude, 
       z.zip
from stores as s
    join customers as c
        on s.store_id = c.closest_store_id
    join zip_codes as z
        on c.zip = z.zip
group by c.customer_id, z.zip, s.city, z.longitude, z.latitude, z.density
order by c.customer_id

"""

df1=my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df1

Unnamed: 0,customer_id,store,density,longitude,latitude,zip
0,1,Berkeley,12056.40,-122.2643,37.8343,94609
1,2,Berkeley,12056.40,-122.2643,37.8343,94609
2,3,Berkeley,12056.40,-122.2643,37.8343,94609
3,4,Berkeley,12056.40,-122.2643,37.8343,94609
4,5,Berkeley,12056.40,-122.2643,37.8343,94609
...,...,...,...,...,...,...
31077,31078,Nashville,220.15,-86.9277,35.8116,37179
31078,31079,Nashville,220.15,-86.9277,35.8116,37179
31079,31080,Nashville,220.15,-86.9277,35.8116,37179
31080,31081,Nashville,220.15,-86.9277,35.8116,37179


In [70]:
min_value = df1['density'].min()
max_value = df1['density'].max()

bins = np.linspace(min_value,max_value,10)

df1['bins'] = pd.cut(df1['density'], bins=bins, include_lowest=True)



In [71]:
df1

Unnamed: 0,customer_id,store,density,longitude,latitude,zip,bins
0,1,Berkeley,12056.40,-122.2643,37.8343,94609,"(11350.398, 17025.467]"
1,2,Berkeley,12056.40,-122.2643,37.8343,94609,"(11350.398, 17025.467]"
2,3,Berkeley,12056.40,-122.2643,37.8343,94609,"(11350.398, 17025.467]"
3,4,Berkeley,12056.40,-122.2643,37.8343,94609,"(11350.398, 17025.467]"
4,5,Berkeley,12056.40,-122.2643,37.8343,94609,"(11350.398, 17025.467]"
...,...,...,...,...,...,...,...
31077,31078,Nashville,220.15,-86.9277,35.8116,37179,"(0.259, 5675.329]"
31078,31079,Nashville,220.15,-86.9277,35.8116,37179,"(0.259, 5675.329]"
31079,31080,Nashville,220.15,-86.9277,35.8116,37179,"(0.259, 5675.329]"
31080,31081,Nashville,220.15,-86.9277,35.8116,37179,"(0.259, 5675.329]"


In [72]:
f = open('gmap_api_key.txt', 'r')
my_api_key = f.read()
f.close()

gmaps.configure(api_key=my_api_key)

In [73]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select z.latitude, z.longitude
from customers as cu
     join zip_codes as z
         on cu.zip = z.zip
order by 1,2

"""

df2 = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df2

Unnamed: 0,latitude,longitude
0,25.4837,-80.4136
1,25.5021,-80.3997
2,25.5021,-80.3997
3,25.5021,-80.3997
4,25.5302,-80.3919
...,...,...
31077,47.9581,-122.4047
31078,47.9581,-122.4047
31079,47.9581,-122.4047
31080,47.9581,-122.4047


In [74]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select latitude, longitude
from stores
order by 1

"""

df3 = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df3
df_markers = df3[['latitude','longitude']]

In [75]:
f = open('geojson_data/customer_zip_geojson.json')
customer_zip_geojson = json.load(f)
f.close()

kansas = (38.4942,-98.3132)

fig = gmaps.figure(center=kansas, zoom_level=4, map_type = 'HYBRID')
          
geojson_layer = gmaps.geojson_layer(customer_zip_geojson)

heatmap_layer = gmaps.heatmap_layer(df2)

marker_layer = gmaps.marker_layer(df_markers)

fig.add_layer(marker_layer)
fig.add_layer(heatmap_layer)
fig.add_layer(geojson_layer)

fig

Figure(layout=FigureLayout(height='420px'))

# Business Considerations and Action Plan


The previous documents contain a detailed analysis of AGM sales, AGM's customer base, and performance by store. Key highlights of these analyses are: 
1. August, October, and March are the months with the strongest sales. 
2. A vast majority of customers live within 15 miles from their closest store. 
3. Most customers who make frequent purchases and who drive the most revenue live within six miles of their closest store. 
4. There is a positive correlation between customers living in dense and populous zip codes and their spending activity. 
5. Major holidays mostly correspond with higher than average spending in the week leading up to the holiday, but day-of-holiday sales are weak. 

These analysis support the hypothesis that proximity to store is the most important factor in increasing customer engagement and sales. Furthermore, we should target customers in higher density zip codes, and zip codes with more population. Higher densities and larger populations increase the potentcy of the network effect. In other words, targeting customers in higher density areas will increase the chances of our brand spreading by word of mouth.

Therefore, we propose the following to the executive committee: 
1. Provide delivery services or drop locations to the outer zip codes surrounding our stores. We have provided a map above which shows the density of each customer in each zip code and their proximity to their nearest store. We may refer to this map which determining efficient drop zones or delivery routes for our drivers. 
2. On holidays, and in our busy months, create pop-up stores in the outskirts of our store's epicenter. Target zip codes that are between 5 and 15 miles from the store. These zip codes contain a large portion of our known customers, and we will likely be able to retain these customers and encourage them to visit our store with more touchpoints closer to their home. There are few customers that are more than 15 miles from the store, and we may not wish to spend money to target customers who are unlikely to visit the main store frequently. 
