# Where to Go When You Want to Eat Pizza ?

## Data Science Capstone - IBM Data Science Professional Certificate on Coursera


## Introduction-problem identification
Let's say you have never been to the US and you want to have only pizza while you are there. So you want to go to a place with a high density of Pizza places around you. The problem we aim to solve is to analyze the Pizza stores' locations in the major US cities and find the best place for our tourist so that he can have a good pizza-tourism.

# Data section
I will use the FourSquare API to collect data about locations of Pizza stores in 5 major US cities which are: New York,NY, San Francisco, CA, Jersey City, NJ, Boston, MA and Chicago,IL. These are one of the most populated US cities and I am hopeful that they will contain the best Pizza places in the US.

## Import necessary libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## Use foursquare credentials to find the places

In [2]:
CLIENT_ID = 'HBCPIHXZLD3ZFDBL5PX5D5UZVF0R2O341HRVSYP0PQ0CGNXA' # your Foursquare ID
CLIENT_SECRET = 'AZTDWXC244KCGRSSLHC1JZEA3BPY4VD1NSBTO1GQKLTNSLKU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)

Your credentails:
CLIENT_ID: HBCPIHXZLD3ZFDBL5PX5D5UZVF0R2O341HRVSYP0PQ0CGNXA


In [3]:
# type your answer here
LIMIT = 500 # Maximum is 100
cities = ["New York, NY", 'Chicago, IL', 'San Francisco, CA', 'Jersey City, NJ', 'Boston, MA']
results = {}
for city in cities:
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&limit={}&categoryId={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        city,
        LIMIT,
        "4bf58dd8d48988d1ca941735") # PIZZA PLACE CATEGORY ID
    results[city] = requests.get(url).json()

In [4]:
df_venues={}
for city in cities:
    venues = pd.json_normalize(results[city]['response']['groups'][0]['items'])
    df_venues[city] = venues[['venue.name', 'venue.location.address', 'venue.location.lat', 'venue.location.lng']]
    df_venues[city].columns = ['Name', 'Address', 'Lat', 'Lng']

The Foursquare API Only gives us the nearest 100 venues in the city.

Let's first check out their densities by our eyes

In [5]:
maps = {}
for city in cities:
    city_lat = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lat'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lat']])
    city_lng = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lng'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lng']])
    maps[city] = folium.Map(location=[city_lat, city_lng], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(df_venues[city]['Lat'], df_venues[city]['Lng'], df_venues[city]['Name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[city])  
    print(f"Total number of pizza places in {city} = ", results[city]['response']['totalResults'])
    print("Showing Top 100")

Total number of pizza places in New York, NY =  289
Showing Top 100
Total number of pizza places in Chicago, IL =  220
Showing Top 100
Total number of pizza places in San Francisco, CA =  166
Showing Top 100
Total number of pizza places in Jersey City, NJ =  123
Showing Top 100
Total number of pizza places in Boston, MA =  184
Showing Top 100


In [6]:
maps[cities[0]]

In [7]:
maps[cities[1]]

In [8]:
maps[cities[2]]

In [9]:
maps[cities[3]]

In [10]:
maps[cities[4]]

We can see that New York and Jersey City are the most dense cities with Pizza places. And better than that, they are just one shore away.

However, Let's have a concrete measure of this density.

For this I will use some basic statistics. I will get the mean location of the pizza places which should be near to most of them if they are really dense or far if not.

Next I will take the average of the distance of the venues to the mean coordinates.

In [11]:
maps = {}
for city in cities:
    city_lat = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lat'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lat']])
    city_lng = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lng'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lng']])
    maps[city] = folium.Map(location=[city_lat, city_lng], zoom_start=11)
    venues_mean_coor = [df_venues[city]['Lat'].mean(), df_venues[city]['Lng'].mean()] 
    # add markers to map
    for lat, lng, label in zip(df_venues[city]['Lat'], df_venues[city]['Lng'], df_venues[city]['Name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[city])
        folium.PolyLine([venues_mean_coor, [lat, lng]], color="green", weight=1.5, opacity=0.5).add_to(maps[city])
    
    label = folium.Popup("Mean Co-ordinate", parse_html=True)
    folium.CircleMarker(
        venues_mean_coor,
        radius=10,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(maps[city])

    print(city)
    print("Mean Distance from Mean coordinates")
    print(np.mean(np.apply_along_axis(lambda x: np.linalg.norm(x - venues_mean_coor),1,df_venues[city][['Lat','Lng']].values)))

New York, NY
Mean Distance from Mean coordinates
0.02302771699157773
Chicago, IL
Mean Distance from Mean coordinates
0.06104907789221913
San Francisco, CA
Mean Distance from Mean coordinates
0.028457270934640372
Jersey City, NJ
Mean Distance from Mean coordinates
0.020002106838024277
Boston, MA
Mean Distance from Mean coordinates
0.035652470366179606


In [12]:
maps[cities[0]]

In [13]:
maps[cities[1]]

In [14]:
maps[cities[2]]

In [15]:
maps[cities[3]]

In [16]:
maps[cities[4]]

We now see that New York is his best option. And as a plus the Third best place is Jersey City which is just on the other side of the shore. Our tourist's best interest would be to book a hotel near that mean coordinate to surround himself with the 100 Pizza stores there!!
Another observation is that there is one really far away Pizza store which would possible increase its score to be beaten by New York So let's try to remove it and calculate it again

In [17]:
city = 'Jersey City, NJ'
venues_mean_coor = [df_venues[city]['Lat'].mean(), df_venues[city]['Lng'].mean()] 

print(city)
print("Mean Distance from Mean coordinates")
dists = np.apply_along_axis(lambda x: np.linalg.norm(x - venues_mean_coor),1,df_venues[city][['Lat','Lng']].values)
dists.sort()
print(np.mean(dists[:-1]))# Ignore the biggest distance

Jersey City, NJ
Mean Distance from Mean coordinates
0.019607008271043076


That puts Jersey City back in the first place which makes our tourist happy.

Happy Pizza-lover!!