# Capstone Project - The Battle of the Neighborhoods

#### Applied Data Science Capstone by IBM/Coursera

## Which is the Best U.S. City to Visit for Culture?

### Table of contents ###
##### 1. Introduction
##### 2. Data Section
##### 3. Methodology
##### 4. Results
##### 5. Discussion
##### 6. Conclusion

### 1. Introduction

Lots of factors go into making a great cultural city and living in a city that values art and entertainment is not only important for the locals, but it is also the main attraction for the tourists. Excellent art galleries and museums are a good starting point. Then there need to be venues for theater and music of all types, from symphony halls to jazz clubs. Whether cheering for your favorite team or dancing listening to your favorite artists, there’s an activity for everyone.

Let’s assume now that you want to move or travel from abroad to the U.S. and you want to find yourself in a city with a high density of arts and entertainment places around you. The purpose of this project is to explore the Arts and Entertainment locations in five major U.S. cities and find the city which best meets your varied cultural, entertainment, and recreational interests.

### 2. Data section

The Arts and Entertainment category includes a wide range of establishments that operate facilities or provide services to meet varied cultural, entertainment, and recreational interests of their patrons. This category comprises (1) establishments that are involved in producing, promoting, or participating in live performances, events, or exhibits intended for public viewing; (2) establishments that preserve and exhibit objects and sites of historical, cultural, or educational interest; and (3) establishments that operate facilities or provide services that enable patrons to participate in recreational activities or pursue amusement, hobby, and leisure-time interests. 

I would use the FourSquare API in order to collect the data for all Art and Entertainment locations in five major US cities which are: New York, NY, San Francisco, CA, Jersey City, NJ, Boston, MA and Chicago, IL. These are one of the most populated U.S. cities and I am hopeful that they will contain the best Art and Entertainment places in the U.S.

### 3. Methodology

The main purpose of this project is to find the city with the highest Art and Entertainment locations density.

I would have to use the FourSquare API through the venues channel and I would have to use the near query to get the venues in the cities. Moreover, I would have to use the Category ID to set it to show only Art & Entertainment places. 

An Example of this requests is: https://api.foursquare.com/v2/venues/explore?&client_id=&client_secret=&v=20200427&NewYork,NY&limit=100&categoryId=4d4b7104d754a06370d81259

That 4d4b7104d754a06370d81259 is the Id of the Art & Entertainment Category and the Foursquare limitation is up to 100 venues per query.

In addition, I would have to repeat this request for all the five cities and get their top 100 venues. Then, I would have to save the data only from the results and plotted them on the map for visual inspection.

Afterwards, I would have to create an indicator of the density of the Art & Entertainment locations and calculate a center coordinate of the venues in order to get the mean longitude and latitude values. Finally, I would have to calculate the mean of the Euclidean distance from each venue to the mean coordinates, which it is my indicator, mean distance to the mean coordinate.

### 4. Results

Let's import the libraries.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from IPython.display import display_html
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


In [2]:
CLIENT_ID = 'PACCZALESBFXG4ZVMWO3LGWE1ST5MKTEBEZZ2BEHGTBOPHM4'
CLIENT_SECRET = 'GLJB5KBEKG2LS1HFVHEYT24LHZLKSSBZBI2HSTU3ZX1LXFUZ'
VERSION = '20200427' # Foursquare API version

print('Panagiotis Mouzoukos')
print('CLIENT_ID: ' + CLIENT_ID)

Panagiotis Mouzoukos
CLIENT_ID: PACCZALESBFXG4ZVMWO3LGWE1ST5MKTEBEZZ2BEHGTBOPHM4


In [3]:
LIMIT = 500 # Maximum is 100
cities = ["New York, NY", 'Chicago, IL', 'San Francisco, CA', 'Jersey City, NJ', 'Boston, MA']
results = {}
for city in cities:
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&limit={}&categoryId={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        city,
        LIMIT,
        "4d4b7104d754a06370d81259") # ARTS & ENTERTAINMENT CATEGORY ID
    results[city] = requests.get(url).json()

In [4]:
df_venues={}
for city in cities:
    venues = json_normalize(results[city]['response']['groups'][0]['items'])
    df_venues[city] = venues[['venue.name', 'venue.location.address', 'venue.location.lat', 'venue.location.lng']]
    df_venues[city].columns = ['Name', 'Address', 'Lat', 'Lng']

In [18]:
maps = {}
for city in cities:
    city_lat = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lat'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lat']])
    city_lng = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lng'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lng']])
    maps[city] = folium.Map(location=[city_lat, city_lng], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(df_venues[city]['Lat'], df_venues[city]['Lng'], df_venues[city]['Name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[city])  
    print(f"Total number of Arts & Entertainment locations in {city} = ", results[city]['response']['totalResults'])
    print("Showing Top 100")

Total number of Arts & Entertainment locations in New York, NY =  280
Showing Top 100
Total number of Arts & Entertainment locations in Chicago, IL =  244
Showing Top 100
Total number of Arts & Entertainment locations in San Francisco, CA =  232
Showing Top 100
Total number of Arts & Entertainment locations in Jersey City, NJ =  58
Showing Top 100
Total number of Arts & Entertainment locations in Boston, MA =  172
Showing Top 100


All the cities have multiple Art & Entertainment locations and probably more than FourSquare observations. From the initial visual observation, we could easily see that New York, NY city has more Art and Entertainment locations than the other cities. The following geoplot city plot maps generated with folium:

In [6]:
maps[cities[0]]

In [7]:
maps[cities[1]]

In [8]:
maps[cities[2]]

In [9]:
maps[cities[3]]

In [10]:
maps[cities[4]]

At the next step, I calculated the mean coordinates and the mean distances to mean coordinates (MDMC).

In [11]:
maps = {}
for city in cities:
    city_lat = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lat'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lat']])
    city_lng = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lng'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lng']])
    maps[city] = folium.Map(location=[city_lat, city_lng], zoom_start=11)
    venues_mean_coor = [df_venues[city]['Lat'].mean(), df_venues[city]['Lng'].mean()] 
    # add markers to map
    for lat, lng, label in zip(df_venues[city]['Lat'], df_venues[city]['Lng'], df_venues[city]['Name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[city])
        folium.PolyLine([venues_mean_coor, [lat, lng]], color="green", weight=1.5, opacity=0.5).add_to(maps[city])
    
    label = folium.Popup("Mean Co-ordinate", parse_html=True)
    folium.CircleMarker(
        venues_mean_coor,
        radius=10,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(maps[city])

    print(city)
    print("Mean Distance from Mean coordinates")
    print(np.mean(np.apply_along_axis(lambda x: np.linalg.norm(x - venues_mean_coor),1,df_venues[city][['Lat','Lng']].values)))

New York, NY
Mean Distance from Mean coordinates
0.020300833123245213
Chicago, IL
Mean Distance from Mean coordinates
0.042248519554010995
San Francisco, CA
Mean Distance from Mean coordinates
0.023756444877047363
Jersey City, NJ
Mean Distance from Mean coordinates
0.016686012999399178
Boston, MA
Mean Distance from Mean coordinates
0.022279557094590566


On the geoplot city plot maps below, the big green circle is the mean coordinates and the green lines are the mean distances.

In [12]:
maps[cities[0]]

In [13]:
maps[cities[1]]

In [14]:
maps[cities[2]]

In [15]:
maps[cities[3]]

In [16]:
maps[cities[4]]

### 5. Discussion

You can easily observe that New York, NY city has more Art and Entertainment locations than the other cities, but there are not so close to each other, as they are in the Jersey City, NJ city, that is why the MDMC of New York, NY is lower than Jersey City, NJ city. 

Another observation is that there are multiple Art and Entertainment locations far away from the mean, which it would possible increase the MDMC score of New York, NY city. 

Let's try to remove some of the furthest Art and Entertainment locations and calculate MDMC of New York, NY again.

In [17]:
city = 'New York, NY'
venues_mean_coor = [df_venues[city]['Lat'].mean(), df_venues[city]['Lng'].mean()] 

print(city)
print("Mean Distance from Mean coordinates")
dists = np.apply_along_axis(lambda x: np.linalg.norm(x - venues_mean_coor),1,df_venues[city][['Lat','Lng']].values)
dists.sort()
print(np.mean(dists[:-15])) #Ignore the biggest distance

New York, NY
Mean Distance from Mean coordinates
0.016266849525873532


The new MDMC of New York, NY city is: 0.016266849525873532.

By removing the furthest 15 Art and Entertainment locations from the mean, New York, NY city remains not only the city with the most Art and Entertainment locations in the country, but also a prime cultural location with the highest density and concertation of Art & Entertainment locations.

One further adjustment, it is to move the location of the Foursquare API query until we get all the Art and Entertainment locations in each city and do the calculations again.

### 6. Conclusion

There is no doubt that New York, NY city has more Art & Entertainment locations than the other cities, at the same time Jersey City, NJ city could be characterized as the most convenient location with higher concentration of Art & Entertainment locations. 

If you are planning your next cultural trip or you want to live in a city that values art and entertainment, the cultural capital of America and one of the great cultural cities of the world is New York, NY, which is a bridge/tunnel away from Jersey City, NJ.