# Clustering of the Neighborhoods of Cities Popular in Media
Heejoon Ahn  
February 20, 2021

## 1. Introduction and Business Problem

London and New York are two of the popular locations for tourism as of late for many fans of different media types. For London, people have traveled not only for sightseeing, but also for the famous locations from novels, movies, and television shows. Some popular examples include the original *Sherlock Holmes* stories, the television adaptation *Sherlock*, *Doctor Who*, *Harry Potter*,and many more. Especially with the widespread fame of *Sherlock Holmes, Doctor Who, and Harry Potter*, the entertainment industry in England has even further catered to generate events like the popular Escape Rooms based on these stories to bring in fans from across the globe to London. 

For New York City, people have mostly traveled there for some of the sightseeing in Manhattan such as Times Square. However, the city itself has been displayed in several stories and movies. The recently most popular series taking place in New York City is Marvel's *Avengers* series and *Spiderman* series. Some of the earlier 2000's movies have been successful in bringing in people such as the *Devil Wear's Prada* movie and even the Christmas family movie *Elf*. 

With the influence of such media and the fans, the tourism in both cities from influence of media sources cannot be discredited. With this in mind, the goal of this project is to generate a way to help tourists visiting both cities to choose their destinations depending on the experiences and services available in respective cities and its neighborhoods.

## 2. Data Description

### 2.1 London Data
The data for London was retrieved from two different sources. One was from Wikipedia's web page looking at the different areas within the city. We will be scraping the data from the website (https://en.wikipedia.org/wiki/List_of_areas_of_London). 

This page provides the following sets of information in the relevant table we will be scraping:
1. Neighborhood 
2. Borough
3. Town 
4. Postal Code District

However, the table that we will be scraping does lack the information regarding latitude and longitude, which we need to generate the maps for our clustering methods. To retrieve them, we utilize the **ArcGIS API**. ArcGIS Online allows users to connect data regarding people and locations with interactive maps. This API will be used to help retrieve the latitudes and longitudes for this dataset. 

### 2.2 New York Data
The data for New York City will be retrieved from the data provided by previous coursework that introduced the New York City clustering methods. Therefore, the retrieving of data will follow the code from the ungraded lab: **Segmenting and Clustering Neighborhoods in New York City**. The information that is provided from this dataset is 

1. Neighborhood
2. Borough
3. Latitude
4. Longitude


### 2.3 Foursquare API
To retrieve the information of the various venues within certain neighborhoods, we will be using the **Foursquare API**. Foursquare, being a location data provider, has information on venue names, locations, menus and more. For our purposes, we will be connecting to the API to retrieve the list of venues from all neighborhoods in respective cities. The radius that the API will be looking at was designated as 500 meters. 

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:
1. Neighbourhood : Name of the Neighbourhood
2. Neighbourhood Latitude : Latitude of the Neighbourhood
3. Neighbourhood Longitude : Longitude of the Neighbourhood
4. Venue : Name of the Venue
5. Venue Latitude : Latitude of Venue
6. Venue Longitude : Longitude of Venue
7. Venue Category : Category of Venue

## 3. Methodology
The first step, before doing anything of interest, will be to load in the necessary packages to do our work. The packages are loaded as shown in the code below:

```python
# loading packages
import pandas as pd
import numpy as np
import requests
import json
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
from arcgis.geocoding import geocode
from arcgis.gis import GIS
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from sklearn.cluster import KMeans
```

The packages that have been loaded perform the following:

* **Pandas**: hande data analysis
* **Numpy**: handle data in a vectorized manner 
* **requests**: handle requests
* **json**: handle JSON files 
* **Nominatim**: convert an address into latitude and longitude values
* **json_normalize**: tranform JSON file into a pandas dataframe
* **matplotlib**: help to generate the maps in our project. 
* **folium**: map rendering library
* **KMeans**: The machine learning model that we are using.

The main approach was to first explore the cities individually, plot their maps to show the neighborhoods, build the models by clustering all of the similar neighborhoods together and finally plot the new maps with the clustered neighborhoods. Afterwards, the discussion of the findings is made. 

### 3.1 Data Collection 
To start the data collection, the first step we implemented was to scrape the data required for London and New York City. For London, we required information including postal codes, neighborhoods, and boroughs. Because the information is mostly provided on the wikipedia page, we scraped using the following steps: 

```python
london_wiki = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
london_url = requests.get(london_wiki)
london_df = pd.read_html(london_url.text)
london_df = london_df[1]
london_df
```

For New York City, the information was provided in a previous lab in this course, so the data was retrieved in a similar fashion to what was provided in the lab notebook. 

```python
# retrieving NYC data from provided json file
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
# getting the neighborhoods data
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

# fill the dataframe
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
neighborhoods.head()
```

Once that is done, a the workflow will consist of data cleaning processes, mapping, and more before we utilize the K-means algorithm to help with the clustering of neighborhoods to help tackle our business problem.