# Introduction/Business Problem - A description of the problem and a discussion of the background

Previoulsy in the Applied Data Science Capstone project we compared neighbourhoods within both New York and Toronto. In this project we are going to compare neighbourhoods between 2 different cities - New York, USA and London, UK

In New York there are 306 different neighbourhoods across 5 different boroughs. There are similarities between the neighbourhoods within each of these boroughs.

In particular, I want to be able to compare area from another city to decide which borough that area is most similar to. For example, is Toronto's Scarbourgh Village neighbourhood most similar to neighbourhoods found in the Bronx, Brooklyn, Manhatten, Queens or Staten Island boroughs in New York.

As I am from UK, I will be comparing area in London, UK to New York, USA.

<em> Scenario </em>

Imagine I am a Data Scientist who currently works for Google in New York, work at the Google offices in New York which are based Chelsea, Manhatten and live in the borough of Manhatten.

I have recently been asked to relocate for work to Google's offices near Kings Cross in London, UK. However I currently know nothing about London or what it is like as a place to live. Before deciding where I would like to move to, I would like to first narrow down my options. 

Even though I don't know anything about London I know that I enjoy living in the Manhatten area of New York and would therefore like to know what areas of London are most similar to Manhatten by using available Foursquare API data. In addition, I'd also not like to be too far from my office near Kings Cross.

We first look at all areas of London to establish which New York boroughs they are most similar to. Then from the areas which are most similar to Manhatten, we will then establish how far from the new Google offices near King Cross they are to establish a shortlist of 3 areas I should consider living in when I moved to London.

<em> Notes </em>

This could be a comparison of any two cities where the data is readily available. London and New York are just a demonstation of it being successfully applied.

Moreover, this general idea could be used in a different context, for example a business that has many succesful stores in Manhatten that is looking to expand to London and is unsure of the best area to locate itself.

# Data - A description of the data and how it will be used to solve the problem

### New York Neighbourhood Data

The data for New York's borough and neighbourhood data will come from, used in the previous practicals in the Applied Data Science Capstone

https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json

This dataset contains every Neighboorhood in New York, the borough they are in as well as their longitude and latitude.

The dataframe has 5 boroughs and 306 neighborhoods.

Data scraped from
https://en.wikipedia.org/wiki/Boroughs_of_New_York_City

In [78]:
import pandas as pd

df = pd.read_html('https://en.wikipedia.org/wiki/Boroughs_of_New_York_City')

#Select correct dataframe (0th) from list of dataframes
df = df[0]

#Drop multiindex names from columns
df.columns = df.columns.droplevel().droplevel()

#Remove bottom 2 rows as summary and comments
df = df[:-3]

#Rename columns as population estimte
df.rename(columns={'Estimate (2019)[3]':'Population Estimate (2019)'}, inplace=True)

print('Total population of New York : '+ str(df['Population Estimate (2019)'].astype(int).sum()))
print('Average population of New York Borough: '+ str(df['Population Estimate (2019)'].astype(int).mean()))
#Divide total population by number of neighbourhoods
print('Average population of New York Neighbourhood : '+ str(df['Population Estimate (2019)'].astype(int).sum()/306))

#Show the relevent columns
df[['Borough', 'Population Estimate (2019)']]

Total population of New York : 8336817
Average population of New York Borough: 1667363.4
Average population of New York Neighbourhood : 27244.5


Unnamed: 0,Borough,Population Estimate (2019)
0,The Bronx,1418207
1,Brooklyn,2559903
2,Manhattan,1628706
3,Queens,2253858
4,Staten Island,476143


<strong> The population of each of these boroughs is approximately 1,700,000 people per borough 

The population of each of these boroughs is approximately 27,000 people per neighbourhood</strong>

### London Borough Data

The data for Londons boroughs and data will come from data scraped from wikipedia (like we did with Toronto)

https://en.wikipedia.org/wiki/List_of_London_boroughs

This dataset contains every Borough in London as well as the longitude and latitude coordinate (which has to be wrangled into the correct form). There are other features which are not required

The dataframe has 32 boroughs (and we do not include the City of London, as it is not a borough)

Data scraped from https://en.wikipedia.org/wiki/List_of_London_boroughs

In [76]:
df2 = pd.read_html('https://en.wikipedia.org/wiki/List_of_London_boroughs')

#Select correct dataframe (0th) from list of dataframes
df2 = df2[0]

#Rename columns as population estimte
df2.rename(columns={'Population (2013 est)[1]':'Population Estimate (2013)'}, inplace=True)

print('Total population of London : '+str(df2['Population Estimate (2013)'].sum()))
print('Average population of London Borough: '+str(df2['Population Estimate (2013)'].mean()))

#Show the relevent columns
df2[['Borough', 'Population Estimate (2013)']]

Total population of London : 8408887
Average population of London Borough: 262777.71875


Unnamed: 0,Borough,Population Estimate (2013)
0,Barking and Dagenham [note 1],194352
1,Barnet,369088
2,Bexley,236687
3,Brent,317264
4,Bromley,317899
5,Camden,229719
6,Croydon,372752
7,Ealing,342494
8,Enfield,320524
9,Greenwich [note 2],264008


<strong>The population of each of these boroughs is approximately 260,000 people per borough</strong>

### New York 'Boroughs' and London 'Boroughs' are not the same - would be like comparing apples and oranges

We can see by comparing average population sizes of New York 'boroughs' and London 'boroughs' that they are not the same size as one another. In particular, a New York borough has approximately 6 to 7 times the population of a London Borough (~1,700,000 and ~260,000 respectively). As such, the boroughs are not comparable. A New York borough is larger than than a London borough.

However, we can also that we can see by comparing average population sizes of a New York neighbourhoods and London boroughs they are not the same size as one another. In particular, a London borough has approximately 10 times the population of a New York Borough (~260,000 and ~27,000 respectively). A London Borough is larger than than a New York Borough.

In particular, it worth noting that a London Borough is somewhere in between the size of a New York neighbourhood and borough. Consequently, It will be crucial to normalise features to allow for fair comparison.

### Similarity data - Foursquare API

We will be utilizing the Foursquare API to establish the features of the neighborhoods in New York. We will collect the types of venues within each neighbourhood. We also use the Foursquare API to establish the features of Boroughs in London. We will collect the types of venues within each borough.

To encode a venue list from the Foursquare API into features we will use one-hot encoding over the venues type in both London Boroughs and New York neighbourhoods, this will ensure features between the two data sets are the same (crucial for comparison later). 

After one-hot encoding, we then want to group by New York Neighbourhood and London Boroughs by taking the mean of the frequency of occurrence of each category, to normalise features. We can call this data our feature space (X)

At this point we should clearly label our New York Neighbourhood data for train and cross-validation data and London Borough data as test data (X_test).

To the New York Neighbourhood feautre data from Foursquare (X) we can add New York Borough labels which have been given an numerical encoding (y), as well splitting it into train (X_train, y_train) and cross-validation datasets (X_cv, y_cv)
(stratified by Borough, 80% - train and 20% Cross Validation).

I then to use the K-Nearest Neighbour Classification model on the New York Neighbourhood data with Borough label to train a Borough classification model.
(i.e. given a Neighbourhoods features it should be able to classify what Borough it belongs to)

We will use the 20% cross-validation set from New York data to establish the best value for K (the number of nearest neighbours), by finding the best accuracy for a range of possible K values.

I will then use the test data for London Boroughs to classify them according to what New York Boroughs they are most similar to.

Then it simple to restrict data to London Boroughs most similar to Manhatten.

### Distance from new Office

The current Google office in London can be found on Google Maps
https://www.google.com/maps/place/Google+UK/@51.5332609,-0.1281919,17z/data=!3m1!4b1!4m5!3m4!1s0x48761b3c54efa6e1:0xc7053ab04745950d!8m2!3d51.5332609!4d-0.1260032

Right clicking on the pop-up and clicking What's here reveal the longitude and latitude for the building
Latitude = 51.5332609
Longitude = -0.1260032

Using the Longatude and Latitude in the London data set allows us to calculate the birdseye distance from the Borough to the office. We will then be able to sort our list of London Boroughs like Manhatten by this distance to give a shortlist of 3 London Boroughs which are most like Manhatten and are close to the Google office in London.