# Capstone Project - The Battle of Neighborhoods 
## 'The Best Walkable Spots for Students in NY'

### Final Report

## Introduction

In this first week, you have to clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

This submission will eventually become your Introduction/Business Problem section in your final report. So I recommend that you push the report (having your Introduction/Business Problem section only for now) to your Github repository and submit a link to it.

## Table of Contents

__1.Introduction - Business Problem Definition.__

__2. Data__

__3. Methodology__

__4. Results__

__5. Discussion__

__6. Conclusion__

# Introduction - Business Problem Definition

##  Study Context

There a category of students or young adults that would like to settle temporarily in New York City. This type of people has often low financial resources. They cannot afford to own and maintain a car with all the expenses that it can generate. Sometimes it is also because they want to adopt a greener transportation style to reduce their carbon footprint.

To help for their daily expenses or to supplement their basic income, they often hold jobs in the restaurants, shops, bars, etc. Moreover, as generally with young people, they also like to frequent the lively neighborhoods to meet their friends or other people. 

We can therefore predict that this category of population will be interested in looking for rental housing in lively areas allowing both leisure outings and opportunities to find odd jobs.

Of course, since they do not have a personal vehicle because lacking financial means or by the choice of ecological convictions, these students or young people will certainly look for renting nearby rooms or studios in neighborhoods as close as possible to places where they can easily find extra work.

A criterion of choice will be the proximity of the place of residence to these employment zones, having the possibility of getting there as quickly as possible first by walking as a pedestrian then using a skate-board, a bicycle or by public transport such as bus lines or metro etc ...

## Problem Statement


As a Data Scientist, I would like to study the neighborhoods of New York and make a classification according to the number of places of shopping, outings and leisure and the accessibility of these places by walking (in a first study).

Later, we can refine by considering also other transportation means such as bike or public transport such as buses and metro.

Finally to be closer to the reality of these low-income people, we can add the rental price criterion (average in the neighborhood) to study the influence on the classification.

Nota: in this case study, the important factor of "income required per tenant" to be able to rent a room has not not been considered. Of course, it must be taken into account in a real life.

## Target Audience

Who could be interested in this problem ?

At least we can list the following target populations :

1. Students that would like to find a job close to the place of residence

2. People who do not own any personal vehicle because they have either ecological convictions or low income but who love lively places with lots of restaurants, cafes, shops, cultural places etc. So those who are interested in living in a place with a high level walk-ability index.

3. Companies that work around search engines for rental accommodation agencies

4. Reception centers and help for students 

# Data Set Definition

A summary of needed data can be established from the problem definition above. 

As an output of the solution, we would like to provide a global picture of the New York places indicating where it is possible to rent a room or studio and with a score calculated based on the proximity of restaurants, bars, Coffee Shops or other businesses fully accessible to pedestrians. Something like a heat-map of places in New York City.

So we will need to retrieve the following information:

1. The list and localisation of New York neigborhoods
2. The list and localisation of places such as coffee shop, restaurants, miscellaneous shop, theaters etc around a specific place in New York 
3. The list of places with their walkability measure
4. The list of places with their average rental price

In the next sections, each data set is described given the origin, the type of data and the way it will be used to solve the problem we defined above.

##  1. New York neigborhoods

We follow the same process as given in the labs _"Segmenting and Clustering Neighborhoods in New York City"_. 

To remember what we learned, we know now that New York has a total of 5 boroughs and 306 neighborhoods. In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and longitude coordinates of each neighborhood.

Luckily, this dataset exists for free on the web.  We will use the dataset at the link we have been given : https://geo.nyu.edu/catalog/nyu_2451_34572.

The Web page displays the following information : 

***
> _2014 New York City Neighborhood Names_

> _Description:_
>> _This New York City Neighborhood Names point file was created as a guide to New York City’s neighborhoods
>> that appear on the web resource, “New York: A City of Neighborhoods.” Best estimates of label centroids were 
>> established at a 1:1,000 scale, but are ideally viewed at a 1:50,000 scale._

> __Publisher: New York (City). Department of City Planning_

> _Collection: Bytes of the Big Apple_

> _Place(s): New York, New York, United States_

> _Subject(s): Neighborhoods, Neighborhood planning, and Communities_

> _Format(s): Shapefile_

> _Year: 2014__

> _Held by: NYU_

> _Preservation record: http://hdl.handle.net/2451/34572_
***

On this Web page, we select the format of the file we want to generate. Selecting _'json'_ format, the following message is displayed indicating that the resulting formatted file is ready to be downloaded locally :
>> _'Your file nyu-2451-34572-geojson.json is ready for download'_

After that, we are able to parse the json features array of 306 items to extract all the information about the neighborhoods of New York.

This step leads us to build a dataframe, _**neighborhoods**_, containing the elementary data needed to solve our problem i.e. neighborhoods with location coordinates. 

Using pandas library, we map the json data to the _**neighborhoods**_ dataframe containing 4 columns listed below :

>> _**Borough, Neighborhood, Latitude, Longitude**_

##  2. Venues in New York Neigborhoods

For these data, we also follow the same process as given in the labs _"Segmenting and Clustering Neighborhoods in New York City"_, i.e. using the _**Foursquare API**_ to explore the neighborhoods.

This is a requirement of this assignment. 

Using Foursquare API allows to request a number of top venues that are in each neighborhood within a defined radius. A GET request to  _**Foursquare**_ looks like in python code :

***
~~~~
# Build the url
radius = 500
LIMIT = 100

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET,
        neighborhood_latitude, neighborhood_longitude,
        VERSION,
        radius,
        LIMIT)

# make the GET request
results = requests.get(url).json()
~~~~ 
***

After that, the 'results' data is ready to be cleaned and structured it into a pandas dataframe containing at least the needed columns and rows such as displayed below :
***
~~~~
 	name                    	categories          	lat     	lng
0 	Arturo's                	Pizza Place        	40.874412 	-73.910271
1 	Bikram Yoga             	Yoga Studio        	40.876844 	-73.906204
2 	Tibbett Diner           	Diner              	40.880404 	-73.908937
3 	Starbucks               	Coffee Shop        	40.877531 	-73.905582
4 	Land & Sea Restaurant	Seafood Restaurant    	40.877885 	-73.905873
~~~~
***

This is how Foursquare API is used to build this data set that will be combined for each neighborhood to identify venues around places of interest.

This will contribute to rank each places with the number of available venues.

##  3. Walkability Measures of New York Places

These measures will help to define the places in New York neighborhoods that are the most suitable for people who do not own a vehicle and want to travel only on foot or by public transportation.
 
Fortunately, like _**Foursquare**_, there is also a Web Site _**"Walk Score"**_ (https://www.walkscore.com/) that provides an API to compute a score for each place identified by location coordinates.

As displayed on the Web page, a :
> _"Walk Score is a number between 0 and 100 that measures the walkability of any address."_

And using the following methodology :
> _"Walk Score measures the walkability of a location based on its distance from amenities, density of population, block length and pedestrian friendliness. The annual ranking identifies the most walkable U.S. cities with populations of more than 300,000."_ 

This leads to defining a 5-level evaluation scale:

***
>>> <img src="WalkScoreIndex.PNG" width="400"/>
***

As with _**Foursquare**_, a user can register to get _wsapikey_ to use API calls to retrieve the score for a place.

An example is given below :

***
~~~~
# Build the url
url = 'http://api.walkscore.com/score?format=json&
        address=1119%8th%20Avenue%20Seattle%20WA%2098101&lat=47.6085&
        lon=-122.3295&transit=1&bike=1&wsapikey=<YOUR-WSAPIKEY>)

# make the GET request
results = requests.get(url).json()
~~~~ 
***

And the **The Walk Score** API will return the following:

***
~~~~
{
"status": 1
, "walkscore": 98
, "description": "Walker's Paradise"
, "updated": "2016-11-17 04:40:31.218250"
, "logo_url": "https://cdn.walk.sc/images/api-logo.png"
, "more_info_icon": "https://cdn.walk.sc/images/api-more-info.gif"
, "more_info_link": "https://www.walkscore.com/how-it-works/"
, "ws_link":
"https://www.walkscore.com/score/1119-8th-Avenue-Seattle-WA-98101/lat=47.6085/lng=-122.3295/?utm_source=walkscore.com&utm_medium=ws_api&utm_campaign=ws_api"
, "help_link": "https://www.walkscore.com/how-it-works/"
, "snapped_lat": 47.6085
, "snapped_lon": -122.3295
, "transit" : {"score": 100, "description": "Rider's Paradise", "summary": "115 nearby routes: 103 bus, 6 rail, 6 other"}
, "bike" : {"score": 68, "description": "Bikeable"}
}
~~~~
***

So it is now possible to rank each places, limit is 5,000 calls by day for a Free Version.


##  4. Average Rental Price of New York Places

These measures will help to refine choice of places in New York neighborhoods that are the most suitable for people with low incomes such as the students population.
 
We will use the Web Site _**"renthop"**_ (https://www.renthop.com/) a specialized search engine that allows users to search for apartments in New York City, Boston, Chicago, and other major metropolitan ... 

Browsing this site, you can find this page _https://www.renthop.com/average-rent-in/new-york-city-ny_ where are displayed the _**"Rental Stats and Trends"**_. This is a collection of information that gives:

1. *Historical Prices and Trends* : curves of the rental prices of housing by categories (studios, 1 bedroom or 2) since 3 years

2. *Median Rents* : table showing prices broken by housing category and quartiles (bottom 25%, median and top 25%)

3. *Average Rents by Neighborhoods* : table of average prices broken by "Neighborhoods" and given, by housing categories, a price and a classification of the needed budget (Cheap, Average, Pricey).

A screenshot of this page is displayed below :

***
>>> <img style="float:left;" src="Renthop_NYC.PNG" width="600"/>
***

We are mainly interested by the data stored into the table *Average Rents by Neighborhoods*.

To extract the average prices by neighborhoods and by housing categories, I will follow the method used in the previous assignment on the study of _"Toronto Neighborhoods"_ when we have extracted postal codes from a wikipedia HTML file.

The steps are explained below :

1. Download the HTML file at the given link : _'https://www.renthop.com/average-rent-in/new-york-city-ny'_

2. Register the file locally

3. Open the file and using _'BeautifulSoup'_ library, retrieve :
> - the table with the title as displayed above *"Average Rents by Neighborhoods"*
> - iterate through the next HTML elements to extract the name of the columns 
> - iterate through the next HTML elements to extract the values for each 'Neighborhood' row 

4. Build the _pandas Dataframe_ with the retrieved data

Below, I present a piece of the HTML file to be parsed with the _'BeautifulSoup'_ library.

***
~~~~
<table id="data-table" class="stripe" style="clear: none; margin: 0; width: 640px; border: 1px solid #bbbbbb; border-top: none;">
<thead>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 300px; padding-left: 30px; text-align: left">Neighborhood</th>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 80px; padding-left: 30px; text-align: left">Studio</th>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 80px; padding-left: 30px; text-align: left">1BR</th>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 80px; padding-left: 30px; text-align: left">2BR</th>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 100px; padding-left: 30px; text-align: left">Budget</th>
</thead>
<tr>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;"><a href="/average-rent-in/alphabet-city/nyc">Alphabet City</a></td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$2,200</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$2,905</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$3,500</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">Average</td>
</tr>
<tr>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;"><a href="/average-rent-in/astoria/nyc">Astoria</a></td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$1,875</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$2,125</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$2,400</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">Cheap</td>
</tr>
~~~~ 
***

The goal of this parsing is to tranform the retrieved data into a _pandas dataframe_.

This dataframe combined with the other information, will allow to refine our evaluation on the neighborhoods that could answer part of our initial problem.

### Thank you for this lab!

This notebook was created by [Alex Aklson](https://www.linkedin.com/in/aklson/) and [Polong Lin](https://www.linkedin.com/in/polonglin/). I hope you found this lab interesting and educational. Feel free to contact us if you have any questions!

This notebook is part of a course on **Coursera** called *Applied Data Science Capstone*. If you accessed this notebook outside the course, you can take this course online by clicking [here](http://cocl.us/DP0701EN_Coursera_Week3_LAB2).

<hr>

Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).