# Capstone Project - The Battle of Neighborhoods 
## 'The Best Walkable Spots for Students in NY'

### Final Report

## Introduction

In this first week, you have to clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

This submission will eventually become your Introduction/Business Problem section in your final report. So I recommend that you push the report (having your Introduction/Business Problem section only for now) to your Github repository and submit a link to it.

## Table of Contents

__1.Introduction - Business Problem Definition.__

__2. Data__

__3. Methodology__

__4. Analysis__

__4. Results__

__5. Discussion__

__6. Conclusion__

# Introduction - Business Problem Definition

## 1. Study Context

There a category of students or young adults that would like to settle temporarily in New York City. This type of people has often low financial resources. They cannot afford to own and maintain a car with all the expenses that it can generate. Sometimes it is also because they want to adopt a greener transportation style to reduce their carbon footprint.

To help for their daily expenses or to supplement their basic income, they often hold jobs in the restaurants, shops, bars, etc. Moreover, as generally with young people, they also like to frequent the lively neighborhoods to meet their friends or other people. 

We can therefore predict that this category of population will be interested in looking for rental housing in lively areas allowing both leisure outings and opportunities to find odd jobs.

Of course, since they do not have a personal vehicle because lacking financial means or by the choice of ecological convictions, these students or young people will certainly look for renting nearby rooms or studios in neighborhoods as close as possible to places where they can easily find extra work.

A criterion of choice will be the proximity of the place of residence to these employment zones, having the possibility of getting there as quickly as possible first by walking as a pedestrian then using a skate-board, a bicycle or by public transport such as bus lines or metro etc ...

## 2. Problem Statement


As a Data Scientist, I would like to study the neighborhoods of New York and make a classification according to the number of places of shopping, outings and leisure and the accessibility of these places by walking (in a first study).

Later, we can refine by considering also other transportation means such as bike or public transport such as buses and metro.

Finally to be closer to the reality of these low-income people, we can add the rental price criterion (average in the neighborhood) to study the influence on the classification.

Nota: in this case study, the important factor of "income required per tenant" to be able to rent a room has not not been considered. Of course, it must be taken into account in a real life.

## 3. Target Audience

Who could be interested in this problem ?

At least we can list the following target populations :

1. Students that would like to find a job close to the place of residence

2. People who do not own any personal vehicle because they have either ecological convictions or low income but who love lively places with lots of restaurants, cafes, shops, cultural places etc. So those who are interested in living in a place with a high level walk-ability index.

3. Companies that work around search engines for rental accommodation agencies

4. Reception centers and help for students 

# Data Set Definition

## 1. Data Sources

A summary of needed data can be established from the problem definition above. 

As an output of the solution, we would like to provide a global picture of the New York places indicating where it is possible to rent a room or studio and with a score calculated based on the proximity of restaurants, bars, Coffee Shops or other businesses fully accessible to pedestrians. Something like a heat-map of places in New York City.

So we will need to retrieve the following information:

1. The list and localisation of New York neigborhoods
2. The list and localisation of places such as coffee shop, restaurants, miscellaneous shop, theaters etc around a specific place in New York 
3. The list of places with their walkability measure
4. The list of places with their average rental price

In the next sections, each data set is described given the origin, the type of data and the way it will be used to solve the problem we defined above.

###  a. New York neigborhoods

We follow the same process as given in the labs _"Segmenting and Clustering Neighborhoods in New York City"_. 

To remember what we learned, we know now that New York has a total of 5 boroughs and 306 neighborhoods. In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and longitude coordinates of each neighborhood.

Luckily, this dataset exists for free on the web.  We will use the dataset at the link we have been given : https://geo.nyu.edu/catalog/nyu_2451_34572.

The Web page displays the following information : 

***
> _2014 New York City Neighborhood Names_

> _Description:_
>> _This New York City Neighborhood Names point file was created as a guide to New York City’s neighborhoods
>> that appear on the web resource, “New York: A City of Neighborhoods.” Best estimates of label centroids were 
>> established at a 1:1,000 scale, but are ideally viewed at a 1:50,000 scale._

> __Publisher: New York (City). Department of City Planning_

> _Collection: Bytes of the Big Apple_

> _Place(s): New York, New York, United States_

> _Subject(s): Neighborhoods, Neighborhood planning, and Communities_

> _Format(s): Shapefile_

> _Year: 2014__

> _Held by: NYU_

> _Preservation record: http://hdl.handle.net/2451/34572_
***

On this Web page, we select the format of the file we want to generate. Selecting _'json'_ format, the following message is displayed indicating that the resulting formatted file is ready to be downloaded locally :
>> _'Your file nyu-2451-34572-geojson.json is ready for download'_

After that, we are able to parse the json features array of 306 items to extract all the information about the neighborhoods of New York.

This step leads us to build a dataframe, _**neighborhoods**_, containing the elementary data needed to solve our problem i.e. neighborhoods with location coordinates. 

Using pandas library, we map the json data to the _**neighborhoods**_ dataframe containing 4 columns listed below :

>> _**Borough, Neighborhood, Latitude, Longitude**_

###  b. Venues in New York Neigborhoods

For these data, we also follow the same process as given in the labs _"Segmenting and Clustering Neighborhoods in New York City"_, i.e. using the _**Foursquare API**_ to explore the neighborhoods.

This is a requirement of this assignment. 

Using Foursquare API allows to request a number of top venues that are in each neighborhood within a defined radius. A GET request to  _**Foursquare**_ looks like in python code :

***
~~~~
# Build the url
radius = 500
LIMIT = 100

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET,
        neighborhood_latitude, neighborhood_longitude,
        VERSION,
        radius,
        LIMIT)

# make the GET request
results = requests.get(url).json()
~~~~ 
***

After that, the 'results' data is ready to be cleaned and structured it into a pandas dataframe containing at least the needed columns and rows such as displayed below :
***
~~~~
 	name                    	categories          	lat     	lng
0 	Arturo's                	Pizza Place        	40.874412 	-73.910271
1 	Bikram Yoga             	Yoga Studio        	40.876844 	-73.906204
2 	Tibbett Diner           	Diner              	40.880404 	-73.908937
3 	Starbucks               	Coffee Shop        	40.877531 	-73.905582
4 	Land & Sea Restaurant	Seafood Restaurant    	40.877885 	-73.905873
~~~~
***

This is how Foursquare API is used to build this data set that will be combined for each neighborhood to identify venues around places of interest.

This will contribute to rank each places with the number of available venues.

###  c. Walkability Measures of New York Places

These measures will help to define the places in New York neighborhoods that are the most suitable for people who do not own a vehicle and want to travel only on foot or by public transportation.
 
Fortunately, like _**Foursquare**_, there is also a Web Site _**"Walk Score"**_ (https://www.walkscore.com/) that provides an API to compute a score for each place identified by location coordinates.

As displayed on the Web page, a :
> _"Walk Score is a number between 0 and 100 that measures the walkability of any address."_

And using the following methodology :
> _"Walk Score measures the walkability of a location based on its distance from amenities, density of population, block length and pedestrian friendliness. The annual ranking identifies the most walkable U.S. cities with populations of more than 300,000."_ 

This leads to defining a 5-level evaluation scale:

***
> <img src="WalkScoreIndex.PNG" width="400"/>
***

As with _**Foursquare**_, a user can register to get _wsapikey_ to use API calls to retrieve the score for a place.

An example is given below :

***
~~~~
# Build the url
url = 'http://api.walkscore.com/score?format=json&
        address=1119%8th%20Avenue%20Seattle%20WA%2098101&lat=47.6085&
        lon=-122.3295&transit=1&bike=1&wsapikey=<YOUR-WSAPIKEY>)

# make the GET request
results = requests.get(url).json()
~~~~ 
***

And the **The Walk Score** API will return the following:

***
~~~~
{
"status": 1
, "walkscore": 98
, "description": "Walker's Paradise"
, "updated": "2016-11-17 04:40:31.218250"
, "logo_url": "https://cdn.walk.sc/images/api-logo.png"
, "more_info_icon": "https://cdn.walk.sc/images/api-more-info.gif"
, "more_info_link": "https://www.walkscore.com/how-it-works/"
, "ws_link":
"https://www.walkscore.com/score/1119-8th-Avenue-Seattle-WA-98101/lat=47.6085/lng=-122.3295/?utm_source=walkscore.com&utm_medium=ws_api&utm_campaign=ws_api"
, "help_link": "https://www.walkscore.com/how-it-works/"
, "snapped_lat": 47.6085
, "snapped_lon": -122.3295
, "transit" : {"score": 100, "description": "Rider's Paradise", "summary": "115 nearby routes: 103 bus, 6 rail, 6 other"}
, "bike" : {"score": 68, "description": "Bikeable"}
}
~~~~
***

So it is now possible to rank each places, limit is 5,000 calls by day for a Free Version.


###  d. Average Rental Price of New York Places

These measures will help to refine choice of places in New York neighborhoods that are the most suitable for people with low incomes such as the students population.
 
We will use the Web Site _**"renthop"**_ (https://www.renthop.com/) a specialized search engine that allows users to search for apartments in New York City, Boston, Chicago, and other major metropolitan ... 

Browsing this site, you can find this page _https://www.renthop.com/average-rent-in/new-york-city-ny_ where are displayed the _**"Rental Stats and Trends"**_. This is a collection of information that gives:

1. *Historical Prices and Trends* : curves of the rental prices of housing by categories (studios, 1 bedroom or 2) since 3 years

2. *Median Rents* : table showing prices broken by housing category and quartiles (bottom 25%, median and top 25%)

3. *Average Rents by Neighborhoods* : table of average prices broken by "Neighborhoods" and given, by housing categories, a price and a classification of the needed budget (Cheap, Average, Pricey).

A screenshot of this page is displayed below :

***
>> <img src="Renthop_NYC.PNG" width="600"/>
***


We are mainly interested by the data stored into the table *Average Rents by Neighborhoods*.

To extract the average prices by neighborhoods and by housing categories, I will follow the method used in the previous assignment on the study of _"Toronto Neighborhoods"_ when we have extracted postal codes from a wikipedia HTML file.

The steps are explained below :

1. Download the HTML file at the given link : _'https://www.renthop.com/average-rent-in/new-york-city-ny'_

2. Register the file locally

3. Open the file and using _'BeautifulSoup'_ library, retrieve :
> - the table with the title as displayed above *"Average Rents by Neighborhoods"*
> - iterate through the next HTML elements to extract the name of the columns 
> - iterate through the next HTML elements to extract the values for each 'Neighborhood' row 

4. Build the _pandas Dataframe_ with the retrieved data

Below, I present a piece of the HTML file to be parsed with the _'BeautifulSoup'_ library.

***
~~~~
<table id="data-table" class="stripe" style="clear: none; margin: 0; width: 640px; border: 1px solid #bbbbbb; border-top: none;">
<thead>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 300px; padding-left: 30px; text-align: left">Neighborhood</th>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 80px; padding-left: 30px; text-align: left">Studio</th>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 80px; padding-left: 30px; text-align: left">1BR</th>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 80px; padding-left: 30px; text-align: left">2BR</th>
<th class="font-size-9 bold" style="padding-top: 10px; padding-bottom: 10px; border-top: 1px solid #bbbbbb; border-bottom: 1px solid #bbbbbb; width: 100px; padding-left: 30px; text-align: left">Budget</th>
</thead>
<tr>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;"><a href="/average-rent-in/alphabet-city/nyc">Alphabet City</a></td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$2,200</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$2,905</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$3,500</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">Average</td>
</tr>
<tr>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;"><a href="/average-rent-in/astoria/nyc">Astoria</a></td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$1,875</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$2,125</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">$2,400</td>
<td class="font-size-9" style="padding-bottom: 5px; padding-top: 3px; padding-left: 10px;">Cheap</td>
</tr>
~~~~ 
***

## 2. Data Processing

#### a) Data Merging

As already said, I have splitted the calls in several steps to get the scores for all venues the 11311 venues.

I have merged all dataframes and files into one unique dataframe and file to consider the totality of the information on the venues for our analysis.


#### b) Data Cleaning

Now take a look at the data and see if all information are ready to be used for our analysis !!!!

Clean if any missing data : adopt a strategy !
 - drop the row
 - replace the value by the same value of the surrounding venue with the same latitude and longitude
 - replace the value by the min of values of the surrounding venues
 - replace the value by the mean of values of the surroundingvenues 
 - etc.

So how many venues have empty walk scores (equal to -1) ?

The number and identification of the venues without walk score : 32 venues without walk scores.

This number represents only 32/11311 = 0,000972505 ~ 0.1%

So these venues are removed before analysis.

We simply dropped these rows and we got a new dataframe with (11311 - 32 =) 11279 rows.


#### c) Data Processing

Sometimes data are not in the right format, they need to be processed or aggregated or created if missing or to be more usable or informative.

Some examples are given below.

######  Data rewriting

- Rental amounts strings with a 'Dollar' symbol have to be formatted in normal numbers 
    - **'$4,542 -> 4542'**
 

######  Missing Data 

- Missing Rental data for neighborhoods: no price values
    - only 68 neighborhoods from the 306 neighborhoods of New York City have been retrieved from the Renthop data files. To complete the missing values for all neighborhoods, we should adopt a filling strategy.
        - Use the the "Median Rents" values or Budget attribute value from the same data source to complete the missing values :
            - 'Cheap' then use 'Bot25%' value
            - 'Pricey' then use 'Top25%' value
            - 'Average' then use 'Median' value
        - Compute a ratio :
            - Use 'Budget' attribute value
            - find Median[1BR]
            - define Ratio = (Average[row][1BR] / Median[1BR]) 
            - define Average[row][Studio] = Median[Studio] * Ratio


-  Missing Neighborhoods references in Price data set
    - After localisation, we found 7 references in rental neighborhoods without localisation coordinates. To find the missing data, we used :
        - use Web https://www.latlong.net/convert-address-to-lat-long.html for manually retrieve coordinates for 4 of these neighborhoods.
        - translate sectors of neighborhoods by the corresponding list of official neighborhoods and again retrieve coordinates for them.

######  Duplicated Data 

- Several names duplicated have been found in the neighborhoods data. Strategy can be :
    - rename duplicates if it makes sense
    - simply drop the duplicated name (first or last name)

###### Data Reworking 

Looking at the budget with 3 values, sometimes a 1BR amount is labelled for example 'Pricey' while another higher amount is labelled 'Average'. It is not very consistent.

It is why, it is more relevant to rescale this column by associating a more understandable label to the price '1BR' value.

So we will rework the Budget column so that new categories will remap on this table :  

|  Label    |    Inf    |  Sup  |
|:---------:|:---------:|:-----:|
| *Cheap*   |     0     | 2149  |
| *Average* |    2150   | 3149  |
| *Pricey*  |    3150   |  '>'  |


So, a new a categorical label has been computed and associated to the registered Budget value that was out of the "normal" scale.   

### End of Data Processing 

At this step, all neighborhoods and venues information have been collected and processed : data cleaning, missing data, duplicates etc...

For convenience reasons, these data will be stored in 2 files that can be reloaded in a dataframe at any time for future operations winning the time dedicated to all data processing steps.

## 3. Data Summary 

Now we can summarize what we know about the new dataframe with the added scores for each venue.

Total number of venues                     = 3508 
Total number of venue attributes           = 12 
Total number of represented neighborhoods  = 81 
Total number of unique venue categories    = 307 


## 4. Conclusions : Data Set Preparation


That section ends the data gathering and processing phases.

Hard work and a lot of trials have been necessary to retrieve with APIs and sometimes manually the full information needed to our analysis.

So now we have all the venues in the retrieved set of 3508 samples representing 81 neighborhoods in New York area with walk scores and rental data.

In the present study, due to lack of time, we will only consider walk score, not bike or transit scores that are also useful information needed to refine the results we would like to present to the interested stakeholders.

We also could extend the information to all 305 neighborhoods of NYC but it will need more efforts to retrieve for example exhaustive rental information from other sources.

It is the same for the number of venues limited for our profile to 50 per search call and walkability scores limited to 5000 per day that is clearly not sufficient in time to cover all potential venues existing in all neighborhoods of NYC.  

The strategy concerning missing data, duplicates could also be refined : dropping is not always the more efficient action ! We tried also to calculate approximate values but from my point of view, the results are not entirely satisfactory. Developing more elaborated algorithms needs more time I was not able to dedicate to this study.

This concludes our data gathering phase. 

Let's continue to the next step : how this data will be use for analysis leading to find the best places in the 81 neighborhoods of NYC for a student or person wanting to rent a 1BR apartment in a neighborhood with potential work places easily accessible on foot!

# Methodology

In this project we will direct our efforts on detecting areas of New York City that have the highest density of venues located in the highest walk-scored places with the lowerest rental rates.

As said in the _**Data**_ section , we will limit our analysis to the ares containes in the 81 neighborhoods of NYC having information about walkability scores and rental rates.

Below, I detail the methodology I used to carry out this project.

- Step 1:

    Previously, we have already collected the amount of **data: location, category, walk score and average rent for each potential workplace located in 81 neighborhoods of New York City. Each potential workplace have been identified and classified according to Foursquare categorization**.


- Step 2:

    The analysis will be based on calculation and exploration of '**venues**' across the different neighborhoods of NYC. We will use **heatmaps** to identify a few promising areas with highest density of venues, of walkability cores and the lowest rental values and focus our attention on those areas.


- Step 3:

    In this final step we will focus on most promising areas and within those create **clusters of locations that meet the requirements** established at the beginning of the project defined with stakeholders.
    
    The locations targeted are those with the highest '**walk scores**' and the '**lowest rates of rent**'. These areas will be classified according to the density of venues in the  vicinity having the highest rate of **pedestrian friendliness in a radius of 500 meters**,  and we want locations **with the lowest rates of rent in radius of 500 meters**.
    
    A map will be presented to the stakeholders, displaying all such locations grouped in createed clusters (using **k-means clustering**) of those locations to identify these specific target areas in neighborhoods which constitute a starting directory of adresses to be explored and searched by the astakeholders identified as the best places to choose the cheapest home and find work in the nearest vicinity.

# Analysis

### Basic Explanatory Data analysis 
After data collecting, completing and cleaning, a main dataframe collects all the available data.

For conveniance reasons, the dataframe has been split in 2 others to facilitate data manipulation and analysis :

- **Neighborhoods dataframe**
    - 53 elements in range [0, 52] (-> Garment District) have all 50 retrieved venues and cumulate from 0 to 2650          venues on a total of 3508 for all 81 venues.

   - Neighborhoods in last range [54, 80]  (Battery Park City -> Forest Park) cumulate from 2651 to 3508 venues

- **Venues dataframe**
  - 3508 retrieved venues that are completely filled by rental data
  
Nota
------

Unfortunately, due to API limitations, we have no information on real repartition of venues because the most part of neighborhoods that cumulate more than 50 retrieved venues are limited to 50.

In the next section, we focus on graphical analysis et derive some useful tables that tells us about the links between different properties of the studied areas of NYC.

## 1. Graphical Data Analysis 

Based on graphical visualisation representated in the pictures displayed below we can define and shape the interesting properties of our dataset relative to the requirements to be taken account in this study.

###  a. Venue Density 

#### Most part of Neighborhoods (87%) have the maximum high venue density of 50 (71 on 81) 

Screenshots of the venue density is displayed below :
***
>>> <table><tr><td><img src="VenueDensity.png" width="400" height="400"/></td><td><img src="NumberVenuesNeighborhood.png" width="400" height="400"/></td></tr></table>
***



###  b. Rental Prices  

#### Neighborhoods 

At a coarse-grained grained level exploration, we first consider neighborhoods.

We do some data explorations by statistical calculations on rentals at neighborhood level.

That could be useful information foor the stakeholders.

- Minimum: Smallest number in the dataset                            : min  = 1625
- First quartile: Middle number between the minimum and the median   : 25%  = 2250
- Second quartile (Median): Middle number of the (sorted) dataset    : 50%  = 2950
- Third quartile: Middle number between median and maximum           : 75%  = 3312
- Maximum: Highest number in the dataset                             : max  = 4391

It has been counted on 81 values and the mean Average Rental values is around 2798 (see below) :

***
>>> <img src="AvgRental1BR_boxplot.png" width="400" height="400"/>
***

That seems to be a quite high amount of money for a student ! 

We could also see how the neighborhoods behave relative to the Rental Price '1BR' or 'Budget' values, categories indicating the observed rental price compared to what it should be.

***
>>> <table><tr><td><img src="AvgRentalbyNgbh.png" width="400" height="400"/></td><td><img src="AvgRentbyNghb_2.png" width="400" height="400"/></td><td><img src="RentalPriceRepartitionByBudget_catplot.png" width="400" height="400"/></td></tr></table>
***

###  c. Walkability Scores 

More accurate, at a fine-grained level exploration, we look at venues behaviour.

We do some data explorations by statistical calculations on rentals in areas defined by the Walkability scores at venue level but also getting the mean at neighborhood level as displayed below.

***
>>> <table><tr><td><img src="NgbhMeanWSper1BRPrice_boxplot.png" width="400" height="400"/></td><td><img src="NgbhMeanWSperBudget_catplot.png" width="400" height="400"/></td></tr></table>
***

 and more descriptive this one :
 
>>> <img src="PercVenperWSperBudget_barplot.png" width="600" height="500"/>

## 2. Statistical Exploration Summary 

Below we presents a summary of the tables built during the statistical exploration of our datasets concerning the neighborhoods and the venues that helped us to select the data complying the requirements.

The Level of Buget or Prices compared well with the level of Walkability Scores. We can see a recovery of the areas ranked by Budget/Prices and Walkability Scores.

The highest Walkability scored areas are also in the highest Rental Budget Prices areas. 

###  a. Neighborhoods by Walkability Scores and Rental Price Categories  

We can see clearly that more than half (45 on 81) i.e 56% of all neighborhoods prices are relatively affordable. 

All 'Pricey' neighborhoods (33 on 81) are highly walkable and represent about 41% of the total.

Let's repeat the same calculus to explore now the number of venues for potential workplaces relative to Walkability scores.
>>>  <table><tr><td><img src="NumberNgbhbyBudgetByWalkScore.png" width="400" height="500"/></td><td><img src="PercNbyWSbyBudget.png" width="400" height="500"/></td></tr></table>


###  b. Venues Targeted by Walkability Scores and Rental Price Categories 

We can see that all areas are massively Wakable and the highest is walkability score, the highest is the rental price.

So, the calculated numbers for venues correlate quite well those for neighborhoods (not a surprise, the opposite would have been surprising of course !). About 53% of all venues are located in 'affordable' areas with high walkability scores.

As previously observed, all venues located in 'Pricey' areas are also highly walkable and represent about 46% of the total.

>>>  <table><tr><td><img src="VenueperWSperBudget.png" width="400" height="500"/></td></tr></table>

##  3. Cartographic Localisation Exploration 

####  a. Target Neighborhoods in New York City 

Above, is displayed a map created for localisation of the Neighborhoods we have in the dataset. It gives a good starting point to compare with the final results after clutering.

 <table><tr><td><img src="Map_NYC_Neighborhoods.png" width="500" height="500"/></td></tr></table>

####  b. Target Places in New York City 

And now the same as above but visualizing venues locations with colors indicating the category of the place in function of the Walk Scores and Pricing !

<img src="Map_NYC_Venues.png" width="500" height="500"/>


####  c. Heatmap of Places in Walk Scored Areas in New York City 

For the fun, explore the venues in a heatmap indicating the areas coloried in function of the Walkability scores.

<img src="HeatMap_WS_NYC_Venues_2.png" width="500" height="500"/>

####  d. Cartographic Exploration Summary

We have about 2 thousands of venues (1825) localized in New York City and a good visualization tool to locate these places rated according to their Walkability or observe rental Pricing of 1 Bedroom apartment.

We can graphically say that several sections appear like (but not the only ones) :

- Good candidate places where rental price is 'cheap' and can live and work without owning a car 
    - Justice Ave (btwn Broadway & 52nd Ave), Elmhurst, NY 11373 (*Walker’s Paradise/Cheap 1900)
    - Woodhaven, NY 11421 (*Walker’s Paradise/Cheap:1625)
    - Ave (Jamaica Av), Richmond Hill, NY 11418 (*Walker’s Paradise/Cheap:1663)
    - Lefferts Blvd, South Ozone Park, NY 11420 (*Very Walkable/Cheap:1663)
    - etc...

but :
- Places to avoid if you have limited financial means even if walkable
    - Ave (at Metropolitan Ave), Brooklyn, NY 11211 (Walker’s Paradise/Pricey:3300)
    - Flatiron District New York, NY 10010 (Walker’s Paradise/Pricey:3552)
    -  Grand Ave, Maspeth, NY 11378 (Car-Dependant/Average:2331)
    - (Bld. 292), Brooklyn, NY 11205 (Car-Dependant/Pricey:3312)
    - etc...

Finally, we could continue along the list of adresses provided thanks to Foursquare for the venues, Walkscore for the pedestrian well-being and Renthop for the rental pricing.

All the data used for the analysis come from manual data manipulations.

The challenge now is to use methods to automate the composition of the group of places fitting the requirements.

We will use K-Means clustering for this purpose.

Let us now cluster those locations to create a list of areas around venues gathering all our conditions i.e hihgly walkable with low rental prices and containing the maximum potential workplaces. The list issued from the processing ids the one that can be provided to stakeholders and usefull for people looking for places to live for cheap and with a good quality of life. 

Those zones, their centers and addresses will be the final result of our analysis. 

##  3. Analysis : Conclusions 

Interesting information for the future tenants ! 

- Unfortunately major part of the rents are 'Pricey' : 41 %
- 1BRs with a Cheap rent are not so many : 22 %
- Even in Average prices, only a little bit more than 1/3 can be found : 41 %

We already found that the repartition of Rental Prices are like below :
- Minimum: Smallest number in the dataset :                          **min = 1625**
- First quartile: Middle number between the minimum and the median : **25% = 2250**
- Second quartile (Median): Middle number of the (sorted) dataset :  **50% = 2950**
- Third quartile: Middle number between median and maximum :         **75% = 3312**
- Maximum: Highest number in the dataset :                           **max = 4391**

The repartition has been counted on 81 neighborhoods and the mean Average Rental values is around **2798**.

A good news : most part of the neighborhoods (78 on a total of 81) are localised in areas with high walk scores (Walker’s Paradise or Very Walkable) and 53 of these 78 have the maximum of venues (50) allowing better chance to find a job.

So we would concentrate on these areas to propose our adresses. In fact, we would like to build a multi-dimentional table of areas where the best profile would be :
- Highest Wakability scores
- Highest number of Venues
- Lowest Rental prices


|Budget     |    Inf    |  Sup  |           
|-----------|:---------:|:-----:|
| *Cheap*   |     0     | 2149  |
| *Average* |    2150   | 3149  |
| *Pricey*  |    3150   |  '>'  |


**Summary** 

- Most part of Neighborhoods are _Walker’s Paradise_  
- Most part of Neighborhoods can offer maximum workplaces  (at least **50**)
- But Neighborhoods are expansive : major part of rents are more than **2950** until **4391**
- Finding a cheap rent is difficult and it costs beween **1625** until **2125**
- And finally Average mean is quite high and lower than the Median rent : **2798** v.s. **2950**

So it is why finding a job in the vicinity with a high walk score could allow to avoid spending money in cars.
This is the next step of analysis.

So the discrimination factor is clearly the price and the result of this study will be to provide the areas with the highest level of walkability, the highest density of venues and the lowest prices that could be resumed below :


  | Budget    | Very Walkable | Walker’s Paradise | Prices            |
  |-----------|:-------------:|:-----------------:|:-----------------:|
  |Cheap	  | 	  6 	  | 	  13	      | Y < 2150 	      |
  |Average	  | 	  5 	  |		  30 		  | 2149 < Y < 3150   |
  |___________|_______________|___________________|___________________|
  |Walk Score |   69 < X < 90 |		 X >= 90      |                   |


# Results

#####  This last phase consists in building the proposal to the stakeholders defining a set of neighborhoods and places in New York City combining the criteria we set up in the proble statement : place with no car-dependance, with high potential to find work places  and with the lowest rental prices.

This led us to define the clusters matching our criteria :
- First : at neighborhood level to have a high level map of areas of interest
- Second : at venues level to define areas adresses for a more accurate map

We have retrieved and prepared the raw data needed for clustering matching the Walk Score, Number of venues and Rental Prices : 

- for neighborhoods and 
- for venues 

##  1. Data Sets 

### a. Target Neighborhoods  Data Set

We kept only our 45 over 81 of all neighborhoods prices are relatively affordable and with high Walkability.

### b. Targeted Venues Data Set 

We chose also the venues belonging to the 45 neighborhoods fitting the requirements.

## 2. Clustering New York City Neighborhoods and Venues

We use the same step to build the clusters for neighborhoods and venues.

### a. Pre-processing

Before clustering we pre-process the data :
- Removing useless columns (categorical, string etc...)
- Normalizing data to be able to interpret features with different magnitudes and distributions equally... We will StandardScaler() to normalize our dataset.

### b. Modeling

To perform neighborhood Clustering, we build our model with with KMeans, using 4 clusters because it is the number of the different groups that fit the requirements. Hereafter, the code use to process the model.

> *clusterNum = 4**

> *k_means_v = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)*

> *k_means_v.fit(X_v)*

> *labels_v = k_means_v.labels_*

> *print(len(labels_v), labels_v)*

### c. Clusters Centroids

The results are displayed in the tables below.

***
><table><tr><td><img src="Centroid_Venue.png" width="400" height="500"/></td><td><img src="Centroid_Neighborhood.png" width="400" height="500"/></td></tr></table>
***


### d. Clusters
Finally, we obtain the clusters displayed below. Once the clusters built, it is possibble to extract the locations in New York City of the areas that meet the different available criteria or requirements.

***
>>> <table><tr><td><img src="Clusters_Ngbh_wo_cat.png" width="400" height="400"/></td><td><img src="Clusters_Venues_wo_cat.png" width="400" height="400"/></td></tr></table>
***

### e. Results : Interactive Maps and Heatmaps

But more powerful is an interactive map as the result of this project where all neighborhoods can be retrieved by a color associated to the retained criteria and a specific place by a grap of venues located on the top of this place on the New York City map.

Below is presented our result : an interactive heatmap of places identified by a specific color and with textual popup of the location and price, in areas also scored by Wakability level.

***
>>> <table><tr><td><img src="HeatMap_WS_NYC_Venues_Cluster.png" width="600" height="600"/></td></tr></table>
***

# Discussion

Using the data available for a non professionnal project, so limited in number and quality, our analysis shows that we were able to retrieve several thousands of basic information on places (a number of about 3500 venues in New York) represented about only 81 neighborhoods over the 306 existing in New York City.

So, unfortunately these data does not cover the totality of New City.

However, we succeded to set in place a tool allowing to collect data, analyse and finally select the most appropriate places matching a number of criteria.

All this work has been mainly conducted using data science tools available for all Data Scientists from beginner level to expert level. 

The most part of the work performed here was to collect and process the data so that it could be used for analysis.

Once the data set cleaned and prepared with the right format, we have been capable to extract the points of interest gathering all the criteria defined in the requirements exposed in the problem statement.

These data were filtered among the data collected intially in such a way that they can constitute a good starting point to consider the areas fitting the following criteria :

- pedestrian possibilities or walkability of a place within a radius of 500 m from the center of visualized circle
- opportunity degree to find a job and
- needed amount of money to rent a 1 bedroom apartment to locate they want to dedicate.

Procedures and functionalities has been elaborated so that it has been set possible to select a set of places in New York City that correspondant to the criteria presented above.

The selection of places is based on the **K-Means Clustering** algorithm that allows to avoid a lot of manual and dedicated opertions on the submitted data set.

However, we can observe an underlying decision process adopted by the K-Means that is not so automatic and some time seems inefficient.

I realized that the main reason is tied to the data :
- first : which data to provide to the clustering algorithm ?
- second : depending on the criteria we want to prioritize for example here Walkabitilty or number of potential workplaces or rental amounts, it will be necessary to also provide multiplicative coefficients to adapt the weight or relative importance that one attaches to the order of the chosen criteria. That will finally guide the K-Means in decision making.

Therefore, the process should be improved by trying several values of coefficients and not retaining those which make the best result compared to what is expected according to the importance of the chosen criteria.

Finally, the results can be consulted under 2 formats:
 - textual list of adresses located around venues chose to mark an interested area
 - visual interative maps or heatmaps that also directly guide the user on the points of interest colored relative to the K-Means clustering results.

# Conclusion

We finally succeded in partitioning manually the candidate places in 4 categories.

We tried to reproduce this categorisation using clustering methods. The obtained cluters do not fit exactly those obtained manually.

The clustering parameters should be tweaked more carefully. In particular, it should be possible to try and adjust to reflect the importance given to each criterion which can change the nature and composition of the clusters.

But it is a promising track.

This process could be surely automated by using more advanced Data Science tools like Neural Network to learn the needed parameters but this was not the purpose of this work.

As a result, we could already propose a list of adresses of places, summarized on interactive maps of New York City that would meet the needs of customers.

At least, based on the results of this work, we can elborate an automatic recommendation tools that could interest the following populations :

- Students that would like to find a job close to the place of residence
- People who do not own any personal vehicle because they have either ecological convictions or low income but who love lively places with lots of restaurants, cafes, shops, cultural places etc. So those who are interested in living in a place with a high level walk-ability index.
- Companies that work around search engines for rental accommodation agencies
- Reception centers and help for students

That ends this work on Data Science Application Project.