![alt text](https://raw.githubusercontent.com/icsouza68/Coursera_Capstone/master/header.jpg "Logo Title Text 1")

# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

Very often, people need to move from one place to another. A new job, raising a family and college admission, for example, are some of the most common reasons for moving around. A change can also take place due to external problems: armed conflicts, poverty, natural disasters, political persecution, etc. But, when arriving at the new location, a question that may eventually arise is: where is the best place to live in this city?

Each individual has unique preferences and needs, which may vary over time, but it's reasonable and legitimate to think that everyone would like to live in a neighborhood that best suits their current expectations. A married person with children may prefer to live closer to where there are more schools and parks, for example. A young single person may prefer somewhere better served by public transport. A couple with no children may prefer to live near restaurants. If the couple is Italian, probably their favourite restaurants would be Italian, not Indian, for instance. The desired combinations of amenities are virtually endless.

Living close to places that are most compatible with current needs and preferences means maximizing personal and familiar happiness. From a philosophical point of view, the pursuit of happiness has always been the subject of study by great philosophers, from Aristoteles, through Kant to Stuart Mill, just to name a few. 

Thus, in addition to the philosophical aspect, and now dealing with a more practical and rational approach, moving from one place to another can result in more or less personal and / or family problems. Lower work productivity (or even unemployment), emotional instability, disagreements with spouse / children, among other problems, can be related to the non-adaptation to a location due to the lack of essential structures needed by the individuals or families. On the other hand,  good adaptation means greater personal and familiar fulfillment.

It is true, however, that happiness depends on many other aspects that are not related to living in a good neighborhood, but all these aspects are beyond the scope of this work.
Having said that, the question to be answered by this project is: **Considering a person's needs, which New York City neighborhoods would be most compatible with him/her?**

### People who might be interested in this kind of information
The answer to this question may be of interest not only to those who seek a place for themselves or their families to live in, but also to public offices working with the establishment of immigrants or refugees, as well as to private businesses such as a real estate offices, or companies seeking professionals abroad, for example. Any entity that is responsible for advising someone to obtain a residence in New York City may be interested in this project.


## Data <a name="data"></a>

### Required Data
For the execution of this project, we will need:
- Data on the New York boroughs and its neighborhoods, which have already been made available throughout the course;
- **Foursquare** venues category list, which can be obtained by simply calling an endpoint (https://api.foursquare.com/v2/venues/categories);
- List of places mapped by **Foursquare** in each New York City neighborhood, obtained through **Foursquare API** calls;
- List of priorities within **Foursquare** categories, informed by the user and which will be used to generate a score for each neighborhood.

### How data will be used

- The user will inform the categories he or she thinks are important in a neighborhood, indicating their priority:

   3: very important;
   2: Important; 
   1: Not so important 
 

- He or she can choose as many categories as he or she likes, indicating their priorities
- The system will obtain the list of boroughs of the city, and for each borough its neighborhoods;
- For each neighborhood, we will get the list of places within the categories that the user chose
- For each venue, the system will calculate its weight based on the importance given by the user
- After thorough analysis, the system will group by neighborhood, adding the weights of their venues
- With this, it will be possible to create clusters based on the scores of each neighborhood.
- A city map will be plotted showing the neighborhoods with colors indicating the scores.
- By color, the user can check the neighborhoods that best suit what he/she thinks is relevant and thus choose the one he/she likes best. 

### Important notice about Foursquare venue categories

- Each venue in Foursquare has at least a main category and a subcategory. For example, "Metro Station" category is a subcategory of "Travel & Transport" main category;
- Foursquare advices that the list of categories may slightly change over time;
- However, there are some subcategories that are subdivided into narrower categories. "Food" main category has a subcategory "Italian Restaurant" which is subdivided into more than 20 different cousines ("Calabria Restaurant", "Venetto Restaurant", "Puglia Restaurant" and so on);
- In an extreme case, categories can reach up to four levels: "Outdoors & Recreation" -> "Athletics & Sports" -> "Gym / Fitness Center" -> ("Boxing Gym", "Climbing Gym", "Cycle Studio", "Gym Pool", etc.);
- When we use Foursquare to search for venues from a certain location, following some category criteria, it might bring categories that were not specified, but having some relation to one of the selected. For instance, if you ask Foursquare to bring all "Italian Restaurants", it will bring all of them and includes all "Pizza Places" as well, but "Pizza Place" subcategory does not belongs to "Italian Restaurant" category. It's an independent main category, but somehow it relates to "Italian Restaurant";
- In this project, we are considering that user will be able to select only subcategories (level = 2) from categories (level = 1), but not subcategories (level = 3) from subcategories (level = 2). However, the system will handle the entire category chain brought by Foursquare by assigning each subcategory (second, third and fourth levels) to its second level subcategory.

### Example of how data will be used

- The first thing the user has to do is to select which boroughs will be part of the search
- The second step is to select all subcategories he/she thinks is relevant and set a "weight" for each one of them:

 3. very important;
 2. Important;
 1. Not so important
 

- Now, the system will get all the categories from Fourquare using an endpoint (https://api.foursquare.com/v2/venues/categories), that is an authenticated basic call to Foursquare API, and process them to build a data structure that allows us to manipulate any subcategory througout the execution of the system;
- Then we will create a dataframe containing all categories that the user has previously selected in the columns, and one row for each neighborhood within the selected boroughs;
- We will group the dataframe by neighborhood and calculate how many venues of each category exist in each, then we will normalize the data and compute the final rating for each category
- We will create specific columns for clustering and rating as well. 
- The system will plot two Folium maps:

 1. a map with the clusters
 2. a map with the final rating of each neighborhood, grouped by a range of the ratings
 
- The rating of each neighborhood is a measure of how close to the user's ideal place to live, the higher, the better.

Based on definition of our problem, factors that will influence our decission are:
* number of existing restaurants in the neighborhood (any type of restaurant)
* number of and distance to Italian restaurants in the neighborhood, if any
* distance of neighborhood from city center

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **Google Maps API reverse geocoding**
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of Berlin center will be obtained using **Google Maps API geocoding** of well known Berlin location (Alexanderplatz)

### Neighborhood Candidates

Let's create latitude & longitude coordinates for centroids of our candidate neighborhoods. We will create a grid of cells covering our area of interest which is aprox. 12x12 killometers centered around Berlin city center.

Let's first find the latitude & longitude of Berlin city center, using specific, well known address and Google Maps geocoding API.

In this project we will try to find an optimal location for a restaurant. Specifically, this report will be targeted to stakeholders interested in opening an **Italian restaurant** in **Berlin**, Germany.

Since there are lots of restaurants in Berlin we will try to detect **locations that are not already crowded with restaurants**. We are also particularly interested in **areas with no Italian restaurants in vicinity**. We would also prefer locations **as close to city center as possible**, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.