<h1>Data Science Capstone Project</h1>
<h3><em>by: Klyde Jasper Jose</em></h3>

<h2>Capstone Instructions</h2>
<p>
Now that you have been equipped with the skills and the tools to use location data to explore a geographical location, over the course of two weeks, you will have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve. If you cannot think of an idea or a problem, here are some ideas to get you started:
<ol>
    <li>In Module 3, we explored New York City and the city of Toronto and segmented and clustered their neighborhoods. Both cities are very diverse and are the financial capitals of their respective countries. One interesting idea would be to compare the neighborhoods of the two cities and determine how similar or dissimilar they are. Is New York City more like Toronto or Paris or some other multicultural city? I will leave it to you to refine this idea.</li>
    <li>In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it? Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?</li>
</ol>
These are just a couple of many ideas and problems that can be solved using location data in addition to other datasets. No matter what you decide to do, make sure to provide sufficient justification of why you think what you want to do or solve is important and why would a client or a group of people be interested in your project.
</p>

<h2>Review Criteria</h2>
<p>
This capstone project will be graded by your peers. This capstone project is worth 70% of your total grade. The project will be completed over the course of 2 weeks. Week 1 submissions will be worth 30% whereas Week 2 submissions will be worth 40% of the total grade.

In Week 1, the following are required to be submitted:
<ol>
<li>A description of the problem and a discussion of the background. (15 marks)</li>
<li>A description of the data and how it will be used to solve the problem. (15 marks)</li>
</ol>

For the second week, the final deliverables of the project will be:
<ol>
    <li>A link to the Notebook on its respective Github repository, showing the code. (15 marks)</li>
    <li>A full report consisting of all of the following components (15 marks):</li>
    <ul>
        <li>Introduction where you discuss the business problem and who would be interested in this project.</li>
        <li>Data where you describe the data that will be used to solve the problem and the source of the data.</li>
        <li>Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.</li>
        <li>Results section where you discuss the results.</li>
        <li>Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.</li>
        <li>Conclusion section where you conclude the report.</li>
    </ul>
    <li>Your choice of a presentation or blogpost. (10 marks)</li>
</ol>
</p>

<h2>Introduction/Business problem</h2>

<p>Metro Manila, simply Manila, is the National Capital Region and the prime tourist destination in the Philippines. Manila comprises 17 cities and municipalities, including the capital city, Manila City. Though it is the smallest region in the country, Metro Manila is the most populous of the twelve defined metropolitan areas in the Philippines and the 19th most populous in the world (<a href="https://www.visualcapitalist.com/most-populous-cities-in-the-world/">Koop, 2021</a>). Being the capital, Manila is considered to be the center of commerce, education, and entertainment of the country.


Having lived in the South of Manila for most of my life, I wanted to know the most common leisure activities available in the metro. Additionally, I want to know the most common cuisines in city to be able to determine the most common used vegetables and fruits in order to start a farm one day. 
</p>


<h2>Description of the data</h2>

<p>I will, as requested by the assignment task, use foursquare data about restaurants in Cologne. Foursquare is a US tech company from New York focusing on location data. Their technology and data powers apps such as Apple's Maps, Uber, Twitter and many other household names. Here is an example of a vegetarian restaurant in Cologne on foursquare: https://de.foursquare.com/v/sattgr%C3%BCn/5c33306cc824ae002c2b414c. I will use foursquare data such as the restaurant name, ID, location and category of food (vegetarian, Italian etc.).

Also, I will use the overview of districts/city parts of Cologne from Wikipedia: https://en.wikipedia.org/wiki/Districts_of_Cologne</p>

<h2>Methodology</h2>

<b>This notebook will be used to complete the Data Science professional certificate by IBM in Coursera</b>

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


In [3]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import urllib.request
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library
from IPython.display import Image
from IPython.core.display import HTML 
import ipython_config #used to hide sensitive data

In [4]:
data_url="https://psa.gov.ph/sites/default/files/attachments/hsd/pressrelease/2015_Table%201_Legislative%20Districts.xlsx"
cities_borough_url = r"D:\Desktop\pythonLearning\Coursera_Capstone\NCR.xlsx"
agriculutral_accounts = "https://psa.gov.ph/system/files/01Summary_2018PSNA_Q42020_8.xlsx"

#References:
#https://psa.gov.ph/national-accounts/base-2018/estimates

Load Data

In [5]:
NCR_data = pd.read_excel(cities_borough_url)

In [6]:
NCR_data=NCR_data.rename(columns={'Unnamed: 2': 'Latitude', 'Unnamed: 3':'Longtitude'})
NCR_data=NCR_data.drop(['Unnamed: 4'], axis=1)
NCR_data.drop(index=list(range(17,26)), inplace=True)

In [7]:
NCR_data

Unnamed: 0,City,2015 Population,Latitude,Longtitude
0,CITY OF MANILA,1780148.0,,
1,CITY OF MANDALUYONG,386276.0,,
2,CITY OF MARIKINA,450741.0,,
3,CITY OF PASIG,755300.0,,
4,QUEZON CITY,2936116.0,,
5,CITY OF SAN JUAN,122180.0,,
6,CALOOCAN CITY,1583978.0,,
7,CITY OF MALABON,365525.0,,
8,CITY OF NAVOTAS,249463.0,,
9,CITY OF VALENZUELA,620422.0,,


In [8]:
#API Key
geocoders_APIkey = ipython_config.geocoders_APIkey
foursquare_ID = ipython_config.foursquare_ID
foursquare_secret= ipython_config.foursquare_secret
foursquare_version = '20210315'
foursquare_limit = 100

In [9]:
for ind, row in NCR_data.iterrows():
    address = str(NCR_data.at[ind, 'City']) + ", Philippines"
    parameters ={
    "key": geocoders_APIkey,
    "address": address
    }
    response = requests.get("https://maps.googleapis.com/maps/api/geocode/json?",params = parameters)
    
    data = json.loads(response.text)["results"][0]["geometry"]
    lat = data["location"]["lat"]
    lng = data["location"]["lng"]

    NCR_data.at[ind, 'Latitude'] = lat
    NCR_data.at[ind, 'Longtitude'] = lng

In [10]:
NCR_data.head()

Unnamed: 0,City,2015 Population,Latitude,Longtitude
0,CITY OF MANILA,1780148.0,14.599512,120.984219
1,CITY OF MANDALUYONG,386276.0,14.579444,121.035917
2,CITY OF MARIKINA,450741.0,14.65073,121.102855
3,CITY OF PASIG,755300.0,14.576377,121.08511
4,QUEZON CITY,2936116.0,14.676041,121.0437


In [11]:
# create map of New York using latitude and longitude values
NCR_map = folium.Map(location=[NCR_data.at[0, 'Latitude'], NCR_data.at[0, 'Longtitude']], zoom_start=11)

# add markers to map
for lat, lng, city in zip(NCR_data['Latitude'], NCR_data['Longtitude'], NCR_data['City']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(NCR_map)  
    
NCR_map

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius):
    
    venues_list=[]
    short_cat_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            foursquare_ID, 
            foursquare_secret,  
            foursquare_version, 
            lat, 
            lng, 
            radius, 
            foursquare_limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
        for w in results:
            string = (w['venue']['categories'][0]['icon']['prefix'])
            short_cat = string.split(sep='/')
            short_cat_list.append(short_cat[5]) 

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    nearby_venues['Short Category']=short_cat_list
    
    return(nearby_venues)

In [15]:
NCR_venues = getNearbyVenues(NCR_data['City'], NCR_data['Latitude'], NCR_data['Longtitude'], 3000) #radius of 1500m
NCR_venues.shape

CITY OF MANILA
CITY OF MANDALUYONG
CITY OF MARIKINA
CITY OF PASIG
QUEZON CITY
CITY OF SAN JUAN
CALOOCAN CITY 
CITY OF MALABON
CITY OF NAVOTAS
CITY OF VALENZUELA
CITY OF LAS PIÑAS
CITY OF MAKATI
CITY OF MUNTINLUPA
CITY OF PARAÑAQUE
PASAY CITY
PATEROS
TAGUIG CITY


(1617, 8)

In [16]:
NCR_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Short Category
0,CITY OF MANILA,14.599512,120.984219,Krispy Kreme,14.601195,120.982774,Donut Shop,food
1,CITY OF MANILA,14.599512,120.984219,BonChon Chicken,14.601194,120.982791,Fried Chicken Joint,food
2,CITY OF MANILA,14.599512,120.984219,98B,14.598836,120.979435,Public Art,arts_entertainment
3,CITY OF MANILA,14.599512,120.984219,The Den,14.598827,120.97945,Coffee Shop,food
4,CITY OF MANILA,14.599512,120.984219,Minor Basilica of St. Lorenzo Ruiz of Manila (...,14.599935,120.974646,Church,building


In [18]:
NCR_venues[['City','Venue Category']].groupby('City').count()

Unnamed: 0_level_0,Venue Category
City,Unnamed: 1_level_1
CALOOCAN CITY,100
CITY OF LAS PIÑAS,100
CITY OF MAKATI,100
CITY OF MALABON,88
CITY OF MANDALUYONG,100
CITY OF MANILA,100
CITY OF MARIKINA,90
CITY OF MUNTINLUPA,100
CITY OF NAVOTAS,50
CITY OF PARAÑAQUE,100


In [30]:
NCR_venues[['Short Category']].value_counts()

Short Category    
food                  1078
shops                  323
arts_entertainment      54
building                45
parks_outdoors          42
travel                  39
nightlife               33
education                3
dtype: int64

In [23]:
# one hot encoding
NCR_onehot = pd.get_dummies(NCR_venues[['Short Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
NCR_onehot['City'] = NCR_venues['City'] 

# move neighborhood column to the first column
fixed_columns = [NCR_onehot.columns[-1]] + list(NCR_onehot.columns[:-1])
NCR_onehot = NCR_onehot[fixed_columns]

NCR_onehot.shape

(1617, 9)

In [24]:
NCR_onehot.head()

Unnamed: 0,City,arts_entertainment,building,education,food,nightlife,parks_outdoors,shops,travel
0,CITY OF MANILA,0,0,0,1,0,0,0,0
1,CITY OF MANILA,0,0,0,1,0,0,0,0
2,CITY OF MANILA,1,0,0,0,0,0,0,0
3,CITY OF MANILA,0,0,0,1,0,0,0,0
4,CITY OF MANILA,0,1,0,0,0,0,0,0


In [19]:
print('There are {} uniques categories.'.format(len(NCR_venues['Venue Category'].unique())))

There are 198 uniques categories.


NCR_venue_cat = NCR_venues['Venue Category'].value_counts().to_frame()

In [20]:
#pd.set_option('display.max_rows', 200)
#pd.reset_option('all')

NCR_venue_cat = pd.read_excel(ipython_config.venue_cat_data_xlsx)
NCR_venue_cat.head()

In [21]:
venue_map = folium.Map(location=[NCR_data.at[0, 'Latitude'], NCR_data.at[0, 'Longtitude']], zoom_start=11)

# add markers to map
for lat, lng, venue in zip(NCR_venues['Venue Latitude'], NCR_venues['Venue Longitude'], NCR_venues['Venue Category']):
    label = '{}'.format(venue)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(venue_map)  
    
venue_map