# The battle of the neighborhoods

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## 1. Introduction: Business Problem <a name='introduction'></a>
Philadelphia, Pennsylvania’s largest city, is one of the most historic cities in America. Philadelphia has been in the forefront of the nation's intellectual, economic, and humanitarian development for more than three hundred years. Philadelphia offers the advantages of living in a big city while maintaining a small-town atmosphere and preserving reminders of its dignified past. The Greater Philadelphia area has been on numerous best city lists as a good place to balance work and family life.

The objective of this capstone project is to analyze and using data science methodology and machine learning techniques like clustering, to provide solutionsmfor what would be a better place to live?

## 2. Data <a name='data'></a>

**[OpenDataPhilly site](https://www.opendataphilly.org/dataset)** is a great repository for information regarding. The most informative datasets I found regarding Philadelphia are the following:

* **[Neighborhoods Data](https://www.opendataphilly.org/dataset/philadelphia-neighborhoods)**: This data set contains 158 neighborhoods of Philadelphia and theirs geographical data. From [Neighborhoods Shapefile](https://github.com/azavea/geo-data), I download and use `Neighborhoods_Philadelphia.geojson` for my location dataset.


* **[Crime Incidents](https://www.opendataphilly.org/dataset/crime-incidents)**: This dataset contains information and geographical location of crime incidents. It can collect thought an [open api](https://cityofphiladelphia.github.io/carto-api-explorer/#incidents_part1_part2).  This crime incidents data come from the Philadelphia Police Department. Crimes include violent offenses such as aggravated assault, rape, arson, simple assault, prostitution, gambling, fraud, and other non-violent offenses.


* **[Venues Data](https://developer.foursquare.com/docs/api-reference/venues/search/)**: I will use Foursquare API to get the venue data for Philly neighborhood and geolocation. 

    
*** Note: *** For later use, reduce the api call, I export data to csv file and put them on `data` folder

In [1]:
import requests
import urllib.request
from urllib.parse import unquote

import pandas as pd
import geopandas as gpd
import numpy as np
import io
import json

from datetime import datetime, date

from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim
from geopy.geocoders import GoogleV3

from sklearn.cluster import KMeans

import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries imported.')

Libraries imported.


### 2.1 Get Locations Data

In [8]:
# get locator of Philadelphia to generate map
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent = 'Philadelphia')
geo_philly = geolocator.geocode('Philadelphia')

In [10]:
# load geographical data
df_philly = gpd.read_file('data/Neighborhoods_Philadelphia.geojson')
# preprocess data, get center point latitude and longitude
df_philly['center_longitude'] = df_philly['geometry'].centroid.x
df_philly['center_latitude'] = df_philly['geometry'].centroid.y
# preprocess data
df_philly = df_philly.sort_values('mapname').reset_index()
df_philly = df_philly.rename(columns= {'mapname' : 'neighborhood'}, inplace= False)
df_philly = df_philly[['neighborhood', 'geometry', 'center_latitude', 'center_longitude']]
print(df_philly.shape)
df_philly.head(2)


(158, 4)

  df_philly['center_longitude'] = df_philly['geometry'].centroid.x

  df_philly['center_latitude'] = df_philly['geometry'].centroid.y


Unnamed: 0,neighborhood,geometry,center_latitude,center_longitude
0,Academy Gardens,"MULTIPOLYGON (((-75.00719 40.06923, -75.00290 ...",40.061186,-75.003104
1,Airport,"MULTIPOLYGON (((-75.25759 39.87626, -75.25174 ...",39.883098,-75.218305


In [7]:
# build Philly map
style_function = lambda x: {
    'fillColor': '#b0d5ff',
    'color': '#006be6',
    'weight': 2,
    
    'fillOpacity': 0.5
}

map_philly = folium.Map(location=[geo_philly.latitude, geo_philly.longitude], 
                        zoom_start=12, tiles="Stamen Terrain")
folium.GeoJson(
    df_philly,
    style_function=style_function,
    tooltip=folium.GeoJsonTooltip(
        fields=['neighborhood'],
        labels=False,
        localize=True
    ),
    popup=folium.GeoJsonPopup(fields=['neighborhood'], labels=False)
).add_to(map_philly)


# add markers to map
for lat, lng, neighborhood in zip(df_philly['center_latitude'], df_philly['center_longitude'], df_philly['neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    icon = folium.DivIcon(html=f"""<div style="font-weight: bold; color: #c7580e;">{neighborhood}</div>""")

    folium.Marker(
        [lat, lng],
        popup=label,
        icon=icon).add_to(map_philly)
map_philly

### 2.2 Get Venues Data

In [44]:
# Foursquare api
from config_file import CLIENT_ID
from config_file import CLIENT_SECRET
LIMIT = 100 # A default Foursquare API limit value
VERSION = '20210605' 

# get all nearby venues
def get_nearby_venues(neighborhoods, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for neighborhood, lat, lng in zip(neighborhoods, latitudes, longitudes):            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            neighborhood, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['neighborhood', 
                  'venue', 
                  'venue_latitude', 
                  'venue_longitude', 
                  'venue_category']
    
    return(nearby_venues)

# get venues category
def get_venue_categories():
    url = 'https://api.foursquare.com/v2/venues/categories'
    params = {
        "client_id": CLIENT_ID,
        "client_secret": CLIENT_SECRET,
        "v": VERSION,
    }
    return requests.get(url, params=params).json()["response"]["categories"]


In [14]:
# get nearby venues
df_venues = get_nearby_venues(
    neighborhoods=df_philly['neighborhood'],
    latitudes=df_philly['center_latitude'],
    longitudes=df_philly['center_longitude'],
    radius=2000
)
# clean duplicate data
df_venues.drop_duplicates(keep='first', inplace=True)
print(df_venues.shape)
df_venues.head(1)

(13843, 5)


Unnamed: 0,neighborhood,venue,venue_latitude,venue_longitude,venue_category
0,Academy Gardens,Cannstatter Volksfest-Verein,40.054602,-75.007887,Event Space


In [19]:
# get all categories list as dictionary
def get_sub_categories(nodes):
    cat_list = {}
    for node in nodes:
        if not node['categories']:
            cat_list[node['name']] = None
        else:
            cat_list[node['name']] = get_sub_categories(node['categories'])
    return cat_list

# common function for seach key return parent key
def find_key(p_dict, search_key, parent_key=None):
    for key in p_dict:
        if search_key == key:
            return parent_key if parent_key is not None else key
        elif isinstance(p_dict[key], dict):
            result = find_key(p_dict[key], search_key, key)
            if result is not None:
                return result

In [21]:
# get all venues categories to dictionary
venue_categories = get_sub_categories(get_venue_categories())

In [22]:
# store all categories list for later use
with open('data/venue_categories.json', 'w', encoding='utf-8') as json_file:
    json.dump(venue_categories, json_file, ensure_ascii=False)

In [23]:
# load categories from saved file
with open('data/venue_categories.json', encoding='utf-8') as data:
    venue_categories = json.load(data)

In [24]:
# get parent category for venues data
df_venues['venue_general_category'] = df_venues.apply(lambda x: find_key(venue_categories, x['venue_category']), axis=1)
df_venues.head(2)

Unnamed: 0,neighborhood,venue,venue_latitude,venue_longitude,venue_category,venue_general_category
0,Academy Gardens,Cannstatter Volksfest-Verein,40.054602,-75.007887,Event Space,Professional & Other Places
1,Academy Gardens,Crown Deli,40.064613,-74.987099,Deli / Bodega,Food
2,Academy Gardens,Asian World of Martial Arts,40.06853,-75.016359,Martial Arts School,Gym / Fitness Center
3,Academy Gardens,Fluehr Park,40.055616,-74.987111,Park,Outdoors & Recreation
4,Academy Gardens,Dagwood's Pub,40.050632,-74.999784,Seafood Restaurant,Food


In [None]:
len(df_venues['venue_general_category'].unique())

In [26]:
# export to csv file for later use
df_venues.to_csv('data/df_venues.csv', index=False)

In [27]:
# import data from file
df_venues = pd.read_csv('data/df_venues.csv')
print(df_venues.shape)
df_venues.head(1)

(13843, 6)


Unnamed: 0,neighborhood,venue,venue_latitude,venue_longitude,venue_category,venue_general_category
0,Academy Gardens,Cannstatter Volksfest-Verein,40.054602,-75.007887,Event Space,Professional & Other Places


### 2.3 Get Crime Data

In [2]:
# change the range of data collected
NUMBER_OF_YEAR = 5

# get api url
current_year = datetime.now().year
crime_url = "https://phl.carto.com/api/v2/sql?filename=incidents_part1_part2&format=csv&q=SELECT * , ST_Y(the_geom) AS lat, ST_X(the_geom) AS lng, round(ST_Area(the_geom::geography)) AS geography_area, round(ST_Area(the_geom_webmercator)) AS area FROM incidents_part1_part2 WHERE dispatch_date_time >= '{}-01-01' AND dispatch_date_time < '{}-01-01'".format(current_year - NUMBER_OF_YEAR + 1, current_year + 1)

# download and load data
crime_response = requests.get(crime_url).content
df_crime_raw = pd.read_csv(io.StringIO(crime_response.decode('utf-8')), dtype={"dc_dist": int})
print(df_crime_raw.shape)
df_crime_raw.head(2)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
(672121, 20)


Unnamed: 0,the_geom,cartodb_id,the_geom_webmercator,objectid,dc_dist,psa,dispatch_date_time,dispatch_date,dispatch_time,hour_,dc_key,location_block,ucr_general,text_general_code,point_x,point_y,lat,lng,geography_area,area
0,0101000020E61000002FD31F2F1ECE52C07129BE0C0CF5...,11,0101000020110F0000401FFA8143F15FC1160AD2D2C283...,117,12,1,2018-01-06 10:56:00,2018-01-06,10:56:00,10.0,201812001185,6600 BLOCK ESSINGTON AVE,600,Thefts,-75.220592,39.91443,39.91443,-75.220592,0.0,0.0
1,0101000020E61000002FD31F2F1ECE52C07129BE0C0CF5...,12,0101000020110F0000401FFA8143F15FC1160AD2D2C283...,118,12,1,2018-06-21 22:57:00,2018-06-21,22:57:00,22.0,201812045738,6600 BLOCK ESSINGTON AVE,300,Robbery Firearm,-75.220592,39.91443,39.91443,-75.220592,0.0,0.0


In [4]:
# store crime data for later use
df_crime_raw.to_csv('data/df_crime_raw.csv', index=False)

In [5]:
# load crime data from saved file
df_crime_raw = pd.read_csv('data/df_crime_raw.csv')
print(df_crime_raw.shape)
df_crime_raw.head(2)

(672121, 20)
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,the_geom,cartodb_id,the_geom_webmercator,objectid,dc_dist,psa,dispatch_date_time,dispatch_date,dispatch_time,hour_,dc_key,location_block,ucr_general,text_general_code,point_x,point_y,lat,lng,geography_area,area
0,0101000020E61000002FD31F2F1ECE52C07129BE0C0CF5...,11,0101000020110F0000401FFA8143F15FC1160AD2D2C283...,117,12,1,2018-01-06 10:56:00,2018-01-06,10:56:00,10.0,201812001185,6600 BLOCK ESSINGTON AVE,600,Thefts,-75.220592,39.91443,39.91443,-75.220592,0.0,0.0
1,0101000020E61000002FD31F2F1ECE52C07129BE0C0CF5...,12,0101000020110F0000401FFA8143F15FC1160AD2D2C283...,118,12,1,2018-06-21 22:57:00,2018-06-21,22:57:00,22.0,201812045738,6600 BLOCK ESSINGTON AVE,300,Robbery Firearm,-75.220592,39.91443,39.91443,-75.220592,0.0,0.0


In [6]:
# preprocess crime data
df_crime = gpd.GeoDataFrame(
    df_crime_raw, geometry=gpd.points_from_xy(df_crime_raw.lng, df_crime_raw.lat))
df_crime = df_crime[['cartodb_id', 'dispatch_date', 'text_general_code', 'geometry']]
df_crime.rename(columns= {'cartodb_id': 'id', 'dispatch_date': 'date', 'text_general_code': 'type'}, inplace= True)
print(df_crime.shape)
df_crime.head(1)

(672121, 4)


Unnamed: 0,id,date,type,geometry
0,11,2018-01-06,Thefts,POINT (-75.22059 39.91443)


In [14]:
# join to get neighborhood
%%time
df_crime = gpd.sjoin(df_crime, df_philly, how='left',op="within")

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326

CPU times: user 5.71 s, sys: 225 ms, total: 5.93 s
Wall time: 6.03 s


In [17]:
df_crime = df_crime[['id', 'date', 'type', 'neighborhood', 'geometry']]
print(df_crime.shape)
df_crime.head(1)

Unnamed: 0,id,date,type,neighborhood,geometry
0,11,2018-01-06,Thefts,Industrial,POINT (-75.22059 39.91443)


In [19]:
# remove null value rows
df_crime = df_crime.dropna(how='any',axis=0)
print(df_crime.shape)
df_crime.head(1)

(667882, 5)


Unnamed: 0,id,date,type,neighborhood,geometry
0,11,2018-01-06,Thefts,Industrial,POINT (-75.22059 39.91443)


In [20]:
# store crime data for later use
df_crime.to_csv('data/df_crime.csv', index=False)

In [21]:
# load crime data from saved file
df_crime = pd.read_csv('data/df_crime.csv')
print(df_crime.shape)
df_crime.head(2)

(667882, 5)


Unnamed: 0,id,date,type,neighborhood,geometry
0,11,2018-01-06,Thefts,Industrial,POINT (-75.22059229 39.91443023)
1,12,2018-06-21,Robbery Firearm,Industrial,POINT (-75.22059229 39.91443023)
