# Segmenting and Clustering Neighborhoods in  in DC
## Applied Data Science Capstone Week 5 Peer-Graded Project Report

By Hernan Labastie


## Introduction to the opportunity

Washington, D.C., formally the District of Columbia and commonly referred to as D.C., Washington, or The District, is the capital of the United States .
he U.S. Census Bureau estimates that the District's population was 705,749 as of July 2019, an increase of more than 100,000 people since the 2010 United States Census.
This continues a growth trend since 2000, following a half-century of population decline.
The city was the 24th most populous place in the United States as of 2010.
According to data from 2010, commuters from the suburbs increase the District's daytime population to over a million.
Crime in Washington, D.C., is concentrated in areas associated with poverty, drug abuse, and gangs. A 2010 study found that 5% percent of city blocks accounted for more than 25% of the District's total crimes

Developers, investors, policy makers and/or city planners have an interest in answering the following questions as the need for
additional services and citizen protection:

1. What neighbourhoods have the highest crime?
2. Is population density correlated to crime level?
3. Using Foursquare data, what venues are most common in different locations within the city?
4. Where really need a coffee shop?

Does the Open Data project have specific enough or thick enough data to empower decisions to be made or is it too
aggregate to provide value in its current detail? Let's find out.

In [None]:
# from PIL import Image
import requests
from PIL import Image

url = 'https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTb-Z15PkiR2QFm6NSI1ZnFx_fgzR3DYcQbbT6AmW4IG7VL02c3'
im = Image.open(requests.get(url, stream=True).raw)
im

## Data

#### To understand and explore we will need the following District of Columbia  Open Data:

1. Open Data Site: https://dcatlas.dcgis.dc.gov/crimecards/
2. Neighbourhoods: hhttps://en.wikipedia.org/wiki/Washington,_D.C.#Demographics 
3. Foursquare Developers Access to venue data: https://foursquare.com/ (https://foursquare.com/)
4. Open Data Site:  https://opendata.dc.gov/datasets/

Using this data will allow exploration and examination to answer the questions. The neighbourhood data will enable us to
properly group crime by neighbourhood. The Census data will enable us to then compare the population density to examine if
areas of highest crime are also most densely populated. Locations of interest will then allow us to cluster and
quantitatively understand the venues most common to that location.

# Methodology
All steps are referenced below in the Appendix: Analysis section.

The methodology will include:

1. Loading each data set
2. Examine the crime frequency by neighbourhood
3. Study the crime types and then pivot analysis of crime type frequency by neighbourhood
4. Understand correlation between crimes and population density
5. Perform k-means statisical analysis on venues by locations of interest based on findings from crimes and neighbourhood
6. Determine which venues are most common statistically in the region of greatest crime count then in all other locations of interest.
6. Determine if an area, such as the Knowledge Park needs a coffee shop.



### Loading the data
After loading the applicable libraries, the referenced geojson neighbourhood data was loaded from DC  Open Data site.  


### Exploring the data
Exploring the count of crimes by neighbourhood gives us the first glimpse into the distribution.

One note is the possibility neighbourhoods names could change at different times.  The crime dataset did not mention which specific neighbourhood naming dataset it was using but we assumed the neighbourhood data provided aligned with the neighbourhoods used in the crime data.  It may be beneficial for the City to note and timestamp neighbourhood naming in the future or simply reference with neighbourhood naming file it used for the crime dataset.



#### First Visualization of Crime
Once the data was prepared, a choropleth map was created to view the crime count by neighbourhood.  As expected the region of greatest crime count was found in the downtown and Platt neighbourhoods.

Examining the crime types enables us to learn the most frequent occuring crimes which we then plot as a bar chart to see most frequenty type.



### Look at specific locations to understand the connection to venues using Foursquare data
Loading the "DC" data enables us to perform a statistical analysis on the most common venues by location.

Plotting the latitude and longitude coordinates of the locations of interest onto the crime choropleth map enables us to now study the most common venues by using the Foursquare data.

#### Analysing each Location
Grouping rows by location and the mean of the frequency of occurance of each category we venue categories we study the top five most common venues.

Putting this data into a pandas dataframe we can then determine the most common venues by location and plot onto a map.

## Results
The analysis enabled us to discover and describe visually and quantitatively:
1. Neighbourhoods in DC

2. Crime freqency by neighbourhood

3. Crime type frequency and statistics.   

4. Crime type count by neighbourhood.  




While, it is not valid, consistent, reliable or sufficient to assume a higher concentration of the combination of coffee shops, bars and clubs predicts the amount of crime occurance in the City of DC , this may be a part of the model needed to be able to in the future.

5. We were able to determine the top 10 most common venues by location of interest.

6 . Statisically, we determined there are no coffee shops .   

## Discussion and Recommendations

The DC Open Data enables us to gain an understanding of the crime volume by type by area but not specific enough to understand the distribution properties. Valuable questions such as, "are these crimes occuring more often in a specific area and at a certain time by a specific demographic of people?" cannot be answered nor explored due to what is reasonably assumed to be personal and private information with associated legal risks.

There is value to the city to explore the detailed crime data using data science to predict frequency, location, timing and conditions to best allocated resources for the benefit of its citizens and it's police force. However, human behaviour is complex requiring thick profile data by individual and the conditions surrounding the event(s). To be sufficient for reliable future prediction it would need to demonstrate validity, currency, reliability and sufficiency.

A note of caution is the possibility neighbourhoods names could change. The crime dataset did not mention which specific neighbourhood naming dataset it was using but we assumed the neighbourhood data provided aligned with the neighbourhoods used in the crime data. It may be beneficial for the City to note and timestamp neighbourhood naming in the future or simply reference with neighbourhood naming file it used for the crime dataset.





## Conclusion
Using a combination of datasets from DC Open Data project and Foursquare venue data we were able to analyse, discover and describe neighbhourhoods, crime, population density and statistically describe quantitatively venues by locations of interest.

While overall, the City of Fredericton Open Data is interesting, it misses the details required for true valued quantitiatve analysis and predictive analytics which would be most valued by investors and developers to make appropriate investments and to minimize risk.

The Open Data project is a great start and empowers the need for a "Citizens Like Me" model to be developed where citizens of digital Fredericton are able to share their data as they wish for detailed analysis that enables the creation of valued services.


# APPENDIX:  Analysis


### Load Libraries

In [None]:
pip install geopy

In [None]:
!conda config --add channels conda-forge
!conda config --add channels matsci
!conda config --add channels abinit

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# for webscraping import Beautiful Soup 
from bs4 import BeautifulSoup

import xml

!conda install folium --yes 
import folium # map rendering library

print('Libraries imported.')

In [None]:
r = requests.get('https://opendata.arcgis.com/datasets/823d86e17a6d47808c6e4f1c2dd97928_0.geojson')
dc_geo = r.json()

In [None]:
neighborhoods_data = dc_geo['features']

In [None]:
neighborhoods_data[0]

In [None]:
g = requests.get('https://opendata.arcgis.com/datasets/6179d35eacb144a5b5fdcc869f86dfb5_0.geojson')
demog_geo = g.json()

In [None]:
demog_data = demog_geo['features']
demog_data[0]

In [None]:
import os
#os.listdir('.')

In [None]:
opencrime = 'DC.xlsx'


In [None]:
workbook = pd.ExcelFile(opencrime)
print(workbook.sheet_names)

In [None]:
crime_df = workbook.parse('Hoja1')
crime_df.head()

In [None]:
#crime_df.drop(['From_Date', 'To_Date'], axis=1,inplace=True)

## What is the crime count by neighbourhood?

In [None]:
crime_data = crime_df.groupby(['NEIGHBORHOOD_CLUSTER']).size().to_frame(name='Count').reset_index()
crime_data

In [None]:
crime_data.describe()

In [None]:
crime_data.rename(index=str, columns={'NEIGHBORHOOD_CLUSTER':'Neighbourh','Count':'Crime_Count'}, inplace=True)
crime_data

In [None]:
crime_data.rename({'Platt': 'Plat'},inplace=True)
crime_data.rename(index=str, columns={'Neighbourhood':'Neighbourh','Count':'Crime_Count'}, inplace=True)
crime_data

In [None]:
address = 'District of Columbia, USA'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of DC, is {}, {}.'.format(latitude, longitude))

In [None]:
world_geo = r'world_countries.json' # geojson file

dc_1_map = folium.Map(location=[38.89, -76.98], width=1000, height=750,zoom_start=12)

dc_1_map

In [None]:
dc_geo = r.json()

threshold_scale = np.linspace(crime_data['Crime_Count'].min(),crime_data['Crime_Count'].max(), 6,dtype=int)
threshold_scale = threshold_scale.tolist()
threshold_scale[-1] = threshold_scale[-1]+1

dc_1_map.choropleth(geo_data=fredericton_geo, data=crime_data,columns=['Neighbourh', 'Crime_Count'],
    key_on='feature.properties.Neighbourh', threshold_scale=threshold_scale,fill_color='YlOrRd', fill_opacity=0.7, 
    line_opacity=0.1, legend_name='Neighbourhoods')

dc_1_map

## Examine Crime Types 

In [None]:
crimetype_data = crime_df.groupby(['OFFENSE']).size().to_frame(name='Count').reset_index()
crimetype_data

In [None]:
crimetype_data.describe()

In [None]:
crimepivot = crime_df.pivot_table(index='NEIGHBORHOOD_CLUSTER', columns='OFFENSE', aggfunc=pd.Series.count, fill_value=0)
crimepivot

In [None]:
crimetype_data.plot(x='OFFENSE', y='Count', kind='barh')


## Let's examine theft from vehicles

In [None]:
mvcrime_df = crime_df
mvcrime_df

In [None]:
mvcrime_data = mvcrime_df.groupby(['NEIGHBORHOOD_CLUSTER']).size().to_frame(name='Count').reset_index()
mvcrime_data

In [None]:
mvcrime_data.describe()

In [None]:
mvcrime_data.rename({'Platt': 'Plat'},inplace=True)
mvcrime_data.rename(index=str, columns={'Neighbourhood':'Neighbourh','Count':'MVCrime_Count'}, inplace=True)
mvcrime_data

In [None]:
world_geo = r'world_countries.json' # geojson file

dc_c_map = folium.Map(location=[38.89, -76.98], width=1000, height=750,zoom_start=12)

dc_c_map

In [None]:
## Motor Vehicle Crime < $5000 Count 
dc_geo = r.json()
threshold_scale = np.linspace(mvcrime_data['MVCrime_Count'].min(), mvcrime_data['MVCrime_Count'].max(),6,dtype=int)
threshold_scale = threshold_scale.tolist()
threshold_scale[-1] = threshold_scale[-1]+1

dc_c_map.choropleth(geo_data=fredericton_geo,data=mvcrime_data,columns=['Neighbourh', 'MVCrime_Count'],key_on='feature.properties.Neighbourh',
    threshold_scale=threshold_scale, fill_color='YlOrRd',fill_opacity=0.7,line_opacity=0.1,legend_name='Neighbourhoods')
dc_c_map

## Is it possible the higher rate of crime in the downtown area is due to population density?

In [None]:
opendemog = 'DC_Census_Tract_Demographics.xlsx'

workbook = pd.ExcelFile(opendemog)
print(workbook.sheet_names)

In [None]:
demog_df = workbook.parse('DC_Census_Tract_Demogr')
demog_df.head()

In [None]:
# Population Density 
world_geo = r'world_countries.json' # geojson file
dc_d_map = folium.Map(location=[38.89, -76.98], width=1200, height=750,zoom_start=12)
dc_d_map

threshold_scale = np.linspace(demog_df['DBpop2011'].min(),demog_df['DBpop2011'].max(),6,dtype=int)
threshold_scale = threshold_scale.tolist()
threshold_scale[-1] = threshold_scale[-1]+1

fredericton_d_map.choropleth(geo_data=demog_geo,data=demog_df,columns=['OBJECTID','DBpop2011'],key_on='feature.properties.OBJECTID',
    threshold_scale=threshold_scale,fill_color='PuBuGn',fill_opacity=0.7, line_opacity=0.1,legend_name='Fredericton Population Density')
fredericton_d_map

## Let's look at specific locations in DC

In [None]:
pointbook = 'DCPoints.xlsx'

workbook_2 = pd.ExcelFile(pointbook)
print(workbook_2.sheet_names)

In [None]:
location_df = workbook_2.parse('Hoja1')
location_df.head()

In [None]:
location_df = location_df.head(100)
location_df

### Add location markers to map

In [None]:
for lat, lng, point in zip(location_df['X'], location_df['Y'], location_df['FULLADDRESS']):
    label = '{}'.format(point)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng],radium=1,popup=label,color='blue',fill=True,fill_color='#3186cc',fill_opacity=0.7,
        parse_html=False).add_to(dc_c_map)
dc_c_map

In [None]:
import gc
gc.collect()

## Explore DC Neighbourhoods
#### Define Foursquare Credentials and Version

In [None]:
CLIENT_ID = 'FWD50UX1QQOPQGYMP0QSM4JEHK4VZ1BVXZYKBYEQBSK1XIAL' # your Foursquare ID
CLIENT_SECRET = 'HGPB5AOG3PACSEUPCOIZLWPCDZAE3VNUJCHI1PG1SRCZ5NZQ' # your Foursquare Secret
VERSION = '20181201' # Foursquare API version



## Let's take a look at nearby venues

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=10, LIMIT=10):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng,            
            v['venue']['name'], 
            v['venue']['id'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Location', 
                  'Location Latitude', 
                  'Location Longitude', 
                  'Venue',
                  'Venue id',                
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category'        
                   ]
    
    return(nearby_venues)

In [None]:
dc_data_venues = getNearbyVenues(names=location_df['OBJECTID_12'],
                                   latitudes=location_df['LATITUDE'],
                                   longitudes=location_df['LONGITUDE']
                                  )

In [None]:
print(dc_data_venues.shape)
dc_data_venues

print('There are {} unique venue categories.'.format(len(dc_data_venues['Venue Category'].unique())))

In [None]:
print('There are {} unique venues.'.format(len(dc_data_venues['Venue id'].unique())))

univen = dc_data_venues.groupby('Location').nunique('Venue Category')
univen

dc_data_venues.groupby('Venue Category').nunique()

## Analyze each Location

In [None]:
# one hot encoding
dc_onehot = pd.get_dummies(dc_data_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
dc_onehot['Location'] = dc_data_venues['Location'] 

# move neighbourhood column to the first column
fixed_columns = [dc_onehot.columns[-1]] + list(dc_onehot.columns[:-1])
dc_onehot = dc_onehot[fixed_columns]

dc_onehot.head()

In [None]:
dc_onehot.shape

### Group rows by location and by the mean of the frequency of occurrence of each category

dc_grouped = dc_onehot.groupby('Location').mean().reset_index()
dc_grouped

In [None]:
dc_grouped.shape

### Print each Location with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in dc_grouped['Location']:
    print("----"+hood+"----")
    temp = dc_grouped[dc_grouped['Location'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

### Now into a pandas dataframe

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Location']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
location_venues_sorted = pd.DataFrame(columns=columns)
location_venues_sorted['Location'] = dc_grouped['Location']

for ind in np.arange(freddy_grouped.shape[0]):
    location_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dc_grouped.iloc[ind, :], num_top_venues)

location_venues_sorted

## Cluster DC Locations

### Run k-means to cluster Locations into 5 clusters

In [None]:
# set number of clusters
kclusters = 5

dc_grouped_clustering = dc_grouped.drop('Location', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dc_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

### Now creating a new dataframe including the cluster as well as the top 10 venues for each Location

In [None]:
dc_merged = location_df

# add clustering labels
dc_merged['Cluster Labels'] = kmeans.labels_

# merge fredericton_grouped with location df to add latitude/longitude for each location
dc_merged = dc_merged.join(location_venues_sorted.set_index('Location'), on='Location')

dc_merged# check the last columns!

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dc_merged['Latitude'], dc_merged['Longitude'], dc_merged['Location'], dc_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon], radius=5,popup=label,color=rainbow[cluster-1],fill=True,fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters

In [None]:
del crime_df
del crime_data
del crimetype_data
del mvcrime_df 
del demog_df
del location_df 
del dc_merged
