# Compairing the Neighborhoods of Chicago & Houston
By: Nick Adelberger

## Introduction

Chicago, on Lake Michigan in Illinois, is the 3rd largest city in the United States and is home to more than 2.7 million people. Known as "The Windy City", Chicago is an international hub for finance, culture, commerce, industry, education, technology, telecommunications, and transportation (reference: https://en.wikipedia.org/wiki/Chicago). Houston, popularly known as "The Bayou City", is set to become the 4th largest city in the United States by the second half of 2020. Houston's economy has a broad industrial base in energy, manufacturing, aeronautics, and transportation. It has become a global city, with strengths in culture, medicine and research (reference: https://en.wikipedia.org/wiki/Houston).

## 1. Business Problem

As a current resident of Chicago, I have been offered a job in Houston as a Data Anaylist. Before accepting the position, I would like to discover more insight into the city of Houston to find any similarities or dissimilarities to Chicago. This will help anyone who is considering moving to Houston by understanding the population and different areas of Houston, and which areas have similar venues and lifestyles to that of Chicago.  

## 2. Data

The Data we will be using to discover similarities and/or dissimilarities on Chicago and Houston will first come from creating a data frame containing the Neighborhoods and Boroughs within both city limits. This information was found on wikipedia through a Google search. 
Chicago Neighborhoods - https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Chicago
Houston Neighborhoods - https://www.houston.org/living-in-houston/neighborhoods-communities

We will also need the geographical coordinates (latitude & longitude) to each neighborhood so that we can successfully utilize the Foursquare API to pull location data on the types of venues within each neighborhood. We will use Python's 'Geocoder' package to acheive this result.

We will then be able to use the Foursquare API location data as our main source of exploration, using what we find to gain insights into the type of venues located in each neighborhood. Foursquare is the most trusted, independent location data platform for understanding how people move through the real world (reference: https://foursquare.com). By creating a developers account, we will be able to use their platform to exract real world data on each neighborhood in Chicago and Houston. This data will include Venues, Restaurants, Hotels, Businesses, Parks, Bars, Shopping Centers, etc. This will be done in the Methodolgy/Analysis Section/s of this report.

In [1]:
# Import Libraries
import requests
import urllib.request
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

!conda install -c anaconda beautifulsoup4 --yes
from bs4 import BeautifulSoup # package used for web scraping

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

print('BeautifulSoup installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/nadelberger/miniconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         148 KB  conda-forge
    conda-4.8.2                |           py37_0         3.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates      anaconda::ca-certificates-2020.1.1-0 --> conda-forge::ca-certificates-2019.11.28-hecc5488_0
  certifi                                          anaconda --> conda-forge
  conda                                            anaconda --> conda-forge
  openssl                anaco

### Getting Houston Neighborhood Data

In [3]:
# speify the URL of webpage to scrape
url = 'https://www.houston.org/living-in-houston/neighborhoods-communities'

# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "html")

In [8]:
# use the 'find' function to bring back the 'div' tag with class 'wikitable' in the HTML and store in 'houston_table' variable
houston_table = soup.find_all('div', class_='text-timeline-year')

# Make empty lists to store data from table loop
houston_neighborhood = []
for div in houston_table:
    houston_neighborhood.append(div.text)
    
# remove the \n from houston_neighborhood list and assign to final_houston_neighborhood list
final_houston_neighborhood = []
for hood in houston_neighborhood:
    final_houston_neighborhood.append(hood.strip())

In [10]:
# Create dataframe and drop empty row reseting index
hou_df = pd.DataFrame(final_houston_neighborhood, columns=['Neighborhood'])
hou_df_mod = hou_df.drop([hou_df.index[19]]).reset_index()
houston_df = hou_df_mod[['Neighborhood']]
houston_df.head()

Unnamed: 0,Neighborhood
0,Ballpark District
1,Civic Center District
2,Convention District
3,Historic District
4,Medical District


Now that we have the neighborhoods in Houston, we need to add the Latitude and Longitude Coordinates for each neighborhood by utilizing the Python 'Geopy Locator'

In [12]:
# Create User_agent for Geopy Locator
geolocator = Nominatim(user_agent='hou_explorer')

In [13]:
# Create lambda function to get lat and lng coordinates for Neighborhoods and add columns to dataframe
houston_df['Latitude'] = houston_df['Neighborhood'].apply(lambda x: geolocator.geocode(x + ', Houston').latitude if geolocator.geocode(x + ', Houston') != None else 0)
houston_df['Longitude'] = houston_df['Neighborhood'].apply(lambda x: geolocator.geocode(x + ', Houston').longitude if geolocator.geocode(x + ', Houston') != None else 0)

# View first 10 rows of Dataframe
houston_df.head(10)

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Ballpark District,0.0,0.0
1,Civic Center District,54.398267,-126.647708
2,Convention District,29.754999,-95.356291
3,Historic District,38.805139,-77.046931
4,Medical District,30.275061,-97.738481
5,Shopping District,54.398267,-126.647708
6,Skyline District,0.0,0.0
7,Theater District,29.761077,-95.366364
8,Warehouse District,38.80426,-77.042155
9,Bellaire,29.69662,-95.575222


We can see that some values are missing (Ballpark District Latitude/Longitude) along with coordinates that don't seem correct (Shopping District Latitude/Longitude). This can happen with Geopy so we will use google to find the coordinates that are missing along with updating the correct coordinates that seem off.

In [17]:
# Update Latitude Values in houston_df
new_df = pd.DataFrame({'Latitude': [29.7050,29.5420,29.8707,29.7167,29.7329,29.6459,29.6726,29.7771,29.6851,29.8016,29.8327,29.7363,29.6733,29.7598,29.7542,30.1580,29.5763,29.6681,29.7552]}, index=[60,52,25,24,23,22,21,17,16,15,14,11,10,6,0,39,41,3,5])
houston_df.update(new_df)

# Update Longitude Values in houston_df
new_df1 = pd.DataFrame({'Longitude': [-95.5453,-95.0170,-95.4365,-95.4169,-95.4334,-95.2769,-95.4201,-95.4355,-95.3993,-95.4381,-95.4448,-95.3043,-95.4399,-95.3633,-95.3533,-95.4894,-95.5370,-95.2802,-95.3627]}, index=[60,52,25,24,23,22,21,17,16,15,14,11,10,6,0,39,41,3,5])
houston_df.update(new_df1)

houston_df.head(10)

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Ballpark District,29.7542,-95.3533
1,Civic Center District,54.398267,-126.647708
2,Convention District,29.754999,-95.356291
3,Historic District,29.6681,-95.2802
4,Medical District,30.275061,-97.738481
5,Shopping District,29.7552,-95.3627
6,Skyline District,29.7598,-95.3633
7,Theater District,29.761077,-95.366364
8,Warehouse District,38.80426,-77.042155
9,Bellaire,29.69662,-95.575222


### Getting Chicago Neighborhood Data

In [18]:
# speify the URL of wikipedia page to scrape
url = 'https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Chicago'

# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "html")

In [20]:
# use the 'find' function to bring back the 'table' tag with class 'wikitable sortable' in the HTML and store in 'chicago_table' variable
chicago_table = soup.find('table', class_='wikitable sortable')

# Make empty lists to store data from table loop
chicago_neighborhood = []
chicago_borough = []
for row in chicago_table.find_all('tr'):
    cells = row.find_all('td')
    if len(cells)==2:
        chicago_neighborhood.append(cells[0].find(text=True))
        chicago_borough.append(cells[1].find(text=True))
        
# remove the \n from neighborhood list and assign to final_neighborhood list
final_chicago_neighborhood = []
for hood in chicago_neighborhood:
    final_chicago_neighborhood.append(hood.strip())
    
# remove the \n from borough list and assign to final_borough list
final_chicago_borough = []
for bor in chicago_borough:
    final_chicago_borough.append(bor.strip())

In [21]:
# Create Dataframe
chi_df = pd.DataFrame(final_chicago_neighborhood, columns=['Neighborhood'])
chi_df['Borough'] = final_chicago_borough

# Group by Borough, seperating the Negihborhoods by a comma on same row 
chicago_df = chi_df.groupby(['Borough'])['Neighborhood'].apply(', '.join).reset_index()
chicago_df.head()

Unnamed: 0,Borough,Neighborhood
0,Albany Park,"Albany Park, Mayfair, North Mayfair, Ravenswoo..."
1,Archer Heights,Archer Heights
2,Armour Square,"Armour Square, Chinatown, Wentworth Gardens"
3,Ashburn,"Ashburn, Ashburn Estates, Beverly View, Crestl..."
4,Auburn Gresham,"Auburn Gresham, Gresham"


In [22]:
# Turn the column Borough from chicago_df to list for computing latitude/longitude coordinates
borough_list = chicago_df['Borough'].to_list()

# Create empty lists for lat and lng coordinates
chi_lat = []
chi_lng = []

# Create for loop to find Latitude and Longitude for each borough and append to lists
for bor in borough_list:
    address = (bor + ', Chicago')
    geolocator = Nominatim(user_agent='chi_explorer')
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    chi_lat.append(latitude)
    chi_lng.append(longitude)

In [23]:
# add chi_lat and chi_lng coordinates to chicago_df and view final dataframe
chicago_df['Latitude'] = chi_lat
chicago_df['Longitude'] = chi_lng

chicago_df.head(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Albany Park,"Albany Park, Mayfair, North Mayfair, Ravenswoo...",41.971937,-87.716174
1,Archer Heights,Archer Heights,41.811422,-87.726165
2,Armour Square,"Armour Square, Chinatown, Wentworth Gardens",41.840033,-87.633107
3,Ashburn,"Ashburn, Ashburn Estates, Beverly View, Crestl...",41.747533,-87.711163
4,Auburn Gresham,"Auburn Gresham, Gresham",41.743387,-87.656042
5,Austin,"Galewood, The Island, North Austin, South Aust...",41.887876,-87.764851
6,Avalon Park,"Avalon Park, Marynook, Stony Island Park",41.745035,-87.588658
7,Avondale,"Avondale, Jackowo, Polish Village, Wacławowo",41.938921,-87.711168
8,Belmont Cragin,"Belmont Central, Brickyard, Cragin, Hanson Park",41.931698,-87.76867
9,Beverly,"Beverly, East Beverly, West Beverly",41.718153,-87.671767


In [24]:
# Save the Dataframes as a csv file
chicago_df.to_csv('chicago_df.csv')
houston_df.to_csv('houston_df.csv')

Now that we have the above datasets for Chicago and Houston, we can use the location coordinates to obtain data from Foursquare API to get venues in those neighborhoods and start our analysis.