# Introduction

While we were examining the different types of venues in Toronto, I had the thought that an overabundance of coffee shops in a downtown section of any major city is not all that suprising, but what happens when you go farther out? What happens also when you change other levers like the median household income of a particular area? Where are the areas of a city where establishments serving alcohol are more common than those serving coffee and how do the economics differ between those areas?

The District of Columbia (within its strict city limits) is both a relatively small urban area and a prime example of the ever increasing wealth inequality in the United States. Household income on the granular census tract level tells one part of the story, but I think examining the types of commercial and public venues in a neighborhood will make the picture even clearer. With these two elements, I can see what elements point to wealth, which to poverty or barely staying above it and which may show us neighborhoods in transition as development and gentrification alter the status quo. I may then be able to classify future neighborhoods of transition as development is often a long game that many do not see beginning until it is in full swing.

Utilizing census tract level median household income data, I want to see what types of venues are more common in differing income areas of the nation's capital using clustering analysis. I will then create a classification model to predict which areas of the city are more likely to gentrify in the near future.

# Data Section

_Data Sets_

1. Census Tract Coordinates
    1. I downloaded a shape file from the US Census Bureau of all census tracts in the District of Columbia. I then converted that shape file to a json and uploaded it to my notebook.
    2. I extracted the releveant data from the JSON for each census tract's numerical name and FIPS (unique identifier) code. I also extracted and averaged all the geographical coordinates that make up a census tract to determine centroids of each tract for my analysis.
2. Census Tract Level Income
    1. I downloaded a CSV of the median household income per census tract from the District of Columbia government and uploaded it to my notebook as a pandas dataframe.
3. Four Square Nearby Venues
    1. Using the census tract coordinate centroids, I returned the venues most closely associated with those tracts. There will be considerable overlap amongst the venues as census tracts do not have a clearly defined geography but are rather delinieated based on population, but this will not be a problem as the analysis will focus on most prevalent types of venues amongst different income levels. 
       1. For example, one restaurant may fall within the radiuses of a low, medium and high income census tract but just the fact that the restaurant exists is only important to the analysis and not the total number of restaurants.
   
_Data Preparation and Execution Plan_

1. Further categorize the types of venues from Four Square to get less but more encompassing categories.
2. Determine appropriate bins for income levels.
3. Cluster the census tracts using venue category and income level.
    

### 1. Census Tract Coordinates JSON

hidden cell below with my credentials uploading json of a shape file from US Census detailing geographical coordinates of census tracts in Washington, DC

In [1]:
# The code was removed by Watson Studio for sharing.

In [2]:
import json

census_data = json.loads(body)

In [3]:
tract_data = census_data['features']

tract_data[0]

{'type': 'Feature',
 'geometry': {'type': 'Polygon',
  'coordinates': [[[-77.0365, 38.919284],
    [-77.03649399999999, 38.919861],
    [-77.036496, 38.920313],
    [-77.03649399999999, 38.920364],
    [-77.036489, 38.920964],
    [-77.03649, 38.921673],
    [-77.036487, 38.922609],
    [-77.03648799999999, 38.922731999999996],
    [-77.036491, 38.923179999999995],
    [-77.036492, 38.923207999999995],
    [-77.036491, 38.923249999999996],
    [-77.036489, 38.92342],
    [-77.036489, 38.924226999999995],
    [-77.036484, 38.924765],
    [-77.03648299999999, 38.924825999999996],
    [-77.036486, 38.925604],
    [-77.036486, 38.925793],
    [-77.036487, 38.92588],
    [-77.03648799999999, 38.926183],
    [-77.036489, 38.926272],
    [-77.036335, 38.926294],
    [-77.035988, 38.92631],
    [-77.035484, 38.926325],
    [-77.034943, 38.926353999999996],
    [-77.033605, 38.926435999999995],
    [-77.033114, 38.926463999999996],
    [-77.032721, 38.926488],
    [-77.032629, 38.92650099999999

In [4]:
column_names = ['Fips Code', 'Census Tract', 'Latitude', 'Longitude']

washDC = pd.DataFrame(columns=column_names)

washDC

Unnamed: 0,Fips Code,Census Tract,Latitude,Longitude


In [5]:
for data in tract_data:
    GEOID = data['properties']['GEOID']
    tractno = data['properties']['NAME']
    coord = data['geometry']['coordinates']
    lats=[]
    lngs=[]
    for i in coord[0]:
        lats.append(i[1])
        lngs.append(i[0])
        
    lat = sum(lats)/len(lats)
    lng = sum(lngs)/len(lngs)
    
    washDC = washDC.append({'Fips Code': GEOID,
                            'Census Tract': tractno,
                            'Latitude':lat,
                            'Longitude':lng}, ignore_index=True)

In [6]:
washDC.head()

Unnamed: 0,Fips Code,Census Tract,Latitude,Longitude
0,11001003700,37.0,38.922393,-77.034116
1,11001003800,38.0,38.920951,-77.039831
2,11001004001,40.01,38.920295,-77.046168
3,11001004002,40.02,38.918421,-77.043794
4,11001003600,36.0,38.923503,-77.030032


### Census Tract Income Data

hidden cell below with my credentials uploading a CSV file of median household income from opendata.dc.gov

In [7]:
# The code was removed by Watson Studio for sharing.

In [8]:
median_inc = pd.read_csv(body2)
median_inc.head()

Unnamed: 0,OBJECTID,GEOID,ALAND,AWATER,NAME,State,County,B19049_001E,B19049_001M,B19049_002E,...,B19053_002E,B19053_002M,B19053_003E,B19053_003M,B19053_calc_pctSelfempE,B19053_calc_pctSelfempM,Shape__Area,Shape__Length,Shape__Area_2,Shape__Length_2
0,1,11001000100,1907610,512798,Census Tract 1,District of Columbia,District of Columbia,191146.0,25411.0,,...,358,97,1993,197,15.2,3.9461,3157970.0,16275.593084,3157970.0,16275.593084
1,2,11001000201,503312,0,Census Tract 2.01,District of Columbia,District of Columbia,,,,...,0,12,0,12,,,832414.2,4265.956241,832414.2,4265.956241
2,3,11001000202,776437,428754,Census Tract 2.02,District of Columbia,District of Columbia,170987.0,28290.0,,...,313,125,1250,179,20.0,7.753475,1284189.0,13196.755434,1284189.0,13196.755434
3,4,11001000300,1010802,2334,Census Tract 3,District of Columbia,District of Columbia,152120.0,21528.0,36047.0,...,245,99,2210,149,10.0,4.007263,1675991.0,5244.314206,1675991.0,5244.314206
4,5,11001000400,1542759,69,Census Tract 4,District of Columbia,District of Columbia,126731.0,38147.0,,...,115,34,503,54,18.6,5.265075,2552695.0,7468.467697,2552695.0,7468.467697


In [9]:
dc_income = median_inc[['GEOID','NAME','B19049_001E','B19053_001E']]
dc_income.head()

Unnamed: 0,GEOID,NAME,B19049_001E,B19053_001E
0,11001000100,Census Tract 1,191146.0,2351
1,11001000201,Census Tract 2.01,,0
2,11001000202,Census Tract 2.02,170987.0,1563
3,11001000300,Census Tract 3,152120.0,2455
4,11001000400,Census Tract 4,126731.0,618


In [10]:
dc_income.rename(columns={'B19049_001E':'Median_Household_Income', 'B19053_001E':'Total_Households'},inplace=True)
dc_income.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0,GEOID,NAME,Median_Household_Income,Total_Households
0,11001000100,Census Tract 1,191146.0,2351
1,11001000201,Census Tract 2.01,,0
2,11001000202,Census Tract 2.02,170987.0,1563
3,11001000300,Census Tract 3,152120.0,2455
4,11001000400,Census Tract 4,126731.0,618


In [11]:
dc_income.dropna(inplace=True)
dc_income.reset_index(drop=True, inplace=True)
dc_income.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,GEOID,NAME,Median_Household_Income,Total_Households
0,11001000100,Census Tract 1,191146.0,2351
1,11001000202,Census Tract 2.02,170987.0,1563
2,11001000300,Census Tract 3,152120.0,2455
3,11001000400,Census Tract 4,126731.0,618
4,11001000501,Census Tract 5.01,116303.0,1888


In [12]:
dc_income.describe()

Unnamed: 0,GEOID,Median_Household_Income,Total_Households
count,177.0,177.0,177.0
mean,11001010000.0,87428.0,1589.169492
std,3264.192,46242.09342,729.377096
min,11001000000.0,13750.0,32.0
25%,11001000000.0,45278.0,1101.0
50%,11001010000.0,84375.0,1428.0
75%,11001010000.0,115667.0,1875.0
max,11001010000.0,250001.0,4811.0


### Four Square Location Data

my Four Square API Credentials are hidden below

In [13]:
# The code was removed by Watson Studio for sharing.

In [14]:
import requests

limit = 500

In [38]:
def getNearbyVenues(codes, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for code, name, lat, lng in zip(codes, names, latitudes, longitudes):
        print(name)
    
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            limit)

        results = requests.get(url).json()['response']['groups'][0]['items']
        
        venues_list.append([(
            code,
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Fips Code',
                             'Tract Number',
                             'Tract Latitude',
                             'Tract Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']
    
    return(nearby_venues)

In [17]:
washDC.columns

Index(['Fips Code', 'Census Tract', 'Latitude', 'Longitude'], dtype='object')

In [21]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Folium installed')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.4.1               |             py_0          26 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1g             |       h516909a_1         2.1 MB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    ------------------------------------------------------------
                       

In [22]:
from geopy.geocoders import Nominatim
address = 'Washington, DC'
geolocator = Nominatim(user_agent='lincoln')
location = geolocator.geocode(address)
DC_lat = location.latitude
DC_lng = location.longitude

  app.launch_new_instance()


TypeError: not all arguments converted during string formatting

In [30]:
map_DC = folium.Map(location=[DC_lat,DC_lng], zoom_start=11)

for lat, lng, tract in zip(washDC['Latitude'], washDC['Longitude'], washDC['Census Tract']):
    label = 'Census Tract {}'.format(tract)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_DC)

map_DC

In [39]:
DC_venues = getNearbyVenues(washDC['Fips Code'],washDC['Census Tract'],washDC['Latitude'],washDC['Longitude'])

37
38
40.01
40.02
36
42.01
42.02
33.02
74.07
68.01
107
96.04
2.01
3
4
5.02
44
46
48.01
48.02
49.02
74.01
74.03
6
7.01
7.02
74.04
73.01
27.02
83.01
16
105
92.04
73.04
30
56
75.02
75.03
9.02
15
13.02
55
96.03
5.01
9.01
84.02
96.02
8.02
13.01
77.08
20.01
62.02
8.01
14.02
101
47.02
52.01
53.01
76.01
75.04
99.02
26
96.01
39
2.02
43
68.04
41
79.01
49.01
59
99.01
1
35
18.03
74.06
33.01
76.04
76.05
76.03
50.02
47.01
81
82
83.02
84.10
106
87.01
80.02
87.02
77.03
77.07
77.09
78.03
78.04
88.03
88.04
89.03
89.04
88.02
98.10
108
109
104
99.03
99.04
99.05
99.06
21.02
22.01
22.02
23.01
64
65
66
67
102
110
68.02
74.08
74.09
78.09
79.03
80.01
94
95.01
95.03
95.04
95.05
99.07
25.01
23.02
24
25.02
70
71
69
72
95.07
95.08
95.09
97
98.01
27.01
28.01
28.02
111
90
91.02
98.11
10.01
10.02
78.07
78.06
78.08
92.01
92.03
93.01
93.02
11
12
14.01
98.04
17.02
103
18.04
19.01
19.02
58
50.01
98.07
98.02
98.03
20.02
21.01
29
31
32
34


In [42]:
DC_venues.columns

Index(['Fips Code', 'Tract Number', 'Tract Latitude', 'Tract Longitude',
       'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category'],
      dtype='object')

In [46]:
DC_venues.groupby('Tract Number').count()

Unnamed: 0_level_0,Fips Code,Tract Latitude,Tract Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Tract Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,77,77,77,77,77,77,77
10.02,15,15,15,15,15,15,15
101,100,100,100,100,100,100,100
102,89,89,89,89,89,89,89
103,11,11,11,11,11,11,11
104,5,5,5,5,5,5,5
105,20,20,20,20,20,20,20
106,52,52,52,52,52,52,52
107,100,100,100,100,100,100,100
108,89,89,89,89,89,89,89
