# 1. Introduction/Business Problem

<span style="font-family:verdana"><span style="font-size:18px">
The presence of parks, gardens and green open spaces in cities is fundamental for the health and the well-being of citizens; so much so that these areas are called "green lungs" of cities. For this reason, it could be interesting to study how much green a city is, especially huge metropolitan cities around the world. To this pourpose, I choose to compare an american metropolis, New York with an european one, London. 
The goal is to understand which of these two cities is greener and how green areas are scattered within each city. 
In order to compare New York and London, it is important to take into account also the size of each city in terms of inhabitant number and the extension of the territory.







# 2. Data

<span style="font-family:verdana"><span style="font-size:18px">
To solve the problem I will consider the following datas:
 

+ <span style="font-family:verdana"><span style="font-size:18px"> I found a list of London's parks and open spaces with its corrisponding extension mesured in hectares and acres on [Wikipedia](https://en.wikipedia.org/wiki/Parks_and_open_spaces_in_London).
    
+ <span style="font-family:verdana"><span style="font-size:18px"> Since there are not location datas on the above dataframe, I use **Forsquare API** to get the precise coordinates associated to each park in London.
    
+ <span style="font-family:verdana"><span style="font-size:18px"> On [NYC OpenData](https://nycopendata.socrata.com/City-Government/Parks-Zones/4j29-i5ry), I get a .json file with a complete and detailed dataframe containing all important features of New York green zones. I will clean this dataframe by extracting datas referring to parks, location and size.
    
+ <span style="font-family:verdana"><span style="font-size:18px">
The [**London region**](https://en.wikipedia.org/wiki/Greater_London#:~:text=The%20region%20covers%201%2C572%20km,areas%20outside%20the%20administrative%20region) forms the administrative boundaries of London and is organised into 33 local government districts, the 32 London boroughs and the City of London. The region covers **1,572 km2** and has a population of **8,982,000**.

+ <span style="font-family:verdana"><span style="font-size:18px">
[**New York City**](https://en.wikipedia.org/wiki/New_York_City) is composed of 5 boroughs, each of which is a county of the State of New York. The boroughs are: Brooklyn, Queens, Manhattan, the Bronx, and Staten Island. New York has an extension of **1,213.37 km2** with a population of **8,399,000**.

<span style="font-family:verdana"><span style="font-size:18px">
Thanks to *territory extension data*, by performing some statistic, one can calculate the total green extension present in London and New York. 
By above datas one can see that London and New York have a quite similar size. However, to be more precise, one have to consider the total green extension of each city in proportion with them size.
    
<span style="font-family:verdana"><span style="font-size:18px">
*Locations datas* will help to understand the green sparsity in New York and London. I think that the sparsity is important because if the green is well distribuited around the city, more people are able to access it.
To visualize this I will use Folium library.

<span style="font-family:verdana"><span style="font-size:18px">
Finally, by comparing the results obtained, it can be determined: which city has a greater extension of green areas (in proportion to its size), the percentage of green available for each citizen and how green is distributed within each city.
<span style="font-family:verdana"><span style="font-size:18px">
In this section I collect the above datas.

### Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# BeautifulSoup
from bs4 import BeautifulSoup

#!conda install -c conda-forge folium=0.5.0 --yes 
#import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0           conda-forge
    geopy:          

### Parks in London

<span style="font-family:verdana"><span style="font-size:18px">
Let's import from [Wikipedia page](https://en.wikipedia.org/wiki/Parks_and_open_spaces_in_London) a dataframe containing some of London green spaces.
To this pourpose I use BeautifulSoup.

In [2]:
req = requests.get("https://en.wikipedia.org/wiki/Parks_and_open_spaces_in_London")

soup = BeautifulSoup(req.content,'lxml')

table = soup.find_all('table')[0]

df = pd.read_html(str(table))

london_parks = pd.DataFrame(df[0]) 

In [3]:
london_parks

Unnamed: 0,name,hectares,acres
0,Thames Chase,9842,"24,320[8]"
1,Epping Forest,2476,"6,118[9]"
2,Wildspace Conservation Park,645,"1,593[10]"
3,Wimbledon Common,460,"1,136[11]"
4,Hampstead Heath,320,790[12]
5,Walthamstow Wetlands,211,520[13]
6,Mitcham Common,182,450[14]
7,Trent Park,169,418[15]
8,Hainault Forest Country Park,136,336[16]
9,Clapham Common,89,220[17]


<span style="font-family:verdana"><span style="font-size:18px">
The above dataframe is composed by three columns: name, hectares and acres. In the name column there are all the parks, in hectares and acres columns there are the size of each park resp. in hectares and acres.
Unfortunately, this dataframe is not complete. For this reason I need to fill manually the above dataframe with the remaining data from wikipedia.

In [4]:
# Add data from wikipedia

# Royal Parks
london_parks= london_parks.append({'name': 'Richmond Park','hectares': 955,'acres': 2359.85}, ignore_index=True)
london_parks = london_parks.append({'name': 'Bushy Park','hectares': 450,'acres': 1112}, ignore_index=True)
london_parks = london_parks.append({'name': 'Regent Park','hectares': 197,'acres': 486.79}, ignore_index=True)
london_parks = london_parks.append({'name': 'Hyde Park','hectares': 140,'acres': 346}, ignore_index=True)
london_parks = london_parks.append({'name': 'Kensington Gardens','hectares': 111,'acres': 274}, ignore_index=True)
london_parks = london_parks.append({'name': 'Greenwich Park','hectares': 73,'acres': 180}, ignore_index=True)
london_parks = london_parks.append({'name': 'St. James Park','hectares': 34,'acres': 84}, ignore_index=True)
london_parks = london_parks.append({'name': 'Green Park','hectares': 16,'acres': 39.5}, ignore_index=True)

# Council parks
london_parks = london_parks.append({'name': 'Victoria Park','hectares': 86.18,'acres': 213}, ignore_index=True)
london_parks = london_parks.append({'name': 'Battersea Park','hectares': 83,'acres': 205}, ignore_index=True)
london_parks = london_parks.append({'name': 'Crystal Palace Park','hectares': 80,'acres': 200}, ignore_index=True)
london_parks = london_parks.append({'name': 'Alexandra Park ','hectares': 80,'acres': 197.68}, ignore_index=True)
london_parks = london_parks.append({'name': 'Brockwell Park','hectares': 51,'acres': 126},ignore_index=True)

london_parks

Unnamed: 0,name,hectares,acres
0,Thames Chase,9842.0,"24,320[8]"
1,Epping Forest,2476.0,"6,118[9]"
2,Wildspace Conservation Park,645.0,"1,593[10]"
3,Wimbledon Common,460.0,"1,136[11]"
4,Hampstead Heath,320.0,790[12]
5,Walthamstow Wetlands,211.0,520[13]
6,Mitcham Common,182.0,450[14]
7,Trent Park,169.0,418[15]
8,Hainault Forest Country Park,136.0,336[16]
9,Clapham Common,89.0,220[17]


<span style="font-family:verdana"><span style="font-size:18px">
In order to have no problems in searching coordinates, let's simplify the name of Wildspace Conservation Park and Hainault Forest Country Park.

In [5]:
london_parks['name'][2]= 'Wildspace Park'
london_parks['name'][8]= 'Hainault Park'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [6]:
london_parks

Unnamed: 0,name,hectares,acres
0,Thames Chase,9842.0,"24,320[8]"
1,Epping Forest,2476.0,"6,118[9]"
2,Wildspace Park,645.0,"1,593[10]"
3,Wimbledon Common,460.0,"1,136[11]"
4,Hampstead Heath,320.0,790[12]
5,Walthamstow Wetlands,211.0,520[13]
6,Mitcham Common,182.0,450[14]
7,Trent Park,169.0,418[15]
8,Hainault Park,136.0,336[16]
9,Clapham Common,89.0,220[17]


## Forsquare to get the coordinates

In [7]:
CLIENT_ID = 'NX1WPC5VTHDOD00ZA5NJIMN3UO4QRCYU24LRXUU1SN1XOCN3' # your Foursquare ID
CLIENT_SECRET = 'YF5WPODSRDE0FVO4QXRABZSZCWDZ4S03SZMGT3A2JEWOU0GN' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: NX1WPC5VTHDOD00ZA5NJIMN3UO4QRCYU24LRXUU1SN1XOCN3
CLIENT_SECRET:YF5WPODSRDE0FVO4QXRABZSZCWDZ4S03SZMGT3A2JEWOU0GN


<span style="font-family:verdana"><span style="font-size:18px">
Let's create a new dataframe containing the coordinates of each park.

In [10]:
# define the dataframe columns
column_names = ['name','latitude','longitude'] 

# instantiate the dataframe
parks_coordinates = pd.DataFrame(columns=column_names)

In [12]:
# Fill the dataframe

for i in range(0,27):

 address= london_parks['name'][i]+', London'

 geolocator = Nominatim(user_agent="foursquare_agent")
 location = geolocator.geocode(address)
 latitude = location.latitude
 longitude = location.longitude

 parks_coordinates = parks_coordinates.append({'name': london_parks['name'][i] ,'latitude': latitude,'longitude': longitude}, ignore_index=True)

 

In [13]:
parks_coordinates

Unnamed: 0,name,latitude,longitude
0,Thames Chase,51.609674,-0.115491
1,Epping Forest,51.655222,0.17204
2,Wildspace Park,51.631335,-0.106465
3,Wimbledon Common,51.427074,-0.244198
4,Hampstead Heath,51.563982,-0.167187
5,Walthamstow Wetlands,51.586531,-0.049506
6,Mitcham Common,51.393623,-0.137502
7,Trent Park,51.659165,-0.141826
8,Hainault Park,51.603316,0.09336
9,Clapham Common,51.462075,-0.137359


<span style="font-family:verdana"><span style="font-size:18px">
I merge now the above dataframes: *london_parks* and *parks coordinates*.

In [14]:
df_london_parks = london_parks.merge(parks_coordinates)

In [15]:
df_london_parks

Unnamed: 0,name,hectares,acres,latitude,longitude
0,Thames Chase,9842.0,"24,320[8]",51.609674,-0.115491
1,Epping Forest,2476.0,"6,118[9]",51.655222,0.17204
2,Wildspace Park,645.0,"1,593[10]",51.631335,-0.106465
3,Wimbledon Common,460.0,"1,136[11]",51.427074,-0.244198
4,Hampstead Heath,320.0,790[12],51.563982,-0.167187
5,Walthamstow Wetlands,211.0,520[13],51.586531,-0.049506
6,Mitcham Common,182.0,450[14],51.393623,-0.137502
7,Trent Park,169.0,418[15],51.659165,-0.141826
8,Hainault Park,136.0,336[16],51.603316,0.09336
9,Clapham Common,89.0,220[17],51.462075,-0.137359


### New York Parks

<span style="font-family:verdana"><span style="font-size:18px">
Let's download New York Parks dataframe and load the.json file from [NYC Open Data](https://nycopendata.socrata.com/City-Government/Parks-Zones/4j29-i5ry).

In [16]:
!wget -q -O 'newyork_parks.json' https://data.cityofnewyork.us/resource/4j29-i5ry.json
print('Data downloaded!')

Data downloaded!


In [17]:
with open('newyork_parks.json') as json_data:
    newyork_parks = json.load(json_data)

In [18]:
# let's explore the first data

newyork_parks[0]

{'acres': '2.27347393',
 'borough': 'B',
 'communityboard': '313',
 'councildistrict': '43',
 'department': 'B-13',
 'description': 'Dreier Offerman Park-Calvert Vaux',
 'gispropnum': 'B125',
 'location': 'Bay 46 & Cropsey Ave',
 'nys_assembly': '47',
 'nys_senate': '23',
 'omppropid': 'B125-01',
 'precinct': '60',
 'propname': 'Calvert Vaux Park',
 'retired': False,
 'sitename': 'Calvert Vaux Park Playground',
 'subcategory': 'Plgd Within Park',
 'us_congress': '11',
 'zipcode': '11214',
 'multipolygon': {'type': 'MultiPolygon',
  'coordinates': [[[[-73.99140943598985, 40.58775845097591],
     [-73.99140923879449, 40.587757908846136],
     [-73.99140824815159, 40.58775474163144],
     [-73.99140685824237, 40.58775166263823],
     [-73.99139483451481, 40.587733390119894],
     [-73.99138481376232, 40.587718164220924],
     [-73.99138219078924, 40.587714034224184],
     [-73.99137525839429, 40.58770312116221],
     [-73.99134501334551, 40.58765545328694],
     [-73.99131475652865, 40.58

<span style="font-family:verdana"><span style="font-size:18px"> I will transform the above datas in a dataframe called *new york parks*.

In [19]:
# define the dataframe columns
column_names = ['name','acres', 'latitude', 'longitude'] 

# instantiate the dataframe
newyork_parks_2 = pd.DataFrame(columns=column_names)

In [22]:
for data in newyork_parks:
    for i in range(1736): #
    
        site = data['sitename']
        #acres = data['acres']
        lat= data['multipolygon']['coordinates'][0][0][0][0]
        long= data['multipolygon']['coordinates'][0][0][0][1]
    #location = data['location']
    #borough = data['borough'] 
    
        
    #neighborhood_latlon = data['multipolygon']['coordinates']
    #neighborhood_lat = neighborhood_latlon[0]
    #neighborhood_lon = neighborhood_latlon[1]
    
    newyork_parks_2 = newyork_parks_2.append({'name': site,'acres': acres,'latitude': lat, 'longitude': long}, ignore_index=True)
    
    #newyork_parks_2 = newyork_parks_2.append({'Acres': acres,'Site Name': site,
                                              #'Location': location,
                                              #'Borough': borough}, ignore_index=True)

In [23]:
# Let's see the first 5 rows of the dataframe

newyork_parks_2.head()

Unnamed: 0,name,acres,latitude,longitude
0,Calvert Vaux Park Playground,2.27347393,-73.991409,40.587758
1,World's Fair Playground,2.68940812,-73.845859,40.736499
2,Charybdis Playground,1.05413028,-73.922926,40.779626
3,Northwest Corner,6.37607361,-73.956657,40.79994
4,Mullaly Plgd (2),0.29690903,-73.923352,40.834063
