# Cleaning the Tree Census Data

**Author: Inga Silkworth <br>
Date: 08/10/2017**

## Methodology

Three dataframes are examined: New York City street tree census data from 1995, 2005, and 2015. The dataframes are cleaned and initial exploratory data analysis is performed. Some information relating to trees in NYC is added and initial observations are recorded.

## Get the Data and the Basic Info

### Resources for the 2015 census

The main page of 2015 census can be found here: https://www.nycgovparks.org/trees/treescount

<blockquote>Number of Volunteers. The 2,241 volunteers is double the number that participated in 2006. Volunteers completed 34 percent of the census. Innovative Mapping Technology. The use of innovative geospatial technology and a strong quality review process has yielded an exceptionally accurate inventory of street trees.<div></blockquote>

They don't include trees planted on private property. https://www.nycgovparks.org/trees/treescount/past-censuses

A nice map with all NYC trees plotted for every street and marked with colors by species and circle size by diameter: 
https://tree-map.nycgovparks.org/ <br>

Benefits of trees: https://tree-map.nycgovparks.org/learn/benefits <br>
<blockquote>
Stormwater intercepted each year: 1,095,211,388 gallons Value: \$10,842,587.27 <br>
Energy conserved each year: 671,779,096 kWh Value: \$84,808,673.34 <br>
Air pollutants removed each year: 641 tons Value: \$6,700,060.27 <br>
Carbon dioxide reduced each year: 623,193 tons Value: \$4,162,900.81 <br>
Total Value of Annual Benefits \$110,677,149.92 <br> <div></blockquote>

CO2 reduced each year numbers cannot possibly be right though, as one tree can only absorb ~40 lbs of CO2 a year. They used these equations http://www.itreetools.org/ Although they also count CO2 reduced by power plants because of lower AC usage because of tree shade. Still seems way too high. A tree can sequester a ton of CO2 in 40 years of its life, but that's not an anual measure. <br>

Percent change map from 1995 to 2015: https://www.nycgovparks.org/pagefiles/109/tree-census-population-change-lg__583319dce4432.gif <br>
There's already a map of street tree density by census tract https://www.nycgovparks.org/pagefiles/109/tree-census-density-lg__58330128a3d6b.jpg

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
df95 = pd.read_csv('../RawData/1995_Street_Tree_Census.csv')
df05 = pd.read_csv('../RawData/2005_Street_Tree_Census.csv')
df15 = pd.read_csv('../RawData/2015_Street_Tree_Census_-_Tree_Data.csv')

In [None]:
print(df95.info())
print(df05.info())
print(df15.info())

## Clean Up the Zip Codes

Don't include tress from neighborhoods such as Yonkers, Mt. Vernon, etc., since they are not part of New York City.

In [None]:
zips_to_remove = [10550, 10704, 10803, 11005, 11096, 11251, 11359, 11559, 11580, 83]

df95 = df95[~df95.Zip_New.isin(zips_to_remove)]
df05 = df05[~df05.zipcode.isin(zips_to_remove)]
df15 = df15[~df15.zipcode.isin(zips_to_remove)]

# in df95 change zip codes of some buildings to those of the nearby neighborhoods
# since they are not used in 05 and 15
df95.Zip_New = df95.Zip_New.replace(10103, 10019)
df95.Zip_New = df95.Zip_New.replace(10041, 10004)
df95.Zip_New = df95.Zip_New.replace(10119, 10001)
df95.Zip_New = df95.Zip_New.replace(10153, 10019)
df95.Zip_New = df95.Zip_New.replace(10162, 10075)
df95.Zip_New = df95.Zip_New.replace(10129, 10029)
df95.Zip_New = df95.Zip_New.replace(10112, 10020)
df95.Zip_New = df95.Zip_New.replace(10107, 10019)

# a building near Riverside Church
df15.zipcode = df15.zipcode.replace(10115, 10027)
# Laguardia airport
df15.zipcode = df15.zipcode.replace(11371, 11370)
# york college
df15.zipcode = df15.zipcode.replace(11451, 11433)
# there's no area information on 10281, so I'll change it to 10280
df15.zipcode = df15.zipcode.replace(10281, 10280)

# change the zip code of the world trade center to the currently used one
df95.Zip_new = df95.Zip_New.replace(10048, 10007)
df05.zipcode = df05.zipcode.replace(10048, 10007)
df15.zipcode = df15.zipcode.replace(10048, 10007)

**The 1995 (2005) dataset has 23,299 (8911) trees in zipcode 0. What should I do about it?** <br>
Either there are no trees on Roosevelt Island (zip = 10044) in 2015 or they didn't do the census there that year. In df95 they consider it part of Manhattan and technically it is. <br>
In 2015 dataset, there are 935 trees included from Central Park. Zip code 83 will be excluded for that reason. <br>
In 2005 dataset, zip code 10023 is used for the area of 10069 and zip code 11211 is used for 11249.

In [None]:
# Import areas for zip codes (areas are in sq. miles)
# I had to guesstimate the area for 11249 since I couldn't find it anywhere.
zip_areas = pd.read_csv('zip_code_areas.csv')
print(len(zip_areas))

## Chceck the Conditions of the Trees

In [None]:
print(df95.Condition.value_counts(), '\n', '************************')
print(df05.status.value_counts(), '\n', '************************')
print(df15.status.value_counts())
print(df15.health.value_counts())

Could there be a jump in dead tree numbers in 2015 because of Hurricane Sandy? Sandy was in 2012. <br>
http://www.theepochtimes.com/n3/1328435-sandy-is-still-killing-nyc-trees/ <br>

<blockquote>In the immediate aftermath of Sandy, almost 11,000 street trees and 9,000 park trees were destroyed. That’s $28 million in day-of-storm tree damages." <div></blockquote>

Although by the time the survey was taken in 2015, most of the dead trees might have been cleared out.
<blockquote> In Brooklyn alone, 48,000 trees have been inspected, once in the summer of 2013, and again in the summer of 2014, resulting in the removal of more than 2,500 storm-impacted trees. <div></blockquote>

Maybe that's why there were 17,000 stumps in 2015. It would be interesting to plot the stumps and see if they're close to the coast or scattered everywhere around the city. 
* The updated evacuation map after hurricane Sandy https://www.huffingtonpost.com/2013/06/18/nyc-hurricane-evacuation-zones-map_n_3460565.html
* Hurrican Sandy inundation zones https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6342a4.htm

**NOTE: 10,761 trees of unknown status from 1995 will be basically excluded from all of the plots.**

## Prepare the dataframe for dead trees

Three zip codes that are in the 95 and 05 dataframes, but not in the 15 (and thus the final dead tree dataset) are 0, 10044 (Roosevelt Island), and 11430 (JFK Airport).

Sites that will help with the vis: 
* https://stackoverflow.com/questions/42408265/plot-new-york-neighborhoods-with-d3-js
* http://www.d3noob.org/2013/03/a-simple-d3js-map-explained.html

In [None]:
# save a dataset to plot dead trees later. Only want numbers of dead trees by zip codes for the three datasets
# and density per zip code
d_15 = df15.zipcode[df15.status.isin(['Dead', 'Stump'])].value_counts()
dead = pd.DataFrame(d_15.reset_index())
dead.columns = ['zip_code', 'count_15']
dead = pd.merge(dead, zip_areas, on='zip_code', how='left')

d_05 = df05.zipcode[df05.status.isin(['Dead'])].value_counts()
dead05 = pd.DataFrame(d_05.reset_index())
dead05.columns = ['zip_code', 'count_05']
dead = pd.merge(dead, dead05, on='zip_code', how='left')

d_95 = df95.Zip_New[df95.Condition.isin(['Dead', 'Stump', 'Shaft'])].value_counts()
dead95 = pd.DataFrame(d_95.reset_index())
dead95.columns = ['zip_code', 'count_95']
dead = pd.merge(dead, dead95, on='zip_code', how='left')

dead['density_15'] = dead['count_15'] / dead['area']
dead['density_05'] = dead['count_05'] / dead['area']
dead['density_95'] = dead['count_95'] / dead['area']

In [None]:
dead.to_csv('/Users/ingasilk/Projects/NYCTreeAnalysis/nyc-tree-census/dead_tree_densities.csv', index=False)

## Keep only Alive Trees and Columns of Interest

Get a reduced dataframe for the main plots of live trees.

In [None]:
df95a = df95[df95.Condition.isin(['Good', 'Excellent', 'Poor', 'Fair', 'Critical'])]
df05a = df05[df05.status.isin(['Good', 'Excellent', 'Poor'])]
df15a = df15[df15.status == 'Alive']

# remove non-trees from 1995
df95a = df95a[df95a.Spc_Common != 'Hedge']
df95a = df95a[df95a.Spc_Common != 'Unknown Stump']
df95a = df95a[df95a.Spc_Common != 'Shrub']

In [None]:
df95as = df95a[['Borough', 'Zip_New', 'Spc_Common']]
df95as.columns = ['boroname', 'zipcode', 'spc_common']

df05as = df05a[['boroname', 'zipcode', 'spc_common']] 

df15as = df15a[['boroname', 'zipcode', 'spc_common']] 

## Check the numbers by Boroughs

In [None]:
print(df95as.boroname.value_counts())
print(df05as.boroname.value_counts())
print(df15as.boroname.value_counts())

In [None]:
# fix Staten Island Entries for 2005
df05as.boroname = df05as.boroname.replace('5', 'Staten Island')
df05as.boroname = df05as.boroname.replace(5, 'Staten Island')

## Make Tree Names the Same Across Datasets

In [None]:
df95as.spc_common = df95as.spc_common.map(lambda x: ' '.join(reversed(x.split(', '))).title())
df05as.spc_common = df05as.spc_common.map(lambda x: ' '.join(reversed(x.split(', '))).title())
df15as.spc_common = df15as.spc_common.map(lambda x: str(x).title())

old95 = ['Unknown Live Trees', 'Willow Species', 'Euro. Mountain-Ash', 'Golden-Chain Tree', 'Trumpet Tree Sp', 
         'S Goldenrain Tree', 'Norway-Cr Kng Maple', 'Callery-Aristo Pear', 'Red-Red Sunst Maple', 'Privet Species', 
         'Crabapple-Ind.Summer', 'Higan-Pendla Cherry', 'Eur. Smoke Tree', 'Crabapple-Harv. Gold', 
         'Norway-Schwed Maple', 'White-Aut Purpl Ash', 'Green-Mars Seed Ash', 'Red-Oct Glory Maple', 
         'Sugar-Grn Mtn Maple', 'Fla. Strangler Fig', 'Amer. Mountain-Ash', 'Holly Species','Honeylocust', 
         'American Arborvitae']
new95 = ['Unknown', 'Other Willow', 'European Mountain Ash', 'Golden Chain Tree', 'Trumpet Tree', 'Goldenrain Tree', 
         'Crimson King Maple', 'Callery Pear', 'Red Sunset Maple', 'Other Privet', 'Indian Summer Crabapple', 
         'Weeping Higan Cherry', 'Smoketree', 'Harvest Gold Crabapple', 'Schwedleri Maple', 'Autumn Purple White Ash', 
         'Green Ash', 'October Glory Red Maple', 'Sugar Maple', 'Florida Strangler Fig', 'American Mountain Ash', 
         'Holly', 'Honey Locust', 'Eastern Arborvitae']

old05 = ['Norway-Cr Kng Maple', 'Holly Species', 'Hickory', 'Willow Species', 'Silverbell', 'Dogwood Spp.', 
         'Maackia,Amur', 'Larch', 'American Mountainash', 'Golden-Chain Tree', 'Pondcypress', 'Juniper Spp.', 
         'Baldcypress Species', 'Willow ?', 'American Smoketree', 'Korean Mountainash', 
         'Japanese Falsecypress', 'Atlantic Whitecedar', 'Honeylocust']
new05 = ['Crimson King Maple', 'Holly', 'Other Hickory', 'Other Willow', 'Other Silverbell', 'Other Dogwood', 
         'Amur Maackia', 'Common Larch', 'American Mountain Ash', 'Golden Chain Tree', 'Pond Cypress', 'Other Juniper', 
         'Baldcypress', 'Other Willow', 'Smoketree', 'Korean Mountain Ash', 
         'Japanese False Cypress', 'Atlantic White Cedar', 'Honey Locust', 'Eastern Arborvitae']

old15 = ['Ash', 'Crab Apple', 'Tulip-Poplar', 'Douglas-Fir', 'Spruce', 'Littleleaf Linden', 'Schumard\'S Oak', 
         'Purple-Leaf Plum', '\'Schubert\' Chokecherry', 'Magnolia', 'American Larch', 'Common Hackberry', 'Maple', 
         'Serviceberry', 'Pine', 'Nan', 'Honeylocust', 'Cherry', 'Arborvitae']
new15 = ['Other Ash', 'Crabapple', 'Tulip Tree', 'Douglas Fir', 'Other Spruce', 'Little Leaf Linden', 'Schumard Oak', 
         'Purpleleaf Plum', 'Shubert Chokecherry', 'Other Magnolia', 'Common Larch', 'Hackberry', 'Other Maple', 
         'Other Serviceberry', 'Other Pine', 'Unknown', 'Honey Locust', 'Other Cherry', 'Eastern Arborvitae']

for i in range(len(old95)):
    df95as.spc_common.replace(old95[i], new95[i], inplace=True)

for i in range(len(old05)):
    df05as.spc_common.replace(old05[i], new05[i], inplace=True)
    
for i in range(len(old15)):
    df15as.spc_common.replace(old15[i], new15[i], inplace=True)

## Get a Quick Look at Most Popular Species

From Epoch Times http://www.theepochtimes.com/n3/1328435-sandy-is-still-killing-nyc-trees/ 
<blockquote>As of 2006, New York City had an estimated 2.6 million public trees—600,000 on the streets, 2 million in parks.<div></blockquote>

1995 dataset has the most variety of trees and 2015 the least, but it has the least number of unknowns. Volunteers could've been using an app to recognize trees. From https://www.nycgovparks.org/trees/treescount/about about the 2015 census:
<blockquote>The TreeKIT mapping method and the accompanying mobile app are the foundation of TreesCount! 2015. NYC Parks chose TreeKIT for TreesCount! 2015 because it is easy to use and generates a representative map of the urban forest that places the tree exactly where it is located along the curb. <div></blockquote> 

<blockquote>This year, we trained New Yorkers to be expert tree counters by providing extensive training and tree guides to make sure that our voluntreers were confident and that measurements were as accurate as possible. <div></blockquote>

Why did 70,000 Norway Maples disappear in 20 years? And why were 30,000 Honeylocusts planted?
* Norway maples were diseased in 1996
* Since it's an invasive species, there's been an effort to eradicate it
* Honey locusts are resilient to flooding

http://www.nytimes.com/1996/06/02/nyregion/diseased-norway-maple-trees-leaving-some-streets-bare.html 1996 <br> 
http://www.nytimes.com/2002/06/30/nyregion/environment-unfortunately-these-maples-are-spreading.html 2002 <br> 
https://patch.com/new-york/tarrytown/the-norway-maple-new-york-s-ultimate-weed <br>
https://www.change.org/p/new-york-state-department-of-environmental-conservation-ban-the-norway-maple-in-new-york-3 <br>
https://www.nycgovparks.org/trees/treescount

Invasive species: <br>
* Norway Maple
* Tree of Heaven
* Russian Olive
* Smooth Buckthorn
* Black Locust

The Epoch Times:
<blockquote>According to the latest estimate from 2005, over half of the trees on New York City streets belong to five species: London planetree, known for its camouflage-patterned bark; Norway maple with its low and bushy foliage; callery pear, which infamously smells like semen when it blooms; the thorny honey locust, and pin oak, with its incised leaves. Of these, the best performing trees post-Sandy are **honey locust, pin oak, and callery pear**. Expect to see more of them on the streets as replanting continues. <div></blockquote> 

In [None]:
print(df95as.spc_common.value_counts())
print(df05as.spc_common.value_counts())
print(df15as.spc_common.value_counts())

In [None]:
df15as.spc_common.unique()

In [None]:
df05as.spc_common.unique()

In [None]:
df95as.spc_common.unique()

## Prepare the Dataframes for Alive Trees

### Get numbers for total tree counts

In [None]:
print('Number of trees in 1995:', len(df95as))
print('Number of trees in 2005:', len(df05as))
print('Number of trees in 2015:', len(df15as))

### Get df for borough counts and densities

Get the file with borough areas adjusted by subtracting the area of the 10 biggest parks in NYC and LaGuardia, JFK airport land areas. Area information found at:
* https://en.wikipedia.org/wiki/Boroughs_of_New_York_City
* https://www.nycgovparks.org/about/faq
* https://en.wikipedia.org/wiki/LaGuardia_Airport
* https://www.panynj.gov/airports/jfk-facts-info.html

In [None]:
borough_areas = pd.read_csv('borough_areas.csv')
borough_areas = borough_areas[['borough', 'area_adjusted']]

In [None]:
borough_areas

In [None]:
boroughs = pd.DataFrame(df15as.boroname.value_counts().reset_index())
boroughs.columns = ['borough', 'count_15']
boroughs = pd.merge(boroughs, borough_areas, on='borough', how='left')

boroughs05 = pd.DataFrame(df05as.boroname.value_counts().reset_index())
boroughs05.columns = ['borough', 'count_05']
boroughs = pd.merge(boroughs, boroughs05, on='borough', how='left')

boroughs95 = pd.DataFrame(df95as.boroname.value_counts().reset_index())
boroughs95.columns = ['borough', 'count_95']
boroughs = pd.merge(boroughs, boroughs95, on='borough', how='left')

boroughs['density_15'] = boroughs['count_15'] / boroughs['area_adjusted']
boroughs['density_05'] = boroughs['count_05'] / boroughs['area_adjusted']
boroughs['density_95'] = boroughs['count_95'] / boroughs['area_adjusted']

In [None]:
boroughs

In [None]:
boroughs.to_csv('/Users/ingasilk/Projects/NYCTreeAnalysis/nyc-tree-census/borough_counts.csv', index=False)

Manhattan is now the greenest neighborhood in NYC, but this wasn't the case 10 and 20 years ago. Queens used to be the queen of trees with Bronx at the bottom. Now Staten Island is the least lush borough.

### Get df for popular tree species

In [None]:
populars = pd.DataFrame(df15as.spc_common.value_counts().head(15).reset_index())
populars.columns = ['species_15', 'count_15']
populars['rank'] = populars.index + 1

populars05 = pd.DataFrame(df05as.spc_common.value_counts().head(15).reset_index())
populars05.columns = ['species_05', 'count_05']
populars05['rank'] = populars05.index + 1
populars = pd.merge(populars, populars05, on='rank', how='left')

populars95 = pd.DataFrame(df95as.spc_common.value_counts().head(15).reset_index())
populars95.columns = ['species_95', 'count_95']
populars95['rank'] = populars95.index + 1
populars = pd.merge(populars, populars95, on='rank', how='left')

populars

## Plot Tree Diameters

In [None]:
import matplotlib
from matplotlib import pyplot as plt
import seaborn
%matplotlib inline

In [None]:
df95as.diameter.hist(bins=50)

## Plots I Want to Make

* Density of dead trees per zip code for the 3 years.
* Total number of trees for NYC from 3 dfs. Need 3 numbers of tree counts.
* Percent growth chart from 3 dfs for each borough. Need tree counts per borough for 3 dfs. Also add a column of area of boroughs and tree density for 3 years. This can be one dataframe.
* List of most popular trees for 3 years. Do value count for each year and save top 15: rank, name, and count
* Density of trees for each zip in 2015. Need zips, counts per zip, area, density.