# Module 4


## Overview

In this module we’ll be looking at data from the New York City tree census, [here](https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh). This data is collected by volunteers across the city, and is meant to catalog information about every single tree in the city. We'll be accessing it via the socrata API. 

### Task 

Build a dash app for a arborist studying the health of various tree species (as defined by the variable ‘spc_common’) across each borough (defined by the variable ‘borough’). This arborist would like to answer the following two questions for each species and in each borough: 

1. What proportion of trees are in good, fair, or poor health according to the ‘health’ variable? 

2. Are stewards (steward activity measured by the ‘steward’ variable) having an impact on the health of trees? 

### Getting Started 


In [1]:
import pandas as pd
import numpy as np

The data is conveniently available in json format, so we should be able to just read it directly in to Pandas:

In [6]:
url = 'https://data.cityofnewyork.us/resource/nwxe-4ae8.json'
trees = pd.read_json(url)
trees.head(5)

Unnamed: 0,address,bbl,bin,block_id,boro_ct,borocode,boroname,brch_light,brch_other,brch_shoe,...,tree_dbh,tree_id,trnk_light,trnk_other,trunk_wire,user_type,x_sp,y_sp,zip_city,zipcode
0,108-005 70 AVENUE,4022210000.0,4052307.0,348711,4073900,4,Queens,No,No,No,...,3,180683,No,No,No,TreesCount Staff,1027431.148,202756.7687,Forest Hills,11375
1,147-074 7 AVENUE,4044750000.0,4101931.0,315986,4097300,4,Queens,No,No,No,...,21,200540,No,No,No,TreesCount Staff,1034455.701,228644.8374,Whitestone,11357
2,390 MORGAN AVENUE,3028870000.0,3338310.0,218365,3044900,3,Brooklyn,No,No,No,...,3,204026,No,No,No,Volunteer,1001822.831,200716.8913,Brooklyn,11211
3,1027 GRAND STREET,3029250000.0,3338342.0,217969,3044900,3,Brooklyn,No,No,No,...,10,204337,No,No,No,Volunteer,1002420.358,199244.2531,Brooklyn,11211
4,603 6 STREET,3010850000.0,3025654.0,223043,3016500,3,Brooklyn,No,No,No,...,21,189565,No,No,No,Volunteer,990913.775,182202.426,Brooklyn,11215


Socrata places a 1000 row limit on their API. Raw data is meant to be "paged" through for applications, with the expectation that a UX wouldn't be able to handle a full dataset.  We can see this by examining the shape of this data: 

In [5]:
trees.shape

(1000, 45)

The goal of using the Socrata is to force you to think about where your data operations are happening, and not resort to pulling in the data and performing all operations in local memory. Using SoQL is a good way to avoid the limits of the API by querying the data.

## App Development

The app was developed using data from the following query:  

In [137]:
soql_url = ('https://data.cityofnewyork.us/resource/nwxe-4ae8.json?' +\
        '$select=steward, health, boroname, count(tree_id)' +\
        '&$group=steward, health, boroname').replace(' ', '%20')
df = pd.read_json(soql_url)
print('Dimensions of dataframe : ' + str(df.shape))

Dimensions of dataframe : (66, 4)


We can better view this data and use it in the app by turning it into a pivot table. 

In [161]:
pv = pd.pivot_table(df,
                    index=['boroname', 'steward'],
                    columns=["health"],values=['count_tree_id'],
                    aggfunc=[np.sum],
                    fill_value=0,  
                    margins = True, 
                    margins_name= 'Total count')
pv

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,sum,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,count_tree_id,count_tree_id,count_tree_id,count_tree_id
Unnamed: 0_level_2,health,Fair,Good,Poor,Total count
boroname,steward,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Bronx,1or2,2130,12038,640,14808
Bronx,3or4,125,689,41,855
Bronx,4orMore,7,62,2,71
Bronx,,8625,53814,2412,64851
Brooklyn,1or2,6490,35749,1638,43877
Brooklyn,3or4,760,5147,143,6050
Brooklyn,4orMore,59,464,10,533
Brooklyn,,17764,96852,4668,119284
Manhattan,1or2,4471,18241,1463,24175
Manhattan,3or4,1415,5974,428,7817


As seen in the [app.py](https://github.com/jemceach/608/blob/master/module4/app.py) file - this data was been passed through dash to develop an interactive bar graph to display the data above by borough. 

## Conclusion 

The application displays the health status of all tree species across each NYC borough. The application clearly shows the number of trees per health and steward status. 

The data queried shows that approximately 77% of all trees are in good health standing and 71% of all trees are not assigned a steward. In the future, I would consider adding an interactive pivot table to the application as well, so that the user could better visualize the proportion of health and steward values as shown below:   

In [162]:
pv2 = pd.pivot_table(df,
                     index=['steward'],
                     columns=["health"],
                     values=['count_tree_id'],
                     aggfunc=[lambda x:x.sum()/df['count_tree_id'].sum()],
                     fill_value=0,  
                     margins = True, 
                     margins_name= 'Total count')
pv2

Unnamed: 0_level_0,<lambda>,<lambda>,<lambda>,<lambda>
Unnamed: 0_level_1,count_tree_id,count_tree_id,count_tree_id,count_tree_id
health,Fair,Good,Poor,Total count
steward,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
1or2,0.03203,0.168725,0.009189,0.209944
3or4,0.004139,0.022823,0.001092,0.028054
4orMore,0.000292,0.00199,7.2e-05,0.002355
,0.10467,0.579874,0.028867,0.713411
Total count,0.141131,0.773412,0.03922,0.953763


Application uses 
1. What proportion of trees are in good, fair, or poor health according to the ‘health’ variable? 
2. Are stewards (steward activity measured by the ‘steward’ variable) having an impact on the health of trees?