# NYC Tree Census and Income
Codecademy Portfolio Project by Leah Fulmer ([Github](https://github.com/leahmfulmer), [Tableau](https://public.tableau.com/app/profile/leahmfulmer/vizzes))<br>
With gratitude to David Belyaev ([Tableau](https://public.tableau.com/app/profile/david.belyaev/vizzes))

#### Project Objectives:

* Explore a given data set.
* Form questions for analysis.
* Create several visualizations.
* Combine visualizations in a Tableau Dashboard.
* Present interactive visual dashboard through [Tableau Public](https://public.tableau.com/app/profile/leahmfulmer/viz/NYCTreesIncome_17195152146380/NYCTreesIncome).

#### Table of Contents :
[Section 1: Loading and Examining the Data](#data)<br>
[Section 2: Wrangling and Tidying the Data](#tidy)<br>
[Section 3: Defining Questions for Analysis](#questions)<br>
[Section 4: Transitioning to Tableau Public](#tableau)<br>

### Section 1: Loading and Examining the Data <a id="data"></a>

In [1]:
# Import modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load data: tree_census
# Data can be found here: https://github.com/Codecademy-Curriculum/Learn-Tableau-for-Data-Viz/tree/main/datasets
tree_census = pd.read_csv("./datasets/tree-census-NYC_2015.csv")

print("This dataset contains {} rows and {} columns. \n".format(tree_census.shape[0], tree_census.shape[1]))
print("The columns are called ...{}.".format(tree_census.columns))

This dataset contains 683788 rows and 16 columns. 

The columns are called ...Index(['Unnamed: 0', 'tree_id', 'tree_dbh', 'stump_diam', 'status', 'health',
       'spc_latin', 'spc_common', 'address', 'zipcode', 'borocode', 'boroname',
       'nta_name', 'state', 'Latitude', 'longitude'],
      dtype='object').


In [3]:
# Examine data: tree_census
tree_census.head()

Unnamed: 0.1,Unnamed: 0,tree_id,tree_dbh,stump_diam,status,health,spc_latin,spc_common,address,zipcode,borocode,boroname,nta_name,state,Latitude,longitude
0,0,180683,3,0,Alive,Fair,Acer rubrum,red maple,108-005 70 AVENUE,11375,4,Queens,Forest Hills,New York,40.723092,-73.844215
1,1,200540,21,0,Alive,Fair,Quercus palustris,pin oak,147-074 7 AVENUE,11357,4,Queens,Whitestone,New York,40.794111,-73.818679
2,2,204026,3,0,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,390 MORGAN AVENUE,11211,3,Brooklyn,East Williamsburg,New York,40.717581,-73.936608
3,3,204337,10,0,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,1027 GRAND STREET,11211,3,Brooklyn,East Williamsburg,New York,40.713537,-73.934456
4,4,189565,21,0,Alive,Good,Tilia americana,American linden,603 6 STREET,11215,3,Brooklyn,Park Slope-Gowanus,New York,40.666778,-73.975979


In [4]:
# Load data: income
income = pd.read_csv("./datasets/income-NYC_2015.csv")

print("This dataset contains {} rows and {} columns. \n".format(income.shape[0], income.shape[1]))
print("The columns are called ...{}.".format(income.columns))

This dataset contains 218 rows and 8 columns. 

The columns are called ...Index(['GEO_ID', 'zipcode', 'HouseholdsEstimateTotal',
       'HouseholdsMargin of ErrorTotal',
       'HouseholdsEstimateMedian income (dollars)',
       'HouseholdsMargin of ErrorMedian income (dollars)',
       'HouseholdsEstimateMean income (dollars)',
       'HouseholdsMargin of ErrorMean income (dollars)'],
      dtype='object').


In [5]:
# Examine data: income
income.head()

Unnamed: 0,GEO_ID,zipcode,HouseholdsEstimateTotal,HouseholdsMargin of ErrorTotal,HouseholdsEstimateMedian income (dollars),HouseholdsMargin of ErrorMedian income (dollars),HouseholdsEstimateMean income (dollars),HouseholdsMargin of ErrorMean income (dollars)
0,8600000US10451,10451,18140,405,26048,2140.0,40836.0,3424.0
1,8600000US10452,10452,25432,368,24790,1337.0,36083.0,1578.0
2,8600000US10453,10453,26802,409,23095,1605.0,33354.0,1416.0
3,8600000US10454,10454,12790,247,20210,1930.0,31533.0,2272.0
4,8600000US10455,10455,14023,329,23253,1598.0,32854.0,2127.0


### Section 2: Wrangling and Tidying the Data<a id="tidy"></a>

After mapping the boroughs with Tableau Public, I noticed that some datapoints were labeled incorrectly. The zipcode 11234 is in Brooklyn, <br>but some entries in the original dataset said that it was in Queens. This led to a misrepresented distribution of trees across boroughs. Let's correct this.

In [6]:
# What is the error?

zip_11234 = tree_census[tree_census.zipcode == 11234]
zip_11234.borocode.unique()
zip_11234.boroname.unique()

array(['Brooklyn', 'Queens'], dtype=object)

In [7]:
# Which points are labelled incorrectly?

incorrect_label = tree_census[(tree_census.zipcode == 11234) & (tree_census.boroname == 'Queens')]
incorrect_label.head()

Unnamed: 0.1,Unnamed: 0,tree_id,tree_dbh,stump_diam,status,health,spc_latin,spc_common,address,zipcode,borocode,boroname,nta_name,state,Latitude,longitude
428387,428387,654303,11,0,Alive,Fair,Platanus x acerifolia,London planetree,5031 FLATBUSH AVENUE,11234,4,Queens,Georgetown-Marine Park-Bergen Beach-Mill Basin,New York,40.582074,-73.891625
429222,429222,654305,7,0,Alive,Good,Prunus,cherry,5031 FLATBUSH AVENUE,11234,4,Queens,Georgetown-Marine Park-Bergen Beach-Mill Basin,New York,40.582407,-73.891952
429721,429721,654307,5,0,Alive,Fair,Prunus,cherry,5031 FLATBUSH AVENUE,11234,4,Queens,Georgetown-Marine Park-Bergen Beach-Mill Basin,New York,40.582577,-73.892141
430241,430241,654309,48,0,Alive,Fair,Morus,mulberry,5031 FLATBUSH AVENUE,11234,4,Queens,Georgetown-Marine Park-Bergen Beach-Mill Basin,New York,40.582633,-73.892204
430399,430399,654304,9,0,Alive,Good,Prunus,cherry,5031 FLATBUSH AVENUE,11234,4,Queens,Georgetown-Marine Park-Bergen Beach-Mill Basin,New York,40.58213,-73.891643


It looks like only seven trees have an incorrect label for `boroname` and `borocode`. This is easy to edit individually.

In [8]:
# Edit borough for individual trees

tree_census.loc[428387, 'boroname'] = 'Brooklyn'
tree_census.loc[428387, 'borocode'] = 3
tree_census.loc[429222, 'boroname'] = 'Brooklyn'
tree_census.loc[429222, 'borocode'] = 3
tree_census.loc[429721, 'boroname'] = 'Brooklyn'
tree_census.loc[429721, 'borocode'] = 3
tree_census.loc[430241, 'boroname'] = 'Brooklyn'
tree_census.loc[430241, 'borocode'] = 3
tree_census.loc[430399, 'boroname'] = 'Brooklyn'
tree_census.loc[430399, 'borocode'] = 3
tree_census.loc[432730, 'boroname'] = 'Brooklyn'
tree_census.loc[432730, 'borocode'] = 3
tree_census.loc[432731, 'boroname'] = 'Brooklyn'
tree_census.loc[432731, 'borocode'] = 3

In [9]:
# Check that Brooklyn is the only borough associated with zipcode 11234

check = tree_census[(tree_census.zipcode == 11234)]
check.borocode.unique()
check.boroname.unique()

array(['Brooklyn'], dtype=object)

In [10]:
# Filter out unhealthy trees to reduce size of dataset

tree_census = tree_census[(tree_census.health == 'Good')]

In [None]:
# Save data

tree_census.to_csv("datasets/tree-census-NYC-2015-short.csv")

### Section 3: Defining Questions for Analysis<a id="questions"></a>

* Is there a correlation between household income and tree characteristics (e.g., size, health)?
    * *This is our driving question. It's why we're bringing these datasets together in the first place.*
* Which borough has the most trees?
* Where in New York City do we find the highest and lowest incomes?
    * Is there variation within a borough, or is there income clustering? 
    * Do all wealthy households live in the same borough?
* Where in New York City do we find the largest trees? The smallest?

### Section 4: Transitioning to Tableau Public<a id="tableau"></a>

All of these questions can be explored visually. It's time to take ourselves over to Tableau Public for some [*interactive visualizations!*](https://public.tableau.com/app/profile/leahmfulmer/viz/NYCTreesIncome_17195152146380/NYCTreesIncome?publish=yes)