# <div style="text-align: center"> Vacouver Street Trees </div>
### <div style="text-align: center"> Data Visualization Analysis | Pankti Shah | December 2021 </div> 
- - -
- - -

## Introduction 

### Motivation:

City trees are important: they purify the air, reduce heat islands, help regulate the water cycle and provide immense health benefits.Trees play an important role in increasing urban biodiversity, providing  plants and animals with a favourable habitat, food and protection.
A mature tree absorbs greater CO2 per year. As a result, trees play an important role in climate change mitigation. Especially in cities with high levels of pollution, trees can improve air quality, making cities healthier places to live in.

Large trees are excellent filters for urban pollutants and fine particulates. They absorb pollutant gases (such as carbon monoxide, nitrogen oxides, ozone and sulfer oxides) and filter dust, dirt or smoke out of the air by trapping them on leaves and bark. Living in close proximity of urban green spaces and having access to them can improve physical and mental health. This, in turn, contributes to the well-being of urban communities.

Trees also help to reduce carbon emissions by helping to conserve energy. For example, the correct placement of trees around buildings can reduce the need for air conditioning, and reduce winter heating bills. Not to mention planning urban landscapes with trees can increase property value, and attract tourism and businesses.


### Given these motivations, questions I will be exploring are: 
1. Evalute relationship between height and diameter of the tree with its age (planted date). This is to help understand does the age of a tree mean larger diamter or height. By understanding consistency of plating trees, distribution of age, height, and diameter of trees in Vancouver, we can  understand if certain neighbourhood are more preferred for living, and improve tree planting strategy in the city. 
2. Evaluate species distribution across Vancouver neighbourhood (ie., popular species across Vancouver neighbourhood). This is to help understand biodiversity of trees across the city. 

## Analysis

In [1]:
# Importing in required libraries
import pandas as pd
import altair as alt
alt.data_transformers.enable('default', max_rows=1000000)
import json

Importing data, dropping columns that will not be helpful to answer questions stated in the introduction. Converting date_plated column into year, month, and day. Dropping all rows that have a missing value. 

In [2]:
# Importing the data
url = 'https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_vancouver_trees.csv'
df = pd.read_csv(url, parse_dates=['date_planted'])

# Dropping columns that are not going to be used in analysis or helpful in answering questions as stated in introduction
df = df.drop(columns=['std_street','on_street', 'civic_number', 'tree_id' , 'cultivar_name', 'genus_name', 'assigned', 'plant_area', 'common_name' ,'on_street_block','root_barrier'])

# Converting date_planted column into year, month, data separate columns. 
datetimes = pd.to_datetime(df['date_planted'])
df[['year','month','day']] = datetimes.dt.date.astype(str).str.split('-',expand=True)

# Removing all the rows that are missing values to ensure analysis doesn't add any unnecessary bias 
df = df.dropna()

# Adding a column in the dataset that will determine height and diameter ratio of a tree
df = df.assign(height_diameter_ratio = df['height_range_id']/df['diameter'])

df.head()

Unnamed: 0.1,Unnamed: 0,species_name,neighbourhood_name,date_planted,diameter,street_side_name,curb,height_range_id,latitude,longitude,year,month,day,height_diameter_ratio
9,13029,GRANDIFLORA X,Renfrew-Collingwood,2013-01-21,3.0,ODD,N,1,49.250114,-123.039156,2013,1,21,0.333333
10,14062,ROBUR,Kitsilano,1995-03-15,13.0,EVEN,Y,3,49.259133,-123.155318,1995,3,15,0.230769
12,3515,SYLVATICA,Renfrew-Collingwood,2001-05-01,3.0,ODD,Y,1,49.241922,-123.046271,2001,5,1,0.333333
16,14533,PENNSYLVANICA,Hastings-Sunrise,2003-01-06,8.0,ODD,Y,2,49.262,-123.036142,2003,1,6,0.25
18,13410,KOUSA,Marpole,1993-11-29,6.25,ODD,Y,2,49.211428,-123.125269,1993,11,29,0.32


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2338 entries, 9 to 4999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Unnamed: 0             2338 non-null   int64         
 1   species_name           2338 non-null   object        
 2   neighbourhood_name     2338 non-null   object        
 3   date_planted           2338 non-null   datetime64[ns]
 4   diameter               2338 non-null   float64       
 5   street_side_name       2338 non-null   object        
 6   curb                   2338 non-null   object        
 7   height_range_id        2338 non-null   int64         
 8   latitude               2338 non-null   float64       
 9   longitude              2338 non-null   float64       
 10  year                   2338 non-null   object        
 11  month                  2338 non-null   object        
 12  day                    2338 non-null   object        
 13  hei

In [4]:
df.describe(include='all')

  df.describe(include='all')


Unnamed: 0.1,Unnamed: 0,species_name,neighbourhood_name,date_planted,diameter,street_side_name,curb,height_range_id,latitude,longitude,year,month,day,height_diameter_ratio
count,2338.0,2338,2338,2338,2338.0,2338,2338,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0
unique,,109,22,1466,,3,2,,,,31.0,11.0,31.0,
top,,PLATANOIDES,Renfrew-Collingwood,2006-11-21 00:00:00,,ODD,Y,,,,1998.0,2.0,4.0,
freq,,192,230,8,,1173,2163,,,,127.0,447.0,97.0,
first,,,,1989-11-15 00:00:00,,,,,,,,,,
last,,,,2019-04-16 00:00:00,,,,,,,,,,
mean,10232.198033,,,,6.231801,,,1.79213,49.246617,-123.098752,,,,0.341225
std,5880.952995,,,,4.351155,,,0.945447,0.021065,0.04889,,,,0.197366
min,10.0,,,,0.5,,,0.0,49.201366,-123.22344,,,,0.0
25%,4905.25,,,,3.0,,,1.0,49.229326,-123.136656,,,,0.25


We are using about ~2300 datapoints in this analysis. Dataset contains 109 unique species and 22 unique neighbourhoods in Vancouver. Dataset also contains information about when a tree was planted, and its associated diameter, height, genus name, which side of the street it is planted along with various other miscellaneous information. Latitude and longitude of the tree are also provided. All the null values in the dataset have been removed to enable conducting adequate data analysis. Additional columns were added to the original dataset. New columns are height to diameter ratio of a tree and date_planted column being split into year, month, and day. 

To explore answer to the first question, I will be using the columns year, height_range_id, diameter and height_diameter_ratio, latitude and longitude. Information from these columns will help evalute relationship between height and diameter of the tree with its age (planted date).

To explore answer to the second question, I will be using the columns species_name, neighbourhood_name, latitude and longitude. Information from these columns will help evaluate species distribution across the city of Vancouver. 

In [5]:
click = alt.selection_multi()

chart1 = (alt.Chart(df).mark_bar().encode(
    alt.X('year', title='Year'),
    alt.Y('count()', title='Total Trees Planted in the Year', sort='x'),
    alt.Color('year', title="Year"),
    opacity=alt.condition(click, alt.value(0.9), alt.value(0.2)))
.add_selection(click)).properties(width=400, title = "Figure 1: Total Trees Planted from 1989 to 2019 across Vancouver")

chart1

Figure 1 above shows that generally 80+ trees were planted from the year 1995 to 2013. Outside these dates, there has been significant cut in number of trees planted. In most recent years, the city has planted less than 25 trees each year. 

In [6]:
click = alt.selection_multi()

chart2 = (alt.Chart(df).mark_bar().encode(
    alt.X('year', title='Year'),
    alt.Y('mean(height_range_id)', title='Mean height of trees', sort='x'),
    alt.Color('year', title="Year"),
    opacity=alt.condition(click, alt.value(0.9), alt.value(0.2)))
.add_selection(click)).properties(width=400, title = 'Figure 2: Mean Height of Trees Planted from 1989 to 2019')

chart2

Figure 2 shows us that older the tree, greater its height on average. This makes sense considering plants tend to grow taller over a longer duration of time. Considering that an unequal number of trees were planted each year, we do have some discrepancies. However, overall we can say with age, trees tend to be taller. 

In [7]:
click = alt.selection_multi()

chart3 = (alt.Chart(df).mark_bar().encode(
    alt.X('year', title='Year'),
    alt.Y('mean(diameter)', title='Mean diameter of the tree', sort='x'),
    alt.Color('year', title="Year"),
    opacity=alt.condition(click, alt.value(0.9), alt.value(0.2)))
.add_selection(click)).properties(width=400, title = 'Figure 3: Mean Diameter of Trees Planted from 1989 to 2019')

chart3

Figure 3 shows us that older the tree, greater its diameter on average. This makes sense considering plants tend to grow wider over a longer duration of time. Considering that an unequal number of trees were planted each year, we do have some discrepancies. However, overall we can say with age, trees tend to be wider. 

In [8]:
click = alt.selection_multi()

chart4 = (alt.Chart(df).mark_bar().encode(
    alt.X('year', title='Year'),
    alt.Y('mean(height_diameter_ratio)', title='Mean height to diameter ratio of the tree', sort='x'),
    alt.Color('year', title="Year"),
    opacity=alt.condition(click, alt.value(0.9), alt.value(0.2)))
.add_selection(click)).properties(width=400, title = 'Figure 4: Mean Height:Diameter of Trees Planted from 1989 to 2019')

chart4

Figure 4 shows us that generally height to diameter ratio is more or less consistent across the years. We can therefore assume most tree species grow taller and wider at a similar rate. Average range of height to diameter ratio is between 0.3 - 0.35. 

In [9]:
# Next, combining Figures 1 to 4 in a specific layout. 
# Adding a column selection such that selecting one year will highlight bars of the same year across all the charts. 

click = alt.selection_multi()

chart1 = (alt.Chart(df).mark_bar().encode(
    alt.X('year', title='Year'),
    alt.Y('count()', title='Total Trees Planted in the Year', sort='x'),
    alt.Color('year', title="Year"),
    opacity=alt.condition(click, alt.value(0.9), alt.value(0.2)))
.add_selection(click)).properties(width=400, title = "Figure 1: Total Trees Planted from 1989 to 2019 across Vancouver")

chart2 = (alt.Chart(df).mark_bar().encode(
    alt.X('year', title='Year'),
    alt.Y('mean(height_range_id)', title='Mean height of trees', sort='x'),
    alt.Color('year', title="Year"),
    opacity=alt.condition(click, alt.value(0.9), alt.value(0.2)))
.add_selection(click)).properties(width=400, title = 'Figure 2: Mean Height of Trees Planted from 1989 to 2019')

chart3 = (alt.Chart(df).mark_bar().encode(
    alt.X('year', title='Year'),
    alt.Y('mean(diameter)', title='Mean diameter of the tree', sort='x'),
    alt.Color('year', title="Year"),
    opacity=alt.condition(click, alt.value(0.9), alt.value(0.2)))
.add_selection(click)).properties(width=400, title = 'Figure 3: Mean Diameter of Trees Planted from 1989 to 2019')

chart4 = (alt.Chart(df).mark_bar().encode(
    alt.X('year', title='Year'),
    alt.Y('mean(height_diameter_ratio)', title='Mean height to diameter ratio of the tree', sort='x'),
    alt.Color('year', title="Year"),
    opacity=alt.condition(click, alt.value(0.9), alt.value(0.2)))
.add_selection(click)).properties(width=400, title = 'Figure 4: Mean Height:Diameter of Trees Planted from 1989 to 2019')

combined = (chart1) & (chart2 | chart3) & (chart4)

In [10]:
# Plots with slider filter  

# A slider filter 1
# This plots diameter of a tree vs year it was planted. Slider allows to explore data at increment of 5.
slider = alt.binding_range(min=0, max=60, step=5, name='Diameter')
selector = alt.selection_single(name="SelectorName", fields=['diameter'],
                                bind=slider, init={'diameter': 0})

filter_year2 = alt.Chart(df).mark_point().encode(
    x=alt.X('year', title='Year'),
    y=alt.Y('diameter', title='Diameter'),
    color=alt.condition(
        alt.datum.diameter < selector.diameter,
        alt.value('red'), alt.value('blue')
    )
).add_selection(
    selector
).properties(width=400, title = 'Figure 5: Diameter of Trees Planted from 1989 to 2019')


# A slider filter 2
# This plots height of a tree vs year it was planted. Slider allows to explore data at increment of 0.5.
slider = alt.binding_range(min=0, max=8, step=0.5, name='Height')
selector = alt.selection_single(name="SelectorName", fields=['height_range_id'],
                                bind=slider, init={'height_range_id': 0})

filter_year3 = alt.Chart(df).mark_point().encode(
    x=alt.X('year', title='Year'),
    y=alt.Y('height_range_id', title='Height'),
    color=alt.condition(
        alt.datum.height_range_id < selector.height_range_id,
        alt.value('red'), alt.value('blue')
    )
).add_selection(
    selector
).properties(width=400, title = 'Figure 6: Height of Trees Planted from 1989 to 2019')

# Layout for slider 1 and 2 plots
points_combined = filter_year2 | filter_year3
points_combined

Figures 5 and 6 shows diameter and height of trees planted from 1989 to 2019, respectively. They allow us to understand the data more effectively, in case averages used in the previous figures were not adequate. From the data, we can clearly see outliers and this helps us justify why averages from some years did not follow the trend. For example, data points from 1998 are more sparsed and have clear outliers which led to have averages that are slightly higher than anticipated. However, for the purpose of this analysis we will not be removing any outliers from the dataset.   

Next, we will explore distribution of height and diameter of the trees across various Vancouver neighbourhoods.

In [11]:
# Import relevant data
url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json'))

# Create Map of Vancouver
vancouver_map = alt.Chart(data_geojson_remote).mark_geoshape(
    color = 'gray', opacity= 0.5, stroke='white').encode(
).project(type='identity', reflectY=True)

# Filter Relevant Dataset
median_df = df.groupby('neighbourhood_name'
                      ).median().reset_index(
).rename(columns={'neighbourhood_name':'name'})[['name',
                                                 'diameter', 
                                                 'latitude', 
                                                 'longitude']]

# Add ability to explore the data via hovering over the map
hover = alt.selection_single(fields=['name'], on='mouseover')

# Map to show tree diameter distrubution across Vancouver
chart5 = alt.Chart(data_geojson_remote).mark_geoshape().transform_lookup(
    lookup='properties.name',
    from_=alt.LookupData(median_df, 'name', ['diameter', 'name'])).encode(
    color=alt.Color('diameter:Q', title='Diameter'),
    opacity=alt.condition(hover, alt.value(1),alt.value(0.4)),
    tooltip=['name:N', alt.Tooltip('diameter:Q', title='Diameter')]).project(type='identity', reflectY=True).properties(title='Map 1: Diameter of Trees across Vancouver Neighbourhood').add_selection(hover)



median_dfs = df.groupby('neighbourhood_name'
                      ).median().reset_index(
).rename(columns={'neighbourhood_name':'name'})[['name',
                                                 'height_range_id', 
                                                 'latitude', 
                                                 'longitude']]

# Map to show tree height distribution across Vancouver
chart6 = alt.Chart(data_geojson_remote).mark_geoshape().transform_lookup(
    lookup='properties.name',
    from_=alt.LookupData(median_dfs, 'name', ['height_range_id', 'name'])).encode(
    color=alt.Color('height_range_id:Q', title = 'Height'),
    opacity=alt.condition(hover, alt.value(1),alt.value(0.4)),
    tooltip=['name:N', alt.Tooltip('height_range_id:Q', title='Height')]).project(type='identity', reflectY=True).properties(title='Map 2: Height of Trees across Vancouver Neighbourhood').add_selection(hover)

chart6


# Layout to effectively combine both charts 
map_combined = chart5 | chart6
map1 = map_combined.resolve_scale(color='independent')
map1

Previously, we found higher diameter and heights are associated with older trees. Hasting-Sunrise neighbourhood has the greatest number of older trees. Distribution of age of the tree is quite uneven across the city. However, generally the northern side of the city has bigger or more older trees than southern. 

In [12]:
#Interim Layout for Question # 1 
Layout_diameter_height = (map1 & combined) & (filter_year2 | filter_year3)
Layout_diameter_height

In [13]:
# Since there are 100+ species in the dataset, we will only look at top 10 most popular species found in Vancouver area. 
dt=df.groupby('species_name').count()

#Find top 10 species
top_10 = dt.sort_values(by='Unnamed: 0', ascending=False).head(10)

#make a table that only contains top 10
top = df.query('species_name in ["SERRULATA","PLATANOIDES","CERASIFERA", "RUBRUM", "SYLVATICA", "AMERICANA","EUCHLORA   X", "BETULUS","CAMPESTRE","FREEMANI   X"]')

top.head()

Unnamed: 0.1,Unnamed: 0,species_name,neighbourhood_name,date_planted,diameter,street_side_name,curb,height_range_id,latitude,longitude,year,month,day,height_diameter_ratio
12,3515,SYLVATICA,Renfrew-Collingwood,2001-05-01,3.0,ODD,Y,1,49.241922,-123.046271,2001,5,1,0.333333
30,14946,CAMPESTRE,Dunbar-Southlands,2002-03-28,5.0,EVEN,Y,3,49.23946,-123.18166,2002,3,28,0.6
31,2503,CAMPESTRE,Grandview-Woodland,1989-11-24,8.0,EVEN,Y,3,49.272019,-123.06091,1989,11,24,0.375
34,1782,FREEMANI X,Hastings-Sunrise,2006-11-21,4.0,MED,Y,2,49.269494,-123.03576,2006,11,21,0.5
43,13617,RUBRUM,Victoria-Fraserview,1996-11-07,7.5,EVEN,Y,2,49.212181,-123.058075,1996,11,7,0.266667


In [14]:
# Create heat map; neghbourhood vs species
click = alt.selection_multi()
chart7 = (alt.Chart(top).mark_bar().encode(
    alt.X('count()', title='Number of Trees'),
    alt.Y('species_name', title='Species Name', sort='x'),
    alt.Color('species_name', title="Species Name"),
    opacity=alt.condition(click, alt.value(0.9), alt.value(0.2)))
.add_selection(click)).properties(width=400, title= 'Figure 7: Top 10 Species')

chart7

Figure 7 shows following as the most popular tree species being planted across Vancouver: Platanoides, Rubrum, Slyvatica, Cerasifera, Campsestre, Betulus, Freemani X, Americana, Serrulata, Euchlora X. From the total of 107 unique species, above top 10 make up most of the species. 

In [15]:
heatmap1 = alt.Chart(top).mark_rect().encode(
    alt.Color('count()'),
    alt.X('species_name', title='Species Name'),
    alt.Y('neighbourhood_name', sort='color', title='Neighbourhood')).properties(width=400, title = 'Figure 8: Heat Map of Residence of Popular Species in Vancouver Neighbourhood')
heatmap1

In [16]:
#A dropdown filter
combined2 = (chart7 & heatmap1)
combined2

species = ["SERRULATA","PLATANOIDES","CERASIFERA", "RUBRUM", "SYLVATICA", "AMERICANA","EUCHLORA   X", "BETULUS","CAMPESTRE","FREEMANI   X"]
neighbourhood = sorted(top['neighbourhood_name'].unique())

species_dropdown = alt.binding_select(options=species)
neighbourhood_dropdown = alt.binding_select(options=neighbourhood)

species_select = alt.selection_single(fields=['species_name'], bind=species_dropdown, name="species_name")
neighbourhood_select = alt.selection_single(fields=['neighbourhood_name'], bind=neighbourhood_dropdown, name="neighbourhood_name")

filter_species = combined2.add_selection(species_select).transform_filter(species_select)
filter_species2 = filter_species.add_selection(neighbourhood_select).transform_filter(neighbourhood_select)

filter_species2

## Dashboard

In [17]:
panel_layout = Layout_diameter_height & filter_species2

Description of the dashboard panel below. 

- Map of neighbourhoods in Vancouver that shows average diameter and height of the trees. Both are interactive maps (through hovering over neighbourhoods). Addtional tooltip interactions provide more information about name of the neighbourhood and information on average diameter and height of the trees. 

- Figure 1 shows total trees planted for the each year. Figures 2 and 3 shows average tree height and diameter for each of the plantation years. Figure 4 shows average height to diameter ratio for each year. All 4 plots are interactive through selection of a year. 

- Figures 5 and 6 shows more information about diameter and height of the trees for every datapoint available. Height and diameter slider widget are included to help explore the data. As slider value increases, colours on the plot will change to red from blue to help keep track and better visualize the data. 

- Figure 7 shows top 10 species found in the Vancouver area. This is an interactive bar plot. Heat map, or Figure 8 shows number of popular species found in the Vancouver neighbourhood. Both Figure 7 and 8 are interactive through a widget. There are two filter widgets. Widget for the species selection will filter both the figures 7 and 8. Filtering on the neighbourhood widget tool will enable to zoom into the total number of trees that satisfy both the widget criteria. 

In total we have 2 interactive maps, 4 interactive bar plots, 2 scatter plots interacted by slider widget, and heat map and bar plot that interact with each other through dropdown widget. 

In [18]:
panel_layout

## Discussion

### Summary
Trees now have a fundamental place in many big cities around the world. Large trees are excellent filters for urban pollutants and fine particulates. They absorb pollutant gases and filter dust, dirt or smoke out of the air by trapping them on leaves and bark. Living in close proximity of urban green spaces and having access to them can improve physical and mental health. This, in turn, contributes to the well-being of urban communities. Trees also help to reduce carbon emissions by helping to conserve energy, can increase property value, and attract tourism and business.

Given these motivation, I was using 'Vancouver trees' dataset to evaluate relationship between height, diameter of the tree with its age. Also, I was looking to understand distribution of the tree species, age, height, and diameter across Vancouver. 

I was using about ~2300 datapoints; dataset contained 109 unique species and 22 unique neighbourhood in Vancouver. All the null values in the dataset were removed to conduct adequate data analysis. 

From various visualizations, following was determined:
 - Similar number of trees were planted from 1995-2013 (~80/year). There has been significant reduction in new tree plantation in last couple years ( less than 25 new trees being planted/year)
 - Older the tree, greater its height and diameter are on average. Height to diameter ratio is more or less consistent across the years. This makes sense; tree grows wider and taller at a similar rate. 
- Top 10 most popular species across Vancouver are Platanoides, Rubrum, Slyvatica, Cerasifera, Campsestre, Betulus, Freemani X, Americana, Serrulata, Euchlora X. 
- Hasting-Sunrise neighbourhood has the most number of large trees, and greatest number of species diversity.  

#### Conclusion
Trees are important part of urbanization. Given that there are numerous benefits of having trees in a neighbourhood, it is important for cities to keep evaluating tree biodiversity and keep up with new plantations as required. Benefits of having older trees are especially important. Typically, older trees have greater height and diameter measures. However, we can look into investing in a unique tree species that will grow quicker than others. The most popular tree species in Vancouver is Platanoides; they are most popular in Hasting-Sunrise neighbourhood. 

In future, it would be interesting to evaluate the data to answer following additional questions: 
- Determine average property value of each neighbourhood and find correlation with biodiversity of the tree and/or number of trees and/or age of trees
- Determine species that have higher relative growth ratio per year. These species of trees can then be planted into newer neighbourhoods to increase its biodiversity. 

## References

Not all the work in this notebook is original. Parts that were borrowed from other resources are as follows:

- Importance of Tree Plantation [Trees](https://www.ecowatch.com/trees-climate-cities-2646806706.html)
- Programming in Python for Data Science sample final project for inspiration [Data Source](https://www.kaggle.com/rtatman/lego-database)
- Altair documentation including, but not limited to, 
    - [Top K Items](https://altair-viz.github.io/gallery/top_k_items.html)
    - [Top-K plot with Others](https://altair-viz.github.io/gallery/top_k_with_others.html)
    - [Custom Color Mapping](https://altair-viz.github.io/user_guide/customization.html#color-domain-and-range)