# Analysis Report of Vancouver Street Trees Dataset

<div style="text-align: right"> November 7, 2021 </div>

<div style="text-align: right"> Final Project Analysis Report by Peng Zhang </div>

## INTRODUCTION

The City of Vancouver is one of the leading cities in Canada to advocate Open Data in the public sector. It has developed and maintained a robust public street trees database that is freely available to the public under the terms of the <a href="https://opendata.vancouver.ca/pages/licence/" target="_blank">Open Government Licence – Vancouver</a>.

As a Data Science practitioner and also a Data Liberation believer, it is my pleasure to explore a <a href="https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/vancouver_trees.csv" target="_blank">subset</a> of the Vancouver Street Trees dataset that is introduced by the Data Visualization course at UBC. From this dataset, **the intention of the analysis is to have an overall view of Vancouver street trees for their biological diversity, growth and distributions within neighbourhoods**. The findings would provide reference and insights for the residents of Vancouver, botanists, urban greening project workers, and potential house buyers who are interested with the street trees in the city.

### Questions of Interest

Through a completed Exploratory Data Analysis (EDA) on the above noted Vancouver Street Trees subset, the following questions have been defined according to the intention of the analysis.

1. Based on the distribution of street trees planted in Vancouver by genus, which genus is the most popular one?
2. In the past 30 years, how many trees have been planted every single year? And what is the number of trees planted by genus each year? 
3. Based on the growth rates of each genus, which genus is the fastest one within the Vancouver street trees dataset?
4. How many street trees are growing in each neighbourhood and what are the average age and average diameter of trees in each neighbourhood?

## DATA IMPORTS AND WRANGLING

### Data imports

In [1]:
# Import libraries needed for this assignment

import altair as alt
import pandas as pd
import json

alt.data_transformers.enable("data_server")

# Use data sever url method to read data
URL = "https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/vancouver_trees.csv"
trees_df_original = pd.read_csv(URL)

# Glance at the original df
trees_df_original

Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,civic_number,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,W 13TH AV,MAPLE ST,PSEUDOPLATANUS,Kitsilano,,9.00,EVEN,ACER,N,1996,10,Y,13310,SYCAMORE MAPLE,4,2900,,N,49.259856,-123.150586
1,WALES ST,WALES ST,PLATANOIDES,Renfrew-Collingwood,2018-11-28,3.00,ODD,ACER,N,5291,7,Y,259084,PRINCETON GOLD MAPLE,1,5200,PRINCETON GOLD,N,49.236650,-123.051831
2,W BROADWAY,W BROADWAY,RUBRUM,Kitsilano,1996-04-19,14.00,EVEN,ACER,N,3618,C,Y,167986,KARPICK RED MAPLE,3,3600,KARPICK,N,49.264250,-123.184020
3,PENTICTON ST,PENTICTON ST,CALLERYANA,Renfrew-Collingwood,2006-03-06,3.75,EVEN,PYRUS,N,2502,5,Y,213386,CHANTICLEER PEAR,1,2500,CHANTICLEER,Y,49.261036,-123.052921
4,RHODES ST,RHODES ST,GLYPTOSTROBOIDES,Renfrew-Collingwood,2001-11-01,3.00,ODD,METASEQUOIA,N,5639,N,Y,189223,DAWN REDWOOD,2,5600,,N,49.233354,-123.050249
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,ROBSON ST,ROBSON ST,CAMPESTRE,West End,,7.00,ODD,ACER,N,1015,c,Y,122814,HEDGE MAPLE,2,1000,,N,49.283666,-123.123231
29996,OSLER ST,CONNAUGHT DRIVE,PLATANOIDES,Shaughnessy,2007-04-16,8.00,ODD,ACER,N,4690,10,Y,132211,NORWAY MAPLE,1,1000,,Y,49.243636,-123.129480
29997,BEATRICE ST,BEATRICE ST,CERASIFERA,Victoria-Fraserview,,17.30,EVEN,PRUNUS,N,6218,9,Y,59355,PISSARD PLUM,3,6200,ATROPURPUREUM,N,49.227406,-123.066936
29998,ANGUS DRIVE,ANGUS DRIVE,BILOBA,Shaughnessy,2006-02-17,4.00,ODD,GINKGO,N,1551,9,Y,207753,GINKGO OR MAIDENHAIR TREE,1,1500,,Y,49.254431,-123.140382


### Identify and drop irrelevant columns

In [2]:
# Check columns of the original df

trees_df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   std_street          30000 non-null  object 
 1   on_street           30000 non-null  object 
 2   species_name        30000 non-null  object 
 3   neighbourhood_name  30000 non-null  object 
 4   date_planted        14085 non-null  object 
 5   diameter            30000 non-null  float64
 6   street_side_name    30000 non-null  object 
 7   genus_name          30000 non-null  object 
 8   assigned            30000 non-null  object 
 9   civic_number        30000 non-null  int64  
 10  plant_area          29722 non-null  object 
 11  curb                30000 non-null  object 
 12  tree_id             30000 non-null  int64  
 13  common_name         30000 non-null  object 
 14  height_range_id     30000 non-null  int64  
 15  on_street_block     30000 non-null  int64  
 16  cult

Based on the above data information and the dataset schema from <a href="https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id" target="_blank">City of Vancouver Open Data Portal - Street Trees</a>, the **columns** can be preliminarily identified as four **groups**:

1. Trees biological classifications and names, such as `genus_name`, `species_name`, `common_name`, `cultivar_name`
2. Trees growth related characteristics, such as `date_planted`, `diameter`, `height_range_id`
3. Trees coordinates and areas, such as `latitude`, `longitude`, `neighbourhood_name`
4. Other specific location / orientation / identification information

As per the questions, the irrelevant columns under the fourth group and trees coordinates will be dropped. Also, to narrow down the focus to the highest level of tree classification, the columns of `species_name`, `common_name`, `cultivar_name` will also be dropped and only keep the column of `genus_name`.

In [3]:
# Transform the column 'date_planted' to a datetime64 dtype and drop irrelevant columns
trees_df_dropped = pd.read_csv(URL,parse_dates = ['date_planted']
                                    ).drop(columns=['std_street',
                                                    'on_street',
                                                    'street_side_name',
                                                    'assigned',
                                                    'civic_number',
                                                    'plant_area',
                                                    'curb',
                                                    'on_street_block',
                                                    'tree_id',
                                                    'root_barrier',
                                                    'latitude',
                                                    'longitude',
                                                    'species_name',
                                                    'common_name',
                                                    'cultivar_name'])

trees_df_dropped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   neighbourhood_name  30000 non-null  object        
 1   date_planted        14085 non-null  datetime64[ns]
 2   diameter            30000 non-null  float64       
 3   genus_name          30000 non-null  object        
 4   height_range_id     30000 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 1.1+ MB


### Exam null values

Based on the above dateframe information, the column of `date_planted` has almost half of values missed. Since reviewing trees growth is one of the tasks in this analysis, any observations without the date of being planted will be considered as uninformative data and dropped from the original dataframe. In the EDA, an extra analysis has been performed and verified that dropping the observations with null values in `date_planted` will not cause the data representativeness issue.

In [4]:
# Drop observations without value of date_planted

trees_df_dropped_dropna = trees_df_dropped.dropna(subset=['date_planted'])

trees_df_dropped_dropna.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14085 entries, 1 to 29998
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   neighbourhood_name  14085 non-null  object        
 1   date_planted        14085 non-null  datetime64[ns]
 2   diameter            14085 non-null  float64       
 3   genus_name          14085 non-null  object        
 4   height_range_id     14085 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 660.2+ KB


### Further exam for error data

In [5]:
# Exam the remaining columns for more details

trees_df_dropped_dropna.describe(include='all',datetime_is_numeric=True)

Unnamed: 0,neighbourhood_name,date_planted,diameter,genus_name,height_range_id
count,14085,14085,14085.0,14085,14085.0
unique,22,,,68,
top,Renfrew-Collingwood,,,ACER,
freq,1323,,,3970,
mean,,2003-09-20 17:40:42.172523904,6.352586,,1.822932
min,,1989-10-27 00:00:00,0.0,,0.0
25%,,1997-12-15 00:00:00,3.0,,1.0
50%,,2003-04-01 00:00:00,5.0,,2.0
75%,,2009-11-13 00:00:00,8.0,,2.0
max,,2019-06-03 00:00:00,317.0,,9.0


It has been noticed that the **minimum of `diameter` and `height_range_id`** are **zero**. For `height_range_id`, 0 represents the range of height is from 0 to 10 ft. However, since the `diameter` figure is the diameter of tree at breast height, it should not have the value of 0. So these observations will be considered as invalid data and removed from the df.

In [6]:
indexs = trees_df_dropped_dropna[trees_df_dropped_dropna['diameter'] == 0].index

# To simplify the object names afterwards, make a cope of df named as trees_df
trees_df = trees_df_dropped_dropna.copy()

trees_df.drop(indexs, inplace=True)

trees_df.describe(include='all',datetime_is_numeric=True)

Unnamed: 0,neighbourhood_name,date_planted,diameter,genus_name,height_range_id
count,14083,14083,14083.0,14083,14083.0
unique,22,,,68,
top,Renfrew-Collingwood,,,ACER,
freq,1323,,,3970,
mean,,2003-09-20 23:57:38.893701504,6.353489,,1.822978
min,,1989-10-27 00:00:00,0.5,,0.0
25%,,1997-12-15 12:00:00,3.0,,1.0
50%,,2003-04-01 00:00:00,5.0,,2.0
75%,,2009-11-13 00:00:00,8.0,,2.0
max,,2019-06-03 00:00:00,317.0,,9.0


## ANALYSIS

### Dataset Description

The cleaned target dataframe **trees_df** is composed of 5 columns. There are 68 distinct genera within the total of 14,083 trees (observations). According to <a href="https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id" target="_blank">City of Vancouver Open Data Portal - Street Trees</a> where the dataset was originally obtained, the brief descriptions of columns are listed as below:

* **Categorical columns**

`neighbourhood_name`: City's defined local area in which the tree is located.

`genus_name`: Genus name of trees.

* **Quantitative columns**

`diameter`: DBH in inches (DBH stands for diameter of tree at breast height).

`height_range_id`: 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and10 = 100+ ft).

* **Datetime columns**

`date_planted`: The date of planting.

#### Q1 - Based on distribution of street trees planted in Vancouver by genus, which genus is the most popular one?

Biological diversity is one of the interests for the analysis. The quick guess for the answer would be Acer as Acer is a genus of trees commonly known as maples. A maple leaf is on the Canadian flag, and the maple has been chosen as a national symbol in Canada.

In [7]:
# To visualize distrubition of genus for whole df
plot_1_title = alt.TitleParams(
    "Figure 1 Number of street trees planted per genus",
     subtitle = "Acer is the most popular genus of street trees")

plot_1_genus = alt.Chart(trees_df).mark_bar().encode(
    alt.X('count():Q',title='Number of Trees'),
    alt.Y('genus_name:N',title='Genus',sort='x')
)

# Add text annotation for the number of trees for each genus
text_1_genus = plot_1_genus.mark_text(align='left',dx=2).encode(text='count():Q')

plot_1_genus = (plot_1_genus + text_1_genus).properties(title=plot_1_title,width=550)

plot_1_genus

From Figure 1, it has been confirmed that Acer is the most popular genus of trees based on the dataframe. The number of Acer trees is more than double of Prunus trees that have the second largest number in the Vancouver street trees. All the rest of the genera have less than 1,000 trees shown in the dataframe, and 47 out of the total 68 genera have less than 100 trees.

#### Q2 - In the past 30 years, how many trees have been planted every single year? And what is the number of trees planted by genus each year?

The dataframe has provide the specific date of being planted for each tree. From the available data, the age of current trees will be calculated first and then plot the number of trees planted by year. Since the accuracy is not a priority in this case, only the year of tree planted will be extracted for calculating the age till 2021.

In [8]:
# Extract year of planted and calculate age of trees
trees_df_yr = trees_df.assign(year=trees_df['date_planted'].dt.year)

# Calculate the age of trees till 2021.
trees_df_age = trees_df_yr.assign(age=(2021-trees_df_yr['year']))

# Plot distribution of number of trees planted by year
# It has been tested that if specifying the dtype of column 'year' as quantitative (Q), the x-axis label takes scientific
# formatting, ie. 1,989. To avoid the ",", the dtype of 'year' has been specified as nominal (N)
plot_2_title = alt.TitleParams(
    "Figure 2 Number of street trees planted each year",
     subtitle = "(Data available from 1989 to 2019)")

plot_2_year = alt.Chart(trees_df_age).mark_bar().encode(
    alt.X('year:N',title=None),
    alt.Y('count():Q',title='Number of trees planted')).properties(title=plot_2_title)

plot_2_year

Figure 2 indicates that there was a peak period between 1995 and 2013 to plant street trees by the City of Vancouver. During this period, the city had planted the highest number of trees in a single year in 1998 and in 2013. Before 1995 and after 2014, the number of trees planted was relatively lower, especially in 2016 when there were less than 50 new trees on public boulevards in Vancouver. Urban forestry is a systemic project. How many trees are planted is determined by a group of factors, such as public budget, tree replacement plan due to species distribution, insects, diseases, or environmental stress, etc. It is a clear message that the City of Vancouver has maintained a dynamic public trees planting program that benefits the wellbeings of residents in Vancouver.

Moving further, to plot number of trees planted by year and add dropdown selection by genus.

In [9]:
# Plot number of trees planted by year and add dropdown selection by genus

# Specify the subtitle color and bold it to draw attention
plot_3_title = alt.TitleParams(
    "Figure 3 Number of street trees planted each year by genus(from 1989 to 2019)",
    subtitle = "Dropdown selection is available by genus",
    subtitleColor='steelblue', subtitleFontWeight='bold')

genus = sorted(trees_df_age['genus_name'].unique())

dropdown_genus = alt.binding_select(name='Genus', options=genus)

select_genus = alt.selection_single(fields=['genus_name'], bind=dropdown_genus)

# Since objective is to see number of trees for each genus, for y axis, specify stack=False
plot_3_genus_year_bar = alt.Chart(trees_df_age).mark_bar().encode(
    alt.X('year:N',title=None),
    alt.Y('count():Q',stack=False,title='Number of trees planted per genus'),
    alt.Color('genus_name:N',title='Genus name')
).add_selection(select_genus).encode(
    opacity=alt.condition(select_genus, alt.value(0.9), alt.value(0.0))).properties(title=plot_3_title)

plot_3_genus_year_bar

Figure 3 with dropdown selection has combined multiple information required in one plot and provide the audiences with convenience to efficiently explore number of trees planted from 1989 to 2019 for each genus.

The answer to the question 2 would be valuable reference for the public who are interested with tree planting history in Vancouver. In addition, for researchers, the answer will provide a first-hand insight into the urban forestry and street trees replacement planning.

#### Q3 - Based on the growth rates of each genus, which genus is the fastest one within the Vancouver street trees dataset?

The insights extracted from this question will not only confirm or verify some of the current scientific conclusions on trees growth patterns, but also provide accumulated evidences for improving the optimization of street trees in cities with similar ecological environment as Vancouver.

In [10]:
# Calculate growth rate in diameter, growth rate in height and save them into df

trees_df_rate = trees_df_age.assign(rate_dia=trees_df_age['diameter']/trees_df_age['age'],
                                    rate_height=trees_df_age['height_range_id']/trees_df_age['age']
                                   ).round(2)

# Plot two growth rates using scatter chart
# Add zooming and panning to solve the issue that the plot is saturated
# Add dropdown widget to select genus
plot_4_title = alt.TitleParams(
    "Figure 4 Trees growth rate in diameter & height per genus",
    subtitle = ["Dropdown selection is available by genus", "Zooming and Panning are available"],
    subtitleColor='steelblue', subtitleFontWeight='bold')

plot_growth_rates = alt.Chart(trees_df_rate).mark_point(size=5).encode(
    alt.X('rate_dia:Q',title="Growth rate in diameter (inch/yr)"),
    alt.Y('rate_height:Q',title="Growth rate in height (unit/yr)")
).interactive()

plot_4_growth_rates_genus = plot_growth_rates.add_selection(select_genus).encode(
    opacity=alt.condition(select_genus, alt.value(0.9), alt.value(0.0))).properties(title=plot_4_title,width=600)

plot_4_growth_rates_genus

Note that the growth rate in height is represented in **unit/yr**. Here the "unit" refers to height_range_id: 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and10 = 100+ ft)

By adding zooming, panning and dropdown widget selection features, audiences can effectively check the two growth rates based on their interested genus.

**It must be noted that** in the original dataframe, the height of trees was collected as **height_range_id**. Audiences should be aware that the calculated average growth rate by height is not an absolute value of height but a processed relative variable. Typically, if the quantity of observations for a particular genus is sufficient (deep research is required to define the "sufficiency"), the result will have more accuracy from the statistical perspective. But if the genus has limited observations in the dataframe, the audiences should be more careful to assess the results. Again, deep research is required in future analysis.

In order to effectively compare the growth rate for all genera, except for a bar chart that just shows the mean value, a boxplot is also created to include a few key summary statistics and visualize data distribution. This combination will provide more information to audiences particularly for researchers who may care about the statistics more than a mean value.

In [11]:
# Plot average growth rates in diameter per genus

plot_5_title = alt.TitleParams(
    "Figure 5 Trees average growth rate in diameter per genus",
    subtitle = "TSUGA, PSEUDOTSUGA & POPULUS are top three fastest growing genera")

plot_ave_growth_rate_dia_bar = alt.Chart(trees_df_rate).mark_bar().encode(
    alt.X('mean(rate_dia)', title='Average growth rate in diameter (inch/yr)'),
    alt.Y('genus_name:N',title='Genus',sort='x'),
    alt.Color('count()',scale=alt.Scale(scheme='blues'),title='Number of trees')
).properties(width=130)

# Sort boxplot by average growth rate in diameter to keep consistent with bar chart
ave_growth_rate_dia_order = trees_df_rate.groupby('genus_name')['rate_dia'].mean().sort_values().index.tolist()

# To distinguish the boxplot from the bar chart, specify the color as green
plot_ave_growth_rate_dia_box = alt.Chart(trees_df_rate).mark_boxplot(color='green').encode(
    alt.X('rate_dia:Q', title='Growth rate in diameter (inch/yr)'),
    alt.Y('genus_name:N',title='Genus',sort=ave_growth_rate_dia_order)
).properties(width=310)

# Modify the domain of the axis’ scale for boxplot to make the plot more readable
# At the same time, discard outliers with the value more than 4 inch/year
plot_ave_growth_rate_dia_box_scale = plot_ave_growth_rate_dia_box.encode(alt.X('rate_dia:Q', 
                                                                               title='Growth rate in diameter (inch/yr)',
                                                                               scale=alt.Scale(domain=[0, 4],clamp=True)))

# Use top-level title configuration to locate title in the middle
plot_5_ave_growth_rates_dia = (plot_ave_growth_rate_dia_bar | plot_ave_growth_rate_dia_box_scale
                              ).properties(title=plot_5_title).configure_title(anchor='middle')

plot_5_ave_growth_rates_dia

In [12]:
# Plot average growth rates in height per genus

plot_6_title = alt.TitleParams(
    "Figure 6 Trees average growth rate in height per genus",
    subtitle = "TSUGA, POPULUS & PSEUDOTSUGA are top three fastest growing genera")

# To distinguish the plot from the above growth rates in diameter chart, change alternative color scheme
plot_ave_growth_rate_height_bar = alt.Chart(trees_df_rate).mark_bar().encode(
    alt.X('mean(rate_height)', title='Average growth rate in height (unit/yr)'),
    alt.Y('genus_name:N',title='Genus',sort='x'),
    alt.Color('count()',scale=alt.Scale(scheme='teals'),title='Number of trees')
).properties(width=130)

# Sort boxplot by average growth rate in height to keep consistent with bar chart
ave_growth_rate_height_order = trees_df_rate.groupby('genus_name')['rate_height'].mean().sort_values().index.tolist()

# To distinguish the boxplot from the bar chart, specify the color as grey
plot_ave_growth_rate_height_box = alt.Chart(trees_df_rate).mark_boxplot(color='grey').encode(
    alt.X('rate_height:Q', title='Growth rate in height (unit/yr)'),
    alt.Y('genus_name:N',title='Genus',sort=ave_growth_rate_height_order)
).properties(width=310)

# Modify the domain of the axis’ scale for boxplot to make the plot more readable
# At the same time, discard outliers with the value more than 0.6 unit/year
plot_ave_growth_rate_height_box_scale = plot_ave_growth_rate_height_box.encode(alt.X('rate_height:Q', 
                                                                               title='Growth rate in height (unit/yr)',
                                                                               scale=alt.Scale(domain=[0, 0.6],clamp=True)))

# Use top-level title configuration to locate title in the middle
plot_6_ave_growth_rates_height = (plot_ave_growth_rate_height_bar | plot_ave_growth_rate_height_box_scale
                                 ).properties(title=plot_6_title).configure_title(anchor='middle')

plot_6_ave_growth_rates_height

Note that the growth rate in height is represented in unit/yr. Here the "unit" refers to height_range_id: 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and10 = 100+ ft)

Based on Figure 5 and 6, TSUGA, POPULUS and PSEUDOTSUGA are identified as the top three fastest growing genera. But as indicated on the plots and Figure 1, the sample size of the three genera is very limited. The number of trees (observations) of TSUGA, POPULUS and PSEUDOTSUGA in the dataframe are 2, 3, and 8, respectively. Since the sample size may affect the confidence level when analyzing and comparing the growth rates among genera, a futher statistical analysis should be considered in the future to justify the conclusion in this case.

#### Q4 - How many street trees are growing in each neighbourhood and what are the average age and average diameter of trees in each neighbourhood?

The last question might be interesting for residents in Vancouver to know or for potential home buyers to select their desired neighbourhood with considerations of trees distribution and some characters.

In [13]:
# To make a base map of Vancouver by using the geojson url saved in url_geojson

url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'

# Format it in a Topo json format using alt.Data()

data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json'))

# Then make base Vancouver Altair map using the data_geojson_remote object
# Use an identity type and need to reflectY=True. Without this second argument the map of Vancouver is upside down

base_map = alt.Chart(data_geojson_remote).mark_geoshape(
    color='white',stroke="grey").encode(
).project(type='identity', reflectY=True)

Perform calculation needed and create a new dataframe.

In [14]:
# Rename neighbourhood_name to name since that's what's it's called in the geojson url
trees_nbhd_num_df = pd.DataFrame(trees_df_age['neighbourhood_name'].value_counts()
                            ).reset_index().rename(columns={'index':'name','neighbourhood_name':'trees_number'})

trees_nbhd_age_df = trees_df_age.groupby('neighbourhood_name').mean().round(1).reset_index(
).rename(columns={'neighbourhood_name':'name','diameter':'ave_diameter','age':'ave_age'})[['name','ave_diameter','ave_age']]

trees_nbhd_df = trees_nbhd_num_df.merge(trees_nbhd_age_df, left_on='name', right_on='name')

trees_nbhd_df

Unnamed: 0,name,trees_number,ave_diameter,ave_age
0,Renfrew-Collingwood,1323,5.6,17.5
1,Hastings-Sunrise,1285,8.1,19.7
2,Kensington-Cedar Cottage,1169,6.5,18.7
3,Sunset,936,6.0,17.3
4,Victoria-Fraserview,908,5.5,17.7
5,Dunbar-Southlands,762,6.1,16.7
6,Marpole,689,5.6,17.2
7,Riley Park,684,5.8,17.2
8,Grandview-Woodland,632,6.6,18.0
9,Killarney,615,6.2,17.7


In [15]:
# Add tooltips and hover selection on map

hover = alt.selection_single(fields=['name'], on='mouseover')

plot_trees_nbhd = alt.Chart(data_geojson_remote).mark_geoshape().transform_lookup(
    lookup='properties.name',
    from_=alt.LookupData(trees_nbhd_df, 'name', ['trees_number','name'])).encode(
    alt.Color('trees_number:Q',title='Number of trees'),
    opacity=alt.condition(hover, alt.value(1), alt.value(0.1)),
    tooltip=['name:N',alt.Tooltip('trees_number:Q', title='Number of trees')]
).add_selection(hover).project(type='identity', reflectY=True)

plot_trees_nbhd_map = base_map + plot_trees_nbhd

# Plot ave_age per ave_diameter
# Scale the x and y axis by specifying zero=False

plot_age_dia = alt.Chart(trees_nbhd_df).mark_point().encode(
    alt.X('ave_diameter:Q',scale=alt.Scale(zero=False),title='Average diameter (inch)'),
    alt.Y('ave_age:Q',scale=alt.Scale(zero=False),title='Average age (yr)'),
    stroke=alt.condition(hover, alt.value('black'), alt.value('#ffffff')))

plot_7_title = alt.TitleParams(
    "Figure 7 Trees distribution per neighbourhood (number/age/diameter)",
    subtitle = ["Hover selection on map is available by neighbourhood", "Tooltips are available on map"],
    subtitleColor='steelblue', subtitleFontWeight='bold')

plot_7_nbhd_map = (plot_trees_nbhd_map & plot_age_dia
                  ).properties(title=plot_7_title).configure_title(anchor='middle')

plot_7_nbhd_map

Based on the above trees_nbhd_df and Figure 7, the average age of trees in these neighbourhoods are between 16.3 years old and 19.7 years old. And the average diameter of trees are from 5.1 inch to 8.1 inch.

Through the link of map selection with the scatter plot of Figure 7, it is effective and efficient for audiences to hover over their target neighbourhood on map to quickly obtain the number of trees in the neighbourhood, know how old and how big the trees there on average.

## DISCUSSIONS

The analysis intends to discover an overall view of Vancouver street trees for their biological diversity, growth and distributions within neighbourhoods. Through Figure 1 to Figure 7 and related tables, all four pre-defined questions have been answered. It has been confirmed that within 68 genera and a total of 14,083 trees, Acer is the most popular genus based on the dataset. It is not surprised as the maple trees are Canada's official arboreal emblem. Historical data shows Vancouver has maintained a dynamic public trees planting program. Especially between 1995 and 2013, a large number of trees had been added to the streets by the government. The plots generated can effectively provide audiences with more details about the yearly number of trees planted by genus, the growth rates by genus, the trees distribution and certain characters by neighbourhood. In addition, by comparing the growth rates within 68 genera, it is noted that TSUGA, POPULUS and PSEUDOTSUGA are the top three fastest growing genera in Vancouver.

Except for the four questions that have been discussed, **further questions** could be considered in the future:

1. What is the specific genus distribution as per neighbourhood. This could be demonstrated on the Vancouver map, too. The answer will be informative for residents to know the distribution of fast-growing trees or shade trees in their area.
2. What is the average height range of trees in each neighbourhood. Some house buyers may prefer taller trees or shade trees in their neighbourhood.
3. Since genus is the highest level of tree classification in the df, data analyst may extend the analysis into the next level, species, or common names, or cultivar_names to continuously explore the data patterns.

In the analysis report, the **assumption and limitations of the analysis** should be highlighted for audiences to pay attention, including:

1. The original data only includes the public trees on boulevards in the City of Vancouver. Park trees and private trees are not included.
2. More than half of the original observations from the original dataset have been dropped since lacking information on the date of the tree being planted. Therefore, the analysis is based on the remaining observations only. This may limit the conclusions or insights extracted from this analysis.
3. In order to simplify the analysis, the calculation of the age of trees has not included the specific month and day.
4. The information related to the height of trees was collected as height_range_id. So audiences should be aware that the height figures in this report are not the absolute measurement of height.
5. Due to the limited sample size, more research is required to verify the conclusion of the fastest growing trees in this report.

**The last important notes**: 
1. The top priority of this analysis is emphasizing on the demonstration of learning outcomes from the Data Vis perspectives. The analyst (author) recognized in this report the narrative should be more cohesive, the original dataset column analysis and data wrangling should be supported with more domain expertise if judged from a higher level data analysis standard.
2. More statistical tools could be applied in the analysis to make the report more convincing. This will be considered for future work.

## DASHBOARD 1

In [16]:
# Resize plots to accommodate the combined panel
# Create new object to avoid overwriting previous plots
plot_1 = plot_1_genus.properties(height=1100,width=250)
plot_2 = plot_2_year.properties(height=300,width=350)
plot_3 = plot_3_genus_year_bar.properties(height=300,width=350)
plot_4 = plot_4_growth_rates_genus.properties(height=300,width=350)

plot_1 | (plot_2 & plot_3 & plot_4)

## DASHBOARD 2

In [17]:
# Since objects with "config" attribute cannot be used within VConcatChart, remove the previous "config"
# Create new object to avoid overwriting previous plots
plot_5 = (plot_ave_growth_rate_dia_bar | plot_ave_growth_rate_dia_box_scale).properties(title=plot_5_title)
plot_6 = (plot_ave_growth_rate_height_bar | plot_ave_growth_rate_height_box_scale).properties(title=plot_6_title)
plot_7 = (plot_trees_nbhd_map & plot_age_dia).properties(title=plot_7_title)

(plot_5 & plot_6) & plot_7

## REFERENCES

### Resources Used

Not all the work in this project is original. The following resources have been used as references.

* <font color='blue'>Data Source</font>

The Vancouver street trees dataset provided by UBC Data Visualization course. The dataset schema from <a href="https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id" target="_blank">City of Vancouver Open Data Portal - Street Trees</a>.

* <font color='blue'>Sample Report</font>

The sample reports, *Evolution of LEGO* EDA report and analysis report provided by UBC Data Visualization course.

* <font color='blue'>Report Writing</font>

Markdown formatting for the text cells in JupyterNotebook, *Basic Syntax - The Markdown elements outlined in John Gruber's design document*, retrieved on November 1, 2021 from <a href="https://www.markdownguide.org/basic-syntax/" target="_blank">Markdown Guide</a>.

* <font color='blue'>Python and Data Vis Techniques</font>

Learned from UBC Data Visualization course, Programming in Python for Data Science course, Piazza group discussions, <a href="https://altair-viz.github.io/" target="_blank">Altair Documentation</a>, and other self-learning resources from the internet.

<h2><center>**End of the Report**</center></h2>