
# Exploratory Data Analysis of Vancouver Street Trees Dataset


<div style="text-align: right"> November 7, 2021 </div>

<div style="text-align: right"> Final Project EDA Report by Peng Zhang </div>

## Pre-analysis - Narrow down the "focus" and data wrangling

### Read and review original df

In [1]:
# Import libraries needed for this assignment

import altair as alt
import pandas as pd
import json

alt.data_transformers.enable("data_server")

DataTransformerRegistry.enable('data_server')

In [2]:
URL = "https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/vancouver_trees.csv"
trees_df_original = pd.read_csv(URL)

trees_df_original

Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,civic_number,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,W 13TH AV,MAPLE ST,PSEUDOPLATANUS,Kitsilano,,9.00,EVEN,ACER,N,1996,10,Y,13310,SYCAMORE MAPLE,4,2900,,N,49.259856,-123.150586
1,WALES ST,WALES ST,PLATANOIDES,Renfrew-Collingwood,2018-11-28,3.00,ODD,ACER,N,5291,7,Y,259084,PRINCETON GOLD MAPLE,1,5200,PRINCETON GOLD,N,49.236650,-123.051831
2,W BROADWAY,W BROADWAY,RUBRUM,Kitsilano,1996-04-19,14.00,EVEN,ACER,N,3618,C,Y,167986,KARPICK RED MAPLE,3,3600,KARPICK,N,49.264250,-123.184020
3,PENTICTON ST,PENTICTON ST,CALLERYANA,Renfrew-Collingwood,2006-03-06,3.75,EVEN,PYRUS,N,2502,5,Y,213386,CHANTICLEER PEAR,1,2500,CHANTICLEER,Y,49.261036,-123.052921
4,RHODES ST,RHODES ST,GLYPTOSTROBOIDES,Renfrew-Collingwood,2001-11-01,3.00,ODD,METASEQUOIA,N,5639,N,Y,189223,DAWN REDWOOD,2,5600,,N,49.233354,-123.050249
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,ROBSON ST,ROBSON ST,CAMPESTRE,West End,,7.00,ODD,ACER,N,1015,c,Y,122814,HEDGE MAPLE,2,1000,,N,49.283666,-123.123231
29996,OSLER ST,CONNAUGHT DRIVE,PLATANOIDES,Shaughnessy,2007-04-16,8.00,ODD,ACER,N,4690,10,Y,132211,NORWAY MAPLE,1,1000,,Y,49.243636,-123.129480
29997,BEATRICE ST,BEATRICE ST,CERASIFERA,Victoria-Fraserview,,17.30,EVEN,PRUNUS,N,6218,9,Y,59355,PISSARD PLUM,3,6200,ATROPURPUREUM,N,49.227406,-123.066936
29998,ANGUS DRIVE,ANGUS DRIVE,BILOBA,Shaughnessy,2006-02-17,4.00,ODD,GINKGO,N,1551,9,Y,207753,GINKGO OR MAIDENHAIR TREE,1,1500,,Y,49.254431,-123.140382


### Identify and drop irrelevant columns

In [3]:
# Glance at the original df

trees_df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   std_street          30000 non-null  object 
 1   on_street           30000 non-null  object 
 2   species_name        30000 non-null  object 
 3   neighbourhood_name  30000 non-null  object 
 4   date_planted        14085 non-null  object 
 5   diameter            30000 non-null  float64
 6   street_side_name    30000 non-null  object 
 7   genus_name          30000 non-null  object 
 8   assigned            30000 non-null  object 
 9   civic_number        30000 non-null  int64  
 10  plant_area          29722 non-null  object 
 11  curb                30000 non-null  object 
 12  tree_id             30000 non-null  int64  
 13  common_name         30000 non-null  object 
 14  height_range_id     30000 non-null  int64  
 15  on_street_block     30000 non-null  int64  
 16  cult

Based on the above data information and the dataset schema from <a href="https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id" target="_blank">Vancouver Street Trees</a>, the **columns** can be preliminarily identified as four **categories**:

1. Trees biological classifications and names, such as genus_name, species_name, common_name, cultivar_name

2. Trees growth related characteristics, such as date_planted, diameter, height_range_id

3. Trees coordinates and areas, such as latitude, longitude, neighbourhood_name

4. Other specific location / orientation / identification information

**The intention of this analysis is to have an overall view of Vancouver street trees for their biological diversity, growth and distributions within neighbourhoods**. Therefore, the irrelevant columns under the fourth category will be dropped to make the analysis more efficient.

In addition, to narrow down the focus to the highest level of tree classification, the columns of species_name, common_name, cultivar_name will also be dropped and only keep the column of genus_name.

In [4]:
# Transform the column 'date_planted' to a datetime64 dtype and drop irrelevant columns
trees_df_dropped = pd.read_csv(URL,parse_dates = ['date_planted']
                                    ).drop(columns=['std_street',
                                                    'on_street',
                                                    'street_side_name',
                                                    'assigned',
                                                    'civic_number',
                                                    'plant_area',
                                                    'curb',
                                                    'on_street_block',
                                                    'tree_id',
                                                    'root_barrier',
                                                    'species_name',
                                                    'common_name',
                                                    'cultivar_name'])

trees_df_dropped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   neighbourhood_name  30000 non-null  object        
 1   date_planted        14085 non-null  datetime64[ns]
 2   diameter            30000 non-null  float64       
 3   genus_name          30000 non-null  object        
 4   height_range_id     30000 non-null  int64         
 5   latitude            30000 non-null  float64       
 6   longitude           30000 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(2)
memory usage: 1.6+ MB


### Deal with null values

Based on the df information, the column of **date_planted** has almost half of **values missed**. Since reviewing trees growth is one of the objectives in this analysis, any observations without date of planted will be considered as uninformative data and dropped from the original df. 

Prior to dropping a large number of nulls values, it is important to take necessary precautions to check whether removal of these data could make significate impacts on the analysis. In this case, if trees in some neighbourhoods or within particular genes have zero non-null data for the date of planted, those neighbourhoods or genes will be entirely excluded from the later analysis by removing the whole observations. This may cause data representativeness issue.

In [5]:
# Using quick repeat plotting to visualize missing values of 'date_planted' related to certain categorical variables
# Explicit data types to be added when encoding as using data sever url method to read data

trees_df_dropped_nans = trees_df_dropped.assign(nan=trees_df_dropped['date_planted'].isna()).reset_index()

col_categorical = trees_df_dropped_nans.select_dtypes('object').columns.tolist()

plot_nan = alt.Chart(trees_df_dropped_nans).mark_rect(height=10).encode(
    alt.X('index:O'),
    alt.Y(alt.repeat(),type='nominal'),
    alt.Color('nan'),
    alt.Stroke('nan')).properties(width=400).repeat(col_categorical)

plot_nan

From the above exploratory plot, it has been confirmed that all the 22 neighbourhoods have representative data of date of trees planted.

It has also been observed that most of genera have representative data for date planted with a few exemptions, such as AILANTHUS,ALNUS,ARAUCARIA. However, the sample size of these genera is relatively very low compared to the majority of genera with representative data for date planted.

Based on the above discussion, the analyst will be more confident to narrow down the focus by dropping observations without date of planted.

In [6]:
# Further drop observations without value of date_planted

trees_df_dropped_dropna = trees_df_dropped.dropna(subset=['date_planted'])

trees_df_dropped_dropna.info()
# Verify that the number of rows of dropped df should be 14085.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14085 entries, 1 to 29998
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   neighbourhood_name  14085 non-null  object        
 1   date_planted        14085 non-null  datetime64[ns]
 2   diameter            14085 non-null  float64       
 3   genus_name          14085 non-null  object        
 4   height_range_id     14085 non-null  int64         
 5   latitude            14085 non-null  float64       
 6   longitude           14085 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(2)
memory usage: 880.3+ KB


### Further exam df for more details

In [7]:
# Exam the remaining columns for more details

trees_df_dropped_dropna.describe(include='all',datetime_is_numeric=True)

Unnamed: 0,neighbourhood_name,date_planted,diameter,genus_name,height_range_id,latitude,longitude
count,14085,14085,14085.0,14085,14085.0,14085.0,14085.0
unique,22,,,68,,,
top,Renfrew-Collingwood,,,ACER,,,
freq,1323,,,3970,,,
mean,,2003-09-20 17:40:42.172523904,6.352586,,1.822932,49.24657,-123.10016
min,,1989-10-27 00:00:00,0.0,,0.0,49.200732,-123.22344
25%,,1997-12-15 00:00:00,3.0,,1.0,49.229239,-123.136819
50%,,2003-04-01 00:00:00,5.0,,2.0,49.24641,-123.0968
75%,,2009-11-13 00:00:00,8.0,,2.0,49.263314,-123.057646
max,,2019-06-03 00:00:00,317.0,,9.0,49.293881,-123.018258


It has been noticed that the **minimum of diameter and height_range_id** are **zero**. For height_range_id, 0 represents the range of height is from 0 to 10 ft. However, since the diameter figure is the diameter of tree at breast height, it should not have the value of 0. So these observations will be considered as invalid data and removed from the df.

In [8]:
indexs = trees_df_dropped_dropna[trees_df_dropped_dropna['diameter'] == 0].index

# To simplify the object names afterwards, make a cope of df named as trees_df
trees_df = trees_df_dropped_dropna.copy()

trees_df.drop(indexs, inplace=True)

trees_df.describe(include='all',datetime_is_numeric=True)

Unnamed: 0,neighbourhood_name,date_planted,diameter,genus_name,height_range_id,latitude,longitude
count,14083,14083,14083.0,14083,14083.0,14083.0,14083.0
unique,22,,,68,,,
top,Renfrew-Collingwood,,,ACER,,,
freq,1323,,,3970,,,
mean,,2003-09-20 23:57:38.893701504,6.353489,,1.822978,49.246574,-123.100166
min,,1989-10-27 00:00:00,0.5,,0.0,49.200732,-123.22344
25%,,1997-12-15 12:00:00,3.0,,1.0,49.229246,-123.136823
50%,,2003-04-01 00:00:00,5.0,,2.0,49.24641,-123.096806
75%,,2009-11-13 00:00:00,8.0,,2.0,49.263317,-123.057649
max,,2019-06-03 00:00:00,317.0,,9.0,49.293881,-123.018258


After the data wrangling, the target df has 14,083 observations and 7 columns. The 14,083 observations (street trees) are classified as 68 genera and growing at 22 neighbourhoods in Vancouver.

## Start EDA on cleaned df

### Distribution of number of trees planted by genus

Biological diversity is one of the interests for the analysis. The **initial question** would be to reveal the distribution of street trees planted in Vancouver by genus and find out which genus is the most popular one based on the df. From the df description, there are 68 distinct genera within the total of 14,083 trees. The quick guess for the answer would be Acer as Acer is a genus of trees commonly known as maples. A maple leaf is on the Canadian flag, and the maple has been chosen as a national symbol in Canada.

In [9]:
# To visualize distrubition of genus for whole dt
plot_1_genus = alt.Chart(trees_df).mark_bar().encode(
    alt.X('count():Q',title='Number of Trees'),
    alt.Y('genus_name:N',title='Genus',sort='x')
)

# Add text annotation for the number of trees for each genus
text_1_genus = plot_1_genus.mark_text(align='left',dx=2).encode(text='count():Q')

plot_1_genus = plot_1_genus + text_1_genus

plot_1_genus

From the above bar chart, it has been confirmed that Acer is the most popular genus of trees based on the df. The number of Acer trees is more than double of Prunus trees that have the second largest number in the Vancouver street trees. All the rest of genera have less than 1,000 trees shown in the df, and 47 out of the total 68 genera have less than 100 trees.

The **plot_1_genus** will be included in the analysis report to answer the **1st question**.

### Distribution of number of trees planted by year

The next interest is to discover the history of planting street trees in Vancouver. The df has provide the specific date of planted for each tree. From the available data, the age of current trees can be calculated. Since the accuracy is not a priority in this case, only the year of tree planted will be extracted for calculating the age till 2021.

In [10]:
# Extract year of planted and calculate age of trees
trees_df_yr = trees_df.assign(year=trees_df['date_planted'].dt.year)

# Calculate the age of trees till 2021.
trees_df_age = trees_df_yr.assign(age=(2021-trees_df_yr['year']))

trees_df_age

Unnamed: 0,neighbourhood_name,date_planted,diameter,genus_name,height_range_id,latitude,longitude,year,age
1,Renfrew-Collingwood,2018-11-28,3.00,ACER,1,49.236650,-123.051831,2018,3
2,Kitsilano,1996-04-19,14.00,ACER,3,49.264250,-123.184020,1996,25
3,Renfrew-Collingwood,2006-03-06,3.75,PYRUS,1,49.261036,-123.052921,2006,15
4,Renfrew-Collingwood,2001-11-01,3.00,METASEQUOIA,2,49.233354,-123.050249,2001,20
6,Kitsilano,1994-12-12,7.00,GLEDITSIA,2,49.267281,-123.149326,1994,27
...,...,...,...,...,...,...,...,...,...
29979,West End,1997-06-17,8.00,GLEDITSIA,2,49.282286,-123.132499,1997,24
29981,Victoria-Fraserview,1990-03-29,11.00,ACER,2,49.227675,-123.061773,1990,31
29989,Oakridge,2013-02-19,3.00,STEWARTIA,1,49.228650,-123.134332,2013,8
29996,Shaughnessy,2007-04-16,8.00,ACER,1,49.243636,-123.129480,2007,14


To plot the number of trees planted by each year from 1989 to 2019, bar chart is a preferred graph as it can clearly show the number of trees in an axis range and demonstrate the trend for the particular period.

In [11]:
# Plot distribution of number of trees planted by year

# It has been tested that if specifying the dtype of column 'year' as quantitative (Q), the x-axis label takes scientific
# formatting, ie. 1,989. To avoid the ",", the dtype of 'year' has been specified as nominal (N)
plot_2_year = alt.Chart(trees_df_age).mark_bar().encode(
    alt.X('year:N'),
    alt.Y('count():Q'))

plot_2_year

The above bar chart indicates that there was a peak period from 1995 and 2013 to plant street trees by the City of Vancouver. During this period, the city had planted the highest number of trees in a single year in 1998 and in 2013. Before 1995 and after 2014, the quantity of trees planted were relatively lower, expecially in 2016 when there were less than 50 new trees on public boulevards in Vancouver. The urban forestry is a systemic project. How many trees planted is determined by a number of factors, such as public budget, tree replacement plan due to species distribution, insects, diseases, or environmental stress, etc. It is a clear message that the City of Vancouver has developed and maintained a robust public trees database that benefits the wellbeings of residents in Vancouver.

### Distribution of number of trees planted by year by genus

From the scientific perspective, it is meaningful to know how many trees being planted by different genus during the specified period. Try three types of plots and select the best one for analysis report.

**Option 1** - Creating subplots for each genus via faceting

In [12]:
# Create subplots for each genus planted by year

# If specifying the dtype of column 'year' as nominal (N), no output for subplots. To dig into the issue in the future
# Specify the dtype of column 'year' as quantitative (Q) then subplots can be normally generated
# Since it is EDA plot, it is not a big concern that the x-axis label takes scientific formatting
plot_3_genus_year_facet = alt.Chart(trees_df_age).mark_bar().encode(
    alt.X('year:Q'),
    alt.Y('count():Q')
).facet('genus_name:N').resolve_scale(y='independent')

plot_3_genus_year_facet

**Option 2** - Scatter chart

In [13]:
# Plot genus of trees vs year planted. Use color & size channel to vis quantity of trees

# Sort genera by frequency
order_genus = trees_df_age['genus_name'].value_counts(ascending=True).reset_index()['index'].tolist()

plot_4_genus_year_circle = alt.Chart(trees_df_age).mark_circle().encode(
    alt.X('year:N'),
    alt.Y('genus_name:N',sort=order_genus),
    alt.Color('count()',scale=alt.Scale(scheme='tealblues')),
    alt.Size('count()'))

plot_4_genus_year_circle

**Option 3** - Bar chart with selection feature - **dropdown widget**

In [14]:
# Plot number of trees planted by year and add dropdown selection by genus

genus = sorted(trees_df_age['genus_name'].unique())

dropdown_genus = alt.binding_select(name='Genus', options=genus)

select_genus = alt.selection_single(fields=['genus_name'], bind=dropdown_genus)

# Since objective is to see number of trees for each genus, for y axis, specify stack=False
plot_5_genus_year_bar = alt.Chart(trees_df_age).mark_bar().encode(
    alt.X('year:N'),
    alt.Y('count():Q',stack=False),
    alt.Color('genus_name:N')
).add_selection(select_genus).encode(
    opacity=alt.condition(select_genus, alt.value(0.9), alt.value(0.0)))

plot_5_genus_year_bar

Comparing the above three options, the analyst will pick the third one - Bar chart with dropdown widget in the analysis report. The bar chart with dropdown selection has combined multiple information required in one plot and provide the audiences with convenience to efficiently explore number of trees planted from 1989 to 2019 for each genus.

Till now, the **2nd question** could be defined as: In the past 30 years, how many trees have been planted in each single year? And what is the number of trees planted by genus in each year? The answers would be the valuable reference for the public who are interested with tree planting history in Vancouver. In addition, for researchers, the question will provide a first-hand insight into the urban forestry and street trees replacement planning.

The **plot_2_year** together with **plot_5_genus_year_bar** will be utilized in the analysis report for the **2nd question**.

### Calculate & Plot trees growth rates

The df provides a set of information for 14,083 street trees under 68 genera. In particular, botanists or urban greening workers could study these open source data to facilitate their researches or projects related to trees growth. Therefore, the **3rd question** would focus on the growth rates of each genus and discover which one is the fastest genus within the Vancouver street trees dataset.

The insights extracted from this question will not only confirm or verify some of the current scientific conclusions on trees growth patterns, but also provide accumulated evidences for improving the optimization of street trees in cities with similar ecological environment as Vancouver.

In [15]:
# Calculate growth rate in diameter, growth rate in height and save them into df

trees_df_rate = trees_df_age.assign(rate_dia=trees_df_age['diameter']/trees_df_age['age'],
                                    rate_height=trees_df_age['height_range_id']/trees_df_age['age']
                                   ).round(2)
trees_df_rate.describe()

Unnamed: 0,diameter,height_range_id,latitude,longitude,year,age,rate_dia,rate_height
count,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0
mean,6.353489,1.822978,49.246638,-123.100185,2003.312433,17.687567,0.383389,0.112437
std,5.273568,0.98352,0.0215,0.04918,7.126035,7.126035,0.276021,0.067645
min,0.5,0.0,49.2,-123.22,1989.0,2.0,0.02,0.0
25%,3.0,1.0,49.23,-123.14,1997.0,12.0,0.23,0.08
50%,5.0,2.0,49.25,-123.1,2003.0,18.0,0.33,0.1
75%,8.0,2.0,49.26,-123.06,2009.0,24.0,0.45,0.12
max,317.0,9.0,49.29,-123.02,2019.0,32.0,11.32,2.0


**It must be noted that** in the original df, height of trees are collected as **height_range_id**. 0-10 for every 10 feet (ie. 0 = 0-10 ft, 1 = 10-20 ft). Therefore audiences should be aware that the calculated average growth rate by height is not an absolute value of height but a processed relative variable. Typically, if the quantity of observations for a particular genus are sufficient (deep research is required to define the "sufficiency"), the result will have more accuracy from the statistical perspective. But if the genus has limited observations in the df, the audiences should be more careful to assess the results. Again, deep research is required in the future analysis.

In [16]:
# Plot two growth rates using scatter chart
# Add zooming and panning to solve the issue that the plot is saturated
# Add dropdown widget to select genus

plot_growth_rates = alt.Chart(trees_df_rate).mark_point(size=5).encode(
    alt.X('rate_dia:Q',title="Growth rate in diameter"),
    alt.Y('rate_height:Q',title="Growth rate in height")
).interactive()

plot_6_growth_rates_genus = plot_growth_rates.add_selection(select_genus).encode(
    opacity=alt.condition(select_genus, alt.value(0.9), alt.value(0.0)))

plot_6_growth_rates_genus

By adding zooming, panning and dropdown widget selection features, audiences can effectively check the two growth rates based on their interested genus.

It is noted that **tooltips** was supposed to be added in the above plot to make easier for audiences to see growth rates figure when they hover over points. However, once the points have been filtered by the dropdown widget, when people hover over the points of their interested genus, it may show misleading values from hidden point (due to the opacity condition in the code) that belongs to another genus because the two points are overlapped or too close to be distinguished by the mouseover control. For example, in case the points for the genus of Acer are filtered and visible on the screen. When the pointer of mouse is moved above one point that visually should belongs to Acer, the tooltip, however, unintentionally indicates the value of point that is from another genus and hidded on the screen. The misleading tooltips issue is more likely to occur in the areas where points overlap or too close. Therefore, the analyst has decided not to apply tooltips at this time and leave this as a future technical question on Altair plotting. 

The above scatter plot provides the growth rates distribution for each genus. It is still not effective to compare the rates among all genera. So the average of growth rates for each genus will be calculated and a bar chart will be generated to list and sort the rates by the average rates per genus.

Together with the bar chart that just shows the mean value, a boxplot is also created to include a few key summary statistics and visualize data distribution. This combination will provide more information to audiences particularly for researchers who may care about the statistics more than a mean value.

In [17]:
# Plot average growth rates in diameter per genus

plot_ave_growth_rate_dia_bar = alt.Chart(trees_df_rate).mark_bar().encode(
    alt.X('mean(rate_dia)', title='Average growth rate - diameter'),
    alt.Y('genus_name:N',title='Genus',sort='x'),
    alt.Color('count()',scale=alt.Scale(scheme='blues'))
).properties(width=150)

# Sort boxplot by average growth rate in diameter to keep consistent with bar chart
ave_growth_rate_dia_order = trees_df_rate.groupby('genus_name')['rate_dia'].mean().sort_values().index.tolist()

# To distinguish the boxplot from the bar chart, specify the color as green
plot_ave_growth_rate_dia_box = alt.Chart(trees_df_rate).mark_boxplot(color='green').encode(
    alt.X('rate_dia:Q', title='Growth rate - diameter'),
    alt.Y('genus_name:N',title='Genus',sort=ave_growth_rate_dia_order)
).properties(width=350)

plot_7_ave_growth_rates_dia = plot_ave_growth_rate_dia_bar | plot_ave_growth_rate_dia_box

plot_7_ave_growth_rates_dia

In [18]:
# Modify the domain of the axis’ scale for boxplot to make the plot more readable
# At the same time, discard outliers with the value more than 4 inch/year

plot_ave_growth_rate_dia_box_scale = plot_ave_growth_rate_dia_box.encode(alt.X('rate_dia:Q', 
                                                                               title='Growth rate - diameter',
                                                                               scale=alt.Scale(domain=[0, 4],clamp=True)))

plot_7_ave_growth_rates_dia = plot_ave_growth_rate_dia_bar | plot_ave_growth_rate_dia_box_scale

plot_7_ave_growth_rates_dia

In [19]:
# Plot average growth rates in height per genus

# To distinguish the plot from the above growth rates in diameter chart, change alternative color scheme
plot_ave_growth_rate_height_bar = alt.Chart(trees_df_rate).mark_bar().encode(
    alt.X('mean(rate_height)', title='Average growth rate - height'),
    alt.Y('genus_name:N',title='Genus',sort='x'),
    alt.Color('count()',scale=alt.Scale(scheme='teals'))
).properties(width=150)

# Sort boxplot by average growth rate in height to keep consistent with bar chart
ave_growth_rate_height_order = trees_df_rate.groupby('genus_name')['rate_height'].mean().sort_values().index.tolist()

# To distinguish the boxplot from the bar chart, specify the color as grey
plot_ave_growth_rate_height_box = alt.Chart(trees_df_rate).mark_boxplot(color='grey').encode(
    alt.X('rate_height:Q', title='Growth rate - height'),
    alt.Y('genus_name:N',title='Genus',sort=ave_growth_rate_height_order)
).properties(width=350)

plot_8_ave_growth_rates_height = plot_ave_growth_rate_height_bar | plot_ave_growth_rate_height_box

plot_8_ave_growth_rates_height

In [20]:
# Modify the domain of the axis’ scale for boxplot to make the plot more readable
# At the same time, discard outliers with the value more than 0.6 unit/year

plot_ave_growth_rate_height_box_scale = plot_ave_growth_rate_height_box.encode(alt.X('rate_height:Q', 
                                                                               title='Growth rate - height',
                                                                               scale=alt.Scale(domain=[0, 0.6],clamp=True)))

plot_8_ave_growth_rates_height = plot_ave_growth_rate_height_bar | plot_ave_growth_rate_height_box_scale

plot_8_ave_growth_rates_height

Based on the above plot 7 and plot 8, TSUGA, POPULUS and PSEUDOTSUGA are identified as the top three fastest growing genera. But as indicated on the plots and previous plot 1, the sample size of the three genera is very limited. The number of trees (observations) of TSUGA, POPULUS and PSEUDOTSUGA in the df are 2, 3, and 8, respectively. Since the sample size may affect the confidence level when analyzing and comparing the growth rates among genera, **a futher statistical analysis should be considered in the future to justify the conclusion in this case**.

To answer the **3rd question**, the combination of **plot_6_growth_rates_genus**, **plot_7_ave_growth_rates_dia** and **plot_8_ave_growth_rates_height** will be included in the formal analysis report.

### Plot trees distribution on map

The **4th question** (last question) is to figure out how many street trees are growing in each neighbourhood and what are the average age and average diameter of trees in each neighbourhood. The answer might be interesting for residents in Vancouver to know or for potential home buyers to select their desired neighbourhood with considerations of trees distribution and some characters.

#### Make base map of Vancouver

In [21]:
# To make a base map of Vancouver by using the geojson url saved in url_geojson

url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'

# Format it in a Topo json format using alt.Data()

data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json'))

# Then make base Vancouver Altair map using the data_geojson_remote object
# Use an identity type and need to reflectY=True. Without this second argument the map of Vancouver is upside down

base_map = alt.Chart(data_geojson_remote).mark_geoshape(
    color='white',stroke="grey").encode(
).project(type='identity', reflectY=True)

base_map

#### Plot the number of trees for each neighbourhood on map

Perform calculation needed and create a new df.

In [22]:
# Rename neighbourhood_name to name since that's what's it's called in the geojson url
trees_nbhd_num_df = pd.DataFrame(trees_df_age['neighbourhood_name'].value_counts()
                            ).reset_index().rename(columns={'index':'name','neighbourhood_name':'trees_number'})

trees_nbhd_age_df = trees_df_age.groupby('neighbourhood_name').mean().round(1).reset_index(
).rename(columns={'neighbourhood_name':'name','diameter':'ave_diameter','age':'ave_age'})[['name','ave_diameter','ave_age']]

trees_nbhd_df = trees_nbhd_num_df.merge(trees_nbhd_age_df, left_on='name', right_on='name')

trees_nbhd_df

Unnamed: 0,name,trees_number,ave_diameter,ave_age
0,Renfrew-Collingwood,1323,5.6,17.5
1,Hastings-Sunrise,1285,8.1,19.7
2,Kensington-Cedar Cottage,1169,6.5,18.7
3,Sunset,936,6.0,17.3
4,Victoria-Fraserview,908,5.5,17.7
5,Dunbar-Southlands,762,6.1,16.7
6,Marpole,689,5.6,17.2
7,Riley Park,684,5.8,17.2
8,Grandview-Woodland,632,6.6,18.0
9,Killarney,615,6.2,17.7


Add tooltips and hover selection on map.

In [23]:
hover = alt.selection_single(fields=['name'], on='mouseover')

plot_trees_nbhd = alt.Chart(data_geojson_remote).mark_geoshape().transform_lookup(
    lookup='properties.name',
    from_=alt.LookupData(trees_nbhd_df, 'name', ['trees_number','name'])).encode(
    color='trees_number:Q',
    opacity=alt.condition(hover, alt.value(1), alt.value(0.1)),
    tooltip=['name:N',alt.Tooltip('trees_number:Q', title='Number of trees')]
).add_selection(hover).project(type='identity', reflectY=True)

plot_trees_nbhd_map = base_map + plot_trees_nbhd

plot_trees_nbhd_map

#### Make scatter plot for ave_age vs ave_diameter for each neighbourhood
#### Link map selection to scatter plot

In [24]:
# Plot ave_age per ave_diameter
# Scale the x and y axis by specifying zero=False

plot_age_dia = alt.Chart(trees_nbhd_df).mark_point().encode(
    alt.X('ave_diameter:Q',scale=alt.Scale(zero=False)),
    alt.Y('ave_age:Q',scale=alt.Scale(zero=False)),
    stroke=alt.condition(hover, alt.value('black'), alt.value('#ffffff')))

plot_9_nbhd_map = plot_trees_nbhd_map & plot_age_dia

plot_9_nbhd_map

Through the link of map selection with the scatter plot, it is effective and efficient for audiences to hover over their target neighbourhood on map to quickly obtain the number of trees in the neighbourhood, know how old and how big the trees there on average. So the **plot_9_nbhd_map** will be used to answer the **4th question** in the analysis report.

#### Further dropping irrelevant columns

Since there will be no plot requiring the tree coordinates data, two columns, **latitude** and **longitude** could be dropped when performing data wrangling.

## To close the report

Till now, four questions have been defined and a set of plots have been selected to answer these questions. The analysis on Vancouver street trees dataset will be stopped here. However, at the end of the analysis report, **further questions** will be listed for future considerations:

1. What is the specific genus distribution as per neighbourhood. This could be demonstrated on the Vancouver map, too. The answer will be informative for residents to know the distribution of fast-growing trees or shade trees in their area.
2. What is the average height range of trees in each neighbourhood. Some house buyers may prefer taller trees or shade trees in their neighbourhood.
3. Since genus is the highest level of tree classification in the df, data analyst may extend the analysis into next level, species, or common names, or cultivar_names to continuously explore the data patterns.

In the analysis report, the **assumption and limitations of the analysis** should be highlighted for audiences to pay attention, including:

1. The original data only include the public trees on boulevards in the City of Vancouver. Park trees and private trees are not included.
2. More than half of original observations from the original dataset have been dropped since lacking of information on date of tree planted. Therefore, the analysis is based on the remaining observations only. This may limit the conclusions or insights extracted from this analysis.
3. In order to simplify the analysis, the calculation of age of trees has not included the specific month and day.
4. The information related to the height of trees were collected as height_range_id. So audiences should be aware that the height figures in this report are not the absolute measurement of height.
5. Due to the limited sample size, more research is required to verify the conclusion of the fastest growth trees in this report.

**The last important notes**: 
1. The top priority of this analysis is emphasizing on the demonstration of learning outcomes from the Data Vis perspectives. The analyst (author) recognized in this report the narrative should be more cohesive, the original dataset column analysis and data wrangling should be supported with more domain expertise if judged from a higher level data analysis standard.

2. More statistical tools could be applied in the analysis to make the report more convincible. This will be considered for future work.

<h2><center>**End of EDA Report**</center></h2>