# Exploratory Data Visualization of Vancouver Trees Dataset

---

April 27, 2025   
Nadim Khan

## Introduction

The following project contains a detailed report on exploratory data visualization using a subset of the [Vancouver Street Trees dataset](https://opendata.vancouver.ca/explore/dataset/public-trees/information/?disjunctive.neighbourhood_name&disjunctive.on_street&disjunctive.species_name&disjunctive.common_name) and while attempting to answer the questions laid out below.

### Question(s) of Interest

1. What are most common tree species around Vancouver?
2. Is there any correlation between the diameter and height of the tree?
3. How does the distribution of tree species varies across neighbourhood?
4. How has the total number of trees planted changed over the years?

## Analysis

### Data Import

In [1]:
# Importing the libraries needed for the analysis

import pandas as pd
import numpy as np
import altair as alt

In [2]:
# Read in the required dataset

trees_df = pd.read_csv('https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv')

trees_df.head()

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,10747,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,...,15,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323
1,12573,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,...,7,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.25635,-123.158709
2,29676,ROSS ST,ROSS ST,NIGRA,Sunset,,12.0,ODD,PINUS,N,...,7,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254
3,8856,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,...,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721
4,21098,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,,15.5,ODD,AESCULUS,Y,...,N,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958


### Dataset Description & Review of Data

- explain columns of interest, and overall data you will be using
- Use info/describe
- Repeat/Facet plots to roughly explore all of the data

In [3]:
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          5000 non-null   int64  
 1   std_street          5000 non-null   object 
 2   on_street           5000 non-null   object 
 3   species_name        5000 non-null   object 
 4   neighbourhood_name  5000 non-null   object 
 5   date_planted        2363 non-null   object 
 6   diameter            5000 non-null   float64
 7   street_side_name    5000 non-null   object 
 8   genus_name          5000 non-null   object 
 9   assigned            5000 non-null   object 
 10  civic_number        5000 non-null   int64  
 11  plant_area          4950 non-null   object 
 12  curb                5000 non-null   object 
 13  tree_id             5000 non-null   int64  
 14  common_name         5000 non-null   object 
 15  height_range_id     5000 non-null   int64  
 16  on_str

#### Dataset Observations

The `trees_df` dataset consists of $5000$ rows and $21$ columns. Each tree Id is associated with detailed information about it's location, size, name.
- Although there is a significant amount of null values in the `date_planted` column, we will retain the column as it's relevant to question 1 and 2 and datatype needs to be coverted to `datetime` format.
- Column `Unnamed: 0` and `cultivar_name` will be dropped since the latter has too many missing values and the former was an index column imported from the original dataset.
- Columns associated with location information excluding `neighbourhood_name`, `latitude` and `longitute` will be dropped as it's redundant information.
- Additionally, `assigned`, `curb`, `plant_area` and `root_barrier` columns will be droped as well, since it's irrelavant to our analtsis.
- The remainder columns are relevant to the questions and will be retained for further exploration.

### Data Wrangling

Let's start by dropping the columns identified above as irrelevant to the analysis, rearrange the columns and fix the datatype.

In [4]:
# Drop and reorder the columns using iloc

trees_df = trees_df.iloc[:, [13, 8, 3, 14, 4, 15, 6, 5, 19, 20]]

# Preview the changes

trees_df.head(10)

Unnamed: 0,tree_id,genus_name,species_name,common_name,neighbourhood_name,height_range_id,diameter,date_planted,latitude,longitude
0,21421,ACER,PLATANOIDES,NORWAY MAPLE,Riley Park,4,28.5,2000-02-23,49.252711,-123.106323
1,129645,PYRUS,CALLERYANA,CHANTICLEER PEAR,Arbutus-Ridge,2,6.0,1992-02-04,49.25635,-123.158709
2,154675,PINUS,NIGRA,AUSTRIAN PINE,Sunset,4,12.0,,49.213486,-123.083254
3,180803,FRAXINUS,AMERICANA,AUTUMN APPLAUSE ASH,Killarney,4,11.0,1999-11-12,49.220839,-123.036721
4,74364,AESCULUS,HIPPOCASTANUM,COMMON HORSECHESTNUT,Shaughnessy,4,15.5,,49.238514,-123.154958
5,233622,PARROTIA,PERSICA,VANESSA PERSIAN IRONWOOD,West End,1,3.0,2012-04-05,49.281906,-123.133076
6,105171,ACER,CAMPESTRE,HEDGE MAPLE,Victoria-Fraserview,3,12.0,,49.217522,-123.071311
7,187792,MAGNOLIA,OFFICINALIS,CHINESE MAGNOLIA,Kensington-Cedar Cottage,2,3.0,2001-04-02,49.251127,-123.071912
8,104016,QUERCUS,PALUSTRIS,PIN OAK,Downtown,1,8.0,1999-12-17,49.281303,-123.108253
9,102612,MALUS,ZUMI,REDBUD CRABAPPLE,Renfrew-Collingwood,1,3.0,2008-03-13,49.257272,-123.030023


In [5]:
# Fix datatype of tree_id, and date_planted

trees_df = trees_df.assign(tree_id = trees_df['tree_id'].astype('str'),
                          date_planted = trees_df['date_planted'].astype('datetime64[ns]'))

# Preview the changes
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   tree_id             5000 non-null   object        
 1   genus_name          5000 non-null   object        
 2   species_name        5000 non-null   object        
 3   common_name         5000 non-null   object        
 4   neighbourhood_name  5000 non-null   object        
 5   height_range_id     5000 non-null   int64         
 6   diameter            5000 non-null   float64       
 7   date_planted        2363 non-null   datetime64[ns]
 8   latitude            5000 non-null   float64       
 9   longitude           5000 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(5)
memory usage: 390.8+ KB


### Exploratory Visualizations

To kick off this step, let's look at the summary statistics of the `trees_df`

In [6]:
# Summary statistics

trees_df.describe(include = 'all')

Unnamed: 0,tree_id,genus_name,species_name,common_name,neighbourhood_name,height_range_id,diameter,date_planted,latitude,longitude
count,5000.0,5000,5000,5000,5000,5000.0,5000.0,2363,5000.0,5000.0
unique,5000.0,67,171,361,22,,,,,
top,21421.0,ACER,SERRULATA,KWANZAN FLOWERING CHERRY,Renfrew-Collingwood,,,,,
freq,1.0,1218,463,383,384,,,,,
mean,,,,,,2.7344,12.340888,2003-09-06 04:03:08.912399488,49.247349,-123.107128
min,,,,,,0.0,0.0,1989-10-31 00:00:00,49.202783,-123.22056
25%,,,,,,2.0,4.0,1997-11-06 00:00:00,49.230152,-123.144178
50%,,,,,,2.0,10.0,2003-02-12 00:00:00,49.247981,-123.105861
75%,,,,,,4.0,18.0,2009-11-17 00:00:00,49.263275,-123.063484
max,,,,,,9.0,71.0,2019-05-07 00:00:00,49.29393,-123.023311


**Key Observations**

1. `tree_id` column contains unique value for each observation with no duplicate values.
2. There are $171$ unique species of trees present in the dataset with **SERRULATA** being the most common species amongst them. For the purpose of our analysis and simplicity, I will only consider the top 10 most common species in our analysis.
3. **Renfrew-Collingwood** neighbourhood has the highest number of trees among the $22$ distinct neighbourhoods present in the dataset.

In [7]:
# Saving top 10 species in a list

top_10_species = trees_df['species_name'].value_counts().nlargest(10).index.to_list()

# Filtering the dataset for top 10 species

filtered_trees_df = trees_df[trees_df['species_name'].isin(top_10_species)]

filtered_trees_df.describe(include = 'all')

Unnamed: 0,tree_id,genus_name,species_name,common_name,neighbourhood_name,height_range_id,diameter,date_planted,latitude,longitude
count,2497.0,2497,2497,2497,2497,2497.0,2497.0,1053,2497.0,2497.0
unique,2497.0,9,10,75,22,,,,,
top,21421.0,ACER,SERRULATA,KWANZAN FLOWERING CHERRY,Kensington-Cedar Cottage,,,,,
freq,1.0,956,463,383,197,,,,,
mean,,,,,,2.770525,13.133556,2003-09-15 17:35:43.589743616,49.246715,-123.106956
min,,,,,,0.0,0.0,1989-11-06 00:00:00,49.202986,-123.21782
25%,,,,,,2.0,5.5,1997-04-30 00:00:00,49.230067,-123.14218
50%,,,,,,3.0,12.0,2003-02-24 00:00:00,49.247247,-123.105327
75%,,,,,,4.0,18.5,2009-12-18 00:00:00,49.26253,-123.06772
max,,,,,,9.0,56.0,2019-03-29 00:00:00,49.29393,-123.023611


#### Question 1: What are most common tree species around Vancouver?

Since we are interested in visualizing the total number of trees per species in our dataset, I will be plotting a horizontal bar chart as it's an ideal choice of visualization with emphasis on the magnitude of quantitaive values while keeping the species name in a readable orientation. 

The chart highlights the 10 most common species present in Vancouver.

In [8]:
top_10_species_plot = alt.Chart(filtered_trees_df).mark_bar().encode(
    alt.X('count()'), 
    alt.Y('species_name', sort = 'x'))

top_10_species_plot

The most common tree species found in the dataset is **SERRULATA** followed by **PLATANOIDES** and **CERASIFERA**.

#### Question 2: How does the tree diameter varies across different neighbourhoods?

Let's firt look at distribution of the data within the `diameter` and `height_range_id` columm.

In [9]:
alt.Chart(filtered_trees_df).mark_bar().encode(
    alt.X(alt.repeat(), type = 'quantitative', bin = alt.Bin(maxbins = 15)), 
    alt.Y('count()')
).repeat(['height_range_id', 'diameter'], columns = 2)

The data distribution in both the columns is skewed to the right with a few very large values. This means that majority of the trees fall between $1-4$ height range id and $0-20$ inches in diameter. Let's furhter if the two columns are related to each other using a 2D histogran to avoid any saturation issues.

I'd expect that the two columns to be positively correlated, taller trees would be expected to have a larger diameter. 

In [10]:
alt.Chart(filtered_trees_df).mark_rect().encode(
    alt.X('height_range_id', bin = alt.Bin(maxbins = 15)), 
    alt.Y('diameter', bin = alt.Bin(maxbins = 25)),
    alt.Color('count()')
)

It does seems like the  two columns are positively correlated, i.e. as the height of the tree increases, the diameter increases as well. So we could also assume that areas with a high proportion of trees with larger diameter will also have taller trees.

In addition to the relationship between numerical columns, now let's explore how they are related to the categorical columns like `species_name` and `neighbourhood_name`.

In [12]:
diameter_order = []

for groupby_col in ['species_name', 'neighbourhood_name']:
    diameter_order.extend(
        filtered_trees_df
        .groupby(groupby_col)['diameter']
        .median()
        .sort_values().index.to_list())


alt.Chart(filtered_trees_df).mark_boxplot().encode(
    alt.X(alt.repeat('column'), type = 'quantitative'), 
    alt.Y(alt.repeat('row'), type = 'nominal', sort = diameter_order)
).repeat(column = ['height_range_id', 'diameter'], row = ['species_name', 'neighbourhood_name'])

It looks like Mount Pleasant and West Point Grey has trees with the largest diameter, and Serrulata and Plantanoid are the top two tree species with the largest diameter. Give this information, I would think that Mount Pleasant and West Point Grey must contain a high proportion of Serrulata and Plantanoid species trees among all the species. 

#### Question 3: How does the distribution of tree species varies across neighbourhood?

Let's start by visualizing, the total count of trees in each neighbourhood using bar chart which is most suitable for visualizing magnitude of a number, to get a sense of the distribution of trees across various neighbourhood in Vancouver.

In [None]:
top_neighbourhood = alt.Chart(filtered_trees_df).mark_bar().encode(
    alt.X('count()'), 
    alt.Y('neighbourhood_name', sort = 'x'))

top_neighbourhood

As visible, the top 3 neighbourhoods with the most number of trees in Vancouver are **Renfrew-Collingwood, Kensington-Cedar Cottage and Hastings-Sunrise**.

Now to look at the distribution of the top 10 tree species present in each neighbourhood, I'll plot the proportion of the species present in each neighbourhood on a heatmap like plot using mark_circles.

In [None]:
# Normalizing the data

normalized_df = filtered_trees_df.groupby('neighbourhood_name')['species_name'].value_counts(normalize = True).reset_index(name = 'proportion')

normalized_df.head(10)

In [None]:
normalized_plot = alt.Chart(normalized_df).mark_circle().encode(
    alt.X('species_name', sort = '-size'), 
    alt.Y('neighbourhood_name'),
    color = 'proportion',
    size = 'proportion'
)

normalized_plot

Here, we can confirm that Serrulata is the most prevalent tree species in Mount Pleasant and Plantanoides in West Point Grey which are top 2 neighbourhoods and tree species with largest diameter. Additionally, it seems like Kensington-Cedar Cottage, Killarney & Renfrew-Collingwood has the most evenly distributed number of tree species.


#### Question 4: How has the total number of trees planted changed over the years?

This is a comaparitavely straightforward question and can be answered by counting the total number of trees planted in year and plotting a line graph which is a generally accepted practice to plot time-series data.

In [None]:
trees_per_year = alt.Chart(trees_df).mark_line().encode(
    alt.X('year(date_planted)'),
    alt.Y('count()', scale = alt.Scale(domain = [0, 140]))).properties(width = 800)
    
trees_per_year + trees_per_year.mark_circle()

The trend of tree plantation could be divided in three areas, starting with a growth period between 1990-1996, followed by a series of fluctuation with no particular trend until 2014 and finally a steep decline until 2016. The underlying data needs to be further examined to find the causeof this, it could be because of the high number of null values present in the dataset.

## Concluding Remarks

The four key types of graph I will include in the final report:
- A **bar chart** displaying the most common species prevelant in Metro Vancouver.
- A **boxplot** highlighting the distribution of tree diameter across various neighbourhood.
- I will be using a **geographical plot** instead for displaying the distribution of tree species across various neighbourhoood.
- Finally, I will also include the **line graph** to show trend of tree plantation in Vancouver.

I will include the dataset description and highlight the fact that analysis is limited to the subset of the data used to share the limiations of the dataset with the audience, add interactivity to charts instead of using a top 10 filtered chart and allow the audience to control the granular level of information they would to see.

## References

- [Data Source](https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv)
- Data Visualization Sample Project for inspiration
- [Fundamentals of Data Visualization](https://clauswilke.com/dataviz/) book to determine the right graph for visualization