# Exploratory Data Analysis for Trees in Vancouver

By Jenna Le Noble

### Introduction

The Vancouver Trees dataset includes information about various types of trees planted around the city of Vancouver, including the species name, the planted date, the tree's diameter, the tree's height, the street address, the neighborhood where the tree is planted and more. The data is found on the [City of Vancouver website](https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name). Since it is such a large dataset, we will be using a subset of the original Trees data that includes 5000 rows.

We are interested in finding out which tree species are the largest in terms of diameter and height. Then, we want to find the nieghborhoods with the highest count of these trees. 

### EDA

First, we import the necessary libraries and read in the data set from the internet.

In [1]:
import pandas as pd
import altair as alt

In [2]:
trees_data = pd.read_csv('https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv')
trees_data.head()

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,10747,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,...,15,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323
1,12573,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,...,7,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.25635,-123.158709
2,29676,ROSS ST,ROSS ST,NIGRA,Sunset,,12.0,ODD,PINUS,N,...,7,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254
3,8856,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,...,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721
4,21098,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,,15.5,ODD,AESCULUS,Y,...,N,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958


**Table 1: Raw Data**

To answer our original questions, we are interested in the following columns:
* `species_name`: name of the tree species
* `neighborhood_name`: name of the neighborhood where the tree is planted
* `diameter`: diameter of the tree (inches)
* `genus_name`: genus name of the tree
* `common_name`: common name of the tree
* `height_range_id`: height range of the tree with value 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and 10 = 100+ ft)

We will keep these columns in the dataset and drop the rest, as they are not needed to answer our questions. We can further explore the data by using `.info()`.

In [3]:
trees_data = trees_data.loc[:, ['species_name', 'neighbourhood_name', 'diameter',
                               'genus_name', 'common_name', 'height_range_id']]

trees_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   species_name        5000 non-null   object 
 1   neighbourhood_name  5000 non-null   object 
 2   diameter            5000 non-null   float64
 3   genus_name          5000 non-null   object 
 4   common_name         5000 non-null   object 
 5   height_range_id     5000 non-null   int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 234.5+ KB


Each of the columns we are interested in do not contain any `NA` values. We can also note that `diameter` is a continuous numeric column, `height_range_id` is a discrete numeric column, while the rest of the columns are categorical.

We can further explore the columns by using `describe()`:

In [4]:
trees_data.describe(include='all')

Unnamed: 0,species_name,neighbourhood_name,diameter,genus_name,common_name,height_range_id
count,5000,5000,5000.0,5000,5000,5000.0
unique,171,22,,67,361,
top,SERRULATA,Renfrew-Collingwood,,ACER,KWANZAN FLOWERING CHERRY,
freq,463,384,,1218,383,
mean,,,12.340888,,,2.7344
std,,,9.2666,,,1.56957
min,,,0.0,,,0.0
25%,,,4.0,,,2.0
50%,,,10.0,,,2.0
75%,,,18.0,,,4.0


**Table 2: Summary Statistics by Column**

Table 2 tells us some interesting facts about about the tree data in regards to the questions we are interested in:

* The maximum tree diameter is 71 and the mean is 12.34
* The maximum height id is 9 (90-100ft) and the mean height id is 2.7
* There are 171 unique tree species

Further analysis is required in order to answer our original questions. We can start by creating some visualizations to get a better sense of the data. Since there are 171 different species, we wish to look at the top 20 species with the largest mean diameter.

In [5]:
mean_diameter = trees_data.groupby('species_name')['diameter'].mean().reset_index()
top_species = mean_diameter.sort_values(by='diameter', ascending=False).head(20)

diameter_plot = alt.Chart(top_species).mark_bar().encode(
    alt.X('diameter'),
    alt.Y('species_name', sort='x')    
).properties(
    title='Top 20 Species with Largest Mean Diameter')
diameter_plot

The figure above shows the 20 tree species with the largest mean diameter. The mean diameters range from around 23 to just below 40 inches. Next, we will explore the 20 tree species with the highest mean height range, and compare those species to the largest diameter species.  

In [6]:
mean_height = trees_data.groupby('species_name')['height_range_id'].mean().reset_index()
top_species_height = mean_height.sort_values(by='height_range_id', ascending=False).head(20)

height_plot = alt.Chart(top_species_height).mark_bar().encode(
    alt.X('height_range_id'),
    alt.Y('species_name', sort='x')).properties(
    title='Top 20 Species with Largest Mean Height Range')
height_plot

Now we can see the 20 species with the highest mean height range. The mean height id ranges from around 4 to 7 (meaning 40-50 ft to 70-80 ft high). We can see that some of the same tree species are apart of both the 20 largest mean diameters as well as the 20 highest height range id, including Trichocarpa, Cinerea, Rubra, Procera, Saccharinum and more.

Our next visualization is a scatterplot that will show both size measurements of each of the species that are found to be apart of the largest diameter group and the highest height range id group.

In [7]:
largest_species = trees_data[trees_data['species_name'].isin(top_species['species_name']) &
                            trees_data['species_name'].isin(top_species_height['species_name'])]


largest_species_plot = alt.Chart(largest_species).mark_circle(size=75).encode(
    alt.X('mean(height_range_id)'),
    alt.Y('mean(diameter)'),
    alt.Color('species_name'))
largest_species_plot

The plot above shows the 14 tree species that are found to be within the top 20 largest mean diameter group as well as the top 20 highest mean height id group. We can also see the Trichocarpa species has the largest mean diameter and the highest mean height range id out of all of the species.

Now we want to discover which Neighborhoods contain the highest count of these largest 14 species.

In [8]:
alt.Chart(largest_species).mark_bar().encode(
    alt.X('species_name'),
    alt.Y('count()'),
    alt.Color('species_name')).properties(width=150, height=75).facet(
    'neighbourhood_name', columns=4)

The bar charts above give us a sense of the counts of species for each neighborhood. To further determine which neighborhood has the highest count of these species, we can create a heat map.

In [9]:
alt.Chart(largest_species).mark_rect().encode(
    alt.Color('count()'),
    alt.X('species_name'),
    alt.Y('neighbourhood_name', sort='color'))

We can also create a similar visualization that uses both size and color to determine the counts. 

In [10]:
alt.Chart(largest_species).mark_circle().encode(
    alt.Color('count()'),
    alt.X('species_name'),
    alt.Y('neighbourhood_name', sort='color'),
    alt.Size('count()'))

The plots above show us that Kitsilano, Dunbar-Southlands and Shaughnessy appear to have the highest number of counts of the largest tree species. The most common trees are shown to be Hippocastanum and Rubra, as they have the most frequent counts across all neighborhoods.

### Conclusion

After exploring different visualizations, I will include the following 4 plots in my final report and make the following changes:

1. Bar graph of the top 20 species with largest mean diameter and top 20 species with highest mean height range id
    * This plot can potentially become a facet plot so that the 2 plots are side by side, or a layered plot
    * Remove 0 from the axis so that the bars become shorter
    * Add labels and title


    
2. Scatterplot of the 14 species apart of the 2 largest species groups
    * Update the axis so that it only reflects values closest to the points (remove 0 from the axis)
    * Add labels and title


3. Facet bar graph of the number of counts of largest species by neighborhood
    * Adjust the size of each plot
    * Possibly make each y scale independent
    * Add labels and title


4. Circle plot that demonstrates the counts of species by neighborhood 
    * Look into better colouring schemes
    * Add labels and title
    
    
The first 2 plots are useful for showing which tree species are the largest (in terms of diameter, height and both diameter & height). The last 2 plots are helpful for answering our question: which neighborhoods contain the highest counts of the largest species. 

### References

* https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name