# Exploratory Data Analysis

## Data Import

In this analysis I will be using the following packages to explore the data.

- Altair
- Pandas
- Vega Datasets
- Numpy

Below is a brief view of the data.

In [6]:
# Import libraries needed for this assignment
from hashlib import sha1
import altair as alt
import pandas as pd
from vega_datasets import data
import numpy as np
alt.data_transformers.enable("default", max_rows=None)
# Importing the data
url='https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/street_trees.csv'
street_trees_df = pd.read_csv(url, sep=';')
# Filtered dataframe that includes relevant columns and rows as well as newly created columns
trees_df=street_trees_df.drop(['TREE_ID', 'CIVIC_NUMBER', 'STD_STREET','CULTIVAR_NAME','ASSIGNED','ROOT_BARRIER','PLANT_AREA','ON_STREET_BLOCK','ON_STREET','STREET_SIDE_NAME','CURB','Geom'], axis=1)
trees_df = trees_df[trees_df['GENUS_NAME'].isin(['ACER', 'PRUNUS'])]
trees_df=trees_df.assign(YEAR_PLANTED=np.where(trees_df.DATE_PLANTED=='Nat','N/A', trees_df['DATE_PLANTED'].str[:4]))
trees_df.loc[:,'TREE_TYPE']=np.where((trees_df.COMMON_NAME.str.contains('MAPLE')), 'MAPLE',np.where((trees_df.COMMON_NAME.str.contains('CHERRY')), 'CHERRY',
                            np.where((trees_df.COMMON_NAME.str.contains('PLUM')),'PLUM','OTHER')))
no_nulls_df = pd.notnull(trees_df['YEAR_PLANTED'])
no_nulls_df=trees_df[no_nulls_df]
trees_df

Unnamed: 0,GENUS_NAME,SPECIES_NAME,COMMON_NAME,NEIGHBOURHOOD_NAME,HEIGHT_RANGE_ID,DIAMETER,DATE_PLANTED,YEAR_PLANTED,TREE_TYPE
4,ACER,PALMATUM,OSAKAZUKI JAPANESE MAPLE,KITSILANO,2,14.00,,,MAPLE
13,PRUNUS,PADUS,EUROPEAN BIRDCHERRY,MOUNT PLEASANT,3,28.70,,,CHERRY
14,ACER,PLATANOIDES,NORWAY MAPLE,MOUNT PLEASANT,2,19.00,,,MAPLE
15,PRUNUS,PADUS,EUROPEAN BIRDCHERRY,MOUNT PLEASANT,4,20.90,,,CHERRY
16,ACER,PLATANOIDES,NORWAY MAPLE,MOUNT PLEASANT,2,21.40,,,MAPLE
...,...,...,...,...,...,...,...,...,...
147480,ACER,PLATANOIDES,NORWAY MAPLE,SHAUGHNESSY,5,22.75,,,MAPLE
147484,ACER,PLATANOIDES,GLOBEHEAD NORWAY MAPLE,RENFREW-COLLINGWOOD,1,3.00,2008-03-13,2008,MAPLE
147489,ACER,RUBRUM,RED MAPLE,GRANDVIEW-WOODLAND,3,8.00,,,MAPLE
147490,ACER,RUBRUM,RED MAPLE,GRANDVIEW-WOODLAND,3,14.00,,,MAPLE


## Dataset Description

The Street Trees dataset used in this analysis is sourced from github however, the data is available from the City of Vancouver's [Open Data Portal](https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id) website.

Below is quick look a the dataset schema

| Column                              | Description                                                                                                        |
|-------------------------------------|:-------------------------------------------------------------------------------------------------------------------|
| TREE_ID                             | Numerical ID                                                                                                       |
| CIVIC_NUMBER                        | Street address of the site at which the tree is associated with                                                    |
| STD_STREET                          | Street name of the site at which the tree is associated with                                                       |
| GENUS_NAME                          | Genus name                                                                                                         |
| SPECIES_NAME                        | Species name                                                                                                       |
| CULTIVAR_NAME                       | Cultivar name                                                                                                      |
| COMMON_NAME                         | Common name                                                                                                        | 
| ASSIGNED                            | Indicates whether the address is made up to associate the tree with a nearby  lot (Y=Yes or N=No                   |
| ROOT_BARRIER                        | Root barrier installed (Y = Yes, N = No)                                                                           |
| PLANT_AREA                          | B = behind sidewalk, G = in tree grate, N = no sidewalk, C = cutout, a number  indicates boulevard width in feet   |
| ON_STREET_BLOCK                     | The street block at which the tree is physically located on                                                        |
| ON_STREET                           | The name of the street at which the tree is physically located on                                                  |
| NEIGHBOURHOOD_NAME                  | City's defined local area in which the tree is located.  For more information, see theLocal Area Boundary Datapage.|
| STREET_SIDE_NAME                    | The street side which the tree is physically located on (Even, Odd or Median  (Med))                               |
| HEIGHT_RANGE_ID                     | 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and10 = 100+ ft)                            |
| DIAMETER                            | DBH in inches (DBH stands for diameter of tree at breast height)                                                   |
| CURB                                | Curb presence (Y = Yes, N = No)                                                                                    |       | DATE_PLANTED                        | The date of planting in YYYYMMDD format.  Data for this field may not be available for all trees.                  |
| Geom                                | Spatial representation of feature                                                                                  |


## Data Summary Tables and Methods

In [4]:
street_trees_df.info()
print("\n")
street_trees_df.loc[:,('HEIGHT_RANGE_ID', 'DIAMETER')].describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147493 entries, 0 to 147492
Data columns (total 19 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   TREE_ID             147493 non-null  int64  
 1   CIVIC_NUMBER        147493 non-null  int64  
 2   STD_STREET          147493 non-null  object 
 3   GENUS_NAME          147493 non-null  object 
 4   SPECIES_NAME        147493 non-null  object 
 5   CULTIVAR_NAME       79786 non-null   object 
 6   COMMON_NAME         147493 non-null  object 
 7   ASSIGNED            147493 non-null  object 
 8   ROOT_BARRIER        147493 non-null  object 
 9   PLANT_AREA          145988 non-null  object 
 10  ON_STREET_BLOCK     147493 non-null  int64  
 11  ON_STREET           147493 non-null  object 
 12  NEIGHBOURHOOD_NAME  147493 non-null  object 
 13  STREET_SIDE_NAME    147493 non-null  object 
 14  HEIGHT_RANGE_ID     147493 non-null  int64  
 15  DIAMETER            147493 non-nul

Unnamed: 0,HEIGHT_RANGE_ID,DIAMETER
count,147493.0,147493.0
mean,2.627148,11.605081
std,1.544236,9.187469
min,0.0,0.0
25%,1.0,4.0
50%,2.0,9.0
75%,4.0,16.5
max,10.0,435.0


In [3]:
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66770 entries, 4 to 147492
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   GENUS_NAME          66770 non-null  object 
 1   SPECIES_NAME        66770 non-null  object 
 2   COMMON_NAME         66770 non-null  object 
 3   NEIGHBOURHOOD_NAME  66770 non-null  object 
 4   HEIGHT_RANGE_ID     66770 non-null  int64  
 5   DIAMETER            66770 non-null  float64
 6   DATE_PLANTED        28208 non-null  object 
 7   YEAR_PLANTED        28208 non-null  object 
 8   TREE_TYPE           66770 non-null  object 
dtypes: float64(1), int64(1), object(7)
memory usage: 5.1+ MB


The initial table, street_trees_df shows the complete github dataset.  For this analysis, I will be using a select number of columns from this table and I have created 2 new columns.  YEAR_PLANTED to isolate the year as well as the TREE_TYPE which is the simple name like maple or cherry tree. 

You will notice that the there are some missing data for in the DATE_PLANTED and the newly created YEAR_PLANTED.  These records will be removed during the analysis that include time.  

$$
\sum_{i=1}^n x_i = x_{maple} + x_{cherry} + x_{plum} + x_{other}
$$