In [54]:
import pandas as pd
import altair as alt

Next I will read in the data, which is hosted on UBC's github page.

The format for reading in data in python using `pandas` is:

```python
pd.read_csv(csv_name)
```

It's also a good idea to assign the data to a variable. Something easy to remember, like `trees`, or something easy to type, like `df`.

In [55]:
df = pd.read_csv("https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv")

In [56]:
df.describe()

Unnamed: 0.1,Unnamed: 0,diameter,civic_number,tree_id,height_range_id,on_street_block,latitude,longitude
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,14861.9204,12.340888,2975.7076,128682.5846,2.7344,2960.227,49.247349,-123.107128
std,8680.023278,9.2666,2078.580429,75412.260406,1.56957,2086.861052,0.021251,0.049137
min,2.0,0.0,2.0,36.0,0.0,0.0,49.202783,-123.22056
25%,7192.75,4.0,1300.5,61321.5,2.0,1300.0,49.230152,-123.144178
50%,14870.0,10.0,2639.0,130130.5,2.0,2600.0,49.247981,-123.105861
75%,22366.75,18.0,4123.0,191332.0,4.0,4100.0,49.263275,-123.063484
max,29992.0,71.0,9113.0,270750.0,9.0,9100.0,49.29393,-123.023311


After reading in the data, it's a good idea to quickly look at its overall structure by using:

```python
df.describe()
```

Using this approach, we can see the mean and the standard deviation for all of the numerical columns.

Mathematically, the mean can be expressed as:

$
\Sigma = \frac {x_i}{n}
$

and the standard deviation can be expressed as:

$$\sigma = \sqrt {\frac{\Sigma (x_i - \mu)} {n} }$$

Now let's make some charts:

In [57]:
df.columns

Index(['Unnamed: 0', 'std_street', 'on_street', 'species_name',
       'neighbourhood_name', 'date_planted', 'diameter', 'street_side_name',
       'genus_name', 'assigned', 'civic_number', 'plant_area', 'curb',
       'tree_id', 'common_name', 'height_range_id', 'on_street_block',
       'cultivar_name', 'root_barrier', 'latitude', 'longitude'],
      dtype='object')

In [58]:
genus_chart = alt.Chart(df).mark_bar().encode(
    x=alt.X('count()', title="Count"),
    y=alt.Y('genus_name', title="Genus name", sort='-x')
).properties(height=600, title="Population of tree type in Vancouver")

genus_chart

It looks like the most common type of tree in Vancouver is of the genus Acer, which is also known as a Japanese maple.

Here is a picture: 

<img src="https://upload.wikimedia.org/wikipedia/commons/9/9b/Acer_palmatum_img.jpg" width=200 height=100 alt="Japanese Nara Maple"/>

Photo source: [Wikipedia](https://en.wikipedia.org/wiki/Acer_palmatum)



Next, let's look at the total number of trees planted in each year.

In [59]:
planted_year = pd.DatetimeIndex(df['date_planted']).year.to_list()

In [60]:
df['planted_year'] = planted_year
df = df.dropna()

In [61]:
alt.Chart(df).mark_line().encode(
    x=alt.X('planted_year'),
    y=alt.Y('count()')
)

It looks like there were very few trees planted before 1995 and past 2015. Could this be because we are missing data or because there weree truky few trees planted in those timeframes?