<a href="https://colab.research.google.com/github/nolgalindo/Node/blob/main/Copy_of_intro_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# City of San Francisco Trees

Imagine you've been commissioned by the City of San Francisco to tackle a problem they've been having with local flora. The parks department has taken extensive documentation of the city's trees since the 1970s - what species are growing, where they are, who they're maintained by - amassing a dataset of over 200K trees in that time. 

The funding for that project has recently been called into question, and the City Board needs to see its value in reapproving funds for the following year. Stakeholders have raised several concerns over the past few years, and your job is to use the data to answer them. Good luck!

## Jupyter Notebook

First things first, let's get some terminology straight.
- The *language* we're working in – Python 3.7 
- The *editor* we're using is Google Colab – The code runs on Google's servers, and shows the results on our browser
- This file is an interactive Python notebook, a `.ipynb` file. These are pretty special, also known as **Jupyter notebooks**. 

Jupyter notebooks have a few special properties that make it ideal for work with data:
 - Code is organized into cells, which can be **code** or **markdown** 
 - We can run the cells in **any order**, try it out!
 - The last item returned in a cell will print automatically, no need to wrap it with `print()`

In [None]:
# Set a variable 
x = 'Answer to the ultimate question of life'

In [None]:
# Return it without print
print(x)

Answer to the ultimate question of life


Anything you can do in Python, you can do here! 

1. Write a function that takes a string as input, and does something to it 
2. In a new cell, call the function and test it out

In [None]:
# Write a function here
def func(computername):
  return computername + 'is thinking...'


In [None]:
# Call it here
func('OKcomputer')

'OKcomputeris thinking...'

## Importing packages

We use the `pandas` package to easily work with data as tables.
<br>The `numpy` package allows us to work with some other special data types, like missing values
<br><br>We'll rename these as `pd` and `np`, just so its easier to refer to later on

In [None]:
# Import packages
import pandas as pd
import numpy as np


## Importing data

For this semester, we'll typically work with data in *tabular* format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a `.csv` file ending, short for comma seperated values.

For example, a CSV file could look something like...

```
tree_number, species_name, address
312, Magnolia grandiflora, 2828 Divisadero St
124, Melaleuca quinquenervia, 485 Union St
912, Pittosporum undulatum, 47 Vicksburg St
```

To import this, let's use the `pd.read_csv()` function:

In [None]:
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-1/workshop/trees.csv'

In [None]:
# Read into dataframe
trees = pd.read_csv(url)

Here, we've saved the data to a `dataframe` object named `trees`

In [None]:
# Check the type
type(trees)

pandas.core.frame.DataFrame

DataFrames contain our data in little "spreadsheet"-like structures. Whatever manipulations you can think of doing to the data, you can likely search how to do 

## Exploring dataframes

Let's take a look at the data. We'll use the function `.head()` to read in the first 5 rows

In [None]:
trees.head()

Unnamed: 0,tree_id,legal_status,caretaker,dbh,plot_size,species_name,common_name,date,site_location,site_type,latitude,longitude,address
0,30314,DPW Maintained,Private,16.0,,Pittosporum undulatum,Victorian Box,1955-10-20,Sidewalk,Cutout,37.759772,-122.398109,501 Arkansas St
1,30321,DPW Maintained,Private,2.0,,Magnolia grandiflora,Southern Magnolia,1956-01-06,Sidewalk,Cutout,37.795718,-122.44186,2828 Divisadero St
2,30334,DPW Maintained,Private,4.0,,Ginkgo biloba,Maidenhair Tree,1956-02-06,Sidewalk,Cutout,37.743222,-122.433634,601 29th St
3,30335,DPW Maintained,Private,2.0,,Ginkgo biloba,Maidenhair Tree,1956-02-06,Sidewalk,Cutout,37.743226,-122.433565,601 29th St
4,30333,DPW Maintained,Private,1.0,,Arbutus 'Marina',Hybrid Strawberry Tree,1956-02-06,Sidewalk,Cutout,37.743217,-122.433721,601 29th St


How big is the dataset? `.shape` returns a tuple with the dimensions as (rows, columns)

In [None]:
# Show shape
trees.shape

(36073, 13)

Let's try to understand our data a bit better. 
- How many different tree species are in the dataset? 

In [None]:
# number of unique
trees.species_name.nunique()

367

- Which tree shows up the most frequently?

In [None]:
# value counts
trees.common_name.value_counts()


Swamp Myrtle              2781
Brisbane Box              2751
Hybrid Strawberry Tree    1968
Victorian Box             1604
Southern Magnolia         1602
                          ... 
Quaking aspen                1
Cabada palm                  1
Tree aloe                    1
Seedless Lime                1
Apricot                      1
Name: common_name, Length: 365, dtype: int64

Show the biggest trees by sorting the dataframe:
<br>Note: `dbh` records diameter of the tree base

In [None]:
# sort values
trees.sort_values(by='dbh', ascending=False)

Unnamed: 0,tree_id,legal_status,caretaker,dbh,plot_size,species_name,common_name,date,site_location,site_type,latitude,longitude,address
34738,14513,DPW Maintained,DPW,100.0,4X4,Fraxinus uhdei,Shamel Ash: Evergreen Ash,2018-06-18,Sidewalk,Cutout,37.776560,-122.446728,501 Masonic Ave
28183,12738,DPW Maintained,DPW,100.0,4x4,Tristaniopsis laurina 'Elegant',Small-leaf Tristania 'Elegant',2013-07-12,Sidewalk,Cutout,37.786183,-122.477196,1630 Lake St
5025,4768,DPW Maintained,DPW,100.0,3X3,Corymbia ficifolia,Red Flowering Gum,1993-01-05,Sidewalk,Cutout,37.732715,-122.385231,26 Commer Ct
17964,24961,DPW Maintained,DPW,90.0,20,Phoenix canariensis,Canary Island Date Palm,2005-04-21,Median,Cutout,37.767709,-122.426675,100 Dolores St
5581,13104,DPW Maintained,DPW,90.0,3X3,Ficus retusa nitida,Banyan Fig,1993-10-26,Sidewalk,Cutout,37.801143,-122.426724,1530 Lombard St
...,...,...,...,...,...,...,...,...,...,...,...,...,...
14101,78518,DPW Maintained,Private,0.0,,Prunus cerasifera,Cherry Plum,2000-10-21,Sidewalk,Cutout,37.710295,-122.450931,75 Laura St
14114,78567,DPW Maintained,Private,0.0,,Arbutus 'Marina',Hybrid Strawberry Tree,2000-10-21,Sidewalk,Cutout,37.710306,-122.453138,40 Sears St
14763,44728,DPW Maintained,Private,0.0,,Melaleuca quinquenervia,Cajeput,2001-04-03,Sidewalk,Cutout,37.748648,-122.477643,1144 Quintara St
14796,44797,DPW Maintained,Private,0.0,,Prunus serrulata,Ornamental Cherry,2001-04-12,Sidewalk,Cutout,37.765145,-122.480368,1206 22nd Ave


### Subsetting

Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:

We can filter rows from a dataframe based on some condition

- Show only `Cherry Plum` trees

In [None]:
# subset
trees[trees.common_name == 'Cherry Plum']

Unnamed: 0,tree_id,legal_status,caretaker,dbh,plot_size,species_name,common_name,date,site_location,site_type,latitude,longitude,address
149,53700,Permitted Site,Private,14.0,,Prunus cerasifera,Cherry Plum,1970-03-04,Sidewalk,Cutout,37.746081,-122.426025,263 Duncan St
198,54020,DPW Maintained,Private,13.0,,Prunus cerasifera,Cherry Plum,1972-04-07,Sidewalk,Cutout,37.772780,-122.494875,862 35th Ave
208,54057,DPW Maintained,Private,8.0,,Prunus cerasifera,Cherry Plum,1972-04-21,Sidewalk,Cutout,37.772551,-122.494860,874 35th Ave
265,54255,Permitted Site,Private,10.0,3x3,Prunus cerasifera,Cherry Plum,1972-07-03,Sidewalk,Cutout,37.759509,-122.442802,191 Caselli Ave
364,221734,DPW Maintained,Private,12.0,Width 4ft,Prunus cerasifera,Cherry Plum,1972-08-17,Sidewalk,Cutout,37.765292,-122.452934,203 Carl St
...,...,...,...,...,...,...,...,...,...,...,...,...,...
35535,55973,DPW Maintained,Private,3.0,,Prunus cerasifera,Cherry Plum,2019-06-10,Sidewalk,Cutout,37.791259,-122.432719,2221 Webster St
35571,236272,DPW Maintained,Private,3.0,Width 3ft,Prunus cerasifera,Cherry Plum,2019-07-26,Sidewalk,Cutout,37.766989,-122.416495,99 Shotwell St
35572,236271,DPW Maintained,Private,3.0,Width 3ft,Prunus cerasifera,Cherry Plum,2019-07-26,Sidewalk,Cutout,37.767032,-122.416501,99 Shotwell St
35700,246210,DPW Maintained,Private,3.0,Width 0ft,Prunus cerasifera,Cherry Plum,2019-10-01,Sidewalk,Cutout,37.767967,-122.443800,725 Buena Vista Ave West


How would you show only trees north of Golden Gate Park (latitude > `37.77285`)

Hint: Same way as matching if statements in python, mirroring the syntax above

In [None]:
# You try!
trees[trees.latitude > 37.33285]

Unnamed: 0,tree_id,legal_status,caretaker,dbh,plot_size,species_name,common_name,date,site_location,site_type,latitude,longitude,address
0,30314,DPW Maintained,Private,16.0,,Pittosporum undulatum,Victorian Box,1955-10-20,Sidewalk,Cutout,37.759772,-122.398109,501 Arkansas St
1,30321,DPW Maintained,Private,2.0,,Magnolia grandiflora,Southern Magnolia,1956-01-06,Sidewalk,Cutout,37.795718,-122.441860,2828 Divisadero St
2,30334,DPW Maintained,Private,4.0,,Ginkgo biloba,Maidenhair Tree,1956-02-06,Sidewalk,Cutout,37.743222,-122.433634,601 29th St
3,30335,DPW Maintained,Private,2.0,,Ginkgo biloba,Maidenhair Tree,1956-02-06,Sidewalk,Cutout,37.743226,-122.433565,601 29th St
4,30333,DPW Maintained,Private,1.0,,Arbutus 'Marina',Hybrid Strawberry Tree,1956-02-06,Sidewalk,Cutout,37.743217,-122.433721,601 29th St
...,...,...,...,...,...,...,...,...,...,...,...,...,...
36068,144227,DPW Maintained,Private,0.0,Width 4ft,Agonis flexuosa,Peppermint Willow,2020-01-25,Sidewalk,Cutout,37.773933,-122.503557,782 43rd Ave
36069,144230,DPW Maintained,Private,0.0,Width 4ft,Melaleuca quinquenervia,Cajeput,2020-01-25,Sidewalk,Cutout,37.775598,-122.503676,696 43rd Ave
36070,261517,DPW Maintained,Private,3.0,Width 3ft,Agonis flexuosa,Peppermint Willow,2020-01-25,Sidewalk,Yard,37.775886,-122.501730,679 41st Ave
36071,144157,DPW Maintained,Private,0.0,Width 4ft,Tristaniopsis laurina,Swamp Myrtle,2020-01-25,Sidewalk,Cutout,37.774642,-122.501452,746 41st Ave


## Data Manipulation

What is the average diameter of the `Evergreen Pear` tree?

In [None]:
# You try!
trees[trees.common_name == 'Evergreen Pear'].dbh.mean()

5.306595365418895

## Visualization

First things first, let's import the package to help us visualize the data, `plotly`.

If this package isn't yet included, we can install it using `!pip install plotly`. More on this week 5. 

In [None]:
# Import
import plotly.express as px

Note that we're using the sub package of the broader package, called `plotly express`. This simplifies a lot of the more difficult steps

In [None]:
px.scatter?


In [None]:
trees_sample = trees.sample(frac=.2)

In [None]:
fig = px.scatter(trees_sample, x='date', y='dbh')
fig.show()

Clearly, there aren't any obvious trends going on from this view. Let's add in some more parameters

In [None]:
fig = px.scatter(trees_sample, x='date', y='dbh', 
                 opacity=.15, color='site_location', 
                 hover_name='common_name', hover_data=['site_location','site_type','address'],
                 marginal_x = 'histogram', marginal_y = 'histogram',
                 color_discrete_sequence = px.colors.qualitative.Prism[4:],
                 labels={'site_location':'Site Location', 'dbh':'Tree Diameter', 'date':'Date Recorded'}
                )
fig.show()

Plotly express has a broad range of options to play with, let's take a look at the documentation. 
<br>Do a quick google search to pull up documentation for `px.scatter` OR run `px.scatter?` in a Jupyter cell

In [None]:
# Help
px.scatter?


### Geographic Plots

The transportation department wants to know track any trees sitting on a road median, in order to quickly remove debris after a bad storm. 

- Is there a general area in which there are more roadside / median trees?
- Could you show the address, caretaker, and name of the tree on hover?

In [None]:
fig = px.scatter_mapbox(trees_sample, lat='latitude', lon='longitude', mapbox_style="stamen-terrain", zoom=11, 
                        color='site_location', size='dbh', opacity=.3,
                        color_discrete_sequence=['orange','red','orange','orange','orange','orange'],
                        hover_name='address',hover_data=['site_location','caretaker'],
                        labels={'site_location':'Site Location', 'dbh':'Tree Diameter', 
                                'date':'Date Recorded', 'caretaker':'Care Taker'}

                       )
fig.show()