# City of San Francisco Trees

Imagine you've been commissioned by the City of San Francisco to tackle a problem they've been having with local flora. The parks department has taken extensive documentation of the city's trees since the 1970s - what species are growing, where they are, who they're maintained by - amassing a dataset of over 200K trees in that time. 

The funding for that project has recently been called into question, and the City Board needs to see its value in reapproving funds for the following year. Stakeholders have raised several concerns over the past few years, and your job is to use the data to answer them. Good luck!

## Jupyter Notebook

First things first, let's get some terminology straight.
- The *language* we're working in – Python 3.7 
- The *editor* we're using is Google Colab – The code runs on Google's servers, and shows the results on our browser
- This file is an interactive Python notebook, a `.ipynb` file. These are pretty special, also known as **Jupyter notebooks**. 

Jupyter notebooks have a few special properties that make it ideal for work with data:
 - Code is organized into cells, which can be **code** or **markdown** 
 - We can run the cells in **any order**, try it out!
 - The last item returned in a cell will print automatically, no need to wrap it with `print()`

In [None]:
x = 'Answer to the Ultimate Question of Life, the Universe, and Everything'

In [None]:
print(x) # Run this cell after running the one above, and again after running the one below

In [None]:
x = 42

In [None]:
def UltimateQuestion(computer_name):
    return computer_name + ' is thinking...'

In [None]:
UltimateQuestion('DeepThought')

## Importing packages

We use the `pandas` package to easily work with data as tables.
<br>The `numpy` package allows us to work with some other special data types, like missing values
<br><br>We'll rename these as `pd` and `np`, just so its easier to refer to later on

In [1]:
# as allows us to rename the packages
import pandas as pd
import numpy as np

In [2]:
pd.options.display.max_rows = 5 # Just to shorten output

## Importing data

For this semester, we'll typically work with data in *tabular* format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a `.csv` file ending, short for comma seperated values.
<br><br>To import this, let's use the `pd.read_csv()` function:

In [3]:
# Replace w/ URL
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-1/workshop/trees.csv'
trees = pd.read_csv(url)

Here, we've saved the data to a `dataframe` object named `trees`

In [4]:
type(trees)

pandas.core.frame.DataFrame

DataFrames contain our data in little "spreadsheet"-like structures. Whatever manipulations you can think of doing to the data, you can likely search how to do 

## Exploring dataframes

Let's take a look at the data. We'll use the functions `.head()` and `.tail()`

In [5]:
trees.head()

Unnamed: 0,tree_id,legal_status,caretaker,dbh,plot_size,species_name,common_name,date,site_location,site_type,latitude,longitude,address
0,30314,DPW Maintained,Private,16.0,,Pittosporum undulatum,Victorian Box,1955-10-20,Sidewalk,Cutout,37.759772,-122.398109,501 Arkansas St
1,30321,DPW Maintained,Private,2.0,,Magnolia grandiflora,Southern Magnolia,1956-01-06,Sidewalk,Cutout,37.795718,-122.44186,2828 Divisadero St
2,30334,DPW Maintained,Private,4.0,,Ginkgo biloba,Maidenhair Tree,1956-02-06,Sidewalk,Cutout,37.743222,-122.433634,601 29th St
3,30335,DPW Maintained,Private,2.0,,Ginkgo biloba,Maidenhair Tree,1956-02-06,Sidewalk,Cutout,37.743226,-122.433565,601 29th St
4,30333,DPW Maintained,Private,1.0,,Arbutus 'Marina',Hybrid Strawberry Tree,1956-02-06,Sidewalk,Cutout,37.743217,-122.433721,601 29th St


How big is the dataset? `.shape` returns a tuple with the dimensions as (rows, columns)

In [6]:
trees.shape

(36073, 13)

Let's try to understand our data a bit better. 
- How many unique tree species are in the dataset? 

In [7]:
trees.species_name.nunique()

367

- Which is the most common?

In [8]:
trees.common_name.value_counts()

Swamp Myrtle         2781
Brisbane Box         2751
                     ... 
Black Mission Fig       1
Pindo Palm              1
Name: common_name, Length: 365, dtype: int64

- Find the biggest tree
<br>Note: `dbh` represents diameter of the tree base

In [9]:
trees.sort_values(by='dbh', ascending=False)

Unnamed: 0,tree_id,legal_status,caretaker,dbh,plot_size,species_name,common_name,date,site_location,site_type,latitude,longitude,address
34738,14513,DPW Maintained,DPW,100.0,4X4,Fraxinus uhdei,Shamel Ash: Evergreen Ash,2018-06-18,Sidewalk,Cutout,37.776560,-122.446728,501 Masonic Ave
28183,12738,DPW Maintained,DPW,100.0,4x4,Tristaniopsis laurina 'Elegant',Small-leaf Tristania 'Elegant',2013-07-12,Sidewalk,Cutout,37.786183,-122.477196,1630 Lake St
...,...,...,...,...,...,...,...,...,...,...,...,...,...
14796,44797,DPW Maintained,Private,0.0,,Prunus serrulata,Ornamental Cherry,2001-04-12,Sidewalk,Cutout,37.765145,-122.480368,1206 22nd Ave
36072,144192,DPW Maintained,Private,0.0,Width 4ft,Lophostemon confertus,Brisbane Box,2020-01-25,Sidewalk,Cutout,37.776940,-122.502697,618 42nd Ave


### Subsetting

Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:

We can filter rows from a dataframe based on some condition
- Show only trees north of Golden Gate Park (latitude > `37.77285`)? 
- Show only `Cherry Plum` trees (What's the mean diameter?)

In [10]:
trees[trees.latitude > 37.77285]

Unnamed: 0,tree_id,legal_status,caretaker,dbh,plot_size,species_name,common_name,date,site_location,site_type,latitude,longitude,address
1,30321,DPW Maintained,Private,2.0,,Magnolia grandiflora,Southern Magnolia,1956-01-06,Sidewalk,Cutout,37.795718,-122.441860,2828 Divisadero St
5,30339,DPW Maintained,Private,11.0,,Platanus x hispanica,Sycamore: London Plane,1956-02-15,Sidewalk,Cutout,37.793189,-122.441380,2560 Divisadero St
...,...,...,...,...,...,...,...,...,...,...,...,...,...
36071,144157,DPW Maintained,Private,0.0,Width 4ft,Tristaniopsis laurina,Swamp Myrtle,2020-01-25,Sidewalk,Cutout,37.774642,-122.501452,746 41st Ave
36072,144192,DPW Maintained,Private,0.0,Width 4ft,Lophostemon confertus,Brisbane Box,2020-01-25,Sidewalk,Cutout,37.776940,-122.502697,618 42nd Ave


In [11]:
# You try!

### Chaining

Another common task is to find patterns based on groups.
- Which tree type, on average, has the largest diameter?

In [12]:
trees.groupby(by='common_name').agg('mean')['dbh'].sort_values(ascending=False).head()

common_name
Date palm (species unknown)    70.000000
False Avocado                  35.000000
Canary Island Date Palm        30.912664
Flooded Box: Coolibah          30.000000
Morton Bay Fig                 29.000000
Name: dbh, dtype: float64

## Visualization

First things first, let's import the package to help us visualize the data, `plotly`.

If this package isn't yet included, we can install it using `!pip install plotly`. More on this week 5. 

In [None]:
import plotly.express as px

## Uncomment & run the following if graphs don't show
# import plotly.io as pio
# pio.renderers.default='notebook'

Note that we're using the sub package of the broader package, called `plotly express`. This simplifies a lot of the more difficult steps

Plotly express has a broad range of options to play with, let's take a look at the documentation. 
<br>Do a quick google search to pull up documentation for `px.scatter` OR run `px.scatter?` in a Jupyter cell

In [None]:
px.scatter?

In [None]:
trees_sample = trees.sample(frac=.2)

In [None]:
fig = px.scatter(trees_sample, x='date', y='dbh')
fig.show('notebook')

In [None]:
# You try!

### Geographic Plots

The transportation department wants to know track any trees sitting on a road median, in order to quickly remove debris after a bad storm. 
<br> Is there a general area in which there are more roadside / median trees?

In [None]:
fig = px.scatter_mapbox(trees_sample, lat='latitude', lon='longitude')
fig.show('notebook')

In [None]:
# You try!