# Downloading data

- The data I'm using for this project is the [Street Tree List]('https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq') from San Francisco Department of Public Works. I downloaded the data on Nov. 6, 2022. At the time, the data set was last updated on Nov. 6, 2022 as well.
- I created a copy of the data set and named it `original_street_tree_list.csv`. I then put the data set in the `raw-data` folder.


# Explore the data

In [8]:
# import packages
import pandas as pd
import altair as alt

In [9]:
# read csv file
sf_trees_original = pd.read_csv('street_tree_list.csv')
sf_trees_original.head()

Unnamed: 0,TreeID,qLegalStatus,qSpecies,qAddress,SiteOrder,qSiteInfo,PlantType,qCaretaker,qCareAssistant,PlantDate,...,XCoord,YCoord,Latitude,Longitude,Location,Fire Prevention Districts,Police Districts,Supervisor Districts,Zip Codes,Neighborhoods (old)
0,217365,Section 806 (d),Ceanothus 'Ray Hartman' :: California Lilac 'R...,707 Rockdale Dr,1.0,Sidewalk: Property side : Yard,Tree,Private,,10/14/2021 12:00:00 AM,...,5997488.0,2098235.0,37.741209,-122.451285,"(37.74120925101712, -122.45128526411095)",9.0,7.0,4.0,59.0,40.0
1,92771,DPW Maintained,Tristaniopsis laurina :: Swamp Myrtle,11X Blanken Ave,4.0,Sidewalk: Curb side : Cutout,Tree,Private,,10/14/2021 12:00:00 AM,...,6011718.0,2087394.0,37.712247,-122.40132,"(37.712246915438215, -122.40132023435935)",10.0,3.0,8.0,309.0,1.0
2,23904,DPW Maintained,Prunus subhirtella 'Pendula' :: Weeping Cherry,1600X Webster St,6.0,Median : Cutout,Tree,DPW,,,...,6003596.0,2114195.0,37.78538,-122.431304,"(37.78537959802679, -122.43130418097743)",13.0,9.0,11.0,29490.0,13.0
3,28646,DPW Maintained,Prunus subhirtella 'Pendula' :: Weeping Cherry,1600X Webster St,7.0,Median : Cutout,Tree,DPW,,,...,6003558.0,2114375.0,37.785872,-122.431449,"(37.78587163716589, -122.43144931782685)",13.0,9.0,11.0,29490.0,13.0
4,229807,DPW Maintained,Jacaranda mimosifolia :: Jacaranda,2560 Bryant St,1.0,Sidewalk: Curb side : Cutout,Tree,Private,,,...,6009700.0,2102427.0,37.753411,-122.409355,"(37.75341142310638, -122.40935530851043)",2.0,4.0,7.0,28859.0,19.0


In [12]:
sf_trees_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196590 entries, 0 to 196589
Data columns (total 23 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   TreeID                     196590 non-null  int64  
 1   qLegalStatus               196533 non-null  object 
 2   qSpecies                   196590 non-null  object 
 3   qAddress                   195097 non-null  object 
 4   SiteOrder                  194796 non-null  float64
 5   qSiteInfo                  196590 non-null  object 
 6   PlantType                  196590 non-null  object 
 7   qCaretaker                 196590 non-null  object 
 8   qCareAssistant             24707 non-null   object 
 9   PlantDate                  70878 non-null   object 
 10  DBH                        153021 non-null  float64
 11  PlotSize                   146229 non-null  object 
 12  PermitNotes                53367 non-null   object 
 13  XCoord                     19

In [14]:
# Copy the original dataframe

sf_trees = sf_trees_original.copy()

In [20]:
# check if TreeID is unique
sf_trees['TreeID'].nunique()

196590

In [21]:
len(sf_trees)

196590

Seems like there's no duplicated `TreeID`. There are 196590 thousand trees planted in SF as of Nov.6, 2022. 

In [22]:
# make sure the length of dataframe matches the number of unique IDs

assert len(sf_trees) == sf_trees['TreeID'].nunique()

- There are 196590 TreeID, but only 70878 have a plant date. 
- `PlantDate` should also be a date Dtype.
- The `Zip Codes` are weird. There are some 2, 3 digit numbers.  

In [15]:
# convert the `PlantDate` Column

sf_trees['PlantDate'] = pd.to_datetime(sf_trees['PlantDate'])


In [7]:
sf_trees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196590 entries, 0 to 196589
Data columns (total 23 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   TreeID                     196590 non-null  int64         
 1   qLegalStatus               196533 non-null  object        
 2   qSpecies                   196590 non-null  object        
 3   qAddress                   195097 non-null  object        
 4   SiteOrder                  194796 non-null  float64       
 5   qSiteInfo                  196590 non-null  object        
 6   PlantType                  196590 non-null  object        
 7   qCaretaker                 196590 non-null  object        
 8   qCareAssistant             24707 non-null   object        
 9   PlantDate                  70878 non-null   datetime64[ns]
 10  DBH                        153021 non-null  float64       
 11  PlotSize                   146229 non-null  object  

In [18]:
sf_trees['PlantDate'].max()

Timestamp('2022-11-12 00:00:00')

In [19]:
sf_trees['PlantDate'].min()

Timestamp('1955-09-19 00:00:00')

In [None]:
# check the columns

