# Downloading data

- The data I'm using for this project is the [Street Tree List]('https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq') from San Francisco Department of Public Works. I downloaded the data on Nov. 6, 2022. At the time, the data set was last updated on Nov. 6, 2022 as well.
- I created a copy of the data set and named it `original_street_tree_list.csv`. I then put the data set in the `raw-data` folder.


# Import the data

In [None]:
# import packages
import pandas as pd
import altair as alt

In [None]:
# read csv file
sf_trees_original = pd.read_csv('street_tree_list.csv')
sf_trees_original.head()

In [None]:
sf_trees_original.info()

In [None]:
# Copy the original dataframe

sf_trees = sf_trees_original.copy()

# Data Cleaning

## Check duplicates

In [None]:
# check if TreeID is unique
sf_trees['TreeID'].nunique()

In [None]:
len(sf_trees)

Seems like there's no duplicated `TreeID`. There are 196590 thousand trees planted in SF as of Nov.6, 2022. 

In [None]:
# make sure the length of dataframe matches the number of unique IDs

assert len(sf_trees) == sf_trees['TreeID'].nunique()

- There are 196590 TreeID, but only 70878 have a plant date. 
- `PlantDate` should also be a date Dtype.
- The `Zip Codes` are weird. There are some 2, 3 digit numbers.  

## Manage column names

In [None]:
# change all column names to lower case
sf_trees.columns = sf_trees.columns.str.lower()

In [None]:
sf_trees.columns = sf_trees.columns.str.replace(' ', '_')

In [None]:
sf_trees.info()

## Convert Dtype

In [None]:
# convert the `PlantDate` Column to datetime

sf_trees['plantdate'] = pd.to_datetime(sf_trees['plantdate'])

In [None]:
sf_trees['treeid'] = sf_trees['treeid'].astype(object)

In [None]:
sf_trees.info()

In [None]:
sf_trees.head()

In [None]:
sf_trees['plantdate'].max()

In [None]:
sf_trees['plantdate'].min()

So the earliest tree plant date is in 1955. The tree is 67 years old now. 

## Sort dataframe

In [None]:
# sort by plantdate

sf_trees.sort_values(by=['plantdate'],ascending=True).head().reset_index(drop=True)

For the data that has the `plantdate`, I want to create some new columns showing which year, month and day of the week those trees were planted.  

In [None]:
sf_trees['_day'] = sf_trees['plantdate'].dt.day_of_week
sf_trees['_month'] = sf_trees['plantdate'].dt.month
sf_trees['_year'] = sf_trees['plantdate'].dt.year


In [None]:
sf_trees.head(10)

In [None]:
sf_trees.info()

In [None]:
sf_trees

# Export the clean data

In [None]:
sf_trees.to_csv('sf_trees_clean.csv', index=False)