# Homework 1024

In class, we took a closer look at Berkeley's 311 data. Just like last week, I'd like you to create a similar Jupyter notebook using Oakland's data (that you cleaned up for homework). I did not create a homework notebook for you — you will create your own and upload it here.

You can copy a lot of the code we used in class. You'll be graded with the following rubric:

- You'll get 10 points total for documentation. You should describe what you're doing with your code using Markdown cells within your Jupyter notebook. Tell me what looks interesting or doesn't look right about the Oakland 311 data.
- You'll get 10 points total for using (correctly) as many methods as we learned during lecture.

In [None]:
import pandas as pd
import altair as alt

## Import previously cleaned CSV

In [None]:
oakland_311 = pd.read_csv(
    'oakland_311_cleaned.csv', 
    dtype={
        'REQUESTID': object
    },
    parse_dates=['DATETIMEINIT', 'DATETIMECLOSED']
)
oakland_311.tail()

In [None]:
# Convert timedelta
oakland_311['ELAPSED_TIME'] = pd.to_timedelta(oakland_311['ELAPSED_TIME']) 

POLL 1: When do you use Markdown vs # comments? Answer at [pollev.com/soooh](https://pollev.com/soooh).

## Explore the data

In [None]:
oakland_311

In [None]:
oakland_311.columns

In [None]:
oakland_311['DATETIMEINIT'].min()

### Count of incidents by year

I'm not going to filter by complete years because I want to show a chart by year-month.

In [None]:
oakland_311.info()

In [None]:
# ensuring i can use 'REQUESTID' to subset when aggregating for counts/medians/etc.
assert len(oakland_311) == oakland_311['REQUESTID'].nunique()

### Checking calls by year and month

In [None]:
calls_by_year = oakland_311.groupby([pd.Grouper(key='DATETIMEINIT', axis=0, freq='A')]).count()[['REQUESTID']].reset_index()
calls_by_year

In [None]:
# Rename `REQUESTID` to `COUNT`
calls_by_year.rename(columns={'REQUESTID': 'COUNT'}, inplace=True)

#### Chart: Calls by Year

In [None]:
alt.Chart(calls_by_year).mark_bar().encode(
    x='DATETIMEINIT:O', # try flipping O to T to see what happens
    y='COUNT'
)

In [None]:
calls_by_month = oakland_311.groupby([pd.Grouper(key='DATETIMEINIT', axis=0, freq='M')]).count()[['REQUESTID']].reset_index()

# Rename `REQUESTID` to `COUNT`
calls_by_month.rename(columns={'REQUESTID': 'COUNT'}, inplace=True)

#### Chart: Calls by Month

In [None]:
alt.Chart(calls_by_month).mark_bar().encode(
    x='DATETIMEINIT:T',
    y='COUNT'
)

# We'll learn how to add titles and other information during this lecture

### Incident types

In [None]:
oakland_311[['DESCRIPTION','REQCATEGORY']].drop_duplicates().sort_values(by=['REQCATEGORY']).reset_index(drop=True)

In [None]:
oakland_311['REQCATEGORY'].value_counts()

POLL 2: What are you noticing here? Answer at [pollev.com/soooh](https://pollev.com/soooh).

## Look at illegal dumps

In [None]:
illegal_dumps = oakland_311[oakland_311['REQCATEGORY'] == 'ILLDUMP'].reset_index(drop=True)
illegal_dumps

In [None]:
illegal_dumps_by_month = illegal_dumps.groupby([pd.Grouper(key='DATETIMEINIT', axis=0, freq='M')]).count()[['REQUESTID']].reset_index()
illegal_dumps_by_month

In [None]:
illegal_dumps_by_month.columns = ['DATETIMEINIT', 'COUNT']

In [None]:
alt.Chart(illegal_dumps_by_month).mark_bar().encode(
    x='DATETIMEINIT:T',
    y='COUNT'
)

In [None]:
illegal_dumps['ELAPSED_TIME'].median()

From mid-2010 to today, it's typically taken about 2-3 days to close Illegal Dump cases.

## Look at illegal dumps more closely...

In [None]:
illegal_dumps.tail()

I want to group up incidents by month based on the description.

In [None]:
illegal_dump_types_by_month = illegal_dumps.groupby([pd.Grouper(key='DATETIMEINIT', axis=0, freq='M'), 'DESCRIPTION']).count()[['REQUESTID']].reset_index()
illegal_dump_types_by_month.columns = ['YEARMONTH', 'DESCRIPTION', 'COUNT']
illegal_dump_types_by_month

In [None]:
illegal_dump_types_by_month.info()

In [None]:
alt.Chart(illegal_dump_types_by_month).mark_bar().encode(
    x='YEARMONTH',
    y='COUNT',
    color='DESCRIPTION',
    tooltip='DESCRIPTION'
).interactive()

This is super ugly but it's good for exploration!

## P.S. Found a weird error in groupby aggregations!

It looks like df.groupby() will include NaN fields when calculating medians, etc. So be careful. To be totally honest, I don't know what's going on here.

In [None]:
# Found a solution for median() of timedeltas! Use `numeric_only=False`
illegal_dumps_resolution_by_year = illegal_dumps.groupby([pd.Grouper(key='DATETIMEINIT', axis=0, freq='A')]).median(numeric_only=False)
illegal_dumps_resolution_by_year

### What are all those warnings?

```
FutureWarning: Dropping invalid columns in DataFrameGroupBy.median is deprecated. In a future version, a TypeError will be raised. Before calling .median, select only columns which should be valid for the function.
  illegal_dumps_resolution_by_year = illegal_dumps.groupby([pd.Grouper(key='DATETIMEINIT', axis=0, freq='A')]).median(numeric_only=False)
```

The warnings are for future versions of pandas. The software contributors are letting you know that they're going to change things up in the future. So don't get too used to this code!

We can fix for this future warning (not required) by following some of the instructions. Specifically, "Before calling .median, select only columns which should be valid for the function."

Below, I'm going to subset the dataframe to `illegal_dumps[['DATETIMEINIT','ELAPSED_TIME']]`. Those are the only 2 columns I'm using. If I run this code, I won't get a warning (I also don't need to use `numeric_only=False` anymore).

In [None]:
illegal_dumps_resolution_by_year = illegal_dumps[['DATETIMEINIT','ELAPSED_TIME']].groupby([pd.Grouper(key='DATETIMEINIT', axis=0, freq='A')]).median()

illegal_dumps_resolution_by_year

In [None]:
illegal_dumps_2019 = illegal_dumps[
    (illegal_dumps['DATETIMEINIT'] >= '2019-01-01') &    
    (illegal_dumps['DATETIMEINIT'] <  '2020-01-01')     
].copy()
illegal_dumps_2019['ELAPSED_TIME'].median()