# Lecture 1030

More pandas and charts with Altair using Berkeley 311 call data.

## Import modules
As usual, we'll import modules at the top of the notebook. This time, we don't need the `requests` module since we're not going to re-download the data from the Internet.

### What is Altair?

[Altair](https://altair-viz.github.io/) is a data visualization library for Python. `matplotlib` is usually the first data viz module Python programmers learn, but Altair is easier to use. The Altair community uses the alias `alt` when importing.

In [1]:
import pandas as pd
import altair as alt

## Import data

We did a lot of work last week cleaning up the Berkeley 311 calls. We don't need to redo that work since we exported a clean version called `berkeley_311_clean.csv`. 

Remember that a `csv` file is just a plain-text file. That means that the file, just as it is, cannot retain the **dtype** of a column.

So this time when we import the data, we'll want to make sure that we set up the dtypes we do know and parse `datetime` dtypes correctly.

I also want to set **Case_ID** to an `object` dtype instead of an `int` dtype. Why would I want to do this? You can't operate on **Case_ID** like it's a number. You aren't going to add up the Case_IDs. So it's better to import that column as an `object`.

Last thing: where you saved this notebook file matters. Where does the file `berkeley_311_clean.csv` exist locally on your computer?

In [2]:
berkeley_311 = pd.read_csv('../1023/berkeley_311_clean.csv', 
    dtype={
        'Case_ID': object,
    },
    parse_dates=['Date_Opened', 'Date_Closed', 'Close_Time']
)

  berkeley_311 = pd.read_csv('../1023/berkeley_311_clean.csv',


In [3]:
berkeley_311.head()

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
0,121000877593,2021-09-16 06:23:23,Closed,2021-09-20 11:22:22,"Facilities, Electrical & Property Management",Parks/Marina Building Services,Keys / Locks,Property,,"Intersection of Browning and Addison, BERKELEY...",Berkeley,CA,Berkeley,,,,4 days 04:58:59
1,121000876647,2021-09-13 10:50:00,Open,NaT,Refuse and Recycling,Residential,Residential Bulky Pickup,Property,054 180702800,1722 DWIGHT WAY,Berkeley,CA,Berkeley,37.862656,-122.275461,"(37.86265624, -122.27546088)",
2,121000809740,2020-11-06 16:51:00,Closed,2020-11-09 01:52:57,General Questions/information,Miscellaneous,Miscellaneous Service Request,Individual,,,Berkeley,CA,Berkeley,,,,2 days 09:01:57
3,121000809739,2020-11-06 16:38:00,Closed,2020-11-09 01:41:12,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,060 249305600,1411 GRIZZLY PEAK BLVD,Berkeley,CA,Berkeley,37.884799,-122.247874,"(37.88479918, -122.24787412)",2 days 09:03:12
4,121000793663,2020-09-01 11:32:00,Closed,2020-09-01 11:36:00,Other Account Services and Billing,Marina,Payment Collection - Marina,Individual,,,Berkeley,CA,Berkeley,,,,0 days 00:04:00


In [4]:
berkeley_311.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 704673 entries, 0 to 704672
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   Case_ID              704673 non-null  object        
 1   Date_Opened          704673 non-null  datetime64[ns]
 2   Case_Status          704673 non-null  object        
 3   Date_Closed          667173 non-null  datetime64[ns]
 4   Request_Category     704673 non-null  object        
 5   Request_SubCategory  704673 non-null  object        
 6   Request_Detail       704673 non-null  object        
 7   Object_Type          704673 non-null  object        
 8   APN                  413081 non-null  object        
 9   Street_Address       455047 non-null  object        
 10  City                 704673 non-null  object        
 11  State                704673 non-null  object        
 12  Neighborhood         704673 non-null  object        
 13  Latitude      

The **Close_Time** column didn't get typed as `timedelta`. It doesn't look like it's possible to do so with `pd.read_csv()`. (There's [an open issue](https://github.com/pandas-dev/pandas/issues/8185) on the pandas repo as of today's lecture.) So we'll just set it this way:

In [5]:
berkeley_311['Close_Time'] = pd.to_timedelta(berkeley_311['Close_Time']) 

In [6]:
berkeley_311.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 704673 entries, 0 to 704672
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype          
---  ------               --------------   -----          
 0   Case_ID              704673 non-null  object         
 1   Date_Opened          704673 non-null  datetime64[ns] 
 2   Case_Status          704673 non-null  object         
 3   Date_Closed          667173 non-null  datetime64[ns] 
 4   Request_Category     704673 non-null  object         
 5   Request_SubCategory  704673 non-null  object         
 6   Request_Detail       704673 non-null  object         
 7   Object_Type          704673 non-null  object         
 8   APN                  413081 non-null  object         
 9   Street_Address       455047 non-null  object         
 10  City                 704673 non-null  object         
 11  State                704673 non-null  object         
 12  Neighborhood         704673 non-null  object         
 13 

In [7]:
berkeley_311

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
0,121000877593,2021-09-16 06:23:23,Closed,2021-09-20 11:22:22,"Facilities, Electrical & Property Management",Parks/Marina Building Services,Keys / Locks,Property,,"Intersection of Browning and Addison, BERKELEY...",Berkeley,CA,Berkeley,,,,4 days 04:58:59
1,121000876647,2021-09-13 10:50:00,Open,NaT,Refuse and Recycling,Residential,Residential Bulky Pickup,Property,054 180702800,1722 DWIGHT WAY,Berkeley,CA,Berkeley,37.862656,-122.275461,"(37.86265624, -122.27546088)",NaT
2,121000809740,2020-11-06 16:51:00,Closed,2020-11-09 01:52:57,General Questions/information,Miscellaneous,Miscellaneous Service Request,Individual,,,Berkeley,CA,Berkeley,,,,2 days 09:01:57
3,121000809739,2020-11-06 16:38:00,Closed,2020-11-09 01:41:12,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,060 249305600,1411 GRIZZLY PEAK BLVD,Berkeley,CA,Berkeley,37.884799,-122.247874,"(37.88479918, -122.24787412)",2 days 09:03:12
4,121000793663,2020-09-01 11:32:00,Closed,2020-09-01 11:36:00,Other Account Services and Billing,Marina,Payment Collection - Marina,Individual,,,Berkeley,CA,Berkeley,,,,0 days 00:04:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
704668,121001011148,2023-06-12 13:06:00,Closed,2023-07-06 13:34:09,Refuse and Recycling,Commercial,Commercial Site Inspection,Property,058 217100500,1924 CEDAR ST,Berkeley,CA,Berkeley,37.877781,-122.272751,"(37.87778105, -122.27275061)",24 days 00:28:09
704669,121001014556,2023-06-29 14:01:00,Closed,2023-07-06 15:26:38,Refuse and Recycling,Request,Cart Repair,Property,053 167101600,1513 OREGON ST,Berkeley,CA,Berkeley,37.856370,-122.279024,"(37.85636976, -122.27902376)",7 days 01:25:38
704670,121001005559,2023-05-17 11:56:00,Closed,2023-07-06 12:07:40,Refuse and Recycling,Residential,Residential Site Inspection,Property,055 191404100,2326 SPAULDING AVE,Berkeley,CA,Berkeley,37.864922,-122.280641,"(37.86492171, -122.28064106)",50 days 00:11:40
704671,121000883511,2021-10-13 10:59:00,Closed,2023-07-06 06:55:14,"Parks, Trees and Vegetation",Trees,Tree Pruning,Property,060 244500103,1700 HOPKINS ST PARK,Berkeley,CA,Berkeley,37.882542,-122.277477,"(37.88254238, -122.27747688)",630 days 19:56:14


## Explore data

What do I do if I don't have a question yet? I'm not really sure what to look into with this 311 data. So I'm going to explore it a little bit. I might do some analysis, I might not.

### Categories of incidents in 2022

I'm curious about the different categories of incidents in the year 2022.

First, I'll create a new dataframe `berkeley_311_2022` that subsets the `berkeley_311` data to just the cases that were open in 2022. (We discussed this last week, but subsetting data is a way to filter data.)

In [18]:
# asdf = berkeley_311.loc[
#     (berkeley_311['Date_Opened'] >= '2022-01-01') &
#     (berkeley_311['Date_Opened'] < '2023-01-01') 
#     # Why don't I use `berkeley_311['Date_Opened'] <= '2022-12-31']` ?
# ]

In [15]:
berkeley_311_2022 = berkeley_311.loc[
    (berkeley_311['Date_Opened'] >= '2022-01-01') &
    (berkeley_311['Date_Opened'] < '2023-01-01') 
    # Why don't I use `berkeley_311['Date_Opened'] <= '2022-12-31']` ?
].copy()

In [16]:
berkeley_311_2022

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
7,121000923118,2022-04-22 11:06:07,Closed,2022-04-23 01:57:00,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,,"Intersection of Henry and Cedar, BERKELEY, CA",Berkeley,CA,Berkeley,,,,0 days 14:50:53
8,121000923067,2022-04-22 08:35:00,Closed,2022-04-23 01:56:37,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,060 239107301,1313 CURTIS ST,Berkeley,CA,Berkeley,37.880943,-122.289982,"(37.88094318, -122.28998212)",0 days 17:21:37
9,121000923227,2022-04-22 15:28:00,Closed,2022-04-23 01:57:41,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,053 165200105,1020 HEINZ AVE,Berkeley,CA,Berkeley,37.853063,-122.288187,"(37.85306275, -122.28818668)",0 days 10:29:41
10,121000923168,2022-04-22 13:21:00,Closed,2022-04-23 01:59:03,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,058 215601000,1635 VIRGINIA ST,Berkeley,CA,Berkeley,37.875586,-122.278920,"(37.87558556, -122.27891974)",0 days 12:38:03
11,121000923276,2022-04-23 10:17:39,Closed,2022-04-23 05:07:20,Graffiti and Vandalism,Graffiti,Graffiti Abatement - Internet Request,Individual,,,Berkeley,CA,Berkeley,,,,-1 days +18:49:41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
703260,121000971323,2022-12-02 11:33:00,Closed,2023-06-29 13:53:39,Refuse and Recycling,Request,Roll Off Bin,Property,057 201900103,2246 MILVIA ST,Berkeley,CA,Berkeley,37.867572,-122.271419,"(37.8675722, -122.27141874)",209 days 02:20:39
703856,121000948662,2022-08-15 10:24:00,Closed,2023-06-30 14:55:20,Refuse and Recycling,Residential,Residential Lost or Stolen Cart,Property,054 173700500,2741 DOHR ST,Berkeley,CA,Berkeley,37.857214,-122.281251,"(37.85721361, -122.28125113)",319 days 04:31:20
703922,121000973880,2022-12-16 10:28:00,Closed,2023-06-30 09:45:00,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,056 199705600,5 ACTON CIR,Berkeley,CA,Berkeley,37.868318,-122.283428,"(37.86831774, -122.28342802)",195 days 23:17:00
704571,121000961524,2022-10-13 09:53:00,Closed,2023-07-06 15:14:28,Refuse and Recycling,Commercial,Commercial Bin Size Increase,Property,054 170500801,2800 FOREST AVE,Berkeley,CA,Berkeley,37.861189,-122.250483,"(37.86118907, -122.25048336)",266 days 05:21:28


Let's look at that expression above. 
- I used `df.loc[ expression ]` because it is more performant than subsetting using just `df[ expression ]`. Either way is fine for your work. There are many ways to subset data in pandas; here's some more information about [those ways](https://pandas.pydata.org/docs/user_guide/indexing.html).
- We're using `&` (instead of `and`). Remember our first lectures: `&` is a [bitwise operator](https://docs.python.org/3/reference/expressions.html#binary-bitwise-operations), while `and` is a [logical or boolean operator](https://docs.python.org/3/reference/expressions.html#boolean-operations).

In [9]:
berkeley_311_2022

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
7,121000923118,2022-04-22 11:06:07,Closed,2022-04-23 01:57:00,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,,"Intersection of Henry and Cedar, BERKELEY, CA",Berkeley,CA,Berkeley,,,,0 days 14:50:53
8,121000923067,2022-04-22 08:35:00,Closed,2022-04-23 01:56:37,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,060 239107301,1313 CURTIS ST,Berkeley,CA,Berkeley,37.880943,-122.289982,"(37.88094318, -122.28998212)",0 days 17:21:37
9,121000923227,2022-04-22 15:28:00,Closed,2022-04-23 01:57:41,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,053 165200105,1020 HEINZ AVE,Berkeley,CA,Berkeley,37.853063,-122.288187,"(37.85306275, -122.28818668)",0 days 10:29:41
10,121000923168,2022-04-22 13:21:00,Closed,2022-04-23 01:59:03,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,058 215601000,1635 VIRGINIA ST,Berkeley,CA,Berkeley,37.875586,-122.278920,"(37.87558556, -122.27891974)",0 days 12:38:03
11,121000923276,2022-04-23 10:17:39,Closed,2022-04-23 05:07:20,Graffiti and Vandalism,Graffiti,Graffiti Abatement - Internet Request,Individual,,,Berkeley,CA,Berkeley,,,,-1 days +18:49:41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
703260,121000971323,2022-12-02 11:33:00,Closed,2023-06-29 13:53:39,Refuse and Recycling,Request,Roll Off Bin,Property,057 201900103,2246 MILVIA ST,Berkeley,CA,Berkeley,37.867572,-122.271419,"(37.8675722, -122.27141874)",209 days 02:20:39
703856,121000948662,2022-08-15 10:24:00,Closed,2023-06-30 14:55:20,Refuse and Recycling,Residential,Residential Lost or Stolen Cart,Property,054 173700500,2741 DOHR ST,Berkeley,CA,Berkeley,37.857214,-122.281251,"(37.85721361, -122.28125113)",319 days 04:31:20
703922,121000973880,2022-12-16 10:28:00,Closed,2023-06-30 09:45:00,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,056 199705600,5 ACTON CIR,Berkeley,CA,Berkeley,37.868318,-122.283428,"(37.86831774, -122.28342802)",195 days 23:17:00
704571,121000961524,2022-10-13 09:53:00,Closed,2023-07-06 15:14:28,Refuse and Recycling,Commercial,Commercial Bin Size Increase,Property,054 170500801,2800 FOREST AVE,Berkeley,CA,Berkeley,37.861189,-122.250483,"(37.86118907, -122.25048336)",266 days 05:21:28


One thing I'm seeing immediately is that the index of this new dataframe `berkeley_311_2022` looks kind of weird. It's no longer sequential. I can reset the index to make it sequential by using `df.reset_index(drop=True)`.

```python
berkeley_311_2022 = berkeley_311_2022.reset_index(drop=True)
```

Alternatively, instead of copying the original dataframe with df.copy(), we can reset the index at the same time we subset the data:

In [19]:
berkeley_311_2022 = berkeley_311.loc[
    (berkeley_311['Date_Opened'] >= '2022-01-01') &
    (berkeley_311['Date_Opened'] < '2023-01-01') 
].reset_index(drop=True)

In [20]:
berkeley_311_2022

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
0,121000923118,2022-04-22 11:06:07,Closed,2022-04-23 01:57:00,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,,"Intersection of Henry and Cedar, BERKELEY, CA",Berkeley,CA,Berkeley,,,,0 days 14:50:53
1,121000923067,2022-04-22 08:35:00,Closed,2022-04-23 01:56:37,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,060 239107301,1313 CURTIS ST,Berkeley,CA,Berkeley,37.880943,-122.289982,"(37.88094318, -122.28998212)",0 days 17:21:37
2,121000923227,2022-04-22 15:28:00,Closed,2022-04-23 01:57:41,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,053 165200105,1020 HEINZ AVE,Berkeley,CA,Berkeley,37.853063,-122.288187,"(37.85306275, -122.28818668)",0 days 10:29:41
3,121000923168,2022-04-22 13:21:00,Closed,2022-04-23 01:59:03,"Streets, Utilities, and Transportation",Clean City Program,Illegal Dumping - City Property,Property,058 215601000,1635 VIRGINIA ST,Berkeley,CA,Berkeley,37.875586,-122.278920,"(37.87558556, -122.27891974)",0 days 12:38:03
4,121000923276,2022-04-23 10:17:39,Closed,2022-04-23 05:07:20,Graffiti and Vandalism,Graffiti,Graffiti Abatement - Internet Request,Individual,,,Berkeley,CA,Berkeley,,,,-1 days +18:49:41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58929,121000971323,2022-12-02 11:33:00,Closed,2023-06-29 13:53:39,Refuse and Recycling,Request,Roll Off Bin,Property,057 201900103,2246 MILVIA ST,Berkeley,CA,Berkeley,37.867572,-122.271419,"(37.8675722, -122.27141874)",209 days 02:20:39
58930,121000948662,2022-08-15 10:24:00,Closed,2023-06-30 14:55:20,Refuse and Recycling,Residential,Residential Lost or Stolen Cart,Property,054 173700500,2741 DOHR ST,Berkeley,CA,Berkeley,37.857214,-122.281251,"(37.85721361, -122.28125113)",319 days 04:31:20
58931,121000973880,2022-12-16 10:28:00,Closed,2023-06-30 09:45:00,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,056 199705600,5 ACTON CIR,Berkeley,CA,Berkeley,37.868318,-122.283428,"(37.86831774, -122.28342802)",195 days 23:17:00
58932,121000961524,2022-10-13 09:53:00,Closed,2023-07-06 15:14:28,Refuse and Recycling,Commercial,Commercial Bin Size Increase,Property,054 170500801,2800 FOREST AVE,Berkeley,CA,Berkeley,37.861189,-122.250483,"(37.86118907, -122.25048336)",266 days 05:21:28


#### Let's view all the unique values of **Request_Category**

You can call `series.unique()` on a column:

In [23]:
berkeley_311_2022['Request_Category'].unique()

array(['Streets, Utilities, and Transportation', 'Graffiti and Vandalism',
       'Other Account Services and Billing', 'Refuse and Recycling',
       'General Questions/information', 'Traffic and Transportation',
       'Parks, Trees and Vegetation', 'Government Activity',
       'Facilities, Electrical & Property Management',
       'Equipment Maintenance', 'Public Records Act'], dtype=object)

I'm interested in getting a count of those categories for 2022. How can I achieve this? We'll use the method `series.value_counts()`.

In [24]:
berkeley_311_2022['Request_Category'].value_counts()

Request_Category
Refuse and Recycling                            24496
General Questions/information                   16839
Streets, Utilities, and Transportation           7397
Other Account Services and Billing               5052
Parks, Trees and Vegetation                      2073
Traffic and Transportation                        974
Government Activity                               777
Facilities, Electrical & Property Management      745
Graffiti and Vandalism                            559
Equipment Maintenance                              21
Public Records Act                                  1
Name: count, dtype: int64

You can also get the `value_counts()`  for two columns:

In [25]:
berkeley_311_2022[['Request_Category', 'Request_SubCategory']].value_counts()

Request_Category                              Request_SubCategory           
General Questions/information                 Miscellaneous                     16839
Refuse and Recycling                          Residential                        9317
                                              Commercial                         8505
Streets, Utilities, and Transportation        Clean City Program                 5650
Refuse and Recycling                          Account Services and Billing       4452
Other Account Services and Billing            Marina                             2803
Refuse and Recycling                          Request                            2222
Parks, Trees and Vegetation                   Trees                              1652
Other Account Services and Billing            Rental Housing Safety Program      1082
Streets, Utilities, and Transportation        Sidewalk/Street Maintenance        1053
Government Activity                           Inquiry          

OK, so let's just look at the major topline categories.

In [26]:
category_counts_2022 = berkeley_311_2022['Request_Category'].value_counts()
category_counts_2022

Request_Category
Refuse and Recycling                            24496
General Questions/information                   16839
Streets, Utilities, and Transportation           7397
Other Account Services and Billing               5052
Parks, Trees and Vegetation                      2073
Traffic and Transportation                        974
Government Activity                               777
Facilities, Electrical & Property Management      745
Graffiti and Vandalism                            559
Equipment Maintenance                              21
Public Records Act                                  1
Name: count, dtype: int64

In [27]:
type(category_counts_2022)

pandas.core.series.Series

#### Convert series to dataframe
The `.value_counts()` method creates a series, not a dataframe. We'll convert that to a pandas dataframe with `to.frame()`.

In [28]:
category_counts_2022 = category_counts_2022.to_frame()
category_counts_2022

Unnamed: 0_level_0,count
Request_Category,Unnamed: 1_level_1
Refuse and Recycling,24496
General Questions/information,16839
"Streets, Utilities, and Transportation",7397
Other Account Services and Billing,5052
"Parks, Trees and Vegetation",2073
Traffic and Transportation,974
Government Activity,777
"Facilities, Electrical & Property Management",745
Graffiti and Vandalism,559
Equipment Maintenance,21


#### Resetting the index

In this dataframe, the index is no longer a series of sequential integers like we've seen before. We'll convert **Request_Category** to a column, from an index. That will make the dataframe easier to use later.

We're going to use `df.reset_index()`. This time, we're not going to use the `drop=True` argument because we want to create a wholly new index.

In [29]:
category_counts_2022 = category_counts_2022.reset_index()

In [30]:
category_counts_2022

Unnamed: 0,Request_Category,count
0,Refuse and Recycling,24496
1,General Questions/information,16839
2,"Streets, Utilities, and Transportation",7397
3,Other Account Services and Billing,5052
4,"Parks, Trees and Vegetation",2073
5,Traffic and Transportation,974
6,Government Activity,777
7,"Facilities, Electrical & Property Management",745
8,Graffiti and Vandalism,559
9,Equipment Maintenance,21


Looks like `Refuse and Recycling`, along with `General Questions/information` and `Streets, Utilities, and Transportation` were among the top issues in 2022. Might be worth looking into some of the sub-categories later.

In [36]:
subcategories_2022 = berkeley_311_2022[
    ['Request_Category', 'Request_SubCategory']
].value_counts().to_frame().reset_index().sort_values(
    by=['Request_Category']
).reset_index(drop=True)
subcategories_2022

Unnamed: 0,Request_Category,Request_SubCategory,count
0,Equipment Maintenance,City Vehicles,20
1,Equipment Maintenance,Equipment,1
2,"Facilities, Electrical & Property Management",Parks/Marina Building Services,84
3,"Facilities, Electrical & Property Management",Electrical Services,661
4,General Questions/information,Miscellaneous,16839
5,Government Activity,COVID-19,42
6,Government Activity,Inquiry,735
7,Graffiti and Vandalism,Graffiti,558
8,Graffiti and Vandalism,Vandalism,1
9,Other Account Services and Billing,Miscellaneous,635


In [37]:
category_counts_2022

Unnamed: 0,Request_Category,count
0,Refuse and Recycling,24496
1,General Questions/information,16839
2,"Streets, Utilities, and Transportation",7397
3,Other Account Services and Billing,5052
4,"Parks, Trees and Vegetation",2073
5,Traffic and Transportation,974
6,Government Activity,777
7,"Facilities, Electrical & Property Management",745
8,Graffiti and Vandalism,559
9,Equipment Maintenance,21


#### Rename columns

Let's change the column names, while we're at it.

You can replace _all_ the columns in a dataframe in one sweep with the following code:

```python
category_counts_2022.columns  = ['category', 'cases']
```

If you have a lot of columns, that's going to be a long list. But if you have a lot of columns to rename, the method above might be easier. If you have only one column to rename out of many columns, you'll want to use the following code:

```python
category_counts_2022.rename(columns={'Request_Category': 'category'}, inplace=True)
```

The first argument in the `df.rename()` method is `columns`. And what do we set columns to? We set it to a Python dictionary where the "key" is the original column name and the "value" is the new column name: `{'Request_Category': 'category'}`. 

The second argument is `inplace=True`. That tells us to change the `category_counts_2022` "in place" or without having to reset the dataframe variable. A lot of the methods in pandas return a new dataframe instead of altering the original dataframe. An alternative to using `inplace` is the following code:

```python
category_counts_2022 = category_counts_2022.rename(columns={'Case_ID': 'Count'})
```

In [39]:
category_counts_2022.columns  = ['category', 'cases']
category_counts_2022

Unnamed: 0,category,cases
0,Refuse and Recycling,24496
1,General Questions/information,16839
2,"Streets, Utilities, and Transportation",7397
3,Other Account Services and Billing,5052
4,"Parks, Trees and Vegetation",2073
5,Traffic and Transportation,974
6,Government Activity,777
7,"Facilities, Electrical & Property Management",745
8,Graffiti and Vandalism,559
9,Equipment Maintenance,21


#### Let's visualize this summary table!

Before we run the Altair code below, let's take a closer look:

```python
alt.Chart(category_counts_2022).mark_bar().encode(
    x='cases',
    y='category'
).properties(
    title='Berkeley 311 cases in 2022'
)
```
The first part of the code `alt.Chart()` requires you to fill the first argument with a dataframe, in this case `category_counts_2022`.

The next part of the code `mark_bar()` specifies a bar chart. (If you want a line chart, you'd use `mark_line()`.)

After that, `.encode()` tells Altair which columns to use for the `x` and `y` axes.

If you want to add a title, you'd use Altair's `.properties()` method.

In [43]:
alt.Chart(category_counts_2022).mark_bar().encode(
    x='cases',
    y='category'
).properties(
    title='Berkeley 311 cases in 2022'
)

Annoyingly, this doesn't sort the chart in descending order, which I prefer. This is the code to do, it's a little more complicated:

```python
alt.Chart(category_counts_2022).mark_bar().encode(
    x='cases',
    y=alt.Y('category', sort='-x')
).properties(
    title='Berkeley 311 cases in 2022'
)
```

Basically, you have to create a custom Y encoding with the format: `alt.Y('column_name', sort='-x')`. `-x` means the inverse of the x-axis, in this case. This is not intuitive, I think — it's just something you'd have to look up in the documentation.

In [48]:
alt.Chart(category_counts_2022).mark_bar().encode(
    x='cases',
    y=alt.Y('category', sort='-x')
).properties(
    title='Berkeley 311 cases in 2022'
)

### Count how many incidents per year

The next thing I'd like to do is get a count of all the incidents by year. However, I know from the last notebook that the data for 2010 and 2023 are not complete. So I need to subset.

Below, I'm creating a new dataframe called `berkeley_311_complete` that limits the `berkeley_311` dataframe to ones in which the **Date_Opened** value starts on or after January 1, 2011 and is before January 1, 2023. 

In [49]:
berkeley_311_complete = berkeley_311.loc[
    (berkeley_311['Date_Opened'] >= '2011-01-01') &
    (berkeley_311['Date_Opened'] < '2023-01-01')
].reset_index(drop=True)

#### Aggregate with `df.groupby()`

To aggregate the data, we're going to use a method called `df.groupby()`. Normally, when we group data, we'll group them by columns, like so:

```python
df.groupby(['Column 1', 'Column 2'])
```

You can also just group by a single column, like we're doing below:

In [50]:
berkeley_311_complete.groupby(['Request_Category'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb151548a90>

Running a `df.groupby()` doesn't do anything on its own, it just creates a pandas DataFrameGroupBy object. You have to follow it up with some kind of other method. Below, we're calling `df.count()` on the DataFrameGroupBy object.

In [51]:
berkeley_311_complete.groupby(['Request_Category']).count()

Unnamed: 0_level_0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
Request_Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Business License,3265,3265,3265,2998,3265,3265,3265,1705,1705,3265,3265,3265,1683,1683,1683,2998
Disability Compliance,30,30,30,30,30,30,30,6,8,30,30,30,6,6,6,30
Environmental Services and Programs,1799,1799,1799,1798,1799,1799,1799,1024,1460,1799,1799,1799,1016,1016,1016,1798
Equipment Maintenance,472,472,472,420,472,472,472,204,326,472,472,472,203,203,203,420
"Facilities, Electrical & Property Management",9027,9027,9027,8401,9027,9027,9027,3880,6435,9027,9027,9027,3860,3860,3860,8401
General Questions/information,146351,146351,146351,143106,146351,146351,146351,56570,64124,146351,146351,146351,55867,55867,55867,143106
Government Activity,4779,4779,4779,4734,4779,4779,4779,2196,4095,4779,4779,4779,2163,2163,2163,4734
Graffiti and Vandalism,4270,4270,4270,4059,4270,4270,4270,1893,3776,4270,4270,4270,1877,1877,1877,4059
Other Account Services and Billing,64760,64760,64760,64703,64760,64760,64760,1037,1051,64760,64760,64760,1022,1022,1022,64703
Outside Agencies,1,1,1,0,1,1,1,0,0,1,1,1,0,0,0,0


It's kind of like getting `value_counts()` on a column.

OK! So that's a new dataframe, with a little too much info. We're not going to do anything with this particular dataframe; I just wanted to show you how `groupby()` works so we can look specifically at how to use it for datetimes.

#### Use df.groupby() with datetimes

Now that we know a little bit about the `groupby()` method, let's figure out how to use this with dates.

It's a little tricky to group by datetimes. Instead of grouping by just a column name, we're going to have to use a method called `pd.Grouper`. 

Before we run the code below, let's look at the different arguments within the method:

```python
pd.Grouper(key='Date_Opened', axis=0, freq='A')
```

The `key` argument lists the column. The `axis` argument is `0`. In pandas, axis 0 is rows and axis 1 means columns. That means you can do column-wise calculations if your data is shaped differently. 

The `freq` argument is `A`, which stands for "annual" or year (`Y` also works). You can see other [frequency arguments](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) in the official pandas documentation.

In [52]:
berkeley_311_complete.groupby([pd.Grouper(key='Date_Opened', axis=0, freq='A')])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb139ed1e50>

Remember that running a `df.groupby()` doesn't do anything on its own; you have to chain that command with some kind of other method. Below, we're calling `df.count()` on the DataFrameGroupBy object. Finally, we're calling our new dataframe `annual_cases`.

In [53]:
annual_cases = berkeley_311_complete.groupby([pd.Grouper(key='Date_Opened', axis=0, freq='A')]).count()

In [54]:
annual_cases

Unnamed: 0_level_0,Case_ID,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
Date_Opened,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2011-12-31,39708,39708,39703,39708,39708,39708,39708,17405,18215,39708,39708,39708,17266,17266,17266,39703
2012-12-31,46643,46643,46636,46643,46643,46643,46643,20919,22488,46643,46643,46643,20705,20705,20705,46636
2013-12-31,50309,50309,50233,50309,50309,50309,50309,25063,27362,50309,50309,50309,24791,24791,24791,50233
2014-12-31,53531,53531,53239,53531,53531,53531,53531,28423,31356,53531,53531,53531,28148,28148,28148,53239
2015-12-31,50334,50334,50031,50334,50334,50334,50334,27432,30466,50334,50334,50334,27087,27087,27087,50031
2016-12-31,50573,50573,50086,50573,50573,50573,50573,28281,31870,50573,50573,50573,27896,27896,27896,50086
2017-12-31,51325,51325,49428,51325,51325,51325,51325,29776,34188,51325,51325,51325,29395,29395,29395,49428
2018-12-31,57636,57636,54520,57636,57636,57636,57636,37882,42171,57636,57636,57636,37361,37361,37361,54520
2019-12-31,58633,58633,54732,58633,58633,58633,58633,39485,43863,58633,58633,58633,38952,38952,38952,54732
2020-12-31,56145,56145,50108,56145,56145,56145,56145,35340,38875,56145,56145,56145,34776,34776,34776,50108


Now let's subset just the one column, **Case_ID**, from annual cases, then reset the index so that `Date_Opened` becomes a new column:

In [62]:
# annual_cases[
#     ['Case_ID']
# ]

In [56]:
annual_cases = annual_cases[['Case_ID']].reset_index()
annual_cases

Unnamed: 0,Date_Opened,Case_ID
0,2011-12-31,39708
1,2012-12-31,46643
2,2013-12-31,50309
3,2014-12-31,53531
4,2015-12-31,50334
5,2016-12-31,50573
6,2017-12-31,51325
7,2018-12-31,57636
8,2019-12-31,58633
9,2020-12-31,56145


#### Rename columns

In [57]:
annual_cases.rename(columns={'Case_ID': 'cases'}, inplace=True)

Let's take a look at our nicely named summary table:

In [58]:
annual_cases

Unnamed: 0,Date_Opened,cases
0,2011-12-31,39708
1,2012-12-31,46643
2,2013-12-31,50309
3,2014-12-31,53531
4,2015-12-31,50334
5,2016-12-31,50573
6,2017-12-31,51325
7,2018-12-31,57636
8,2019-12-31,58633
9,2020-12-31,56145


Let's create a new column in `annual_cases` called **Year**.

In [59]:
annual_cases['year'] = annual_cases['Date_Opened'].dt.year

In [60]:
annual_cases

Unnamed: 0,Date_Opened,cases,year
0,2011-12-31,39708,2011
1,2012-12-31,46643,2012
2,2013-12-31,50309,2013
3,2014-12-31,53531,2014
4,2015-12-31,50334,2015
5,2016-12-31,50573,2016
6,2017-12-31,51325,2017
7,2018-12-31,57636,2018
8,2019-12-31,58633,2019
9,2020-12-31,56145,2020


At this point, I don't need the **Date_Opened** column anymore. So I can subset the dataframe with just the two columns I need. 

In [63]:
annual_cases = annual_cases[['year', 'cases']].copy()

In [64]:
annual_cases

Unnamed: 0,year,cases
0,2011,39708
1,2012,46643
2,2013,50309
3,2014,53531
4,2015,50334
5,2016,50573
6,2017,51325
7,2018,57636
8,2019,58633
9,2020,56145


#### Visualize

In [65]:
alt.Chart(annual_cases).mark_bar().encode(
    x='year',
    y='cases'
)

That's pretty cool, but **Year** shows up kind of weird. Let's make a very small alteration to the code.

Before you run the code below, notice that after `Year` there's a colon and an `O`. The `O` is shorthand for "ordinal," and tells Altair to treat `Year` as if it's a discrete quantity (a.k.a. integers), not a continuous quantity (e.g. a number with decimals). 

In [70]:
alt.Chart(annual_cases.sort_values(by=['year'], ascending=True)).mark_bar().encode(
    x='year:O',
    y='cases'
).properties(
    title='Berkeley 311 calls: Number of cases'
)

You can read about more [Altair encoding types](https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types) in the documentation. It's helpful to get familiar with those encoding types in the event your chart doesn't look quite right. Try adjusting the encoding types on your own to see what happens.

### Median Close_Time by year

Now I'd like to try to get the median length of time it takes to close a case by year. I'm going to try something I think will work...

In [83]:
# median_close_time = berkeley_311_complete.groupby(
#     [pd.Grouper(key='Date_Opened', axis=0, freq='A') ]
# ).median(numeric_only=True)

It looks like that didn't work! Sometimes pandas doesn't work the way you want it to. The problem is that we have too many columns that don't support calculating a median (for example, a bunch of text-only columns.) So we'll have to subset the dataframe for just the two columns we want. Then we can run the `groupby()` operation.

In [72]:
berkeley_311_complete.tail()

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
634098,121000890402,2021-11-12 08:17:00,Closed,2023-07-06 06:48:45,"Parks, Trees and Vegetation",Trees,Tree Pruning,Property,053 166400700,2810 MABEL ST,Berkeley,CA,Berkeley,37.854729,-122.284972,"(37.85472929, -122.28497197)",600 days 22:31:45
634099,121000880514,2021-09-29 10:06:00,Closed,2023-07-06 06:50:28,"Parks, Trees and Vegetation",Trees,Tree Pruning,Property,056 194401100,2422 FIFTH ST,Berkeley,CA,Berkeley,37.860555,-122.296828,"(37.86055493, -122.2968276)",644 days 20:44:28
634100,121000885374,2021-10-21 08:58:00,Closed,2023-07-06 01:04:50,"Parks, Trees and Vegetation",Trees,Tree Pruning,Property,052 156400900,2835 PRINCE ST,Berkeley,CA,Berkeley,37.855386,-122.248593,"(37.85538632, -122.24859265)",622 days 16:06:50
634101,121000888075,2021-11-01 15:52:00,Closed,2023-07-06 06:51:04,"Parks, Trees and Vegetation",Trees,Tree Pruning,Property,054 174208700,2770 MABEL ST,Berkeley,CA,Berkeley,37.855744,-122.285166,"(37.8557436, -122.28516553)",611 days 14:59:04
634102,121000883511,2021-10-13 10:59:00,Closed,2023-07-06 06:55:14,"Parks, Trees and Vegetation",Trees,Tree Pruning,Property,060 244500103,1700 HOPKINS ST PARK,Berkeley,CA,Berkeley,37.882542,-122.277477,"(37.88254238, -122.27747688)",630 days 19:56:14


In [75]:
berkeley_311_complete[
    ['Date_Opened', 'Close_Time']
]

Unnamed: 0,Date_Opened,Close_Time
0,2021-09-16 06:23:23,4 days 04:58:59
1,2021-09-13 10:50:00,NaT
2,2020-11-06 16:51:00,2 days 09:01:57
3,2020-11-06 16:38:00,2 days 09:03:12
4,2020-09-01 11:32:00,0 days 00:04:00
...,...,...
634098,2021-11-12 08:17:00,600 days 22:31:45
634099,2021-09-29 10:06:00,644 days 20:44:28
634100,2021-10-21 08:58:00,622 days 16:06:50
634101,2021-11-01 15:52:00,611 days 14:59:04


In [84]:
median_close_time = berkeley_311_complete[
    ['Date_Opened', 'Close_Time']
].groupby([pd.Grouper(key='Date_Opened', axis=0, freq='A') ]).median()

median_close_time

Unnamed: 0_level_0,Close_Time
Date_Opened,Unnamed: 1_level_1
2011-12-31,86 days 04:51:42
2012-12-31,15 days 23:56:43
2013-12-31,7 days 22:44:52
2014-12-31,10 days 06:55:19
2015-12-31,4 days 06:12:28.500000
2016-12-31,3 days 20:44:05
2017-12-31,3 days 01:18:33
2018-12-31,2 days 22:13:32.500000
2019-12-31,2 days 22:39:11
2020-12-31,1 days 08:17:00


Below, I'm creating a new column called **year**, as we did before.

In [85]:
median_close_time['year'] = median_close_time['Date_Opened'].dt.year

KeyError: 'Date_Opened'

Oops! That didn't work because I forgot to reset the index. (Please don't copy these "mistakes" into your homework, lol.)

In [86]:
median_close_time = median_close_time.reset_index()
median_close_time['year'] = median_close_time['Date_Opened'].dt.year

In [87]:
median_close_time

Unnamed: 0,Date_Opened,Close_Time,year
0,2011-12-31,86 days 04:51:42,2011
1,2012-12-31,15 days 23:56:43,2012
2,2013-12-31,7 days 22:44:52,2013
3,2014-12-31,10 days 06:55:19,2014
4,2015-12-31,4 days 06:12:28.500000,2015
5,2016-12-31,3 days 20:44:05,2016
6,2017-12-31,3 days 01:18:33,2017
7,2018-12-31,2 days 22:13:32.500000,2018
8,2019-12-31,2 days 22:39:11,2019
9,2020-12-31,1 days 08:17:00,2020


Renaming the columns:

In [88]:
median_close_time.columns = ['date_opened', 'close_time', 'year']

In [89]:
median_close_time

Unnamed: 0,date_opened,close_time,year
0,2011-12-31,86 days 04:51:42,2011
1,2012-12-31,15 days 23:56:43,2012
2,2013-12-31,7 days 22:44:52,2013
3,2014-12-31,10 days 06:55:19,2014
4,2015-12-31,4 days 06:12:28.500000,2015
5,2016-12-31,3 days 20:44:05,2016
6,2017-12-31,3 days 01:18:33,2017
7,2018-12-31,2 days 22:13:32.500000,2018
8,2019-12-31,2 days 22:39:11,2019
9,2020-12-31,1 days 08:17:00,2020


Subsetting the dataframe:

In [90]:
median_close_time = median_close_time[['year', 'close_time']].copy()

In [91]:
median_close_time

Unnamed: 0,year,close_time
0,2011,86 days 04:51:42
1,2012,15 days 23:56:43
2,2013,7 days 22:44:52
3,2014,10 days 06:55:19
4,2015,4 days 06:12:28.500000
5,2016,3 days 20:44:05
6,2017,3 days 01:18:33
7,2018,2 days 22:13:32.500000
8,2019,2 days 22:39:11
9,2020,1 days 08:17:00


Let's make a chart!

In [92]:
# alt.Chart(median_close_time).mark_bar().encode(
#     x='year:O',
#     y='close_time',
# ).properties(
#     title='Berkeley 311 calls: Median resolution time'
# )

ValueError: Field "close_time" has type "timedelta64[ns]" which is not supported by Altair. Please convert to either a timestamp or a numerical value.

alt.Chart(...)

ARRGHHH! That didn't work. Let's look at the error: 
```
ValueError: Field "close_time" has type "timedelta64[ns]" which is not supported by Altair. Please convert to either a timestamp or a numerical value.
```
It sounds like I need to convert `timedelta` to a different unit. Let's try, er, nanoseconds.

In [93]:
median_close_time['close_time_nanoseconds'] = median_close_time['close_time'].astype(int)
median_close_time

Unnamed: 0,year,close_time,close_time_nanoseconds
0,2011,86 days 04:51:42,7447902000000000
1,2012,15 days 23:56:43,1382203000000000
2,2013,7 days 22:44:52,686692000000000
3,2014,10 days 06:55:19,888919000000000
4,2015,4 days 06:12:28.500000,367948500000000
5,2016,3 days 20:44:05,333845000000000
6,2017,3 days 01:18:33,263913000000000
7,2018,2 days 22:13:32.500000,252812500000000
8,2019,2 days 22:39:11,254351000000000
9,2020,1 days 08:17:00,116220000000000


Let's try this again! I'm going to use a subset of the dataframe within the chart method argument because I don't want to create a whole new dataframe (Altair won't accept any dataframe at all with a dtype it can't support). Use your discretion for when you want to do something like this.

In [101]:
# median_close_time[['year', 'close_time_nanoseconds']]

In [95]:
alt.Chart(median_close_time[['year', 'close_time_nanoseconds']]).mark_bar().encode(
    x='year:O',
    y='close_time_nanoseconds',
).properties(
    title='Berkeley 311 calls: Median resolution time'
)

### Merge two dataframes

Now I'd like to merge `median_close_time` and `annual_cases`. Why? Mostly because I'd like to teach you how to merge dataframes. But you can get a neat summary table this way. Let's look at both dataframes again:

In [96]:
annual_cases

Unnamed: 0,year,cases
0,2011,39708
1,2012,46643
2,2013,50309
3,2014,53531
4,2015,50334
5,2016,50573
6,2017,51325
7,2018,57636
8,2019,58633
9,2020,56145


In [97]:
median_close_time

Unnamed: 0,year,close_time,close_time_nanoseconds
0,2011,86 days 04:51:42,7447902000000000
1,2012,15 days 23:56:43,1382203000000000
2,2013,7 days 22:44:52,686692000000000
3,2014,10 days 06:55:19,888919000000000
4,2015,4 days 06:12:28.500000,367948500000000
5,2016,3 days 20:44:05,333845000000000
6,2017,3 days 01:18:33,263913000000000
7,2018,2 days 22:13:32.500000,252812500000000
8,2019,2 days 22:39:11,254351000000000
9,2020,1 days 08:17:00,116220000000000


Let's look at the arguments in `pd.merge()` before we run it:

```python
pd.merge(
    df1,
    df2,
    how='outer', # other options: 'inner', 'left', 'right'
    on='Year',
    validate='1:1' # options: '1:m', 'm:m', 'm:1'
)
```
1. The first argument is the left-hand dataframe. The second argument is the right-hand dataframe. Why is it important that there's an order to dataframes? 

2. The `how` argument tells pandas how we'll merge the two dataframes. In this case, we'll use `outer`. But we could also use `left`, `right`, or `inner`. What does this mean? [Here are some visual examples of how joins work.](https://docs.google.com/spreadsheets/d/1SYukPLfuIkiqhIEPeXWXDBqClife8SoEgvHyBxw_ehs/edit) In this case, it doesn't matter which value we use for `how` because both dataframes have 10 rows with matching years. 

3. The `on` argument tells pandas which column key we're going to match on. In this case, we want the years to match up.

4. The `validate` argument is optional, but I recommend you learn how to use it. The value we used, `'1:1'` means that 1 row in the left-hand dataframe will match up to exactly 1 row in the right-hand dataframe. The option `1:m` means that 1 row in the left-hand dataframe could match up to **many** rows in the right-hand dataframe. (Any time you use `m`, you're telling pandas that there _might_ be multiple matches.)


In [98]:
annual_summary = pd.merge(
    annual_cases,
    median_close_time,
    on='year',
    how='outer',
    validate='1:1'
)

In [99]:
annual_summary

Unnamed: 0,year,cases,close_time,close_time_nanoseconds
0,2011,39708,86 days 04:51:42,7447902000000000
1,2012,46643,15 days 23:56:43,1382203000000000
2,2013,50309,7 days 22:44:52,686692000000000
3,2014,53531,10 days 06:55:19,888919000000000
4,2015,50334,4 days 06:12:28.500000,367948500000000
5,2016,50573,3 days 20:44:05,333845000000000
6,2017,51325,3 days 01:18:33,263913000000000
7,2018,57636,2 days 22:13:32.500000,252812500000000
8,2019,58633,2 days 22:39:11,254351000000000
9,2020,56145,1 days 08:17:00,116220000000000


In [102]:
annual_summary['close_time_days'] = annual_summary['close_time_nanoseconds'] / 86400000000000
annual_summary

Unnamed: 0,year,cases,close_time,close_time_nanoseconds,close_time_days
0,2011,39708,86 days 04:51:42,7447902000000000,86.202569
1,2012,46643,15 days 23:56:43,1382203000000000,15.99772
2,2013,50309,7 days 22:44:52,686692000000000,7.947824
3,2014,53531,10 days 06:55:19,888919000000000,10.288414
4,2015,50334,4 days 06:12:28.500000,367948500000000,4.258663
5,2016,50573,3 days 20:44:05,333845000000000,3.863947
6,2017,51325,3 days 01:18:33,263913000000000,3.054549
7,2018,57636,2 days 22:13:32.500000,252812500000000,2.926071
8,2019,58633,2 days 22:39:11,254351000000000,2.943877
9,2020,56145,1 days 08:17:00,116220000000000,1.345139


In [104]:
alt.Chart(annual_summary[['year', 'close_time_days']]).mark_bar().encode(
    x='year:O',
    y='close_time_days',
).properties(
    title='Berkeley 311 calls: Median resolution time'
)

In [100]:
annual_summary.to_csv('berkeley_311_annual_summary.csv', index=False)