# Lecture 1023

Introduction to pandas using Berkeley 311 call data.

## Modules

A **module** is a way for Python to store functions and variables so they can be reused. You `import` modules to use those functions and variables. Typically, all the `import` statements are the top of a file or notebook.

In [1]:
import pandas as pd
import requests

The `requests` module is already built into Python. DataHub already has `pandas` installed; if you're working on your own computer, you will need to install `pandas` yourself.

A **package** is a kind of module that uses "dotted module names." That just means you can use methods with the name of the package followed by a period and the method, e.g. `requests.get()`.

When you `import` a module, you can give it an alias using `as`. Above, we gave the pandas module the alias of `pd`. That means we can call the library’s methods using `pd.read_csv()` instead of `pandas.read_csv()`. We always `import pandas as pd` because that's what the pandas-using community has decided as a convention.

Sometimes we call modules like pandas "libraries." That's a general software term; there's no special Python thing called "library."

## What is pandas?

It's a Python Data Analysis library.

[Pandas](https://pandas.pydata.org/) is a library that allows you to view table data and perform lots of different kinds of operations on that table. In pandas, a table is called a **dataframe**. If you’ve used Excel or Google Sheets, a dataframe should look familiar to you. There are rows and columns. You have column headers. You have discrete rows.

## Download data

We're going to download [311 call data](https://data.cityofberkeley.info/311/311-Cases-COB/bscu-qpbu) from the City of Berkeley's Open Data Portal. You can run the cell below. It will download the data into a file called `berkeley_311.csv`.

If you want to download something straight from the Internet again, you can copy this code but swap out the url and the file name. (But don't forget to `import requests` at the top of your notebook.)

In [2]:
url = 'https://data.cityofberkeley.info/api/views/bscu-qpbu/rows.csv?accessType=DOWNLOAD'
r = requests.get(url, allow_redirects=False)

# write the content of the request into a file called `berkeley_311.csv`
open('berkeley_311.csv', 'wb').write(r.content)

155874542

## Import the csv into a `pandas` dataframe

We use a method called `pd.read_csv()` to import a csv file into a dataframe.

Make sure you assign the dataframe into a variable. Below, we're calling the dataframe `berkeley_311_original`.

In [3]:
berkeley_311_original = pd.read_csv('berkeley_311.csv')

In these notes, I'm going to use the term `df` to stand for 'dataframe' — you will see `df` when you're searching the Internet for answers. In your actual work, you'll replace `df` with whatever you called your dataframe; in this case, that's `berkeley_311_original`.

## View data

### `df.head()` and `df.tail()`

Use `df.head()` to view the first 5 rows and the last 5 rows of the dataframe. 

In [4]:
berkeley_311_original.head()

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location
0,121000877593,09/16/2021 06:23:23 AM,Closed,09/20/2021 11:22:22 AM,"Facilities, Electrical & Property Management",Parks/Marina Building Services,Keys / Locks,Property,,"Intersection of Browning and Addison, BERKELEY...",Berkeley,CA,Berkeley,,,
1,121000876647,09/13/2021 10:50:00 AM,Open,,Refuse and Recycling,Residential,Residential Bulky Pickup,Property,054 180702800,1722 DWIGHT WAY,Berkeley,CA,Berkeley,37.862656,-122.275461,"(37.86265624, -122.27546088)"
2,121000809740,11/06/2020 04:51:00 PM,Closed,11/09/2020 01:52:57 AM,General Questions/information,Miscellaneous,Miscellaneous Service Request,Individual,,,Berkeley,CA,Berkeley,,,
3,121000809739,11/06/2020 04:38:00 PM,Closed,11/09/2020 01:41:12 AM,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,060 249305600,1411 GRIZZLY PEAK BLVD,Berkeley,CA,Berkeley,37.884799,-122.247874,"(37.88479918, -122.24787412)"
4,121000793663,09/01/2020 11:32:00 AM,Closed,09/01/2020 11:36:00 AM,Other Account Services and Billing,Marina,Payment Collection - Marina,Individual,,,Berkeley,CA,Berkeley,,,


You'll notice that there are numbers `0`, `1`, `2`, `3`, and `4` added to the dataframe to the left. That's called the **index** of the dataframe. An index is basically a row id.

You can also use `df.head(20)` to view the first 20 rows. 

How would you view the last 5 rows? Use `df.tail()`. (`head` and `tail` are commonly used methods to read files.)

In [5]:
berkeley_311_original.tail()

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location
704676,121001011148,06/12/2023 01:06:00 PM,Closed,07/06/2023 01:34:09 PM,Refuse and Recycling,Commercial,Commercial Site Inspection,Property,058 217100500,1924 CEDAR ST,Berkeley,CA,Berkeley,37.877781,-122.272751,"(37.87778105, -122.27275061)"
704677,121001014556,06/29/2023 02:01:00 PM,Closed,07/06/2023 03:26:38 PM,Refuse and Recycling,Request,Cart Repair,Property,053 167101600,1513 OREGON ST,Berkeley,CA,Berkeley,37.85637,-122.279024,"(37.85636976, -122.27902376)"
704678,121001005559,05/17/2023 11:56:00 AM,Closed,07/06/2023 12:07:40 PM,Refuse and Recycling,Residential,Residential Site Inspection,Property,055 191404100,2326 SPAULDING AVE,Berkeley,CA,Berkeley,37.864922,-122.280641,"(37.86492171, -122.28064106)"
704679,121000883511,10/13/2021 10:59:00 AM,Closed,07/06/2023 06:55:14 AM,"Parks, Trees and Vegetation",Trees,Tree Pruning,Property,060 244500103,1700 HOPKINS ST PARK,Berkeley,CA,Berkeley,37.882542,-122.277477,"(37.88254238, -122.27747688)"
704680,121001013680,06/26/2023 12:42:00 PM,Closed,07/06/2023 11:47:40 AM,Refuse and Recycling,Residential,Residential Cart Size Decrease,Property,052 156509000,2946 MAGNOLIA ST,Berkeley,CA,Berkeley,37.857004,-122.249211,"(37.8570041, -122.24921146)"


If you call the dataframe on its own, you'll get both the first 5 rows and the last 5 rows.

In [6]:
berkeley_311_original

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location
0,121000877593,09/16/2021 06:23:23 AM,Closed,09/20/2021 11:22:22 AM,"Facilities, Electrical & Property Management",Parks/Marina Building Services,Keys / Locks,Property,,"Intersection of Browning and Addison, BERKELEY...",Berkeley,CA,Berkeley,,,
1,121000876647,09/13/2021 10:50:00 AM,Open,,Refuse and Recycling,Residential,Residential Bulky Pickup,Property,054 180702800,1722 DWIGHT WAY,Berkeley,CA,Berkeley,37.862656,-122.275461,"(37.86265624, -122.27546088)"
2,121000809740,11/06/2020 04:51:00 PM,Closed,11/09/2020 01:52:57 AM,General Questions/information,Miscellaneous,Miscellaneous Service Request,Individual,,,Berkeley,CA,Berkeley,,,
3,121000809739,11/06/2020 04:38:00 PM,Closed,11/09/2020 01:41:12 AM,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,060 249305600,1411 GRIZZLY PEAK BLVD,Berkeley,CA,Berkeley,37.884799,-122.247874,"(37.88479918, -122.24787412)"
4,121000793663,09/01/2020 11:32:00 AM,Closed,09/01/2020 11:36:00 AM,Other Account Services and Billing,Marina,Payment Collection - Marina,Individual,,,Berkeley,CA,Berkeley,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
704676,121001011148,06/12/2023 01:06:00 PM,Closed,07/06/2023 01:34:09 PM,Refuse and Recycling,Commercial,Commercial Site Inspection,Property,058 217100500,1924 CEDAR ST,Berkeley,CA,Berkeley,37.877781,-122.272751,"(37.87778105, -122.27275061)"
704677,121001014556,06/29/2023 02:01:00 PM,Closed,07/06/2023 03:26:38 PM,Refuse and Recycling,Request,Cart Repair,Property,053 167101600,1513 OREGON ST,Berkeley,CA,Berkeley,37.856370,-122.279024,"(37.85636976, -122.27902376)"
704678,121001005559,05/17/2023 11:56:00 AM,Closed,07/06/2023 12:07:40 PM,Refuse and Recycling,Residential,Residential Site Inspection,Property,055 191404100,2326 SPAULDING AVE,Berkeley,CA,Berkeley,37.864922,-122.280641,"(37.86492171, -122.28064106)"
704679,121000883511,10/13/2021 10:59:00 AM,Closed,07/06/2023 06:55:14 AM,"Parks, Trees and Vegetation",Trees,Tree Pruning,Property,060 244500103,1700 HOPKINS ST PARK,Berkeley,CA,Berkeley,37.882542,-122.277477,"(37.88254238, -122.27747688)"


### `df.info()`

Use `df.info()` to get more information on the dataframe. In particular, this method is useful in that it shows us the column names and what `dtype` the column is. (I'll explain `dtype` below.)

In [7]:
berkeley_311_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 704681 entries, 0 to 704680
Data columns (total 16 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Case_ID              704681 non-null  int64  
 1   Date_Opened          704681 non-null  object 
 2   Case_Status          704681 non-null  object 
 3   Date_Closed          667181 non-null  object 
 4   Request_Category     704681 non-null  object 
 5   Request_SubCategory  704681 non-null  object 
 6   Request_Detail       704681 non-null  object 
 7   Object_Type          704681 non-null  object 
 8   APN                  413089 non-null  object 
 9   Street_Address       455055 non-null  object 
 10  City                 704681 non-null  object 
 11  State                704681 non-null  object 
 12  Neighborhood         704681 non-null  object 
 13  Latitude             407748 non-null  float64
 14  Longitude            407748 non-null  float64
 15  Location         

### What is dtype?

You remember when we talked about Python **types**, like `int`, `float`, and `string`? Above, we have `int64` instead of `int`, and `float64` instead of `float`. (`object` is pretty close to `string`, but not exactly.)

Here, **dtype** stands for **data type** and comes from a module called `numpy`. Even though we did not `import numpy`, the pandas module imported `numpy` within its own code.

Side note on dtypes: Sometimes, data doesn't import correctly, and you have to `df.read_csv()` again while simultaneously specifying the dtype. We're not going to do that today because it looks like most of the data imported OK. But we will convert the dtype of two columns so we can perform certain calculations.

Let's take a look at a bit of the dataframe again.

In [8]:
berkeley_311_original.head()

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location
0,121000877593,09/16/2021 06:23:23 AM,Closed,09/20/2021 11:22:22 AM,"Facilities, Electrical & Property Management",Parks/Marina Building Services,Keys / Locks,Property,,"Intersection of Browning and Addison, BERKELEY...",Berkeley,CA,Berkeley,,,
1,121000876647,09/13/2021 10:50:00 AM,Open,,Refuse and Recycling,Residential,Residential Bulky Pickup,Property,054 180702800,1722 DWIGHT WAY,Berkeley,CA,Berkeley,37.862656,-122.275461,"(37.86265624, -122.27546088)"
2,121000809740,11/06/2020 04:51:00 PM,Closed,11/09/2020 01:52:57 AM,General Questions/information,Miscellaneous,Miscellaneous Service Request,Individual,,,Berkeley,CA,Berkeley,,,
3,121000809739,11/06/2020 04:38:00 PM,Closed,11/09/2020 01:41:12 AM,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,060 249305600,1411 GRIZZLY PEAK BLVD,Berkeley,CA,Berkeley,37.884799,-122.247874,"(37.88479918, -122.24787412)"
4,121000793663,09/01/2020 11:32:00 AM,Closed,09/01/2020 11:36:00 AM,Other Account Services and Billing,Marina,Payment Collection - Marina,Individual,,,Berkeley,CA,Berkeley,,,


There are 2 columns that we want to convert. **Date_Opened** and **Date_Closed** are both `object` dtype, but we want to change them to a `datetime64` dtype. That allows us to do some math operations, like sort by the earliest date in the dataframe.

In [9]:
berkeley_311_original['Date_Opened'].min()

'01/01/2014 02:44:55 PM'

In [10]:
berkeley_311_original['Date_Opened'].max()

'12/31/2022 12:27:00 PM'

The operations above are wrong — they are sorted by string, not by actual date!

## Properly type the data

### Copy the original dataframe

Before we start convert the 2 columns, let's copy the original dataframe into a new dataframe. Below, I'm going to use `df.copy()` to create a copy of the original dataframe. We're not going to alter the original dataframe at all. That way, if we run into any problems later, we can compare our edited dataframe with the original dataframe.

In [11]:
berkeley_311 = berkeley_311_original.copy()

### Convert columns to `datetime`

Let's convert **Date_Opened** first so we can contrast the two columns. The syntax for this conversion is:

```python
df['column_name'] = pd.to_datetime(df['column_name'])
```

Note: This might take a long time.

In [12]:
berkeley_311['Date_Opened'] = pd.to_datetime(berkeley_311['Date_Opened'])

By the way, `berkeley_311['Date_Opened']` is a pandas **series**. We don't have to worry too much about that right now, but I want you to have to the right terminology. We're converting a series to a version of itself that passed through the `pd.to_datetime()` method.

Look at the dataframe now and compare **Date_Opened** and **Date_Closed**.

In [13]:
berkeley_311.head()

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location
0,121000877593,2021-09-16 06:23:23,Closed,09/20/2021 11:22:22 AM,"Facilities, Electrical & Property Management",Parks/Marina Building Services,Keys / Locks,Property,,"Intersection of Browning and Addison, BERKELEY...",Berkeley,CA,Berkeley,,,
1,121000876647,2021-09-13 10:50:00,Open,,Refuse and Recycling,Residential,Residential Bulky Pickup,Property,054 180702800,1722 DWIGHT WAY,Berkeley,CA,Berkeley,37.862656,-122.275461,"(37.86265624, -122.27546088)"
2,121000809740,2020-11-06 16:51:00,Closed,11/09/2020 01:52:57 AM,General Questions/information,Miscellaneous,Miscellaneous Service Request,Individual,,,Berkeley,CA,Berkeley,,,
3,121000809739,2020-11-06 16:38:00,Closed,11/09/2020 01:41:12 AM,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,060 249305600,1411 GRIZZLY PEAK BLVD,Berkeley,CA,Berkeley,37.884799,-122.247874,"(37.88479918, -122.24787412)"
4,121000793663,2020-09-01 11:32:00,Closed,09/01/2020 11:36:00 AM,Other Account Services and Billing,Marina,Payment Collection - Marina,Individual,,,Berkeley,CA,Berkeley,,,


See how they look different? You can also call `df.info()` again.

In [14]:
berkeley_311.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 704681 entries, 0 to 704680
Data columns (total 16 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   Case_ID              704681 non-null  int64         
 1   Date_Opened          704681 non-null  datetime64[ns]
 2   Case_Status          704681 non-null  object        
 3   Date_Closed          667181 non-null  object        
 4   Request_Category     704681 non-null  object        
 5   Request_SubCategory  704681 non-null  object        
 6   Request_Detail       704681 non-null  object        
 7   Object_Type          704681 non-null  object        
 8   APN                  413089 non-null  object        
 9   Street_Address       455055 non-null  object        
 10  City                 704681 non-null  object        
 11  State                704681 non-null  object        
 12  Neighborhood         704681 non-null  object        
 13  Latitude      

Let's now convert the 2nd column `Date_Closed`:

In [15]:
berkeley_311['Date_Closed'] = pd.to_datetime(berkeley_311['Date_Closed'])

And let's take a peek at the change:

In [16]:
berkeley_311.head()

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location
0,121000877593,2021-09-16 06:23:23,Closed,2021-09-20 11:22:22,"Facilities, Electrical & Property Management",Parks/Marina Building Services,Keys / Locks,Property,,"Intersection of Browning and Addison, BERKELEY...",Berkeley,CA,Berkeley,,,
1,121000876647,2021-09-13 10:50:00,Open,NaT,Refuse and Recycling,Residential,Residential Bulky Pickup,Property,054 180702800,1722 DWIGHT WAY,Berkeley,CA,Berkeley,37.862656,-122.275461,"(37.86265624, -122.27546088)"
2,121000809740,2020-11-06 16:51:00,Closed,2020-11-09 01:52:57,General Questions/information,Miscellaneous,Miscellaneous Service Request,Individual,,,Berkeley,CA,Berkeley,,,
3,121000809739,2020-11-06 16:38:00,Closed,2020-11-09 01:41:12,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,060 249305600,1411 GRIZZLY PEAK BLVD,Berkeley,CA,Berkeley,37.884799,-122.247874,"(37.88479918, -122.24787412)"
4,121000793663,2020-09-01 11:32:00,Closed,2020-09-01 11:36:00,Other Account Services and Billing,Marina,Payment Collection - Marina,Individual,,,Berkeley,CA,Berkeley,,,


By the way, you might notice that we call `df.head()` and `df.tail()` a lot to check our work. That's OK. Later on, you will be able to run more code at once before calling one of those methods, but for now, it's good to check your work often.

### .min() and .max()

Now we can find the earliest date and the latest date of both columns, by performing `series.min()` and `series.max()` on these series. (There's no equivalent of `df`/dataframe for **series**, unfortunately.)

In [17]:
berkeley_311['Date_Opened'].min()

Timestamp('2010-02-05 11:46:34')

In [18]:
berkeley_311['Date_Opened'].max()

Timestamp('2023-10-22 23:42:17')

The min and max tell us that the year 2010 is not complete (and for that matter, neither is the current year, although we wouldn't need pandas to tell us that). If we analyze the data by year later, we might want to exclude 2010 and 2021 data.

### Get the difference of the two date columns

Pandas allows you to get the difference of two dates by literally subtracting one datetime column from another. We'll create a new column called **Close_Time** that shows us how long it took for a case to be closed.

In [19]:
berkeley_311['Close_Time'] = berkeley_311['Date_Closed'] - berkeley_311['Date_Opened']

The resulting column will not be a `datetime` dtype. It will be a `timedelta` dtype. The term "delta" is often used to mean "change." Observe the last column:

In [20]:
berkeley_311.head()

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
0,121000877593,2021-09-16 06:23:23,Closed,2021-09-20 11:22:22,"Facilities, Electrical & Property Management",Parks/Marina Building Services,Keys / Locks,Property,,"Intersection of Browning and Addison, BERKELEY...",Berkeley,CA,Berkeley,,,,4 days 04:58:59
1,121000876647,2021-09-13 10:50:00,Open,NaT,Refuse and Recycling,Residential,Residential Bulky Pickup,Property,054 180702800,1722 DWIGHT WAY,Berkeley,CA,Berkeley,37.862656,-122.275461,"(37.86265624, -122.27546088)",NaT
2,121000809740,2020-11-06 16:51:00,Closed,2020-11-09 01:52:57,General Questions/information,Miscellaneous,Miscellaneous Service Request,Individual,,,Berkeley,CA,Berkeley,,,,2 days 09:01:57
3,121000809739,2020-11-06 16:38:00,Closed,2020-11-09 01:41:12,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,060 249305600,1411 GRIZZLY PEAK BLVD,Berkeley,CA,Berkeley,37.884799,-122.247874,"(37.88479918, -122.24787412)",2 days 09:03:12
4,121000793663,2020-09-01 11:32:00,Closed,2020-09-01 11:36:00,Other Account Services and Billing,Marina,Payment Collection - Marina,Individual,,,Berkeley,CA,Berkeley,,,,0 days 00:04:00


## A brief detour into data analysis

Now that we've converted the columns, we can do some interesting operations on them.

### Mean

Get the mean of a column by calling `series.mean()`.

In [21]:
berkeley_311['Close_Time'].mean()

Timedelta('54 days 23:47:00.934140811')

The average time the city took to close a case was around 61 days. This is for the whole dataset, from early 2010 to now.

### Median

In [22]:
berkeley_311['Close_Time'].median()

Timedelta('4 days 16:58:12')

But the median time was around 5 days.

### Min

In [23]:
berkeley_311['Close_Time'].min()

Timedelta('-110 days +02:40:00')

The shortest amount of time was 0 days and 0 seconds. That might be a public comment that didn't require follow-up. We can check on that later.

### Max

In [24]:
berkeley_311['Close_Time'].max()

Timedelta('3373 days 22:56:59')

One case seems to have taken 3,373 days! That's almost 10 years. That seems like way too long. There might be an error here.

Let's take a quick detour into _subsetting_ the data. That means to take a smaller set of the data based on some conditions. I'll explain how to subset more later, but for now, check out the following code:

In [25]:
berkeley_311[berkeley_311['Close_Time'] >= '3373 days']

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
20842,121000033912,2010-12-08 12:52:11,Closed,2020-03-04 11:49:10,Refuse and Recycling,Account Services and Billing,Payment Collection - Refuse and Recycling,Individual,,,Berkeley,CA,Berkeley,,,,3373 days 22:56:59


It's hard to know why it took so long to close this case without asking the city for more information.

### Sort data

We can even look at the top 10 cases that took the longest time to resolve. You'll use the `df.sort_values()` method.

Let's break down the below code before we run it. You can see there are 2 options within the parentheses for `.sort_values()`:
```python
by=['Close_Time'], ascending=False
```

The `by` argument tells us which column we will sort the dataframe by. You always need to include this argument. You can sort by multiple columns, too.

The optional `ascending` argument tells us if we want the dataframe to sort from smallest to largest or earliest to latest. By default, `ascending` is set to `True`, so we're going to change it here so it's `False`.

Next, I don't want to see the entire dataframe, just the first 10 rows. So I'm going to call `df.head(10)`.

In [26]:
berkeley_311.sort_values(by=['Close_Time'], ascending=False).head(10)

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
20842,121000033912,2010-12-08 12:52:11,Closed,2020-03-04 11:49:10,Refuse and Recycling,Account Services and Billing,Payment Collection - Refuse and Recycling,Individual,,,Berkeley,CA,Berkeley,,,,3373 days 22:56:59
18295,121000020729,2010-08-01 09:46:00,Closed,2019-08-25 05:45:00,"Streets, Utilities, and Transportation",Sidewalk/Street Maintenance,Potholes,Street,,,Berkeley,CA,Berkeley,,,,3310 days 19:59:00
659660,121000195202,2014-05-22 18:28:00,Closed,2023-03-16 10:28:22,General Questions/information,Miscellaneous,Reclassified Test Records DUMMY,Property,057 202100100,2180 MILVIA ST,Berkeley,CA,Berkeley,37.869453,-122.270949,"(37.86945314, -122.27094931)",3219 days 16:00:22
176579,121000170060,2013-11-25 16:19:58,Closed,2021-12-27 09:39:10,Refuse and Recycling,Residential,Residential Missed Pickup Integration,Property,056 198300200,1114 COWPER ST,Berkeley,CA,Berkeley,37.867101,-122.29084,"(37.86710133, -122.29083996)",2953 days 17:19:12
21462,121000073543,2011-12-02 13:46:57,Closed,2020-01-02 12:43:50,Refuse and Recycling,Residential,Residential Service Start,Property,054 179101800,2617 MABEL ST,Berkeley,CA,Berkeley,37.859285,-122.284746,"(37.85928516, -122.28474645)",2952 days 22:56:53
666308,121000239766,2015-03-27 16:12:07,Closed,2022-11-14 09:04:50,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,055 190400200,1808 BANCROFT WAY,Berkeley,CA,Berkeley,37.866532,-122.274255,"(37.8665324, -122.27425479)",2788 days 16:52:43
666204,121000245510,2015-05-12 09:05:50,Closed,2022-11-14 09:05:47,General Questions/information,Miscellaneous,Miscellaneous Service Request,Street,,,Berkeley,CA,Berkeley,,,,2742 days 23:59:57
179587,121000235500,2015-02-26 13:17:17,Closed,2022-03-28 12:59:16,Refuse and Recycling,Account Services and Billing,Account Adjustment Research - Refuse and Recyc...,Individual,,,Berkeley,CA,Berkeley,,,,2586 days 23:41:59
194781,121000258817,2015-08-13 08:26:19,Closed,2022-07-14 15:38:18,"Streets, Utilities, and Transportation",Sidewalk/Street Maintenance,Potholes,Property,,"Intersection of Grizzly Peak and Marin, BERKEL...",Berkeley,CA,Berkeley,,,,2527 days 07:11:59
194997,121000258965,2015-08-13 15:09:25,Closed,2022-07-14 15:38:51,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,,"Intersection of Euclid and Marin, BERKELEY, CA",Berkeley,CA,Berkeley,,,,2527 days 00:29:26


So that's a preview of an interesting analysis we can do. I showed you one fun part before we moved onto the harder part, checking and vetting the data. Usually, we need to do that first. But we did convert the columns to datetime, which is part of making sure the data was valid.

## Clean data, part 2: Check and vet the data

### Unique identifier for every row?

First, I want to see if `Case_ID` has a unique ID for every row. Why? When you're doing a data analysis, every row should have its own unique ID. Hopefully, the agency that gave you the data has provided a unique ID. Sometimes, though, they don't. In those cases, you want to create a unique ID for every row.

I get a count of unique values by calling `.nunique()` on a column.

In [27]:
berkeley_311['Case_ID'].nunique()

704673

How many rows do we have again? We can use `df.info()` to get the number of rows, or we can scroll up to see again. I'm feeling lazy, so let's just call `len(df)`. (Do you remember that we learned `len()` for both strings and data structures earlier?)

In [28]:
len(berkeley_311)

704681

There might be duplicates or there could be missing data. Let's check for both.

We're going to check by **subsetting** the data.

### Subsetting

This is the general structure of how you subset data in pandas.

```python
df[ expression ]
```

That's not very descriptive. What's the _expression_? There are lots of different ways we write these expressions in pandas. I'm going to show you a handful of different kinds today, but know there are a bunch more!

Ultimately, what I want is a list of the duplicate `Case_ID`s. I'll then subset the dataframe to show any row that has a Case_ID that is on that list.


We'll first check to see which rows in `berkeley_311` are exact duplicates.

In [29]:
berkeley_311[berkeley_311.duplicated()]

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
186417,121000910652,2022-02-22 16:38:00,Closed,2022-02-23 09:18:38,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,060 247503700,1328 LA LOMA AVE,Berkeley,CA,Berkeley,37.884153,-122.255634,"(37.88415345, -122.25563374)",0 days 16:40:38
701476,121001035218,2023-10-15 12:44:00,Closed,2023-10-16 10:50:44,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,060 247503700,1328 LA LOMA AVE,Berkeley,CA,Berkeley,37.884153,-122.255634,"(37.88415345, -122.25563374)",0 days 22:06:44


There are no exact duplicates. Now let's check specifically to see rows in which `Case_ID`s are duplicated. 

In [30]:
berkeley_311[berkeley_311['Case_ID'].duplicated()]

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
65650,121000771810,2020-05-28 12:52:00,Closed,2020-05-28 14:32:46,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",0 days 01:40:46
186417,121000910652,2022-02-22 16:38:00,Closed,2022-02-23 09:18:38,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,060 247503700,1328 LA LOMA AVE,Berkeley,CA,Berkeley,37.884153,-122.255634,"(37.88415345, -122.25563374)",0 days 16:40:38
308596,121000168293,2013-11-13 16:56:22,Closed,2013-11-18 16:16:10,Refuse and Recycling,Request,Cart Repair,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",4 days 23:19:48
334433,121000212784,2014-09-13 19:24:04,Closed,2014-09-22 08:59:39,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",8 days 13:35:35
350841,121000056432,2011-07-11 09:26:45,Closed,2011-07-14 12:56:18,Refuse and Recycling,Residential,Residential Service Start,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",3 days 03:29:33
451739,121000345232,2017-04-28 09:11:28,Closed,2017-05-04 09:32:37,Refuse and Recycling,Residential,Residential Lost or Stolen Cart,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",6 days 00:21:09
496161,121000212850,2014-09-15 09:37:45,Closed,2014-09-18 18:17:05,Refuse and Recycling,Request,Stray Cart Removal,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",3 days 08:39:20
701476,121001035218,2023-10-15 12:44:00,Closed,2023-10-16 10:50:44,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,060 247503700,1328 LA LOMA AVE,Berkeley,CA,Berkeley,37.884153,-122.255634,"(37.88415345, -122.25563374)",0 days 22:06:44


What's annoying about this is that it only shows ONE instance of the Case_ID. What I want is a list of those Case IDs. How do I make a list of the values in one column?

First, I'm going to create a new dataframe that has the duplicated IDs. We're going to create a copy.

In [31]:
dupe_cases = berkeley_311[berkeley_311['Case_ID'].duplicated()].copy()
dupe_cases

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
65650,121000771810,2020-05-28 12:52:00,Closed,2020-05-28 14:32:46,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",0 days 01:40:46
186417,121000910652,2022-02-22 16:38:00,Closed,2022-02-23 09:18:38,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,060 247503700,1328 LA LOMA AVE,Berkeley,CA,Berkeley,37.884153,-122.255634,"(37.88415345, -122.25563374)",0 days 16:40:38
308596,121000168293,2013-11-13 16:56:22,Closed,2013-11-18 16:16:10,Refuse and Recycling,Request,Cart Repair,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",4 days 23:19:48
334433,121000212784,2014-09-13 19:24:04,Closed,2014-09-22 08:59:39,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",8 days 13:35:35
350841,121000056432,2011-07-11 09:26:45,Closed,2011-07-14 12:56:18,Refuse and Recycling,Residential,Residential Service Start,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",3 days 03:29:33
451739,121000345232,2017-04-28 09:11:28,Closed,2017-05-04 09:32:37,Refuse and Recycling,Residential,Residential Lost or Stolen Cart,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",6 days 00:21:09
496161,121000212850,2014-09-15 09:37:45,Closed,2014-09-18 18:17:05,Refuse and Recycling,Request,Stray Cart Removal,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",3 days 08:39:20
701476,121001035218,2023-10-15 12:44:00,Closed,2023-10-16 10:50:44,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,060 247503700,1328 LA LOMA AVE,Berkeley,CA,Berkeley,37.884153,-122.255634,"(37.88415345, -122.25563374)",0 days 22:06:44


Now we create a `list` or `array` (the numpy version of a list) of those IDs:

In [32]:
dupe_case_ids = dupe_cases['Case_ID'].to_list()

# to create an array, you can use this instead:
# dupe_case_ids = dupe_cases['Case_ID'].unique()

Now we'll call `dupe_case_ids`, so we can see what's in it.

In [33]:
dupe_case_ids

[121000771810,
 121000910652,
 121000168293,
 121000212784,
 121000056432,
 121000345232,
 121000212850,
 121001035218]

Now we'll subset the data by finding cases in our edited dataframe `berkeley_311`.

In [34]:
berkeley_311[berkeley_311['Case_ID'].isin(dupe_case_ids)]

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
65565,121000771810,2020-05-28 12:52:00,Closed,2020-05-28 14:32:46,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",0 days 01:40:46
65650,121000771810,2020-05-28 12:52:00,Closed,2020-05-28 14:32:46,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",0 days 01:40:46
186262,121000910652,2022-02-22 16:38:00,Closed,2022-02-23 09:18:38,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,060 247503700,1328 LA LOMA AVE,Berkeley,CA,Berkeley,37.884153,-122.255634,"(37.88415345, -122.25563374)",0 days 16:40:38
186417,121000910652,2022-02-22 16:38:00,Closed,2022-02-23 09:18:38,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,060 247503700,1328 LA LOMA AVE,Berkeley,CA,Berkeley,37.884153,-122.255634,"(37.88415345, -122.25563374)",0 days 16:40:38
262080,121000168293,2013-11-13 16:56:22,Closed,2013-11-18 16:16:10,Refuse and Recycling,Request,Cart Repair,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",4 days 23:19:48
273389,121000056432,2011-07-11 09:26:45,Closed,2011-07-14 12:56:18,Refuse and Recycling,Residential,Residential Service Start,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",3 days 03:29:33
280373,121000212784,2014-09-13 19:24:04,Closed,2014-09-22 08:59:39,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",8 days 13:35:35
308596,121000168293,2013-11-13 16:56:22,Closed,2013-11-18 16:16:10,Refuse and Recycling,Request,Cart Repair,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",4 days 23:19:48
322900,121000212850,2014-09-15 09:37:45,Closed,2014-09-18 18:17:05,Refuse and Recycling,Request,Stray Cart Removal,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",3 days 08:39:20
334433,121000212784,2014-09-13 19:24:04,Closed,2014-09-22 08:59:39,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",8 days 13:35:35


OK, this is weird, it looks like all the cases are for the same address. Still, it's hard to tell what's going on, so I'm going to sort that dataframe by **Case_ID**.

In [35]:
berkeley_311[berkeley_311['Case_ID'].isin(dupe_case_ids)].sort_values(by=['Case_ID'])

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
273389,121000056432,2011-07-11 09:26:45,Closed,2011-07-14 12:56:18,Refuse and Recycling,Residential,Residential Service Start,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",3 days 03:29:33
350841,121000056432,2011-07-11 09:26:45,Closed,2011-07-14 12:56:18,Refuse and Recycling,Residential,Residential Service Start,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",3 days 03:29:33
262080,121000168293,2013-11-13 16:56:22,Closed,2013-11-18 16:16:10,Refuse and Recycling,Request,Cart Repair,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",4 days 23:19:48
308596,121000168293,2013-11-13 16:56:22,Closed,2013-11-18 16:16:10,Refuse and Recycling,Request,Cart Repair,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",4 days 23:19:48
280373,121000212784,2014-09-13 19:24:04,Closed,2014-09-22 08:59:39,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",8 days 13:35:35
334433,121000212784,2014-09-13 19:24:04,Closed,2014-09-22 08:59:39,General Questions/information,Miscellaneous,Miscellaneous Internet Request,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",8 days 13:35:35
322900,121000212850,2014-09-15 09:37:45,Closed,2014-09-18 18:17:05,Refuse and Recycling,Request,Stray Cart Removal,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",3 days 08:39:20
496161,121000212850,2014-09-15 09:37:45,Closed,2014-09-18 18:17:05,Refuse and Recycling,Request,Stray Cart Removal,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",3 days 08:39:20
375704,121000345232,2017-04-28 09:11:28,Closed,2017-05-04 09:32:37,Refuse and Recycling,Residential,Residential Lost or Stolen Cart,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888518,-122.271781,"(37.88851766, -122.27178098)",6 days 00:21:09
451739,121000345232,2017-04-28 09:11:28,Closed,2017-05-04 09:32:37,Refuse and Recycling,Residential,Residential Lost or Stolen Cart,Property,061 256400200,20 TERRACE WALK,Berkeley,CA,Berkeley,37.888615,-122.27221,"(37.88861517, -122.27221016)",6 days 00:21:09


The thing that looks different is the geocoding between those cases. 

FYI, all the code we used above didn't change the original dataframe. We were subsetting, but we did not subset with a new variable. 

Now, we're going to drop the duplicated cases, and reset the variable `berkeley_311`.

In [36]:
berkeley_311 = berkeley_311.drop_duplicates(subset=['Case_ID'])

In [37]:
berkeley_311

Unnamed: 0,Case_ID,Date_Opened,Case_Status,Date_Closed,Request_Category,Request_SubCategory,Request_Detail,Object_Type,APN,Street_Address,City,State,Neighborhood,Latitude,Longitude,Location,Close_Time
0,121000877593,2021-09-16 06:23:23,Closed,2021-09-20 11:22:22,"Facilities, Electrical & Property Management",Parks/Marina Building Services,Keys / Locks,Property,,"Intersection of Browning and Addison, BERKELEY...",Berkeley,CA,Berkeley,,,,4 days 04:58:59
1,121000876647,2021-09-13 10:50:00,Open,NaT,Refuse and Recycling,Residential,Residential Bulky Pickup,Property,054 180702800,1722 DWIGHT WAY,Berkeley,CA,Berkeley,37.862656,-122.275461,"(37.86265624, -122.27546088)",NaT
2,121000809740,2020-11-06 16:51:00,Closed,2020-11-09 01:52:57,General Questions/information,Miscellaneous,Miscellaneous Service Request,Individual,,,Berkeley,CA,Berkeley,,,,2 days 09:01:57
3,121000809739,2020-11-06 16:38:00,Closed,2020-11-09 01:41:12,General Questions/information,Miscellaneous,Miscellaneous Service Request,Property,060 249305600,1411 GRIZZLY PEAK BLVD,Berkeley,CA,Berkeley,37.884799,-122.247874,"(37.88479918, -122.24787412)",2 days 09:03:12
4,121000793663,2020-09-01 11:32:00,Closed,2020-09-01 11:36:00,Other Account Services and Billing,Marina,Payment Collection - Marina,Individual,,,Berkeley,CA,Berkeley,,,,0 days 00:04:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
704676,121001011148,2023-06-12 13:06:00,Closed,2023-07-06 13:34:09,Refuse and Recycling,Commercial,Commercial Site Inspection,Property,058 217100500,1924 CEDAR ST,Berkeley,CA,Berkeley,37.877781,-122.272751,"(37.87778105, -122.27275061)",24 days 00:28:09
704677,121001014556,2023-06-29 14:01:00,Closed,2023-07-06 15:26:38,Refuse and Recycling,Request,Cart Repair,Property,053 167101600,1513 OREGON ST,Berkeley,CA,Berkeley,37.856370,-122.279024,"(37.85636976, -122.27902376)",7 days 01:25:38
704678,121001005559,2023-05-17 11:56:00,Closed,2023-07-06 12:07:40,Refuse and Recycling,Residential,Residential Site Inspection,Property,055 191404100,2326 SPAULDING AVE,Berkeley,CA,Berkeley,37.864922,-122.280641,"(37.86492171, -122.28064106)",50 days 00:11:40
704679,121000883511,2021-10-13 10:59:00,Closed,2023-07-06 06:55:14,"Parks, Trees and Vegetation",Trees,Tree Pruning,Property,060 244500103,1700 HOPKINS ST PARK,Berkeley,CA,Berkeley,37.882542,-122.277477,"(37.88254238, -122.27747688)",630 days 19:56:14


By the way, you might have noticed there's something called `NaT` in one of the rows above. `NaT` stands for _not a time_ and is kind of like `None` or an empty cell in Google Sheets. For non-time related blank cells, you'll see `NaN` (not a number) instead of `NaT`.

The difference between `None` and `NaN`/`NaT` is that the latter allows you to perform calcuations and skip any blank cells. That means, you probably need to check for how many `NaN`/`NaT` cells exist in your dataframe. If there are a lot of them, your analysis might not be valid. You can quickly check for that with the `df.info()` method we learned earlier. There's a column called `Non-Null Count`. 

In [38]:
berkeley_311.info()

<class 'pandas.core.frame.DataFrame'>
Index: 704673 entries, 0 to 704680
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype          
---  ------               --------------   -----          
 0   Case_ID              704673 non-null  int64          
 1   Date_Opened          704673 non-null  datetime64[ns] 
 2   Case_Status          704673 non-null  object         
 3   Date_Closed          667173 non-null  datetime64[ns] 
 4   Request_Category     704673 non-null  object         
 5   Request_SubCategory  704673 non-null  object         
 6   Request_Detail       704673 non-null  object         
 7   Object_Type          704673 non-null  object         
 8   APN                  413081 non-null  object         
 9   Street_Address       455047 non-null  object         
 10  City                 704673 non-null  object         
 11  State                704673 non-null  object         
 12  Neighborhood         704673 non-null  object         
 13  Lati

Which columns have a lot of null values?

### Assert
The keyword `assert` is a good way for us to check if the length of the dataframe now matches the number of unique IDs.

In [39]:
assert len(berkeley_311) == berkeley_311['Case_ID'].nunique()

If the assertion is `True`, nothing happens. But if the assertion is `False`, you'll get an error. You might want to use these kinds of assertions when you have to re-run your notebooks or have to import updated datasets.

## Export a clean version of the data to a csv

We can use the `df.to_csv()` method to export a clean copy of the csv. That way, you can instantly import the clean data in a new notebook instead of rerunning the code in this notebook.

Before we run the code below, let's take a closer look:

```python
berkeley_311.to_csv('berkeley_311_clean.csv', index=False)
```

The first argument in `df.to_csv()` is the name of the file we're going to export our dataframe into. In this case, that's `berkeley_311_clean.csv`.

The second argument is `index=False`. This means that I don't want pandas to export those row ids (0, 1, 2, 3, etc.) that show up at the very lefthand side of the dataframe.

In [40]:
berkeley_311.to_csv('berkeley_311_clean.csv', index=False)

Try removing `index=False` and see what happens.