In [1]:
# HIDDEN
Base.displaysize() = (5, 80)
using DataFrames
using CSV

## Cleaning The Stops Dataset

The Stops dataset ([webpage](https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Stop-Data/6e9j-pj9p)) records police stops of pedestrians and vehicles. Let's prepare it for further analysis.

We can use the `head` command to display the first few lines of the file.

In [2]:
;head data/stops.json

{
  "meta" : {
    "view" : {
      "id" : "6e9j-pj9p",
      "name" : "Berkeley PD - Stop Data",
      "attribution" : "Berkeley Police Department",
      "averageRating" : 0,
      "category" : "Public Safety",
      "createdAt" : 1444171604,
      "description" : "This data was extracted from the Department’s Public Safety Server and covers the data beginning January 26, 2015.  On January 26, 2015 the department began collecting data pursuant to General Order B-4 (issued December 31, 2014).  Under that order, officers were required to provide certain data after making all vehicle detentions (including bicycles) and pedestrian detentions (up to five persons).  This data set lists stops by police in the categories of traffic, suspicious vehicle, pedestrian and bicycle stops.  Incident number, date and time, location and disposition codes are also listed in this data.\r\n\r\nAddress data has been changed from a specific address, where applicable, and listed as the block where the incid

The `stops.json` file is clearly not a CSV file. In this case, the file contains data in the JSON (JavaScript Object Notation) format, a commonly used data format where data is recorded in a dictionary format. Julia's [JSON package](https://github.com/JuliaIO/JSON.jl) makes reading this file as a dictionary simple.

In [4]:
# Note that this could cause our computer to run out of memory if the file
# is large. In this case, we've verified that the file is small enough to
# read in beforehand.

import JSON
json_file = open("data/stops.json", "r")
stops_dict = JSON.parse(json_file)
close(json_file)

print(keys(stops_dict))

["meta", "data"]

Note that `stops_dict` is a Julia dictionary, so displaying it will display the entire dataset in the notebook. This could cause the browser to crash, so we only display the keys of the dictionary above. To peek at the data without potentially crashing the browser, we can print the dictionary to a string and only output some of the first characters of the string.

In [7]:
function print_dict(dictionary, num_chars=500)
    io = IOBuffer(maxsize=num_chars)
    JSON.print(io, dictionary, 2)
    println(String(take!(io)))
end

print_dict(stops_dict["meta"])

{
  "view": {
    "publicationDate": 1496269828,
    "rowsUpdatedAt": 1496269698,
    "hideFromDataJson": false,
    "publicationAppendEnabled": false,
    "viewType": "tabular",
    "attribution": "Berkeley Police Department",
    "createdAt": 1444171604,
    "viewCount": 3725,
    "hideFromCatalog": false,
    "name": "Berkeley PD - Stop Data",
    "oid": 26097722,
    "provenance": "official",
    "totalTimesRated": 0,
    "rowsUpdatedBy": "cd35-9pq2",
    "averageRating": 0,
    "viewLastMod


In [9]:
print_dict(stops_dict["data"], 300)

[
  [
    1,
    "29A1B912-A0A9-4431-ADC9-FB375809C32E",
    1,
    1444146408,
    "932858",
    1444146408,
    "932858",
    null,
    "2015-00004825",
    "2015-01-26T00:10:00",
    "SAN PABLO AVE / MARIN AVE",
    "T",
    "M",
    null,
    null
  ],
  [
    2,
    "1644D161-1113-4C4F-BB2E-BF7


We can likely deduce that the `'meta'` key in the dictionary contains a description of the data and its columns and the `'data'` contains a list of data rows. We can use this information to initialize a DataFrame.

In [102]:
stops = DataFrame()
data = reshape.(stops_dict["data"], :, 15)
columns = [Symbol(c["name"]) for c in stops_dict["meta"]["view"]["columns"]]
map(x -> append!(stops, DataFrame(x, columns)), data)
stops

Unnamed: 0_level_0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,Incident Number,Call Date/Time
Unnamed: 0_level_1,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any
1,1,29A1B912-A0A9-4431-ADC9-FB375809C32E,1,1444146408,932858,1444146408,932858,,2015-00004825,2015-01-26T00:10:00
2,2,1644D161-1113-4C4F-BB2E-BF780E7AE73E,2,1444146408,932858,1444146408,932858,,2015-00004829,2015-01-26T00:50:00
3,3,5338ABAB-1C96-488D-B55F-6A47AC505872,3,1444146408,932858,1444146408,932858,,2015-00004831,2015-01-26T01:03:00
4,4,21B6CBE4-9865-460F-97BC-6B26C6EF2FDB,4,1444146408,932858,1444146408,932858,,2015-00004848,2015-01-26T07:16:00
5,5,0D85FA92-80E9-48C2-B409-C3270251CD12,5,1444146408,932858,1444146408,932858,,2015-00004849,2015-01-26T07:43:00
6,6,B472A15F-6FEE-457A-872F-4B4174D5CC41,6,1444146408,932858,1444146408,932858,,2015-00004865,2015-01-26T09:46:00
7,7,0A9C0082-94FE-4B0B-9F2B-27F21C6BFD94,7,1444146408,932858,1444146408,932858,,2015-00004870,2015-01-26T10:05:00
8,8,679142FB-85E1-4AC1-9CA0-C6393D329111,8,1444146408,932858,1444146408,932858,,2015-00004876,2015-01-26T10:21:00
9,9,9CA07931-B049-4127-A7D4-5A3F4A7F6A3B,9,1444146408,932858,1444146408,932858,,2015-00004882,2015-01-26T10:49:00
10,10,F2BCCCF9-D3AE-40EE-AE8F-2B52132EC677,10,1444146408,932858,1444146408,932858,,2015-00004887,2015-01-26T11:12:00


In [103]:
# Prints column names
print(names(stops))

Symbol[:sid, :id, :position, :created_at, :created_meta, :updated_at, :updated_meta, :meta, Symbol("Incident Number"), Symbol("Call Date/Time"), :Location, Symbol("Incident Type"), :Dispositions, Symbol("Location - Latitude"), Symbol("Location - Longitude")]

The website contains documentation about the following columns:

| Column | Description | Type |
| ------ | ----------- | ---- |
| Incident Number | Number of incident created by Computer Aided Dispatch (CAD) program | Plain Text |
| Call Date/Time  | Date and time of the incident/stop | Date & Time |
| Location  | General location of the incident/stop | Plain Text |
| Incident Type | This is the occurred incident type created in the CAD program. A code signifies a traffic stop (T), suspicious vehicle stop (1196), pedestrian stop (1194) and bicycle stop (1194B). | Plain Text |
| Dispositions  | Ordered in the following sequence: 1st Character = Race, as follows: A (Asian) B (Black) H (Hispanic) O (Other) W (White) 2nd Character = Gender, as follows: F (Female) M (Male) 3rd Character = Age Range, as follows: 1 (Less than 18) 2 (18-29) 3 (30-39), 4 (Greater than 40) 4th Character = Reason, as follows: I (Investigation) T (Traffic) R (Reasonable Suspicion) K (Probation/Parole) W (Wanted) 5th Character = Enforcement, as follows: A (Arrest) C (Citation) O (Other) W (Warning) 6th Character = Car Search, as follows: S (Search) N (No Search) Additional dispositions may also appear. They are: P - Primary case report M - MDT narrative only AR - Arrest report only (no case report submitted) IN - Incident report FC - Field Card CO - Collision investigation report MH - Emergency Psychiatric Evaluation TOW - Impounded vehicle 0 or 00000 – Officer made a stop of more than five persons | Plain Text |
| Location - Latitude | General latitude of the call. This data is only uploaded after January 2017 | Number |
| Location - Longitude  | General longitude of the call. This data is only uploaded after January 2017. | Number |

Notice that the website doesn't contain descriptions for the first 8 columns of the `stops` table. Since these columns appear to contain metadata that we're not interested in analyzing this time, we drop them from the table.

In [108]:
columns_to_drop = [:sid, :id, :position, :created_at, :created_meta, :updated_at, :updated_meta, :meta]
select!(stops, Not(columns_to_drop))
stops

Unnamed: 0_level_0,Incident Number,Call Date/Time,Location,Incident Type,Dispositions,Location - Latitude,Location - Longitude
Unnamed: 0_level_1,Any,Any,Any,Any,Any,Any,Any
1,2015-00004825,2015-01-26T00:10:00,SAN PABLO AVE / MARIN AVE,T,M,missing,missing
2,2015-00004829,2015-01-26T00:50:00,SAN PABLO AVE / CHANNING WAY,T,M,missing,missing
3,2015-00004831,2015-01-26T01:03:00,UNIVERSITY AVE / NINTH ST,T,M,missing,missing
4,2015-00004848,2015-01-26T07:16:00,2000 BLOCK BERKELEY WAY,1194,BM4ICN,missing,missing
5,2015-00004849,2015-01-26T07:43:00,1700 BLOCK SAN PABLO AVE,1194,BM4ICN,missing,missing
6,2015-00004865,2015-01-26T09:46:00,M L KING JR WAY / UNIVERSITY AVE,T,OF4TCN,missing,missing
7,2015-00004870,2015-01-26T10:05:00,M L KING JR WAY / UNIVERSITY AVE,T,OM4TCN,missing,missing
8,2015-00004876,2015-01-26T10:21:00,UNIVERSITY AVE / M L KING JR WAY,T,OF2TCN,missing,missing
9,2015-00004882,2015-01-26T10:49:00,HASTE ST / ELLSWORTH ST,T,OM2TCN,missing,missing
10,2015-00004887,2015-01-26T11:12:00,ADELINE ST / OREGON ST,T,OM2TCN,missing,missing


As with the Calls dataset, we will answer the following three questions about the Stops dataset:

1. Are there missing values in the dataset?
1. Are there any missing values that were filled in (e.g. a 999 for unknown age or 12:00am for unknown date)?
1. Which parts of the data were entered by a human?

### Are there missing values?

We can clearly see that there are many missing latitude and longitudes. The data description states that these two columns are only filled in after Jan 2017.
These values were read as `Nothing` type, so we will convert them to `missing` for further processing:

In [109]:
function replace_nothing_values(df)
    for col in names(df)
        replace!(df[!, col], nothing=>missing)
    end
end

replace_nothing_values(stops)
stops

Unnamed: 0_level_0,Incident Number,Call Date/Time,Location,Incident Type,Dispositions,Location - Latitude,Location - Longitude
Unnamed: 0_level_1,Any,Any,Any,Any,Any,Any,Any
1,2015-00004825,2015-01-26T00:10:00,SAN PABLO AVE / MARIN AVE,T,M,missing,missing
2,2015-00004829,2015-01-26T00:50:00,SAN PABLO AVE / CHANNING WAY,T,M,missing,missing
3,2015-00004831,2015-01-26T01:03:00,UNIVERSITY AVE / NINTH ST,T,M,missing,missing
4,2015-00004848,2015-01-26T07:16:00,2000 BLOCK BERKELEY WAY,1194,BM4ICN,missing,missing
5,2015-00004849,2015-01-26T07:43:00,1700 BLOCK SAN PABLO AVE,1194,BM4ICN,missing,missing
6,2015-00004865,2015-01-26T09:46:00,M L KING JR WAY / UNIVERSITY AVE,T,OF4TCN,missing,missing
7,2015-00004870,2015-01-26T10:05:00,M L KING JR WAY / UNIVERSITY AVE,T,OM4TCN,missing,missing
8,2015-00004876,2015-01-26T10:21:00,UNIVERSITY AVE / M L KING JR WAY,T,OF2TCN,missing,missing
9,2015-00004882,2015-01-26T10:49:00,HASTE ST / ELLSWORTH ST,T,OM2TCN,missing,missing
10,2015-00004887,2015-01-26T11:12:00,ADELINE ST / OREGON ST,T,OM2TCN,missing,missing


Now we can use `describe()` to see the number of missing values in each column.

In [110]:
describe(stops)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Nothing,String,Nothing,String,Int64,Int64,DataType
1,Incident Number,,2015-00004825,,2017-00024254,29208,0,Any
2,Call Date/Time,,2015-01-26T00:10:00,,2017-04-30T23:38:34,28848,0,Any
3,Location,,10TH/ADDISON,,WOOLSEY ST/HARPER ST,6393,0,Any
4,Incident Type,,1194,,T,4,0,Any
5,Dispositions,,"000000, AR",,zz do not use Assist;,2757,63,Any
6,Location - Latitude,,37.8372150110001,,37.900832366,1026,25063,Any
7,Location - Longitude,,-122.239325453,,-122.319669759703,1026,25063,Any


We can check all rows in the DataFrame that contain missing values:

In [111]:
stops[.!completecases(stops), :]

Unnamed: 0_level_0,Incident Number,Call Date/Time,Location,Incident Type,Dispositions,Location - Latitude,Location - Longitude
Unnamed: 0_level_1,Any,Any,Any,Any,Any,Any,Any
1,2015-00004825,2015-01-26T00:10:00,SAN PABLO AVE / MARIN AVE,T,M,missing,missing
2,2015-00004829,2015-01-26T00:50:00,SAN PABLO AVE / CHANNING WAY,T,M,missing,missing
3,2015-00004831,2015-01-26T01:03:00,UNIVERSITY AVE / NINTH ST,T,M,missing,missing
4,2015-00004848,2015-01-26T07:16:00,2000 BLOCK BERKELEY WAY,1194,BM4ICN,missing,missing
5,2015-00004849,2015-01-26T07:43:00,1700 BLOCK SAN PABLO AVE,1194,BM4ICN,missing,missing
6,2015-00004865,2015-01-26T09:46:00,M L KING JR WAY / UNIVERSITY AVE,T,OF4TCN,missing,missing
7,2015-00004870,2015-01-26T10:05:00,M L KING JR WAY / UNIVERSITY AVE,T,OM4TCN,missing,missing
8,2015-00004876,2015-01-26T10:21:00,UNIVERSITY AVE / M L KING JR WAY,T,OF2TCN,missing,missing
9,2015-00004882,2015-01-26T10:49:00,HASTE ST / ELLSWORTH ST,T,OM2TCN,missing,missing
10,2015-00004887,2015-01-26T11:12:00,ADELINE ST / OREGON ST,T,OM2TCN,missing,missing


By browsing through the table above, we can see that all other missing values are in the Dispositions column. Unfortunately, we do not know from the data description why these Dispositions might be missing. Since only there are only 63 missing values compared to 25,000 rows in the original table, we can proceed with analysis while being mindful that these missing values could impact results.

### Are there any missing values that were filled in?

It doesn't seem like any previously missing values were filled in for us. Unlike in the Calls dataset where the date and time were in separate columns, the Call Date/Time column in the Stops dataset contains both date and time.

### Which parts of the data were entered by a human?

As with the Calls dataset, it looks like most of the columns in this dataset were recorded by a machine or were a category selected by a human (e.g. Incident Type).

However, the Location column doesn't have consistently entered values. Sure enough, we spot some typos in the data:

In [112]:
sort!(by(stops, :Location, :Location => length), :Location_length, rev=true)

Unnamed: 0_level_0,Location,Location_length
Unnamed: 0_level_1,Any,Int64
1,2200 BLOCK SHATTUCK AVE,229
2,37.8693028530001~-122.272234021,213
3,UNIVERSITY AVE / SAN PABLO AVE,202
4,UNIVERSITY AVE / SIXTH ST,182
5,80 BLOCK BOLIVAR DR,163
6,SAN PABLO AVE / UNIVERSITY AVE,142
7,ASHBY AVE / SAN PABLO AVE,136
8,ASHBY AVE / SEVENTH ST,131
9,UNIVERSITY AVE / ACTON ST,129
10,SHATTUCK AVE / KITTREDGE ST,126


What a mess! It looks like sometimes an address was entered, sometimes a cross-street, and other times a latitude-longitude pair. Unfortunately, we don't have very complete latitude-longitude data to use in place of this column. We may have to manually clean this column if we want to use locations for future analysis.

We can also check the Dispositions column:

In [113]:
dispositions = sort!(by(stops, :Dispositions, :Dispositions => length), :Dispositions_length, rev=true)

Unnamed: 0_level_0,Dispositions,Dispositions_length
Unnamed: 0_level_1,Any,Int64
1,M,1683
2,WM4TCN,875
3,BM4TWN,681
4,BM2TWN,674
5,WM4TWN,547
6,WF4TCN,537
7,P,493
8,WM2TWN,487
9,BM3TWN,426
10,WM3TCN,400


The Dispositions columns also contains inconsistencies. For example, some dispositions start with a space, some end with a semicolon, and some contain multiple entries. The variety of values suggests that this field contains human-entered values and should be treated with caution.

In [114]:
# Strange values...
dispositions[[1, 21, 31, 267, 1028], :]

Unnamed: 0_level_0,Dispositions,Dispositions_length
Unnamed: 0_level_1,Any,Int64
1,M,1683
2,M;,238
3,M,176
4,IN,14
5,"WF2RCN, WM2RCN, WM4RCS, WM4RAS, HM2RAS",1


In addition, the most common disposition is `M` which isn't a permitted first character in the Dispositions column. This could mean that the format of the column changed over time or that officers are allowed to enter in the disposition without matching the format in the data description. In any case, the column will be challenging to work with.

We can take some simple steps to clean the Dispositions column by removing leading and trailing whitespace, removing trailing semi-colons, and replacing the remaining semi-colons with commas.

In [116]:
stops.Dispositions = map(x -> passmissing(strip)(x, [';', ',', ' ']), stops.Dispositions)
stops

Unnamed: 0_level_0,Incident Number,Call Date/Time,Location,Incident Type,Dispositions,Location - Latitude,Location - Longitude
Unnamed: 0_level_1,Any,Any,Any,Any,SubStri…⍰,Any,Any
1,2015-00004825,2015-01-26T00:10:00,SAN PABLO AVE / MARIN AVE,T,M,missing,missing
2,2015-00004829,2015-01-26T00:50:00,SAN PABLO AVE / CHANNING WAY,T,M,missing,missing
3,2015-00004831,2015-01-26T01:03:00,UNIVERSITY AVE / NINTH ST,T,M,missing,missing
4,2015-00004848,2015-01-26T07:16:00,2000 BLOCK BERKELEY WAY,1194,BM4ICN,missing,missing
5,2015-00004849,2015-01-26T07:43:00,1700 BLOCK SAN PABLO AVE,1194,BM4ICN,missing,missing
6,2015-00004865,2015-01-26T09:46:00,M L KING JR WAY / UNIVERSITY AVE,T,OF4TCN,missing,missing
7,2015-00004870,2015-01-26T10:05:00,M L KING JR WAY / UNIVERSITY AVE,T,OM4TCN,missing,missing
8,2015-00004876,2015-01-26T10:21:00,UNIVERSITY AVE / M L KING JR WAY,T,OF2TCN,missing,missing
9,2015-00004882,2015-01-26T10:49:00,HASTE ST / ELLSWORTH ST,T,OM2TCN,missing,missing
10,2015-00004887,2015-01-26T11:12:00,ADELINE ST / OREGON ST,T,OM2TCN,missing,missing


## Conclusion

As these two datasets have shown, data cleaning can often be both difficult and tedious. Cleaning 100% of the data often takes too long, but not cleaning the data at all results in faulty conclusions; we have to weigh our options and strike a balance each time we encounter a new dataset.

The decisions made during data cleaning impact all future analyses. For example, we chose not to clean the Location column of the Stops dataset so we should treat that column with caution. Each decision made during data cleaning should be carefully documented for future reference, preferably in a notebook so that both code and explanations appear together.

In [None]:
# HIDDEN
# Save data to CSV for other chapters
# CSV.write("data/calls_stops.csv", stops)