In [13]:
# HIDDEN
Base.displaysize() = (5, 80)
using DataFrames
using CSV

## Investigating Berkeley Police Data

We will use the Berkeley Police Department's publicly available datasets to demonstrate data cleaning techniques. We have downloaded the [Calls for Service dataset][calls] and [Stops dataset][stops].

We can use the `ls` shell command with the `-lh` flags to see more details about the files:

[calls]: https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Calls-for-Service/k2nh-s5h5
[stops]: https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Stop-Data/6e9j-pj9p

In [2]:
;ls -lh data/

total 13936
-rw-r--r--  1 irinabchan  staff   979K Sep 14 12:53 Berkeley_PD_-_Calls_for_Service.csv
-rw-r--r--  1 irinabchan  staff    81B Sep 14 12:53 cvdow.csv
-rw-r--r--  1 irinabchan  staff   5.8M Sep 14 12:53 stops.json


The command above shows the data files and their file sizes. This is especially useful because we now know the files are small enough to load into memory. As a rule of thumb, it is usually safe to load a file into memory that is around one fourth of the total memory capacity of the computer. For example, if a computer has 4GB of RAM we should be able to load a 1GB CSV file in `DataFrames`. To handle larger datasets we will need additional computational tools that we will cover later in this book.

Notice the use of the semicolon before `ls` to activate Julia's Shell mode. This tells Jupyter that the next line of code is a shell command, not a Julia expression. Another option is to use the `run()` function that also allows us to run shell commands:

In [1]:
# The `wc` shell command shows us how many lines each file has.
# For example, we can inspect the number of lines of the `stops.json` file
run(`wc -l data/stops.json`);

   29852 data/stops.json


### Understanding the Data Generation

We will state important questions you should ask of all datasets before data cleaning or processing. These questions are related to how the data were generated, so data cleaning will usually **not** be able to resolve issues that arise here.

**What do the data contain?** The website for the Calls for Service data states that the dataset describes "crime incidents (not criminal reports) within the last 180 days". Further reading reveals that "not all calls for police service are included (e.g. Animal Bite)".

The website for the Stops data states that the dataset contains data on all "vehicle detentions (including bicycles) and pedestrian detentions (up to five persons)" since January 26, 2015.

**Are the data a census?** This depends on our population of interest. For example, if we are interested in calls for service within the last 180 days for crime incidents then the Calls dataset is a census. However, if we are interested in calls for service within the last 10 years the dataset is clearly not a census. We can make similar statements about the Stops dataset since the data collection started on January 26, 2015.

**If the data form a sample, is it a probability sample?** If we are investigating a period of time that the data do not have entries for, the data do not form a probability sample since there is no randomness involved in the data collection process — we have all data for certain time periods but no data for others.

**What limitations will this data have on our conclusions?** Although we will ask this question at each step of our data processing, we can already see that our data impose important limitations. The most important limitation is that we cannot make unbiased estimations for time periods not covered by our datasets.

## Cleaning The Calls Dataset

Let's now clean the Calls dataset. The `head` shell command prints the first five lines of the file.

In [14]:
;head data/Berkeley_PD_-_Calls_for_Service.csv

CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
17091420,BURGLARY AUTO,07/23/2017 12:00:00 AM,06:00,BURGLARY - VEHICLE,0,08/29/2017 08:28:05 AM,"2500 LE CONTE AVE
Berkeley, CA
(37.876965, -122.260544)",2500 LE CONTE AVE,Berkeley,CA
17020462,THEFT FROM PERSON,04/13/2017 12:00:00 AM,08:45,LARCENY,4,08/29/2017 08:28:00 AM,"2200 SHATTUCK AVE
Berkeley, CA
(37.869363, -122.268028)",2200 SHATTUCK AVE,Berkeley,CA
17050275,BURGLARY AUTO,08/24/2017 12:00:00 AM,18:30,BURGLARY - VEHICLE,4,08/29/2017 08:28:06 AM,"200 UNIVERSITY AVE
Berkeley, CA
(37.865491, -122.310065)",200 UNIVERSITY AVE,Berkeley,CA


It appears to be a comma-separated values (CSV) file, though it's hard to tell whether the entire file is formatted properly. We can use `CSV.read` to read in the file as a DataFrame. If `CSV.read` errors, we will have to dig deeper and manually resolve formatting issues. Fortunately, `CSV.read` successfully returns a DataFrame:

In [3]:
calls = CSV.read("data/Berkeley_PD_-_Calls_for_Service.csv")

Unnamed: 0_level_0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND
Unnamed: 0_level_1,Int64,String,String,Dates…,String
1,17091420,BURGLARY AUTO,07/23/2017 12:00:00 AM,06:00:00,BURGLARY - VEHICLE
2,17020462,THEFT FROM PERSON,04/13/2017 12:00:00 AM,08:45:00,LARCENY
3,17050275,BURGLARY AUTO,08/24/2017 12:00:00 AM,18:30:00,BURGLARY - VEHICLE
4,17019145,GUN/WEAPON,04/06/2017 12:00:00 AM,17:30:00,WEAPONS OFFENSE
5,17044993,VEHICLE STOLEN,08/01/2017 12:00:00 AM,18:00:00,MOTOR VEHICLE THEFT
⋮,⋮,⋮,⋮,⋮,⋮


Based on the output above, the resulting DataFrame looks reasonably well-formed since the columns are properly named and the data in each column seems to be entered consistently. What data does each column contain? We can look at the dataset website:

| Column         | Description                            | Type        |
| ------         | -----------                            | ----        |
| CASENO         | Case Number                            | Number      |
| OFFENSE        | Offense Type                           | Plain Text  |
| EVENTDT        | Date Event Occurred                    | Date & Time |
| EVENTTM        | Time Event Occurred                    | Plain Text  |
| CVLEGEND       | Description of Event                   | Plain Text  |
| CVDOW          | Day of Week Event Occurred             | Number      |
| InDbDate       | Date dataset was updated in the portal | Date & Time |
| Block_Location | Block level address of event           | Location    |
| BLKADDR        |                                        | Plain Text  |
| City           |                                        | Plain Text  |
| State          |                                        | Plain Text  |

On the surface the data looks easy to work with. However, before starting data analysis we must answer the following questions:

1. **Are there missing values in the dataset?** This question is important because missing values can represent many different things. For example, missing addresses could mean that locations were removed to protect anonymity, or that some respondents chose not to answer a survey question, or that a recording device broke.
1. **Are there any missing values that were filled in (e.g. a 999 for unknown age or 12:00am for unknown date)?** These will clearly impact analysis if we ignore them.
1. **Which parts of the data were entered by a human?** As we will soon see, human-entered data is filled with inconsistencies and mispellings.

Although there are plenty more checks to go through, these three will suffice for many cases. See the [Quartz bad data guide](https://github.com/Quartz/bad-data-guide) for a more complete list of checks.

### Are there missing values?

In Julia, missing values are represented via the `missing` object (equivalent to `NULL` in SQL and `NA` in R). `missing` values propagate automatically when passed to standard operators and functions, which means that an operation involving a `missing` value generally returns `missing`. This makes it really important to inspect your data so you can deal with `missing` values accordingly. See the [documentation](https://docs.julialang.org/en/v1/manual/missing/) for more information.

Using `describe()` to inspect a DataFrame will display the number of `missing` objects present in each column, as well as other useful information:

In [25]:
# HIDDEN
Base.displaysize() = (11, 140)

In [20]:
describe(calls)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Union…
1,CASENO,17043700.0,17000233,17036500.0,17091670,,
2,OFFENSE,,2ND RESPONSE,,VICE,31.0,
3,EVENTDT,,03/02/2017 12:00:00 AM,,08/28/2017 12:00:00 AM,180.0,
4,EVENTTM,,00:00,,23:59,1033.0,
5,CVLEGEND,,ALL OTHER OFFENSES,,WEAPONS OFFENSE,23.0,
6,CVDOW,3.07662,0,3.0,6,,
7,InDbDate,,08/29/2017 08:27:58 AM,,08/29/2017 08:28:06 AM,9.0,
8,Block_Location,,"0\nBerkeley, CA\n",,"WOOLSEY STREET &amp; ELLIS ST\nBerkeley, CA\n",1835.0,
9,BLKADDR,,0 <UNKNOWN>,,WOOLSEY STREET & ELLIS ST,1834.0,27.0
10,City,,Berkeley,,Berkeley,1.0,


It looks like 27 calls didn't have a recorded address in BLKADDR. Unfortunately, the data description isn't very clear on how the locations were recorded. We know that all of these calls were made for events in Berkeley, so we can likely assume that the addresses for these calls were originally somewhere in Berkeley.

### Are there any missing values that were filled in?

From the missing value check above we can see that the Block_Location column has Berkeley, CA recorded if the location was missing.

In addition, an inspection of the `calls` table shows us that the EVENTDT column has the correct dates but records 12am for all of its times. Instead, the times are in the EVENTTM column.

In [22]:
# Show the first 7 rows of the table again for reference
first(calls, 7)

Unnamed: 0_level_0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate
Unnamed: 0_level_1,Int64,String,String,String,String,Int64,String
1,17091420,BURGLARY AUTO,07/23/2017 12:00:00 AM,06:00,BURGLARY - VEHICLE,0,08/29/2017 08:28:05 AM
2,17020462,THEFT FROM PERSON,04/13/2017 12:00:00 AM,08:45,LARCENY,4,08/29/2017 08:28:00 AM
3,17050275,BURGLARY AUTO,08/24/2017 12:00:00 AM,18:30,BURGLARY - VEHICLE,4,08/29/2017 08:28:06 AM
4,17019145,GUN/WEAPON,04/06/2017 12:00:00 AM,17:30,WEAPONS OFFENSE,4,08/29/2017 08:27:59 AM
5,17044993,VEHICLE STOLEN,08/01/2017 12:00:00 AM,18:00,MOTOR VEHICLE THEFT,2,08/29/2017 08:28:05 AM
6,17037319,BURGLARY RESIDENTIAL,06/28/2017 12:00:00 AM,12:00,BURGLARY - RESIDENTIAL,3,08/29/2017 08:28:03 AM
7,17030791,BURGLARY RESIDENTIAL,05/30/2017 12:00:00 AM,08:45,BURGLARY - RESIDENTIAL,2,08/29/2017 08:28:00 AM


As a data cleaning step, we want to merge the EVENTDT and EVENTTM columns to record both date and time in one field. If we define a function that takes in a DataFrame and returns a new DataFrame, we can later use the pipe operator `|>` to apply all transformations in one go. For more information on function composition and piping, check the [documentation](https://docs.julialang.org/en/v1/manual/functions/#Function-composition-and-piping-1).

String concatenation in Julia is done by using the `*` operator, and to broadcast the operation to all items in an array we can combine it to the `.` operator used previously:

In [24]:
function combine_event_datetimes(calls)
    calls.EVENTTM = string.(calls.EVENTTM)
    calls[!, :EVENTDTTM] = calls.EVENTDT .* " " .* calls.EVENTTM
    return calls
end
combine_event_datetimes(calls)

Unnamed: 0_level_0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND
Unnamed: 0_level_1,Int64,String,String,String,String
1,17091420,BURGLARY AUTO,07/23/2017 12:00:00 AM,06:00:00,BURGLARY - VEHICLE
2,17020462,THEFT FROM PERSON,04/13/2017 12:00:00 AM,08:45:00,LARCENY
3,17050275,BURGLARY AUTO,08/24/2017 12:00:00 AM,18:30:00,BURGLARY - VEHICLE
4,17019145,GUN/WEAPON,04/06/2017 12:00:00 AM,17:30:00,WEAPONS OFFENSE
5,17044993,VEHICLE STOLEN,08/01/2017 12:00:00 AM,18:00:00,MOTOR VEHICLE THEFT
⋮,⋮,⋮,⋮,⋮,⋮


### Which parts of the data were entered by a human?

It looks like most of the data columns are machine-recorded, including the date, time, day of week, and location of the event.

In addition, the OFFENSE and CVLEGEND columns appear to contain consistent values. We can check the unique values in each column to see if anything was misspelled:

In [30]:
print(unique(calls.OFFENSE))

["BURGLARY AUTO", "THEFT FROM PERSON", "GUN/WEAPON", "VEHICLE STOLEN", "BURGLARY RESIDENTIAL", "VANDALISM", "DISTURBANCE", "THEFT MISD. (UNDER \$950)", "THEFT FROM AUTO", "DOMESTIC VIOLENCE", "THEFT FELONY (OVER \$950)", "ALCOHOL OFFENSE", "MISSING JUVENILE", "ROBBERY", "IDENTITY THEFT", "ASSAULT/BATTERY MISD.", "2ND RESPONSE", "BRANDISHING", "MISSING ADULT", "NARCOTICS", "FRAUD/FORGERY", "ASSAULT/BATTERY FEL.", "BURGLARY COMMERCIAL", "MUNICIPAL CODE", "ARSON", "SEXUAL ASSAULT FEL.", "VEHICLE RECOVERED", "SEXUAL ASSAULT MISD.", "KIDNAPPING", "VICE", "HOMICIDE"]

In [31]:
print(unique(calls.CVLEGEND))

["BURGLARY - VEHICLE", "LARCENY", "WEAPONS OFFENSE", "MOTOR VEHICLE THEFT", "BURGLARY - RESIDENTIAL", "VANDALISM", "DISORDERLY CONDUCT", "LARCENY - FROM VEHICLE", "FAMILY OFFENSE", "LIQUOR LAW VIOLATION", "MISSING PERSON", "ROBBERY", "FRAUD", "ASSAULT", "NOISE VIOLATION", "DRUG VIOLATION", "BURGLARY - COMMERCIAL", "ALL OTHER OFFENSES", "ARSON", "SEX CRIME", "RECOVERED VEHICLE", "KIDNAPPING", "HOMICIDE"]

Since each value in these columns appears to be spelled correctly, we won't have to perform any corrections on these columns.

We also check the BLKADDR column for inconsistencies and find that sometimes an address is recorded (e.g. 2500 LE CONTE AVE) but other times a cross street is recorded (e.g. ALLSTON WAY & FIFTH ST). This suggests that a human entered this data in and this column will be difficult to use for analysis. Fortunately we can use the latitude and longitude of the event instead of the street address.

In [32]:
calls[[1, 5002], :BLKADDR]

2-element Array{Union{Missing, String},1}:
 "2500 LE CONTE AVE"     
 "ALLSTON WAY & FIFTH ST"

### Final Touchups

This dataset seems almost ready for analysis. The Block_Location column seems to contain strings that record address, latitude, and longitude. We will want to separate the latitude and longitude for easier use.

In [25]:
function split_lat_lon(calls)
    coordinates = map(x -> last(split(x, ['\n', '(', ')'], keepempty=false)), calls.Block_Location)
    df = DataFrame([(Latitude=x, Longitude=y) for (x,y) in split.(coordinates, ",")])
    return hcat(calls, df)
end

split_lat_lon (generic function with 1 method)

Then, we can match the day of week number with its weekday:

In [26]:
# This DF contains the day for each number in CVDOW
day_of_week = CSV.read("data/cvdow.csv")

Unnamed: 0_level_0,CVDOW,Day
Unnamed: 0_level_1,Int64,String
1,0,Sunday
2,1,Monday
3,2,Tuesday
4,3,Wednesday
5,4,Thursday
⋮,⋮,⋮


In [27]:
function match_weekday(calls)
    return join(day_of_week, calls, on = :CVDOW)
end

match_weekday (generic function with 1 method)

We'll drop columns we no longer need:

In [28]:
function drop_uneeded_cols(calls)
    return select!(calls, Not([:CVDOW, :InDbDate, :Block_Location, :City,
                               :State, :EVENTDT, :EVENTTM]))
end

drop_uneeded_cols (generic function with 1 method)

Finally, we'll pipe the `calls` DataFrame through all the functions we've defined by using the `|>` operator:

In [49]:
calls_final = combine_event_datetimes(calls) |> split_lat_lon |> match_weekday |> drop_uneeded_cols

Unnamed: 0_level_0,Day,CASENO,OFFENSE,CVLEGEND,BLKADDR
Unnamed: 0_level_1,String,Int64,String,String,String⍰
1,Sunday,17091420,BURGLARY AUTO,BURGLARY - VEHICLE,2500 LE CONTE AVE
2,Sunday,17038302,BURGLARY AUTO,BURGLARY - VEHICLE,BOWDITCH STREET & CHANNING WAY
3,Sunday,17049346,THEFT MISD. (UNDER $950),LARCENY,2900 CHANNING WAY
4,Sunday,17091319,THEFT MISD. (UNDER $950),LARCENY,2100 RUSSELL ST
5,Sunday,17044238,DISTURBANCE,DISORDERLY CONDUCT,TELEGRAPH AVENUE & DURANT AVE
⋮,⋮,⋮,⋮,⋮,⋮


The Calls dataset is now ready for further data analysis. In the next section, we will clean the Stops dataset.

In [54]:
# HIDDEN
# Save data to CSV for other chapters
# CSV.write("data/calls_julia.csv", calls_final)