In [2]:
# HIDDEN
Base.displaysize() = (5, 110)
using DataFrames
using CSV

In [3]:
# HIDDEN
calls = CSV.read("data/calls_julia.csv")
stops = CSV.read("data/stops_julia.csv");

## Scope

The scope of the dataset refers to the coverage of the dataset in relation to what we are interested in analyzing. We seek to answer the following question about our data scope:

**Does the data cover the topic of interest?**

For example, the Calls and Stops datasets contain call and stop incidents made in Berkeley. If we are interested in crime incidents in the state of California, however, these datasets will be too limited in scope.

In general, larger scope is more useful than smaller scope since we can filter larger scope down to a smaller scope but often can't go from smaller scope to larger scope. For example, if we had a dataset of police stops in the United States we could subset the dataset to investigate Berkeley.

Keep in mind that scope is a broad term not always used to describe geographic location. For example, it can also refer to time coverage — the Calls dataset only contains data for a 180 day period.

We will often address the scope of the dataset during the investigation of the data generation process and confirm the dataset's scope during EDA. Let's confirm the geographic and time scope of the Calls dataset.

In [3]:
calls

Unnamed: 0_level_0,Day,CASENO,OFFENSE,CVLEGEND,BLKADDR
Unnamed: 0_level_1,String,Int64,String,String,String⍰
1,Sunday,17091420,BURGLARY AUTO,BURGLARY - VEHICLE,2500 LE CONTE AVE
2,Sunday,17038302,BURGLARY AUTO,BURGLARY - VEHICLE,BOWDITCH STREET & CHANNING WAY
3,Sunday,17049346,THEFT MISD. (UNDER $950),LARCENY,2900 CHANNING WAY
4,Sunday,17091319,THEFT MISD. (UNDER $950),LARCENY,2100 RUSSELL ST
5,Sunday,17044238,DISTURBANCE,DISORDERLY CONDUCT,TELEGRAPH AVENUE & DURANT AVE
⋮,⋮,⋮,⋮,⋮,⋮


We see that EVENTDTTM is of type `String`. For being able to perform operations on that column it would be better to extract it to another column as the type `Date`:

In [17]:
using Dates
# Split the EVENTDTTM string and select only the date
calls.DATE = first.(split.(calls.EVENTDTTM))
# Transform it into Date
calls.DATE = Date.(calls.DATE, dateformat"m/d/y")
calls

Unnamed: 0_level_0,Day,CASENO,OFFENSE,CVLEGEND,BLKADDR,EVENTDTTM
Unnamed: 0_level_1,String,Int64,String,String,String⍰,String
1,Sunday,17091420,BURGLARY AUTO,BURGLARY - VEHICLE,2500 LE CONTE AVE,07/23/2017 12:00:00 AM 06:00
2,Sunday,17038302,BURGLARY AUTO,BURGLARY - VEHICLE,BOWDITCH STREET & CHANNING WAY,07/02/2017 12:00:00 AM 22:00
3,Sunday,17049346,THEFT MISD. (UNDER $950),LARCENY,2900 CHANNING WAY,08/20/2017 12:00:00 AM 23:20
4,Sunday,17091319,THEFT MISD. (UNDER $950),LARCENY,2100 RUSSELL ST,07/09/2017 12:00:00 AM 04:15
5,Sunday,17044238,DISTURBANCE,DISORDERLY CONDUCT,TELEGRAPH AVENUE & DURANT AVE,07/30/2017 12:00:00 AM 01:16
⋮,⋮,⋮,⋮,⋮,⋮,⋮


In [33]:
# Shows earliest and latest dates in calls
sorted_dates = sort(calls, :DATE)
println(first(sorted_dates, 5).DATE)
println(last(sorted_dates, 5).DATE)

Date[2017-03-02, 2017-03-02, 2017-03-02, 2017-03-02, 2017-03-02]
Date[2017-08-27, 2017-08-27, 2017-08-28, 2017-08-28, 2017-08-28]


In [36]:
sorted_dates[end, :DATE] - sorted_dates[1, :DATE]

179 days

The table contains data for a time period of 179 days which is close enough to the 180 day time period in the data description that we can suppose there were no calls on either April 14st, 2017 or August 29, 2017.

To check the geographic scope, we can use a map:

In [18]:
# We obtain the coordinates to plot our heatmap
locs = stops[completecases(stops), [Symbol("Location - Latitude"), Symbol("Location - Longitude")]]
locs_tuples = collect(zip(locs[:, Symbol("Location - Latitude")], locs[:, Symbol("Location - Longitude")]));

In [19]:
# We'll use the Folium library for Python by calling it via PyCall

using PyCall

matplotlib_cm = pyimport("matplotlib.cm")
matplotlib_colors = pyimport("matplotlib.colors")
cmap = matplotlib_cm.get_cmap("prism")

flm = pyimport("folium")
SF_COORDINATES = (37.87, -122.28)
sf_map = flm.Map(location=SF_COORDINATES, zoom_start=13)
heatmap = flm.plugins.HeatMap(locs_tuples, radius=10)
sf_map.add_child(heatmap)

sf_map

With a few exceptions, the Calls dataset covers the Berkeley area. We can see that most police calls happened in the Downtown Berkeley and south of UC Berkeley campus areas.

Let's now confirm the temporal and geographic scope for the Stops dataset:

In [37]:
stops

Unnamed: 0_level_0,Incident Number,Call Date/Time,Location,Incident Type,Dispositions,Location - Latitude
Unnamed: 0_level_1,String,DateTime,String,String,String⍰,Float64⍰
1,2015-00004825,2015-01-26T00:10:00,SAN PABLO AVE / MARIN AVE,T,M,missing
2,2015-00004829,2015-01-26T00:50:00,SAN PABLO AVE / CHANNING WAY,T,M,missing
3,2015-00004831,2015-01-26T01:03:00,UNIVERSITY AVE / NINTH ST,T,M,missing
4,2015-00004848,2015-01-26T07:16:00,2000 BLOCK BERKELEY WAY,1194,BM4ICN,missing
5,2015-00004849,2015-01-26T07:43:00,1700 BLOCK SAN PABLO AVE,1194,BM4ICN,missing
⋮,⋮,⋮,⋮,⋮,⋮,⋮


In [42]:
sort(stops, Symbol("Call Date/Time"))

Unnamed: 0_level_0,Incident Number,Call Date/Time,Location,Incident Type,Dispositions,Location - Latitude
Unnamed: 0_level_1,String,DateTime,String,String,String⍰,Float64⍰
1,2015-00004825,2015-01-26T00:10:00,SAN PABLO AVE / MARIN AVE,T,M,missing
2,2015-00004829,2015-01-26T00:50:00,SAN PABLO AVE / CHANNING WAY,T,M,missing
3,2015-00004831,2015-01-26T01:03:00,UNIVERSITY AVE / NINTH ST,T,M,missing
4,2015-00004848,2015-01-26T07:16:00,2000 BLOCK BERKELEY WAY,1194,BM4ICN,missing
5,2015-00004849,2015-01-26T07:43:00,1700 BLOCK SAN PABLO AVE,1194,BM4ICN,missing
⋮,⋮,⋮,⋮,⋮,⋮,⋮


As promised, the data collection begins on January 26th, 2015. It looks like the data were downloaded somewhere around the beginning of May 2017 since the dates stop on April 30th, 2017. Let's draw a map to see the geographic data:

In [20]:
sf_map

We can confirm that the police stops in the dataset happened in Berkeley, and that most police calls happened in the Downtown Berkeley and West Berkeley areas.