# Stats with the VicPD crime data

* Today we will explore the VicPD crime data file
* Load it into Python memory
* Start exploring some basic features of the data
* Apply some basic statistics

The <a href="https://vicpd.ca/crime-reports">Victoria Police Department Crime Statistics</a> page provides an excellet data resource.  You will want to click on the VicPD badge and then set the filter so that you see some crime data.  

**EDIT** I have included my download of the VicPD crime stats in the course repo. 

In [None]:
import json as js
import pprint as pp

with open('../data/vic_crimereports.json') as data_file:    
    pdata = js.load(data_file)

In [None]:
print(pdata.keys())
#pp.pprint(pdata['meta'])
print("There are ",len(pdata['data']), " records.", sep='')

### Be careful!

This is a very large file.  As we saw in Monday's class a simple command like **print pdata['data']** resulted in your instructor's poor laptop freezing. 

In [None]:
for x in range(28):
    print(pdata['meta']['view']['columns'][x]['name'], " ", end='')
print('\n')
## decide what information to store, and how to store it. 
## we will use a namedtuple type, as it has a low memory profile
#print(len(pdata['data']),'\n')

## reverse-lookup for defining our reduced data set. 
RLU = dict([(pdata['meta']['view']['columns'][x]['name'], x) for x in range(28)])
#print(RLU, '\n')

keepFields = ['latitude', 'longitude', 'incident_type_primary', 'case_number', 'incident_datetime',\
             'address_1', 'created_at', 'updated_at', 'parent_incident_type']

for x in keepFields:
    print(x, " ", pdata['data'][0][RLU[x]], " ", sep='')

from collections import namedtuple

pdatt = namedtuple('pdatt', keepFields)

In [None]:
import datetime as dt

def isfloat(value):
  try:
    float(value)
    return True
  except ValueError:
    return False

cdata = []
for x in pdata['data']:
    ## convert to dictionary
    tdict = dict([(y, x[RLU[y]]) for y in keepFields])
    
    ## check to see if any terms undefined
    nexists = False
    for key, value in tdict.items():
        if value==None:
            nexists = True
    ## let's ignore the records with no location data. 
    if nexists==True: 
        continue
            
    ## convert the numbers to floats
    if isfloat(tdict['latitude']):
        tdict['latitude'] = float(tdict['latitude'])
    else:
        print("Invalid lat.")
        continue
    if isfloat(tdict['longitude']):
        tdict['longitude'] = float(tdict['longitude'])
    else:
        print("Invalid long.")
        continue
    
    ## and the dates to python datetime objects
    tdict['incident_datetime'] = dt.datetime.strptime(tdict['incident_datetime'],\
                                            '%Y-%m-%dT%H:%M:%S')
    tdict['created_at'] = dt.datetime.strptime(tdict['created_at'],\
                                            '%Y-%m-%dT%H:%M:%S')
    tdict['updated_at'] = dt.datetime.strptime(tdict['updated_at'],\
                                            '%Y-%m-%dT%H:%M:%S')
   
    ## convert dict to pdatt
    pdat = pdatt(**tdict)
    cdata.append(pdat)

#print(len(cdata))
print(cdata[0])


In [None]:
from operator import attrgetter


## let's get the earliest and most recent records, respectively. 
date_cdate = sorted(cdata, key = attrgetter('incident_datetime'))

print(date_cdate[0], '\n')
print(date_cdate[-1], '\n')
## woot!  6 years of data!

print(date_cdate[-1].incident_datetime - date_cdate[0].incident_datetime)

In [None]:
from collections import defaultdict

## that's about 5 years, 150 days of data!

## let's tabulate the "Crime Types" as a structure. 

ctypes = defaultdict(set)
for x in cdata:
    if x.parent_incident_type in ctypes.keys():
        ctypes[x.parent_incident_type].add(x.incident_type_primary)
    else:
        ctypes[x.parent_incident_type] = set([x.incident_type_primary])
        
pp.pprint(ctypes)

### Let's try a heat map with a map of the city

For this you will need to have **Folium** installed.  In your Virtual Machine (VM), at the terminal type **sudo pip install folium** if the "import" call below throws an error. 

*WENDI* at present does not have Folium installed, so this will only work on your VM. 

In [None]:
import sys
## you might need to tell Python where your folium library is.  On my machine, external libraries
## are installed in two locations. 

## these are where my non-anaconda python libraries are located.  I need these for
## folium and such
expaths = ["/usr/lib/python3/dist-packages", "/usr/local/lib/python3.5/dist-packages"]
for xp in expaths:
    if (xp not in sys.path):
        sys.path.append(xp)

import folium as fo
from folium import plugins as fpl

In [None]:
hdata = []
for x in cdata:
    if (x.parent_incident_type=="Assault with Deadly Weapon"):
        newpt = [x.latitude, x.longitude, 0.02 ] ## the last argument is the "heat"
        hdata.append( newpt )
mapa = fo.Map([48.4323, -123.3720], tiles='Stamen Terrain', zoom_start=13)
mapa.add_children(fpl.HeatMap(hdata))
#mapa.create_map(path='assault.wdw.heatmap.html')
mapa

In [None]:
## Heatmap, drugs on Fridays
##  datetime weekday(), 0=Mon, 1=Tues, 2=Wed, 3=Thur, 4=Fri, 5=Sat, 6=Sun. 
hdata = []
for x in cdata:
    if (x.parent_incident_type=="Drugs") and\
       (x.incident_datetime.weekday()==4):
        newpt = [x.latitude, x.longitude, 0.02 ] ## the last argument is the "heat"
        hdata.append( newpt )
mapa = fo.Map([48.4323, -123.3720], tiles='Stamen Terrain', zoom_start=13)
mapa.add_children(fpl.HeatMap(hdata))
#mapa.create_map(path='drugs.fri.heatmap.html')
mapa

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#%matplotlib nbagg

In [None]:
## Pie chart of all the crime types. 
## Let's first do it on the "parent" crime type

cnames = [p for p, q in ctypes.items()]

tots = defaultdict(int)
for x in cdata:
    tots[x.parent_incident_type]+=1
tot = 0
for x,y in tots.items():
    tot += y

fractions = [100*y/tot for x,y in tots.items()]
names = [x for x in tots.keys()]

In [None]:
with plt.xkcd(): ## this enables the xkcd style.
    
    fig=plt.figure()
    fig.set_size_inches(10,10) 
        
    plt.pie(fractions, labels=names, autopct='%1.1f%%', shadow=False)
    plt.title('Relative frequency of incident types', fontsize=20)

In [None]:
## that's a little ugly, let's sort is so that small incidents are only beside "big" ones. . .
## so we will sort the fractions on size, then re-shuffle them together. . .
from operator import itemgetter

dpairs = [(cnames[i], fractions[i]) for i in range(len(cnames))]
sdpairs = sorted(dpairs, key=itemgetter(1))

nnames = []
nfracs = []

while (len(cnames)>0):
    nnames.append(cnames.pop(0) if (len(cnames) % 2 == 0) else cnames.pop())
    nfracs.append(fractions.pop(0) if (len(fractions) % 2 == 0) else fractions.pop())
## build the new names and fractions by taking the first and last elements of these lists, then popping them off

In [None]:
## same thing, but with nnames and nfracs
with plt.xkcd(): ## this enables the xkcd style.
    
    fig=plt.figure()
    fig.set_size_inches(10,10) 
        
    plt.pie(nfracs, labels=nnames, autopct='%1.1f%%', shadow=False)
    plt.title('Relative frequency of incident types', fontsize=20)

In [None]:
## Let's do a more detailed frequency analysis of the crime types.  Let's store the frequencies 
## for the parent types in a dict, and similarly store frequencies for the incident_type_primary
##  as a dict. 

tot = 0
all_tots = defaultdict(int)
for x in cdata:
    tot += 1
    all_tots[x.parent_incident_type] += 1
    all_tots[(x.parent_incident_type, x.incident_type_primary)] += 1
#pp.pprint(all_tots)

## compute the parent incident type frequencies
all_freq = defaultdict(float)

for x in ctypes.keys():
    all_freq[x] = 100*all_tots[x] / tot
    for y in ctypes[x]:
        all_freq[(x,y)] = 100*all_tots[(x,y)] / all_tots[x]

#pp.pprint(all_freq)

In [None]:
## now let's list the major parent_incident_type items (say, over 10%)

## and the major  incident_type_primary
threshold = 5.0

for x in ctypes.keys():
    if all_freq[x]>threshold:
        print(x, "%.1f" % all_freq[x])
        for y in ctypes[x]:
            if all_freq[(x,y)]>threshold:
                print(" -- ", y, "%.1f" % all_freq[(x,y)])


In [None]:
## let's write something that counts the occurances of a crime type by the day of the week
## crtype can be a string, or a pair of strings as in the above code
def weekdaycount(crtype):
    daycount = [0]*7
    if (isinstance(crtype, str)):
        for x in cdata:
            if x.parent_incident_type == crtype:
                daycount[x.incident_datetime.weekday()] += 1
    elif (isinstance(crtype, tuple)) and (len(crtype)==2) and\
         (isinstance(crtype[0], str)) and (isinstance(crtype[1], str)):
        for x in cdata:
            if x.parent_incident_type == crtype[0] and x.incident_type_primary == crtype[1]:
                daycount[x.incident_datetime.weekday()] += 1
    return daycount

def weekdaypct(crtype):
    wdk = weekdaycount(crtype)
    T = all_tots[crtype]
    return ['{:.1f}'.format(100*x/T) for x in wdk]

def presentBDWeek(crtype):
    retval = "Mon, Tue, Wed, Thu, Fri, Sat, Sun\n";
    retval += str(weekdaypct(crtype))
    return retval
    

In [None]:
for x in ctypes.keys():
    print(x, ' (tot ', all_tots[x], ')', sep='')
    print(presentBDWeek("Liquor"), '\n')
    for y in ctypes[x]:
        print(x, " - ", y, ' (tot ', all_tots[(x,y)], ')', sep='')
        print(presentBDWeek( (x, y) ), '\n')
    

#### Weather and crime relations. . .

We want to peek a little deeper into the data.  Let's consider the relationship between weather and crime. For example, we could perform a plot of the number of crime types (say, traffic violations) vs. average daily temperatures.  For this, we will need temperature data, going back to **2011**. 

As in [Week 3](../Week.3/Lecture.2.ipynb) we will want to download the appropriate data from the Stats Canada [Historical Weather Database](http://climate.weather.gc.ca/historical_data/search_historic_data_e.html).  We have done so. 

In [None]:
import os as os
import fnmatch as fn

files = fn.filter(os.listdir('../data'), "eng-daily*.csv")

## let's store the weather data as a dictionary. 

wdatlist = {}

for f in files:
    with open("../data/"+f) as fo:
        #print("loading ", f)
        content = fo.readlines()
        for j in range(26, len(content)):
            ab = content[j].split(',')
            for k in range(len(ab)):
                ab[k] = ab[k].translate({ord(c): None for c in '"'})
            date = dt.date(int(ab[1]), int(ab[2]), int(ab[3]))
            if len(ab[5])>0 and len(ab[7])>0 and len(ab[9])>0:
                wdatlist[date] = (float(ab[5]), float(ab[7]), float(ab[9]))
            
## wdatlist[date] = (max, min, mean) temps
print(len(wdatlist.keys()) // 365, " years of data and ", len(wdatlist.keys()) % 365, " days", sep='')

## let's find all the common dates with data, and put into one big array. 

## we have wdatlist dict dates -> triples
## and cdata a list of cstruc objects, that have datetimes. . .
## so we need to convert cdata to an object indexed by dates, containing counts of 
## everything that occured on those dates. 
comdates = []

ccdata = defaultdict(int)
for xd in cdata:
    ## only record if we have weather data
    if xd.incident_datetime.date() in wdatlist.keys():
        comdates.append(xd.incident_datetime.date())
        ccdata[(xd.incident_datetime.date(),xd.parent_incident_type, xd.incident_type_primary)] += 1

## takes as input parent_incident_type and incident_type primary, 
## builds list x,y coordinates of weather data, counts of crimes. 
## k = 0, 1, 2 for max min mean temps
def xyplot(pit, itp, k):
    x = [wdatlist[date][k] for date in comdates]
    y = [ccdata[(date, pit, itp)] for date in comdates]
    return x,y

In [None]:
x,y = xyplot('Disorder','CAUSE A DISTURBANCE', 0)

plt.xlabel('max temp')
plt.ylabel('disorder')
plt.title('disorder citation counts per day vs. max temp')

plt.plot( x,y, 'ro')

In [None]:
## plot of mean temp (x-axis) vs traffic collisions

x,y = xyplot('Traffic','COLLISION-DAMAGE OVER $1000', 1)

plt.xlabel('min temp')
plt.ylabel('num collision')
plt.title('collision counts per day vs. min temp')

plt.plot( x,y, 'ro')

### Preliminary Analysis

One might have expected there to be a stronger relationship between weather and traffic accidents.  

This gives us a good question to ask, perhaps for the next lab or homework assignment.

**Question:** Is there a relationship between weather and car accidents?  Perhaps we are not seeing it because we have *chosen* to look at the data in a way that does not bring true insight into the question?

**What might we have done wrong?**
 - We are looking at accidents during day n, but the freezing temperatures are on night n.  Perhaps we should be looking at accidents in the morning of day n+1? 
 - If its been below freezing for an entire day, most of the time one would expect the roads are clear and free of ice.  So max/min/mean temperatures might not be important *on their own*.  
 - Perhaps we should look for the scenerio where on day n it is above freezing, rains, and it freezes during the night, then consider accidents on (?morning?) day n+1. 


In [None]:
## this begs the question, what is the relation between max and min tempteratures? 

x = [wdatlist[date][1] for date in comdates] #min
y = [wdatlist[date][0] for date in comdates] #max
plt.xlabel('min temp (night)')
plt.ylabel('max temp (day)')
plt.title('max vs. min daily temperatures')
plt.plot(x,y,'ro')

###  We appear to have a rough correllation!

A technique to quantify correllation that we learn about in **Math 211** is called the *least squares* technique. 

The idea is that, given a situation like the above where one has a list of $x$ and $y$ coordinates, one could try to **fit** a trend-line to the data:

$$y = b + ax$$

As we can see from the above plot, no single line will be a perfect fit for all the data, but we can ask for a *best fit*. 

Given numbers $(a,b) \in \mathbb R^2$ the **individual error** of the fit for a data point $(x_i, y_i)$ is defined to be

$$E_i = y(x_i) - y_i = b + ax_i - y_i$$

The **total error** (squared) is

$$E^2 = \sum_i E_i^2 = \sum_i (b + ax_i - y_i)^2$$

We use squares because:
 - It is simple algebraically
 - If we used the first power it would not be a very useful concept
 - If we used the first power and absolute values, i.e.
$$E = \sum_i |b + ax_i - y_i|$$
we do get a useful concept, although it is more effort to work with. 


### Least squares

The idea is to consider the error squared function as a real-valued function of the two real variables $(a,b)$, i.e. 

$$E^2 : \mathbb R^2 \to \mathbb R$$

Using calculus we can check this function has a unique minimum, and from calculus it occurs when 

$$\frac{\partial (E^2)}{\partial a} = \frac{\partial (E^2)}{\partial b} = 0$$

which is

$$2 \sum_i (b + ax_i - y_i)x_i = 2 \sum_i (b + ax_i - y_i) = 0$$

a system of two linear equations in the two variables $(a,b)$. We can re-write it as:




$$bn + a \sum_i x_i = \sum_i y_i$$

$$b \sum_i x_i + a \sum_i x_i^2 = \sum_i y_i x_i$$

With $n$ the number of points in the data set.  Linear systems can be written as matrix equations. . .

$$\pmatrix{  n & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2} \pmatrix{ b \\ a} = \pmatrix{ \sum_i y_i x_i \cr \sum_i y_i} $$

Although in priciple this is enough to answer the question (multiply by the inverse of the square matrix!) people typically observe that the matrix on the left is the product of two matrices more directly associated to the data.

$$X = \pmatrix{ 1 & 1 & 1 & \cdots & 1 \\ x_1 & x_2 & x_3 & \cdots & x_n }$$

Notice 
$$ \pmatrix{  n & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2}= XX^T$$

Our linear algebra problem has become

$$XX^T \pmatrix{ b \\ a} = X \vec y$$
where $$\vec y = \pmatrix{y_1 \cr y_2 \cr . \cr . \cr y_n}$$


The solution to this problem is therefore 

$$\pmatrix{b \cr a} = (XX^T)^{-1} X \vec y$$

Let's code it up in Python. The **numpy** library has convenient matrix-algebra objects and routines. 

In [None]:
import numpy as np

X = np.matrix([[x[i]**j for i in range(len(x))] for j in range(2)])

bavec = ((X*(X.T)).I)*X*(np.matrix(y).T)


In [None]:
x = [wdatlist[date][1] for date in comdates] #min
y = [wdatlist[date][0] for date in comdates] #max
plt.xlabel('min temp (night)')
plt.ylabel('max temp (day)')
plt.title('max vs. min daily temperatures')
plt.plot(x,y,'ro', label='incidences')

xd = np.linspace(-10.0, 20.0)
yv = bavec[0,0] + bavec[1,0]*xd
plt.plot(xd,yv,'b-', label='best linear fit y = %.1fx + %.1f' % (bavec[1,0], bavec[0,0]) )
plt.legend()

The feature of least squares is that it minimizes total $E^2$. We compute this minimal
$$E = \sqrt{E^2}$$


In [None]:
import math as ma

## sum (b+axi-yi)^2
E = ma.sqrt(sum([(bavec[0,0] + bavec[1,0]*x[i] - y[i])**2 for i in range(len(x))] ))
print("minimal E = ", E)

### Notice:

From this we can conclude that the maximum daily temperature $M$ in Victoria is **typically** 
$$M \simeq 1.2 m + 7.7$$
where $m$ is the minimum daily temperature.  i.e there is more variation on warm days, plus around a $7.7$ degree shift independent of the temperature. 

Let's compute the average error among all our samples, to get a better sense for for what the $\sqrt{E^2}$ computation above means. 

The average error is
$$avg.E = \frac{1}{n} \sum_i |b + ax_i - y_i|$$

In [None]:
avE = sum([ abs(bavec[0,0] + bavec[1,0]*x[i] - y[i]) for i in range(len(x))])/len(x)
print("avg.E = ", avE)

So on *average* if one uses the
$$ M \simeq 1.2 m + 7.7$$
formula, one will typically be wrong by about $2.2$ degrees. 

In [None]:
## how about the max error?
print("Max error: ", max([abs(bavec[0,0] + bavec[1,0]*x[i] - y[i])\
                          for i in range(len(x))]))
## the min error?
print("Min error: ", min([abs(bavec[0,0] + bavec[1,0]*x[i] - y[i])\
                          for i in range(len(x))]))

In [None]:
## percentage of samples with errors bigger than 3? etc. 
