# Crime stats and matplotlib

This last lecture will start with an overview of **matplotlib**, using our crime and weather data as a source of information to plot. 

## Plotting in two dimensions with matplotlib.

A useful resource for the basics of matplotlib is the [matplotlib FAQ "General Concepts" section](http://matplotlib.org/devdocs/faq/usage_faq.html#general-concepts).  It outlines the primary structures and terminology used by matplotlib.   A more encyclopedic (but very good) introduction is [available here](http://www.labri.fr/perso/nrougier/teaching/matplotlib/).  We give a quick summary of matplotlib below. 

The **matplotlib.pyplot** module is the core library for producing 2-dimensional plots in matplotlib.  A display produced by matplotlib is called a **figure**, and figures have potentially many parts, called **axes**. 

<img src="../images/fig_map.png" alt="Riemann sum example" width=300 height=300 alt="taken from http://matplotlib.org/devdocs/faq/usage_faq.html#general-concepts">

The **axes** belong to the figure.  matplotlib has a high-level graphics-display object called an **artist** and all objects (figures, axes, axis, text, etc) are artist objects. 

matplotlib expects all array-objects to be **numpy** arrays.  Other array types can work in matplotlib but be aware these can create problems. 

matplotlib documentation distinguishes between the **backend** and the **frontend**. 
    
 - The **frontend** refers to the way in which you generate code for matplotlib.  For us, this is the i-python notebook.  
 - The **backend** refers to how one turns the code into graphics, or potentially an interactive environment. 
    
There are two primary backend types for matplotlib:

 - Hardcopy backends.  These generate static image files from your code. The code **%matplotlib inline** loads a standard static backend for matplotlib.
 - Interactive backends.  These generate code (some generate and execute the code) for interactive graphics.  For example, some backends generate java code that can be integrated with a web-page to render your application on-line. The code **%matplotlib nbagg** loads a standard interactive back-end for matplotlib.
 - A [list](http://matplotlib.org/faq/usage_faq.html#what-is-a-backend) of preloaded matplotlib backends is here. 
        
For most tasks we will use the default backends (inline and nbagg) for matplotlib.  This requires no special actions. We will also explore applications that use other backends. 

### Simple plot types with matplotlib

Let's start off making a basic figure. Let's plot the max and minimum temperatures for a range of days. 

In [None]:
import vicpd as vpd
import matplotlib.pyplot as plt
%matplotlib inline
#%matplotlib nbagg

In [None]:
# order the weather data!
from collections import OrderedDict

owdat = OrderedDict(sorted(vpd.wdatlist.items()))

t = [k for k,v in owdat.items()][50:62]
y1 = [v[0] for k,v in owdat.items()][50:62]
plt.plot(t,y1, label="max temps (c)")

y2 = [v[1] for k,v in owdat.items()][50:62]
plt.plot(t,y2, label="min temps (c)")

y3 = [v[2] for k,v in owdat.items()][50:62]
plt.plot(t,y3, label="mean temps (c)")
plt.xlabel('date')
plt.ylabel('temperatures')
plt.title('Daily max, min, mean temperatures')

plt.legend()

### Criticisms:

 - The text on the x-axis is difficult to read.  We could use smoother fonts, such as **LaTeX** rendering, or simply enlarge the font.  If you get an error code when using the usetex command, check the error message.  It likely means you are missing an application to convert dvi files to png files.  A quick **sudo apt install dvipng** command should fix that. 
 - Figure could be larger. The **set_size_inches** command works for this. 
 - The legend is obstructing the figure. The [legend locator](http://matplotlib.org/users/legend_guide.html) is good for this. 
 

In [None]:
plt.figure().set_size_inches(10,10)
plt.rc('text', usetex=True) ## smoother fonts!

plt.plot(t,y1, label="max temps (c)")
plt.plot(t,y3, label="mean temps (c)")
plt.plot(t,y2, label="min temps (c)")

plt.xlabel('date')
plt.ylabel('temperatures')
plt.title('Daily max, min, mean temperatures', fontsize=14)

plt.tick_params(axis='both', labelsize=14) ## size of dates and numbers on axis

locs, labels = plt.xticks() ## rotation of x-axis labels
plt.setp(labels, rotation=60)

plt.legend(loc=2)

Let's revisit our plots of daily crime counts, one vs. the other, and try to make them more informative. Our original plots just had one dot for each day.  Let's start by plotting pairs **(num crime type x, num crime type y)** by days, over our full crime data set.  

We create a command analogous to **vpd.xyplot** command, but where we use **ccdata** for both the **x** and **y** axis.

In [None]:
def xyplot(pit1, itp1, pit2, itp2):
    x = [vpd.ccdata[(date, pit1, itp1)] for date in vpd.comdates]
    y = [vpd.ccdata[(date, pit2, itp2)] for date in vpd.comdates]
    return x,y

def makeccplot(pit1, itp1, pit2, itp2):
    plt.xlabel( (pit1+" "+itp1).translate({ord(c): '\$' for c in '$'}) )
    plt.ylabel( (pit2+" "+itp2).translate({ord(c): '\$' for c in '$'}) )
    plt.title( "Daily count comparison" )
    x,y = xyplot(pit1, itp1, pit2, itp2)
    plt.plot(x,y,'ro', label='daily incidences')

makeccplot('Traffic','COLLISION-DAMAGE OVER $1000',\
           'Traffic','COLLISION-DAMAGE UNDER $1000')

As we can see, this is not particularly informative, as some of these dots are likely printed multiple times.  

We have a few options:

 - Make a dot larger if it occurs more often. 
 - For each point on the x-axis plot the average of the y-axis values, and perhaps indicate the variation in y-values with error bars. 

## Let's try the dot size.

In [None]:
from collections import defaultdict
import math as ma

def xyrplot(pit1, itp1, pit2, itp2):
    x = [vpd.ccdata[(date, pit1, itp1)] for date in vpd.comdates]
    y = [vpd.ccdata[(date, pit2, itp2)] for date in vpd.comdates]
    pairs = [(x[i], y[i]) for i in range(len(x))]
    pcount = defaultdict(int)
    for i in range(len(x)):
        pcount[(x[i], y[i])] += 1
    return pcount

def makeccrplot(pit1, itp1, pit2, itp2):
    plt.xlabel( (pit1+" "+itp1).translate({ord(c): '\$' for c in '$'}) )
    plt.ylabel( (pit2+" "+itp2).translate({ord(c): '\$' for c in '$'}) )
    plt.title( "Daily count comparison" )
    xyr = xyrplot(pit1, itp1, pit2, itp2)
    x = [d[0] for d,c in xyr.items()]
    y = [d[1] for d,c in xyr.items()]
    r = [ma.sqrt(c) for d,c in xyr.items()]
    ## and to fix the positioning of the axis...
    xd = (max(x)-min(x))*0.04
    yd = (max(y)-min(y))*0.04
    plt.axis([min(x)-xd, max(x)+xd, min(y)-yd, max(y)+yd])

    plt.scatter(x,y, s=r, label='daily incidences')

makeccrplot('Traffic','COLLISION-DAMAGE OVER $1000',\
            'Traffic','COLLISION-DAMAGE UNDER $1000')

In [None]:
makeccrplot('Assault with Deadly Weapon', 'ASSAULT-W/WEAPON OR CBH',\
            'Weapons Offense', 'WEAPONS-POSSESSION')

This would indicate that no collisions of any type is the *default* and collisions over or under $\$1000$ in damage are unrelated.

Let's check if we would *see* the same relation by plotting averages. 

In [None]:
def ccavyplot(pit1, itp1, pit2, itp2):
    x = [vpd.ccdata[(date, pit1, itp1)] for date in vpd.comdates]
    y = [vpd.ccdata[(date, pit2, itp2)] for date in vpd.comdates]
    pairs = [(x[i], y[i]) for i in range(len(x))]
    pcount = defaultdict(int)
    psum = defaultdict(int)
    
    for i in range(len(x)):
        pcount[x[i]] += 1
        psum[x[i]] += y[i]
    avg = [pcount[xi]/psum[xi] for xi in psum.keys()]

    dev = defaultdict(float)
    for i in range(len(x)):
        dev[x[i]] += abs(y[i]-avg[x[i]])
    for i in dev.keys():
        dev[i] /= pcount[i]
    devl = [dev[i] for i in dev.keys()]
    return list(psum.keys()), avg, devl

## let's compute the deviation from the average as well, and put that
## in as error bars. 
xv, yv, dv = ccavyplot('Traffic','COLLISION-DAMAGE OVER $1000',\
            'Traffic','COLLISION-DAMAGE UNDER $1000')
plt.title('Average number of collisions under 1000, for a given\n number of collisions over. With avg errors')
plt.errorbar(xv, yv, yerr=dv, ecolor='r')


The plot of the averages look quite different.  Although, this is about what we would expect -- that there is no real trend, other than there being slightly more minor collisions than major collisions, on average.

In [None]:
## few assaults on snowy days
xv, yv = vpd.xyplot('Assault', 'ASSAULT-COMMON OR TRESPASS', 4)
plt.plot(xv, yv, 'ro')

In [None]:
## few (but more) assaults on rainy days
xv, yv = vpd.xyplot('Assault', 'ASSAULT-COMMON OR TRESPASS', 3)
plt.plot(xv, yv, 'ro')

In [None]:
## Assaults vs. max temp
xv, yv = vpd.xyplot('Assault', 'ASSAULT-COMMON OR TRESPASS', 0)
plt.plot(xv, yv, 'ro')

In [None]:
makeccrplot('Other', 'SUSPICIOUS PERS/VEH/OCCURRENCE',\
            'Theft from Vehicle', 'THEFT FROM MV UNDER $5000')

In [None]:
makeccrplot('Other', 'SUSPICIOUS PERS/VEH/OCCURRENCE',\
            'Liquor', 'LIQUOR-INTOX IN PUBLIC PLACE')

**Least squares again**

Although we have not seen many examples where the data appears to fit a linear trend, we should mention that the least squares technique is more general than what we described in Week 6. 

Assume you have some data points $(x_i, y_i)$ for $i = 1, 2, \cdots, n$.  And you want to find a *best* fit function of the form

$$F(x) = \sum_{i=1}^n c_i f_i(x)$$

with the functions $f_i : \mathbb R \to \mathbb R$ being given ahead of time. i.e. we want to interpolate our data as a linear combination of some functions that we have decided upon *ahead of time*.  In our example from [Week 6](../Week.6/Lecture.2.ipynb), we had $n=2$, with $f_1(x) = x$ and $f_2 = 1$. 

As in Week 6, given a choice of constants $\{c_i : i = 1, 2, \cdots, n\}$ the **Total Error** (squared) of the approximation is defined as:

$$E^2 = \sum_{i=1}^n (F(x_i) - y_i)^2 $$

Since $F$, $x_i$, and $y_i$ are given, we can think of $E^2$ as a function of the coefficients $(c_1, c_2, \cdots, c_n)$.  By calculus the minimum occurs at a critical point, which is when:

$$\frac{\partial E^2}{\partial c_1} = \frac{\partial E^2}{\partial c_2} = \cdots = \frac{\partial E^2}{\partial c_n} = 0 \hskip 1cm \star$$

which is a system of $n$ linear equations in the $n$ variables $\{c_1, c_2, \cdots, c_n\}$. 

Specifically, 

$$\frac{\partial E^2}{\partial c_k} = \sum_{i=1}^n 2(F(x_i)-y_i)f_k(x_i), \hskip 1cm k=1,2,\cdots, n$$
which allows us to express the linear system $\star$ as

$$\sum_{i,j=1}^n c_jf_j(x_i)f_k(x_i) = \sum_{i=1}^n y_i f_k(x_i), \hskip 1cm k=1,2,\cdots,n$$
which in turn is the matrix equation

$$AA^T \vec c = A \vec y$$
where 
$$A = \pmatrix{f_1(x_1) & f_1(x_2) & \cdots & f_1(x_n) \cr f_2(x_1) & f_2(x_2) & \cdots & f_2(x_n) \cr . & . & & . \cr  . & . & & . \cr f_n(x_1) & f_n(x_2) & \cdots & f_n(x_n)}$$
$\vec y = \pmatrix{y_1 \cr y_2 \cr . \cr . \cr y_n}$, and 
$\vec c = \pmatrix{c_1 \cr c_2 \cr . \cr . \cr c_n}$.

**Example:**

Let's take the max vs. min temperature example from Week 6, but interpolate with more than linear functions.

In [None]:
import numpy as np

## A reminder of the data. 
x = [vpd.wdatlist[date][1] for date in vpd.comdates] #min
y = [vpd.wdatlist[date][0] for date in vpd.comdates] #max
plt.xlabel('min temp (night)')
plt.ylabel('max temp (day)')
plt.title('max vs. min daily temperatures')
plt.plot(x,y,'ro')

## implementing a linear interpolation
A = np.matrix([[x[i]**j for i in range(len(x))] for j in range(2)])
cvec = ((A*(A.T)).I)*A*(np.matrix(y).T)

xd = np.linspace(-10.0, 20.0)
yd = cvec[0,0] + cvec[1,0]*xd
plt.plot(xd, yd, 'b-')

## and the average error.
avE = sum([ abs(cvec[0,0] + cvec[1,0]*x[i] - y[i]) for i in range(len(x))])/len(x)
print("average error = ", avE)

In [None]:
## interpolate using a+bx+cx**2
import numpy as np

A = np.matrix([[x[i]**j for i in range(len(x))] for j in range(3)])

cvec = ((A*(A.T)).I)*A*(np.matrix(y).T)

plt.xlabel('min temp (night)')
plt.ylabel('max temp (day)')
plt.title('max vs. min daily temperatures')
plt.plot(x,y,'ro', label='incidences')

xd = np.linspace(-10.0, 20.0)
yv = cvec[0,0] + cvec[1,0]*xd + cvec[2,0]*(xd**2)
plt.plot(xd,yv,'b-', label='best fit y = %.2f$x^2$ + %.1f$x$ + %.1f' % (cvec[2,0], cvec[1,0], cvec[0,0]) )
plt.legend()

## and we should expect the total error squared to be smaller, but how about average error?
avE = sum([ abs(cvec[0,0] + cvec[1,0]*x[i] + cvec[2,0]*(x[i]**2) - y[i]) for i in range(len(x))])/len(x)
print("average error = ", avE)

In [None]:
## lets try a more flexible interpolation, let's use 1, x, x^2, sin(x) and cos(x)
flist = [lambda x: 1.0, lambda x: x, lambda x: x**2,\
         lambda x: np.sin(x), lambda x: np.cos(x)]
fname = ['', '$x$', '$x^2$', '$\sin(x)$', '$\cos(x)$']

A = np.matrix([[F(x[i]) for i in range(len(x))] for F in flist])

cvec = ((A*(A.T)).I)*A*(np.matrix(y).T)

plt.xlabel('min temp (night)')
plt.ylabel('max temp (day)')
plt.title('max vs. min daily temperatures')
plt.plot(x,y,'ro', label='incidences')

xd = np.linspace(-10.0, 20.0)
yd = [sum([flist[j](xi)*cvec[j,0] for j in range(len(flist))]) for xi in xd]

fitstring=''
for i in range(len(flist)):
    if i!=0: fitstring+="+"
    fitstring+=("%.2f" % cvec[i,0])+fname[i]
plt.plot(xd,yd,'b-', label='best fit '+fitstring )
plt.legend()

## and the average error.
avE = sum([ abs(sum([flist[j](x[i])*cvec[j,0] for j in range(len(flist))]) - y[i]) for i in range(len(x))])/len(x)
print("average error = ", avE)

In [None]:
## just for kicks, lets try 1, sin(x/6), cos(x/6), sin(x/3), cos(x/3)

flist = [lambda x: 1.0, lambda x: np.sin(x/6), lambda x: np.cos(x/6), lambda x: np.sin(x/3), lambda x: np.cos(x/3)]
fname = ['', '$\sin(x/6)$', '$\cos(x/6)$', '$\sin(x/3)$', '$\cos(x/3)$']

A = np.matrix([[F(x[i]) for i in range(len(x))] for F in flist])

cvec = ((A*(A.T)).I)*A*(np.matrix(y).T)

plt.xlabel('min temp (night)')
plt.ylabel('max temp (day)')
plt.title('max vs. min daily temperatures')
plt.plot(x,y,'ro', label='incidences')

xd = np.linspace(-10.0, 20.0)
yd = [sum([flist[j](xi)*cvec[j,0] for j in range(len(flist))]) for xi in xd]

fitstring=''
for i in range(len(flist)):
    if i!=0: fitstring+="+"
    fitstring+=("%.2f" % cvec[i,0])+fname[i]
plt.plot(xd,yd,'b-', label='best fit '+fitstring )
plt.legend()

## and the average error.
avE = sum([ abs(sum([flist[j](x[i])*cvec[j,0] for j in range(len(flist))]) - y[i]) for i in range(len(x))])/len(x)
print("average error = ", avE)

Least-squares as a technique is extremely flexible in fitting one set of variables $y_i$ to a linear combination of functions of another variable $x_i$.   One often wants to know **what** objects to compare, or if there is any interesting comparisons to make **at all**.  

The technique of [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis), roughly speaking, attempts to find a *best possible* ellipsoidal shape fitting a collection of data. The data cloud can be a finite set of vectors in $\mathbb R^n$, thus it allows the simultaneous comparison of many different aspects of a data set. 

## The PCA technique

Rather than implement PCA from scratch, we will start with a library implementation. For this we will use the **sklearn** library. 

In [None]:
from sklearn.decomposition import PCA

## and let's start with something we can know about ahead of time. 
## multivariate_normal( centre, direction of variation, number of points )
mvn = np.random.multivariate_normal( [0,0], [[1,0], [0,5]], 4000 )
pca = PCA(n_components=2)
pca.fit(mvn)
print("PCA components: \n",pca.components_, "\n")
print("Magnitude of PCA components: \n",pca.explained_variance_)

fig = plt.figure()
plt.axis([-9,9,-9,9])
plt.scatter(*zip(*mvn))


## What PCA is doing

Given $k$ *data points* in $\mathbb R^n$ (the previous example has $n=2$ and $k=4000$)

$$\{ \vec x_i \in \mathbb R^n : i = 1, 2, \cdots, k\}$$

*Principal Component Analysis* is the process of constructing the matrix

$$ A = \left[ \vec x_1, \vec x_2, \cdots, \vec x_k \right]$$

(thinking of $\mathbb R^n$ as consisting of column vectors). The $i$-th coordinate of the vector $\vec x_j$ is called the $i$-th *feature* of the vector.  One then observes that the matrix

$$ A \cdot A^T $$

is symmetric. Not only that, this is the matrix of dot products of the various *rows* of $A$. 

* * *

**Theorem (Spectral Theorem)**: A symmetric $n \times n$ matrix $X$ is diagonalizable, that is, there exists a real $n \times n$ matrix $B$ such that
$$B^{-1} X B$$
is diagonal.  Moreover, the theorem tells us that one can assume that $B$ is *orthogonal*, $B^T B = I$. 

* * *

The *PCA* method computes the matrix $B$ as well as the eigenvalues of $A \cdot A^T$. PCA chooses the eigenvalues in *decreasing* order.  So the first column of the matrix $B$ represents the weighting of the features of the data with the highest amount of variance.  The second column vector represents the second-highest amount of variation among the features (technically the direction of the highest amount of variation once one projects the data to the orthogonal-complement of the first vector), etc. **PCA.components_** is the matrix $B$, while **PCA.explained\_variance\_** is the eigenvalues, in decreasing order. 

*Technical Issue* the **sklearn** library technically is substracting off the *mean* from every feature before it computes $B$, i.e. let

$$\tilde A = \pmatrix{ 
a_{11} - avg_i\{a_{1i}\} & a_{12} - avg_i\{a_{1i}\} & \cdots & a_{1k} - avg_i\{a_{1i}\} \cr
a_{21} - avg_i\{a_{2i}\} & a_{22} - avg_i\{a_{2i}\} & \cdots & a_{2k} - avg_i\{a_{2i}\} \cr
. & . & & . \cr
a_{n1} - avg_i\{a_{ni}\} & a_{n2} - avg_i\{a_{ni}\} & \cdots & a_{nk} - avg_i\{a_{ni}\} }$$

The **sklearn** PCA algorithm replaces the matrix $A$ with $\tilde A$ before computing $B$.

*Fact*: Matrices of the form $A \cdot A^T$ always have non-negative eigenvalues. 

* * *

### Let us now look for elementary relationships in the crime data, of the sort PCA can see. 

Let's look for relations in crime occurances, by their types and the time of day in which they occur. 

Let's construct a list of vectors.  Our vectors will be in $\mathbb R^n$ where $n = CT$.

* $CT$ is the total number of crime types listed.  
* $k$ the total number of crimes in the database.

Each row of our matrix will represent a day.  And the column entries will be the number of crimes of various types on that day. 

In [None]:
# list of crime types
ctnl = []
for a,b in vpd.ctypes.items():
    for c in b:
        ctnl.append((a,c))

## reverse-lookup dictionary, to get the index of the crime type.
rev_ctnl = dict([(ctnl[i], i) for i in range(len(ctnl))])

## cdata dates as a set
cdays = set([c.incident_datetime.date() for c in vpd.cdata])
cdayl = list(cdays)

## reverse-lookup a date
rdaylook = dict([(cdayl[i], i) for i in range(len(cdayl))])

A = np.zeros( (len(cdayl), len(ctnl)) )
for c in vpd.cdata:
    A[rdaylook[c.incident_datetime.date()], rev_ctnl[(c.parent_incident_type, c.incident_type_primary)]] += 1.0

## build the data matrix. Every day will have a column consisting of the counts
##  of the crime types on that day. 

pca = PCA(n_components=len(ctnl))
pca.fit(A)

C = pca.components_

#print("PCA components: \n",pca.components_, "\n")
print("Magnitude of PCA components: \n",pca.explained_variance_)


In [None]:
## takes as input the row number of the PCA analysis, and prints short string explaining
## what it means
def exp_row_pca(C, r):
    ## list of entries w/index
    Cl = [(100*C[r,i], i) for i in range(C.shape[1])]
    Cs = sorted(Cl)
    Cs.reverse()
    Cp = [c for c in Cs if c[0]>0.0]
    Cn = [c for c in Cs if c[0]<0.0]
    Cn.reverse()
    return (Cp, Cn)

def text_corr(C, r):
    Cp, Cn = exp_row_pca(C,r)
    print("+corr: ")
    for x in Cp:
        if (x[0]>5.0):
            print(" ", ctnl[x[1]], " pct %.1f" % x[0])
    print("-corr: ")
    for x in Cn:
        if (x[0]<-5.0):
            print(" ", ctnl[x[1]], " pct %.1f" % x[0])

text_corr(C,2)
