# Topic of Labs



## Lab 1

In [Lecture 2, Week 8](../Week.8/Lecture.2.ipynb) we did a PCA analysis of the crime data where we did daily counts of crimes, *binned* by crime type.  In this lab we ask you to start an alternative PCA analysis of the data.

Rather than *bin* the daily data only by crime type, create bins for **crime type**, as well as **weather** and **time of day**.  There are $86$ crime types.  Let's have two bins for weather, they will be for *warm* days ($10^c$ or more) vs. *cool* days (less than $10^c$).   Let's have two bins for *time of day* as well, we will have *working hours* 6am--6pm and non-work hours, being 0am--6am and 6pm--midnight.  This gives us a total of $86 \times 2 \times 2 = 344$ bins for the daily data. 

Step 1) Construct a matrix $k \times 344$ matrix whose $(i,j)$-th entry represents the number of crimes on *day i* in *bin j*.  $k$ will of course be the number of days in the crime database for which we also have weather data. 

Step 2) Perform PCA analysis on this. 

Step 3) For the first four largest eigenvectors (i.e. the top four rows of the **pca.component__** output), look for any strong correllations or inverse correllations.  Can you explain them? 

 - Do the crimes happen in similar locations? Try a heat map. 
 - At similar times?  Try a plot of the time of the crime in one bin, vs. the time of the crime in the other.
 - Are the crime types related? 
 
If you notice a strong correllation or inverse-correllation, attempt a least-square fit of the data.   

**This is only a solution to Step 1 and 2**

In [1]:
## A little code to load the vicpd library from the Week.8 directory.
##  we could of course just move the vicpd.py file. . . but we could also do this.
import os, sys
dirn, modn = os.path.split("../Week.8/vicpd.py")
modn = os.path.splitext(modn)[0]
path = list(sys.path)
sys.path.insert(0, dirn)
try:
     vpd = __import__(modn)
finally:
    sys.path[:] = path # restore
    
import matplotlib.pyplot as plt
%matplotlib inline
#%matplotlib nbagg
import numpy as np
from sklearn.decomposition import PCA

Loading the VicPD library.
[cdata] 5 years and 150 days of crime data. 86607 records total.
[ctypes] tree structure of crime types
[all_tots] totals for crime types
[all_freq] relative frequencies of crime types
[weekdaycount] loaded
[weekdaypct] loaded
[presentBDWeek] loaded
[wdatlist] 5 years and 177 days of weather data, dict of (max c, min c, mean c, rain cm, snow cm) indexed on date
VicPD library loaded.


In [7]:
## Let's set up the binning for Problem 1. 
import itertools as it

tbinstr = ['0am-6am, 6pm--12am', '6am-6pm']

# list of crime types, time periods and temperature range.  The binning indices.
ctnl = []
for a,b in vpd.ctypes.items():
    for c,j,k in it.product(b, range(2), range(2)):
        ctnl.append((a,c,j,k))
        
## reverse-lookup dictionary, to get the index of the crime type and time chunk.
rev_ctnl = dict([(ctnl[i], i) for i in range(len(ctnl))])

## cdata dates as a set
cdays = set([c.incident_datetime.date() for c in vpd.cdata])
cdayl = list(cdays)

## reverse-lookup a date
rdaylook = dict([(cdayl[i], i) for i in range(len(cdayl))])

## get the temperature bin
def tBin(T):
    if T<10.0: return 0
    return 1

A = np.zeros( (len(cdayl), len(ctnl) ) )
for c in vpd.cdata:
    if (c.incident_datetime).date() not in vpd.wdatlist:
        continue   
    A[rdaylook[c.incident_datetime.date()],\
      rev_ctnl[(c.parent_incident_type, c.incident_type_primary,\
               ((c.incident_datetime.hour+6) % 24)//12,\
               tBin(vpd.wdatlist[c.incident_datetime.date()][2]))]] += 1.0

## build the data matrix. Every day will have a column consisting of the counts
##  of the crime types on that day. 

pca = PCA(n_components=len(ctnl))
pca.fit(A)

C = pca.components_

print(" * * * PCA eig-val mag * * * \n")

for i in range(5):
    print("ev %.8f " %pca.explained_variance_[i], end='')

 * * * PCA eig-val mag * * * 

ev 29.20761721 ev 5.12632851 ev 3.26364180 ev 2.52058023 ev 1.89542664 

In [9]:
## takes as input the row number of the PCA analysis, and prints short string explaining
## what it means
def exp_row_pca(C, r):
    ## list of entries w/index
    Cl = [(100*C[r,i], i) for i in range(C.shape[1])]
    Cs = sorted(Cl)
    Cs.reverse()
    Cp = [c for c in Cs if c[0]>0.0]
    Cn = [c for c in Cs if c[0]<0.0]
    Cn.reverse()
    return (Cp, Cn)

## let's run through the temperature database and divide it into $5$ equal-weight temperature bins.
tBinName = ['Cool', 'Warm']

def text_corr(C, r):
    Cp, Cn = exp_row_pca(C,r)
    print("+corr: ")
    for x in Cp:
        if (x[0]>15.0):
            print(" ", ctnl[x[1]][0]+"--"+ctnl[x[1]][1]+" "+tbinstr[ctnl[x[1]][2]],\
                  ' '+tBinName[ctnl[x[1]][3]], " pct %.1f" % x[0])
    print("-corr: ")
    for x in Cn:
        if (x[0]<-15.0):
            print(" ", ctnl[x[1]][0]+"--"+ctnl[x[1]][1]+" "+tbinstr[ctnl[x[1]][2]],\
                  ' '+tBinName[ctnl[x[1]][3]]," pct %.1f" % x[0])
            
for i in range(2):
    if (i!=0): print("\n")
    print("Eigenvalue ", i+1, " variance %.1f" % pca.explained_variance_[i])
    text_corr(C,i)
    

Eigenvalue  1  variance 29.2
+corr: 
  Other--SUSPICIOUS PERS/VEH/OCCURRENCE 6am-6pm  Cool  pct 23.7
  Theft from Vehicle--THEFT FROM MV UNDER $5000 6am-6pm  Cool  pct 21.7
  Other--SUSPICIOUS PERS/VEH/OCCURRENCE 0am-6am, 6pm--12am  Cool  pct 20.3
  Theft--THEFT-OTHER UNDER $5000 6am-6pm  Cool  pct 19.4
  Liquor--LIQUOR-INTOX IN PUBLIC PLACE 0am-6am, 6pm--12am  Cool  pct 18.8
  Property Crime--MISCHIEF $5000 OR UNDER 6am-6pm  Cool  pct 16.2
-corr: 
  Other--SUSPICIOUS PERS/VEH/OCCURRENCE 6am-6pm  Warm  pct -32.5
  Other--SUSPICIOUS PERS/VEH/OCCURRENCE 0am-6am, 6pm--12am  Warm  pct -31.5
  Theft from Vehicle--THEFT FROM MV UNDER $5000 6am-6pm  Warm  pct -27.5
  Liquor--LIQUOR-INTOX IN PUBLIC PLACE 0am-6am, 6pm--12am  Warm  pct -25.0
  Theft--THEFT-OTHER UNDER $5000 6am-6pm  Warm  pct -22.6
  Other--BYLAW-NOISE 0am-6am, 6pm--12am  Warm  pct -21.3
  Property Crime--MISCHIEF $5000 OR UNDER 6am-6pm  Warm  pct -17.5
  Theft--THEFT-SHOPLIFTING UNDER $5000 6am-6pm  Warm  pct -15.7
  Theft--THE