# Finding downgraded crimes using machine learning

Using a machine learning algorithm, The Los Angeles Times found the LAPD was downgrading serious assaults to the less serious "simple assault" category for years. We're going to reproduce this by manually downgrading 15% of the serious assaults in a database, then trying to see if we can detect which ones we edited.

We'll be using actual assault reports from the LAPD, reported between the years of 2008-2012. 

<p class="reading-options">
  <a class="btn" href="/latimes-crime-classification/using-a-classifier-to-find-misclassified-crimes">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/latimes-crime-classification/notebooks/Using a classifier to find misclassified crimes.ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/latimes-crime-classification/notebooks/Using a classifier to find misclassified crimes.ipynb" target="_new">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

### Prep work: Downloading necessary files
Before we get started, we need to download all of the data we'll be using.
* **2008-2012.csv:** cleaned crime reports - a selection of partially scrubbed reports from between 2008-2012


In [1]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/latimes-crime-classification/data/2008-2012.csv.zip -P data
!unzip -n -d data data/2008-2012.csv.zip

File ‘data/2008-2012.csv.zip’ already there; not retrieving.

Archive:  data/2008-2012.csv.zip


## Imports and setup

First we'll set some options up to make everything display correctly. It's mostly because these assault descriptions can be quite long, and the default is to truncate text after a few words.

In [2]:
import pandas as pd

pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)

%matplotlib inline

## Read in our data

Our dataset is going to be a database of crimes committed between 2008 and 2012. The data has been cleaned and filtered a bit, though, so we're only left with two columns:

* `CCDESC`, what criminal code was violated
* `DO_NARRATIVE`, a short text description of what happened

We're going to use this description to see if we can separate serious cases of assault compared to non-serious cases of assault.

In [3]:
df = pd.read_csv("data/2008-2012.csv")
df.head(10)

Unnamed: 0,CCDESC,DO_NARRATIVE
0,SHOPLIFTING - PETTY THEFT ($950 & UNDER),DO-SUSP WAS SEEN THROUGH SURVAILANCE CONCEALING SEVERAL ITEMS INTO HER SHOPPING AND PERSONAL BAG LEAVING WITHOUT PAYING DEPT STORE
1,VIOLATION OF COURT ORDER,DO-SUSP ARRIVED AT VICTS RESID AND ENTERED VICTS RESID IN VIOLATION OF RESTRAINING ORDER
2,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APPRCHED V AND STATED ARE YOU GOING TO FCK ME V REPLIED NO SUSP PULL ED OUT A KNIFE AND STATED IM HERE TO HURT YOU BTCH S USED PROFANITIES
3,THEFT PLAIN - PETTY ($950 & UNDER),DO-UNK SUSP TOOK VICT PREPAID GIFT CARD SUSP PURCHASED PRODUCTS WITH ITEM
4,BATTERY - SIMPLE ASSAULT,DO-SUSP USED RIGHT FIST TO PUNCH VICT IN THE HEAD ONCE N PULL VICT HAIR FOR APPRX 15 SECONDS
5,THEFT OF IDENTITY,DO-UNK SUSP USED VICTS PERSONAL INFO FOR GAIN WITHOUT THE VICTS CONSENT ORKNOWLEDGE
6,SHOPLIFTING - PETTY THEFT ($950 & UNDER),DO-SUSP ENTERED MKT AND SEL ITEMS SUSP CONCEALED ITEMS AND EXITED STORE WOPAYING
7,BURGLARY,DO-UNK SUSP ENTERED VICTS RESIDENCE BY UNLOCKED FRONT DOOR SUSP REMOVED VCTICTS PROPERTY SUSP FLED LOC
8,OTHER MISCELLANEOUS CRIME,DO-SUSP ADMITTED TO PLACING 2010 REG TAG HE ILLEGALLY OBTAINED ON HIS LIC PLATE HIS VEH REG WAS STILL EXP
9,BATTERY - SIMPLE ASSAULT,DO-S APPROACHED V IN VEH S SLAPPED AND LUNGGED AT V


How much data do we have?

In [4]:
df.shape

(830218, 2)

# Clean the data

We don't get to use all 800,000 rows, though! We're just going to stick to assaults. First we'll filter our dataset to only include crimes with a description that includes the word `ASSAULT`.

In [5]:
df = df[df.CCDESC.str.contains("ASSAULT")].copy()
df.shape

(165965, 2)

Assaults come in two forms:

* Serious or Part I assaults, which are Aggravated and Serious assaults
* Non-Serious or Part II assaults, which are Simple assaults

Let's make a new column called `serious` where we save whether the assault is serious/Part I or not.

In [6]:
df['serious'] = df.CCDESC.str.contains("AGGRAVATED") | df.CCDESC.str.contains("DEADLY")
df['serious'] = df['serious'].astype(int)
df.groupby('serious').CCDESC.value_counts()

serious  CCDESC                                        
0        BATTERY - SIMPLE ASSAULT                          71951
         INTIMATE PARTNER - SIMPLE ASSAULT                 42102
         CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT            4297
         OTHER ASSAULT                                       394
1        ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT    43385
         INTIMATE PARTNER - AGGRAVATED ASSAULT              1606
         CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT        1481
         ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER        749
Name: CCDESC, dtype: int64

How many are serious vs simple assaults?

In [7]:
df.serious.value_counts()

0    118744
1     47221
Name: serious, dtype: int64

We have about 2.5x as many serious assaults as we do simple assaults. Typically you want to have equal numbers of both, but we'll see how it goes for now.

## Downgrading some serious assaults

The Los Angeles Times looked for (and found) Part I crimes that the LAPD had downgraded to Part II. We don't have access to these original attributions, though, so we'll need to randomly select serious crimes to downgrade.

Let's take 15% of the serious crimes and downgrade them to Part II. I'd rather not save this in another file because I don't want to imply it's real - **it's just us faking the downgrade for the purposes of the exercise.**

In [8]:
# Select a random sample of 15% of the part I crimes
serious_subset = df[df.serious == 1].sample(frac=0.15)

# So we can flag the ones we're downgrading
df['downgraded'] = 0

# Update the original dataframe to downgrade them to part_ii
df.loc[serious_subset.index, 'downgraded'] = 1
df.loc[serious_subset.index, 'serious'] = 0

How many did we downgrade?

In [9]:
df.downgraded.sum()

7083

Before we had 118,744 simple assaults and 47,221 serious assaults. What's that number look like now?

In [10]:
df.serious.value_counts()

0    125827
1     40138
Name: serious, dtype: int64

And now we'll take a look at some of the downgraded assaults. Bear in mind that **we selected the assaults to downgrade randomly.**

In [11]:
df[df.downgraded == 1].head(10)

Unnamed: 0,CCDESC,DO_NARRATIVE,serious,downgraded
2,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APPRCHED V AND STATED ARE YOU GOING TO FCK ME V REPLIED NO SUSP PULL ED OUT A KNIFE AND STATED IM HERE TO HURT YOU BTCH S USED PROFANITIES,0,1
243,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP DROVE UP ALONG SIDE VICTS DRIVER SIDE WINDOW AND STATED BITCH YOU TRIED TO HIT MY SON SUSP REMOVED BOTTLE WITH RIGHT HAND AND THREW IT AT VICT,0,1
349,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S PUNCHED V IN FACE AND STATED I DONT WANT YOU HERE V FELL TO GROUND S STRUCK V WITH TWO BY FOUR WOOD FOUR TO FIVE TIMES IN FACE AND BODY,0,1
368,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP AND VICT BECAME INVOLVED IN ARGUMENT SUSP PICKED UP METAL STAND AND STRUCK VICT CAUSING VISIBLE INJURIES,0,1
450,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-ARMED WITH A KNIFE SUSP LUNGED AT V2 AND MISSED SUSP THEN STABBED V1 ON THE RIGHT UPPER ARM,0,1
597,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S1 S2 CONFRONTED V ON THE STREETS S1 PUNCHED V IN THE BACK OF THE HEAD WITH UNK METAL OBJ S2 THEN KICKED V TWICE IN THE FACE AND BOTHE SUSPS FLED IN V,0,1
893,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S AND V WERE DRIVING IN VEH S GOT ANGRY AND STRUCK V THEN S GRABS SCISSORS AND ATT TO STAB V V EXITS VEH AND TAKES KEYS S FLEES TO UNK LOC,0,1
1069,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S1 FIRED AT UNK MOTORIST AND STRUCK VICT VEH S2 STOOD BY AS A LOOKOUT,0,1
1406,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP APPROACHED VICT IN VAN SUSP GOT OUT OF VAN PULLED GUN FROM WAISTBAND AND POINTED IT AT VICT SUSP CHASED AFTER VICT WGUN SUSP FLED UNK DIR,0,1
1438,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUPS APPRCHD VICT FROM IN A VEH SUSP EXITED VEH AND HIT VICT 4 TO 5S WITH A BASEBALL BAT SUSP FLED EB WESTERN AV,0,1


In [12]:
df.to_csv('assault.csv')

# Split out a training and testing set

In [13]:
from sklearn.utils import shuffle

# shuffle the data into a random order
# only run this cell once, otherwise it will re-shuffle 😮
df = shuffle(df, random_state=42)
df.head(3)

Unnamed: 0,CCDESC,DO_NARRATIVE,serious,downgraded
285803,BATTERY - SIMPLE ASSAULT,DO-SUSP SECURITY GUARD AT HOTEL GOT INTO VIS BED W VIC WHILE V ASLEEP WO VIC PERMISSION S THEN SPOONED VIC VIC WAS GUEST AT HOTEL AT TIME OF INCIDENT,0,0
420596,BATTERY - SIMPLE ASSAULT,DO-SUSP AND VICT BECAME INVOLVED IN ARGUMENT SUSP HELD A PLASTIC CHAIR IN FRONT OF HIM AND PUSHED VICT WITH CHAIR AS VICT APPROACHED SUSP TO CONFRONT HIM,0,0
115161,BATTERY - SIMPLE ASSAULT,DO-VICT IS PARENT OF SUSPECT STRUGGLED ENSUED FROM AN ARGUMENT AND SUSPECTELBOWED VICT IN THE CHEST SUSP THEN FLED LOCATION,0,0


In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(df, test_size=100, random_state=42)

print(X_train.shape)
print(X_test.shape)

(165865, 4)
(100, 4)


In [15]:
X_train.to_csv('assaults_downgraded_train.csv')
print(X_train.shape)
X_train.head(3)

(165865, 4)


Unnamed: 0,CCDESC,DO_NARRATIVE,serious,downgraded
823014,BATTERY - SIMPLE ASSAULT,DO-SUSP AND VICT INVOLVED IN ARGUMENT SUSP GRABBED VICT BY NECK PUSHED HERAGAINST WALL AND CHOKED HER,0,0
660053,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,DO-V1 AND V2 WERE DISCIPLINED BY STEPFATHER FRIDAY AFTER SCHOOL WITH A BELT V1 HAD SOME MINOR BRUSING ON BODY V2 HAD BRUISING ON HIS BODY,0,0
736013,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-UNK SUSPS DROVE UP ALONSIDE VICTS S YELLED FUCK 38 S1 THEN SHOT APPROX 4-5 RONDS FROM BLUE STEEL REVOLVER STRIKING V1 V2,1,0


In [16]:
X_test.to_csv('assaults_downgraded_test_with_answers.csv')
print(X_test.shape)
X_test.head(3)

(100, 4)


Unnamed: 0,CCDESC,DO_NARRATIVE,serious,downgraded
483580,INTIMATE PARTNER - SIMPLE ASSAULT,DO- S AND V BECAME INVOLV IN AN ARGUMENT S BECAME UPSET AND STRUCK V IN THE FACE WITH A CLOSED FIST FIVE TIMES,0,0
745059,BATTERY - SIMPLE ASSAULT,DO-VICT AND SUSP INVOLVED IN A VERBAL ARGUMENT SUSP SPIT ONCE IN THE VICTS FACE SUSP FLED ON BICYCLE,0,0
644873,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP AND VIC WERE INVLD IN A VERBAL ARGUMENT SUSP STRUCK VIC IN HAND WITH UNK OBJECT CAUSING HALF INCH LACERATION TO HIS LEFT THUMB,1,0


In [17]:
X_test_no_answers = X_test.copy()
X_test_no_answers['CCDESC'] = ''
X_test_no_answers['serious'] = ''
X_test_no_answers['downgraded'] = ''

X_test_no_answers.to_csv('assaults_downgraded_test.csv')
print(X_test_no_answers.shape)
X_test_no_answers.head(3)

(100, 4)


Unnamed: 0,CCDESC,DO_NARRATIVE,serious,downgraded
483580,,DO- S AND V BECAME INVOLV IN AN ARGUMENT S BECAME UPSET AND STRUCK V IN THE FACE WITH A CLOSED FIST FIVE TIMES,,
745059,,DO-VICT AND SUSP INVOLVED IN A VERBAL ARGUMENT SUSP SPIT ONCE IN THE VICTS FACE SUSP FLED ON BICYCLE,,
644873,,DO-SUSP AND VIC WERE INVLD IN A VERBAL ARGUMENT SUSP STRUCK VIC IN HAND WITH UNK OBJECT CAUSING HALF INCH LACERATION TO HIS LEFT THUMB,,


# You are the classifier 👈

Just like a classifier, you must learn the rules from the data.

Open up `assaults_downgraded_test.csv` and start classifying each item as 0 or 1. You may refer to `assaults_downgraded_train.csv` in order to learn the patterns of how these assaults are generally classified. Remember, the police may have gotten some wrong. You don't have a "golden set". 


Let's take a look at a few together

Training set
https://docs.google.com/spreadsheets/d/1OfaxnczYFEXxngQRn86EyDBUB0gMsYLMpqVPXDPyh9A/edit#gid=1666854851

Testing set
https://docs.google.com/spreadsheets/d/1jNAZ11-rt3ix-Peu86IQ2wvhqfD3OiUJz_Gf4ERiLIM/edit#gid=2046706400

