# Natural Language Processing for Crime Analysis

This workbook will take you through some common Natural Language Processing (NLP) tasks. Ultimately it will analyse some crime notes data in order to try to identify distinct categories of crime. I.e., from the crime notes can we distinguish different types of crime?

**Required libraries**

The code below requires the following third-party libraries:

 - pandas
 - nltk
 - gensim
 - matplotlib

These can be installed using most python package managers, e.g.:

`conda install pandas nltk gensim matplotlib`

## Preparation

The code in this document has been designed to read a comma-separated-values (csv) file that contains a column with some crime notes in it. There can be other columns in the data, and these will be ignored, but ee need to tell python what the name of that column is. (Leave the following line alone unless you are using your own data.

In [1]:
crime_column = "Crime Notes"

We also need to load the necessary python libraries. These are basically pieces of code that other people have written that do useful jobs (like reading files, analysing data, drawing graphs, etc.).

In [2]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import nltk # for the natural language processing

# Prepare a list of stopwords:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend([",", "."]) # Also include comma and full stop
stop_words = set(stop_words)

stemmer = nltk.PorterStemmer() # A 'stemmer' for finding roots of words

import string # Useful things to do with strings

# For doing topic modelling:
from gensim import corpora, models, similarities
from itertools import chain

# For making graphs:
import matplotlib.pyplot as plt




## Reading the Data

Now can read the data. the `pandas` library (which we have abreviated to `pd`) has some very usefunl functions for reading and writing data, including one called `read_csv`. We are going to store all of the data in a big 'DataFrame', called simply `df` (we could use any word to refer to the data, but 'df' is common, partly because it is quick to type!).

In [3]:
df = pd.read_csv("data/taxis-after_whitelisting.csv")

Python now has access to those data through the variable called `df`. Lets do a couple of things to check that everything is as expeced.

In [4]:
print("The columns in the data are called:", df.columns)

print("There are", len(df), "rows")

print("Here is some more information about the data:")
df.info()

The columns in the data are called: Index(['Unnamed: 0', 'Crime Category', 'Crime Notes'], dtype='object')
There are 1059 rows
Here is some more information about the data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1059 entries, 0 to 1058
Data columns (total 3 columns):
Unnamed: 0        1059 non-null int64
Crime Category    1059 non-null object
Crime Notes       1059 non-null object
dtypes: int64(1), object(2)
memory usage: 24.9+ KB


## Preparing the Text

Before actually doing any analysis of the text, it is first necessary to prepare it. The following will go through some of the typical procedures, but there are others that might be useful as well.

To prepare the text, the following code defines a new _function_. This function will be given the notes of a crime event as input, and it will return a clean version as its output.

In [5]:
def clean_crime_text(text):
    
    # Convert all of the text to lower case
    text = text.lower()
       
    # Tokenize the text (split it into its constituent words)
    tokens = nltk.word_tokenize(text)
    
    # Remove 'stop words' (words like 'and', 'but', etc)
    tokens = [word for word in tokens if word not in stop_words]
    
    # 'Stem' the words
    #tokens = [stemmer.stem(t) for t in tokens]

    
    # Finish by taking all the tokens,  putting them back together
    # into a single long string of text, and returning this.
    output = " ".join(tokens)
    return output

The `clean_crime_text` function is ready. Now we will create a new column in our DataFrame called `clean_crime` using the raw crime notes as input

In [6]:
df['clean_crime'] = df[crime_column].apply(clean_crime_text)

Finally, sometimes it is useful to create a huge list that contains every word, regardless of which crime notes it is part of. Create that list now.

In [7]:
all_words = [] # a list that will store every word
# Run through every row and add the individual words to the all_words list
for row in df['clean_crime']:
    all_words.extend(nltk.word_tokenize(row))

# Convert this list into a format that the natural language processing
# toolkit (NLTK) understands
all_words = nltk.Text(all_words)

print("Found {} words in total".format(len(all_words)))

Found 36751 words in total


## Preliminary Analysis

Now the text is ready we can do some exploratory analysis

### Common words

Do a frequency distribution showing the most commonly occuring words:

In [8]:
# Count the frequencies of the words
fd = nltk.FreqDist(all_words)

# Show the most common twenty words
fd.most_common(20)

[('taxi', 2624),
 ('suspect', 1967),
 ('driver', 1679),
 ('victim', 1461),
 ('complainant', 874),
 ('fare', 816),
 ('vehicle', 540),
 ('suspects', 436),
 ('male', 378),
 ('stated', 351),
 ('phone', 350),
 ('get', 348),
 ('comp', 323),
 ('money', 322),
 ('pay', 314),
 ('police', 305),
 ('address', 299),
 ('female', 294),
 ('times', 293),
 ('causing', 283)]

**Activity**: In the cell below, write some code that will show the **50** most common words.

### Colocations

Look for bigrams (pairs of words) that occur more frequently than expected. Find the 20 most common collocations.

In [9]:
all_words.collocations(20)

taxi driver; times stated; city centre; driver picks; leeds city;
mobile phone; without paying; home address; pay fare; good escape; get
money; private hire; cash machine; argument ensues; causing damage;
attempt pay; picks fare; calls police; time date; making attempt


**Activity**: In the cell below, write some code that will show the **50** most common collocations.

### Concordance

Concordance allows us to look at the words that surround a particular word (i.e. the context).

In [10]:
# Look at the text that occurs around the word 'phone'
all_words.concordance('phone')

Displaying 25 of 350 matches:
 makes pay taxi driver returned taxi phone calls called address taxi picked sus
ey taxi driver made bank person left phone taxi later contacted taxi company sa
 company said taxi driver taxi found phone victim gets taxi leeds date stated n
ds date stated next realises 's left phone taxi company report 4 later call mal
 report 4 later call male stating 's phone taxi driver wants get give money get
taxi driver wants get give money get phone back victim refuses calls police sus
taxi driver works amber cars victims phone dropped pocket taxi victim taxi comp
5 minutes found suspect made victims phone taxi driver picks fare male driven h
houlder causing injury suspect taken phone payment taxi payment fare taxi drive
 taxi payment fare taxi driver given phone back driven phone caller taxi compan
 taxi driver given phone back driven phone caller taxi company complainant taxi
ek side nose head neck victim leaves phone taxi taxi driver witnesses next pass
driver wit

**Activity**: In the cell below, write some code that will show the context of the word '**take**'

### Common contexts

Find the contexts where the specified words appear

In [11]:
all_words.common_contexts(['phone', 'driver'])

taxi_calls taxi_handed taxi_leaving taxi_victim taxi_got taxi_suspect
taxi_gone hands_driver taxi_found taxi_customer taxi_turned
taxi_complainant taxi_victims


**Activity**: In the cell below, write some code that will show the common contexts of two words that you're interested in.

## Analysis by Crime Type

### Word Frequencies by Crime Type

You might have noticed that the data contain the crime type as an additional column. Lets have a quick look to see whether the most common words are different for the different crime types.

_Note that if you are using your own data then the following might not work (either because your crime type column is called something other than '`Crime Category`', which is easy to fix, or because you don't have the crime categories at all._

The data contains the following unique categories:

In [12]:
print(set(df['Crime Category']))

{'Violent Crime', 'Criminal Damage', 'Theft From Motor Vehicle', 'Robbery', 'Fraud & Forgery', 'Other Theft'}


Lets see if there is a difference between 'Criminal Damage' and 'Other Theft'.

First create two lists of all the words associated with reports of 'Criminal Damage' and 'Other Theft' (as we did before for _all_ words).

In [13]:
# Create two new bags of words, as we did before, but this time only including
# the words associated with reports of 'Criminal Damage' and 'Other Theft'.

crim_damage = []
other_theft = []
for (i, row) in df.iterrows():
    if row['Crime Category'] == 'Criminal Damage':
        crim_damage.extend(nltk.word_tokenize(row['clean_crime']))
    elif row['Crime Category'] == 'Other Theft':
        other_theft.extend(nltk.word_tokenize(row['clean_crime']))
crim_damage = nltk.Text(crim_damage)
other_theft = nltk.Text(other_theft)

print("Found {} words in Criminal Damage and {} with Other Theft".format(\
    len(crim_damage), len(other_theft)))

Found 4250 words in Criminal Damage and 14164 with Other Theft


Now see what the frequency distributions of the words associated with those crime categories look like.

In [14]:
print("The most common words for Criminal Damage are:\n", \
      nltk.FreqDist(crim_damage).most_common(10) )

print("\nThe most common words for Other Theft are:\n", \
      nltk.FreqDist(other_theft).most_common(10) )

The most common words for Criminal Damage are:
 [('taxi', 254), ('suspect', 244), ('driver', 180), ('victim', 124), ('vehicle', 108), ('complainant', 96), ('causing', 87), ('fare', 86), ('male', 55), ('suspects', 53)]

The most common words for Other Theft are:
 [('taxi', 1277), ('driver', 745), ('suspect', 584), ('fare', 401), ('victim', 386), ('complainant', 336), ('phone', 280), ('pay', 214), ('money', 186), ('address', 161)]


**Question**: what do you notice about the differences in the words that commonly appear in 'Criminal Damage' and 'Other Theft' crimes? Are these differences as you would expect?

### Colocations by Crime Type

Briefly repeat the colocations analysis to see if word combinations are different for the two crime types

In [15]:
print("Colocations for Criminal Damage")
crim_damage.collocations(10)

print("\nColocations for Other Theft")
other_theft.collocations(10)

Colocations for Criminal Damage
taxi driver; causing damage; times stated; city centre; wing mirror;
leeds city; private hire; petrol station; argument ensues; suspect
makes

Colocations for Other Theft
taxi driver; city centre; without paying; get money; driver picks;
times stated; leeds city; mobile phone; pay fare; home address


**Question**: again, do you notice any differences in the words that commonly appear together for the different crime types? Are these differences as you would expect?

**Question**: Could the analysis of collocations and/or word frequencies be useful for other investigatory work?

## Clustering / Topic Modelling

One of the most popular uses of natural language processing (and machine learing in general) is _classification_. Classification is a form of machine learning that involves grouping events that are similar.

Here, we can construct a classification of the crime notes to look for those that contain similar combinations of words. As a proof of concept, we will then compare these classifications to their underlying crime type to see if there are any differences. I.e. is the algorithm able to distinguish, based purely on the crme notes, the different types of crime?

For the model used below, we need to tell it how many clusters we want to create (see `NUM_TOPICS` in the code below). Use six in this case because there are six different crime categories. To do this rigorously, we would experiment with different numbers of clusters to try to find an optimal number. Also, it is worth nothing that, strictly speaking, the algorithm doesn't actually assign an entry to a topic. Instead, it returns a _distirbution_ of topics (i.e. the probability of the crime belonging to each topic). If the algorithm has worked well, then the probability of one topic will be particularly high. If the probabilities are all similar, then the algorithm is struggling to find any particularly strong features that it can use to assign the particular crime to a single topic.

Note that we will use a method called 'Latedn Dirichlet Allocation' which actually does something called 'topic mod_delling'. But it's basically classification. Don't worry too much about the code below. It isn't doing anything particularly complicated, but looks like it is.

In [16]:
# First, we need a list of words. The all_words variable that we've been
# using so far is actually an nltk.Text object, which we can't use.
# Instead, create a new 'list of lists' that stores the notes for each 
# individual crime separately as inner lists.
all_words_list = []
# Run through every row and add the individual words to the all_words list
for row in df['clean_crime']:
    all_words_list.append(nltk.word_tokenize(row))

# Create a dictionary. This assigns a number to every unique word.
id2word = corpora.Dictionary(all_words_list)

# Creates the Bag of Word corpus (convert each offence (document) to bag of words).
mm = [id2word.doc2bow(text) for text in all_words_list]

# Trains the LDA model.
NUM_TOPICS = 6
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, \
    num_topics=NUM_TOPICS,                              update_every=1, chunksize=10000, passes=1)

# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]

Now we have created a model called `lda`. For information, you can use the `help` function to learn more about what it can do. (Although these documents are often difficult to understand!). If you're feeling brave, run the chunk below to see what help the authods of the `lda` model have provided.

In [17]:
help(lda)

Help on LdaModel in module gensim.models.ldamodel object:

class LdaModel(gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel)
 |  The constructor estimates Latent Dirichlet Allocation model parameters based
 |  on a training corpus:
 |  
 |  >>> lda = LdaModel(corpus, num_topics=10)
 |  
 |  You can then infer topic distributions on new, unseen documents, with
 |  
 |  >>> doc_lda = lda[doc_bow]
 |  
 |  The model can be updated (trained) with new documents via
 |  
 |  >>> lda.update(other_corpus)
 |  
 |  Model persistency is achieved through its `load`/`save` methods.
 |  
 |  Method resolution order:
 |      LdaModel
 |      gensim.interfaces.TransformationABC
 |      gensim.utils.SaveLoad
 |      gensim.models.basemodel.BaseTopicModel
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, bow, eps=None)
 |      Return topic distribution for the given document `bow`, as a list of
 |      (topic_id, topic_probability) 2-tuples.
 | 

We now have a topic model that is able to take some crime notes and put them into one of six different clusters. Lets test it with an arbitrary crime. The model will show us how likely the crime is to fit into one of the six topics. 

In [18]:
new_entry = ['moving vehicle', 'car', 'over 18', 'taxi driver', 'employee working in vehicle', 'stranger']
print(lda[id2word.doc2bow(new_entry)])

[(0, 0.084209171663971197), (1, 0.084272755222934953), (2, 0.08388900831525499), (3, 0.084071947391653024), (4, 0.084588201379439204), (5, 0.57896891602674672)]


**Question**: which topic is the new arbitrary crime most strongly associated with?

Now lets see which words or phrases characterise each topic:

In [19]:
lda.print_topics()

[(0,
  '0.075*"taxi" + 0.050*"suspect" + 0.048*"driver" + 0.032*"complainant" + 0.027*"victim" + 0.019*"fare" + 0.016*"vehicle" + 0.014*"stated" + 0.011*"suspects" + 0.011*"times"'),
 (1,
  '0.077*"taxi" + 0.068*"victim" + 0.043*"driver" + 0.038*"suspect" + 0.019*"fare" + 0.016*"comp" + 0.015*"vehicle" + 0.011*"gets" + 0.011*"male" + 0.010*"times"'),
 (2,
  '0.046*"taxi" + 0.045*"complainant" + 0.038*"driver" + 0.030*"suspect" + 0.029*"comp" + 0.029*"suspects" + 0.024*"fare" + 0.019*"male" + 0.017*"vehicle" + 0.014*"pay"'),
 (3,
  '0.090*"taxi" + 0.056*"driver" + 0.052*"victim" + 0.033*"suspect" + 0.024*"fare" + 0.019*"phone" + 0.018*"complainant" + 0.015*"money" + 0.012*"male" + 0.010*"get"'),
 (4,
  '0.077*"suspect" + 0.063*"taxi" + 0.041*"driver" + 0.028*"victim" + 0.026*"fare" + 0.024*"complainant" + 0.018*"vehicle" + 0.014*"suspects" + 0.011*"stated" + 0.011*"police"'),
 (5,
  '0.086*"suspect" + 0.068*"victim" + 0.051*"taxi" + 0.031*"driver" + 0.019*"vehicle" + 0.018*"fare" + 0.01

**Question**: do you notice any obvious differences in the six topics? Might these tell you something about the type of crime that the notes are describing?

Optional: run the chunk below to see the topics more clearly (this makes quite a long list).

It is also possible to visualise the topics, but we don't do that here.

In [20]:
for topicid in range(NUM_TOPICS):
    terms = lda.get_topic_terms(topicid, topn=20)
    print("TOPIC {}".format(topicid))
    for word_id, prob in terms:
        print("\t{} : {}".format(id2word[word_id],prob))
    print()

TOPIC 0
	taxi : 0.0746817749598921
	suspect : 0.04958591861435974
	driver : 0.0480272933223971
	complainant : 0.03153131306498431
	victim : 0.027148823883387758
	fare : 0.01893639108298777
	vehicle : 0.015719338367294527
	stated : 0.013685527354738544
	suspects : 0.011075976292190447
	times : 0.010837603331839939
	address : 0.01072175438312481
	causing : 0.010241457941916311
	phone : 0.009952260815519375
	leeds : 0.009574274326914125
	taken : 0.009481883511945954
	get : 0.008732344588713553
	male : 0.00859156469437841
	pay : 0.008552096521288216
	passenger : 0.008475084704211227
	home : 0.007941126189019435

TOPIC 1
	taxi : 0.0773437162593795
	victim : 0.06758166840565706
	driver : 0.04297416244960859
	suspect : 0.037908463004449715
	fare : 0.019070670305904848
	comp : 0.015900203642021753
	vehicle : 0.015013212776471765
	gets : 0.010694955633882555
	male : 0.010646298270124607
	times : 0.009678928843963475
	drives : 0.009435219701798721
	get : 0.009180282316383114
	stated : 0.00887509

Now link the results of the topic models back to the original data so that we can see which topic has been assigned to each individual crime.

In [21]:
# In the same way that we created the 'clean' column of crime text,
# we define a function and then apply it to the crime notes

def find_topic(crime_note):
    words = nltk.word_tokenize(crime_note)
    # Find the distribution of topics over this crime note
    topic_distribution = lda[id2word.doc2bow(words)]
    # Get the probabilities of each topic into their own list
    probs = [probability for topicid, probability in topic_distribution]
    # Find the most likey topic
    max_topic = probs.index(max(probs))
    # That is the topic to return
    return max_topic
    ## Now add all of the topic probabilities
    #for topicid, probability in topic_distribution:
    #    s += (","+str(probability))

df['topic'] = df['clean_crime'].apply(find_topic)

Lets see which topics are the most common:

In [41]:
# Make a pivot table
pivot1 = pd.pivot_table(df, index=['topic'], aggfunc='sum')

pivot1
# Draw a bar chart
#plt.bar(x=range(NUM_TOPICS), height=pivot1.iloc[:,0].values.tolist(),\
#        width=1/NUM_TOPICS, color="blue")

Unnamed: 0_level_0,Unnamed: 0
topic,Unnamed: 1_level_1
0,355431
1,144622
2,24274
3,27790
4,6613
5,1481


Finallly, lets see whether the topics correspond to the crime types by creating a table that shows how many different types of crime are associated with each topic..

In [42]:
# Make a pivot table that counts the number of crime categories per topic

pivot2 = pd.pivot_table(df, index = ['topic'], columns = ['Crime Category'], aggfunc='sum')

# Calculate proportions
_sum = sum(pivot2.sum()) # This is the sum of all cells
pivot2 = pivot2.applymap(lambda x: round((100*x)/_sum,1) if x >0  else 0)

# Show the table
pivot2

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0
Crime Category,Criminal Damage,Fraud & Forgery,Other Theft,Robbery,Theft From Motor Vehicle,Violent Crime
topic,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0,7.3,2.1,29.4,1.8,1.0,21.9
1,2.1,0.3,12.1,1.0,0.7,9.7
2,0.4,0.0,1.5,0.4,0.0,2.0
3,0.3,0.0,4.0,0.1,0.3,0.3
4,0.0,0.0,0.9,0.0,0.0,0.3
5,0.0,0.0,0.1,0.0,0.0,0.1


**Question**: Has it 'worked'? Do the topics adequately distinguishable?

**Question**: Can you see any uses for clustering / topic modelling in your analysis?

## Classification

Another popular uses of natural language processing (and machine learing in general) is _classification_. Classification is a supervised form of machine learning. It reads data that have already been classified and tries to learn the patterns that lead to a particular classification. This is useful for classifying new data that we don't already have a classification for.

Classification could be useful in the analysis of crime data by attempting to identify crimes, from their notes, that don't have their own classification already. E.g. trying to find crimes that are associated with a journey in a taxi. A human could begin by manually identifying and classifying a few hundred individual crimes, and the algorithm could then run through the rest of the data looking for crimes with similar characteristics.

Another use, as discussed during the presentation, could be to classify text on social media by whether it is _hateful_ or not.

We don't have the time to run through an example of classification as well, but there are plenty of examples online.

## Appendix A: Whitelisting Crime Notes

Without manually editting crime notes, it can be difficult to ensure that they are anonymous. It is easy to accidentally miss someone's name, an address, or some unique details. One way to reduce the risks of releasing identifying information is to **whitelist** the notes. In effect, this means looking at all of the unique words that appear in the data and removing all but the most common ones. As it happens, the few most common words often account for a very large portion of the total text, so removing the others shoudn't affect the natural language processing.

The following code demonstrates how to do some simple whitelisting.

In [None]:
# The name of the column with crime notes in
crime_col = 'Crime Notes'

# Read the data:
raw_data = pd.read_csv('data/taxis-before_whitelisting.csv')

# Tokenize and add to a big bag of all words
all_words = []
for index, row in raw_data.iterrows():
    text = row[crime_col].lower() # Get the crime notes for this row and make lower case
    tokens = nltk.word_tokenize(text) # Tokenize the crime notes
    all_words.extend(tokens) # Add them to the big list of words

# Create a big bag of words
text = nltk.Text(all_words)

# Count the frequencies of the words
fd = nltk.FreqDist(text)

# Display the most common words, their count, and their proportion,
# stopping when the list of words accounts for 90% of all words
# Also store these words in a 'whitelist'
whitelist = set()
cumulative = 0.0 # keep track of the cumulative percentage
for i, (word, count) in enumerate(fd.most_common(1000)):
    whitelist.add(word)
    prop = count/len(all_words)*100
    cumulative += prop
    print("{i} {word} -> {count}, {proportion}, {cumulative}".format(\
      i=i, word=word, count=count, proportion=prop, cumulative=cumulative))
    if cumulative > 90:
        break

We now have a potential whitelist of words. Go through and make sure that they're OK, removing any that are sensitive.

In [None]:
words_to_remove = [] # Add any extra words here
whitelist = [word for word in whitelist if word not in words_to_remove]

Finall go back to the original data and remove any words that are not in the whitelist. Then save the csv file

In [None]:
# Create a new DataFrame which will have only the whitelist words
white_data = pd.DataFrame().reindex_like(raw_data)

# Go through each row of the original data, clean the crime notes colum,
# and then add the new row to the white_data
for index, row in raw_data.iterrows():
    text = row[crime_col].lower() # Get the crime notes for this row
    tokens = nltk.word_tokenize(text) # Tokenize the crime notes
    tokens = [t for t in tokens if t in whitelist]
    white_text = " ".join(tokens)
    white_data.loc[index] = row.values.tolist()
    white_data.loc[index,crime_col] = white_text
    
white_data.to_csv('data/taxis-after_whitelisting.csv')