# Natural Language Processing for Crime Analysis

This workbook will take you through some common Natural Language Processing (NLP) tasks. Ultimately it will analyse some crime notes data in order to try to identify distinct categories of crime. I.e., from the crime notes can we distinguish different types of crime?

**Required libraries**

The code below requires the following third-party libraries:

 - pandas
 - nltk
 - gensim
 - matplotlib

These can be installed using most python package managers, e.g.:

`conda install pandas nltk gensim matplotlib`

The easy way to install all of the required packages is to import the [n8prp-environment.yml](../n8prp-environment.yml) file into Anaconda as per the instructions outlined in the [slides](../machine-learning-slides.pdf)

## Preparation

The code in this document has been designed to read a comma-separated-values (csv) file that contains a column with some crime notes in it. There can be other columns in the data, and these will be ignored, but we need to tell python what the name of that column is. 

**YOU CAN PROVIDE YOUR OWN DATA FOR THIS TASK.** If you want to do this, put the csv file in the `data` directory, and then change the two variables below.

In [None]:
# INSERT THE NAME OF THE COLUMN WITH YOUR CRIME NOTES ON THE LINE BELOW:
crime_column  = "Crime Notes"

# INSERT THE NAME OF THE CSV FILE ON THE LINE BELOW. IT SHOULD BE IN THE 'data' DIRECTORY
csv_file_name = "taxis-after_whitelisting.csv" 

We also need to load the necessary python libraries. These are basically pieces of code that other people have written that do useful jobs (like reading files, analysing data, drawing graphs, etc.).

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import nltk # for the natural language processing

# Prepare a list of stopwords (may need to download these from the Internet)
from nltk.corpus import stopwords
try:
    stop_words = stopwords.words('english')
except LookupError:
    print("First time running, will download the NLTK stopwords")
    nltk.download('stopwords')
    stop_words = stopwords.words('english')

stop_words.extend([",", "."]) # Also include comma and full stop
stop_words = set(stop_words)

stemmer = nltk.PorterStemmer() # A 'stemmer' for finding roots of words

import string # Useful things to do with strings

# For doing topic modelling:
from gensim import corpora, models, similarities
from itertools import chain

# For making graphs:
import matplotlib.pyplot as plt


## Reading the Data

Now can read the data. the `pandas` library (which we have abreviated to `pd`) has some very usefunl functions for reading and writing data, including one called `read_csv`. We are going to store all of the data in a big 'DataFrame', called simply `df` (we could use any word to refer to the data, but 'df' is common, partly because it is quick to type!).

In [None]:
df = pd.read_csv("data/"+csv_file_name)

Python now has access to those data through the variable called `df`. Lets do a couple of things to check that everything is as expeced.

In [None]:
print("The columns in the data are called:", df.columns)

print("There are", len(df), "rows")

print("Here is some more information about the data:")
df.info()

## Preparing the Text

Before actually doing any analysis of the text, it is first necessary to prepare it. The following cells will go through some of the typical procedures, but there are others that might be useful as well.

To prepare the text, the following code defines a new _function_. This function will be given the notes of a crime event as input, and it will return a clean version as its output.

In [None]:
def clean_crime_text(text):
    
    # Convert all of the text to lower case
    text = text.lower()
       
    # Tokenize the text (split it into its constituent words)
    tokens = nltk.word_tokenize(text)
    
    # Remove 'stop words' (words like 'and', 'but', etc)
    tokens = [word for word in tokens if word not in stop_words]
    
    # 'Stem' the words
    #tokens = [stemmer.stem(t) for t in tokens]

    
    # Finish by taking all the tokens,  putting them back together
    # into a single long string of text, and returning this.
    output = " ".join(tokens)
    return output

Also check that the 'punkt tokenizer' has been downloaded. Don't worry about this, it's needed to convert words into 'tokens'. If we don't install download it now we get a very clear error later telling us that we need to download it.

In [None]:
# Also see if the 'punkt' tokenizer has been downloaded (this is required to turn the words into 'tokens')
try: 
    nltk.word_tokenize("testing")
    print("Punkt tokenizer already downloaded")
except LookupError:
    print("Downloading the punkt tokenizer")
    nltk.download('punkt')

The `clean_crime_text` function is ready. Now we will create a new column in our DataFrame called `clean_crime` using the raw crime notes as input

In [None]:
df['clean_crime'] = df[crime_column].apply(clean_crime_text)

Now we have a new column for the 'clean' crime notes. Lets look at a few rows to see what they're like

In [None]:
for i in range(3): # Loop over rows 1 to 3
    print("Row "+str(i)+":")
    print("\t"+str(df.loc[[i],['Crime Category']].values[0]))
    print("\t"+str(df.loc[i:,['Crime Notes']].values[0]))
    print("\t"+str(df.loc[i:,['clean_crime']].values[0]))

# By the way, this is a slightly easier way to show a few rows from the top and bottom
# for the three main columns:
#df.loc[:,['Crime Category', 'Crime Notes', 'clean_crime']]

Finally, sometimes it is useful to create a huge list that contains every word, regardless of which crime notes it is part of. Create that list now.

In [None]:
all_words = [] # a list that will store every word
# Run through every row and add the individual words to the all_words list
for row in df['clean_crime']:
    all_words.extend(nltk.word_tokenize(row))

# Convert this list into a format that the natural language processing
# toolkit (NLTK) understands
all_words = nltk.Text(all_words)

print("Found {} words in total".format(len(all_words)))

## Preliminary Analysis

Now the text is ready we can do some exploratory analysis

### Common words

Do a frequency distribution showing the most commonly occuring words:

In [None]:
# Count the frequencies of the words
fd = nltk.FreqDist(all_words)

# Show the most common twenty words
fd.most_common(20)

**Activity**: In the cell below, write some code that will show the **50** most common words.

### Colocations

Look for bigrams (pairs of words) that occur more frequently than expected. Find the 20 most common collocations.

In [None]:
all_words.collocations(20)

_Note: the code above might not work. You might get an error about "too many values to unpack". This is a known bug with the current version of the NLTK library, as documented [here](https://github.com/nltk/nltk/issues/2299). It should be fixed in the next release_.

**Activity**: In the cell below, write some code that will show the **50** most common collocations.

### Concordance

Concordance allows us to look at the words that surround a particular word (i.e. the context).

In [None]:
# Look at the text that occurs around the word 'phone'
all_words.concordance('phone')

**Activity**: In the cell below, write some code that will show the context of the word '**take**'

### Common contexts

Find the contexts where the specified words appear

In [None]:
all_words.common_contexts(['phone', 'driver'])

**Activity**: In the cell below, write some code that will show the common contexts of two words that you're interested in.

## Analysis by Crime Type

### Word Frequencies by Crime Type

You might have noticed that the data contain the crime type as an additional column. Lets have a quick look to see whether the most common words are different for the different crime types.

_Note that if you are using your own data then the following might not work (either because your crime type column is called something other than '`Crime Category`', which is easy to fix by replacing '`Crime Category`' in the line below with whatever your column is called, or because you don't have the crime categories at all._

The data contains the following unique categories:

In [None]:
print(set(df['Crime Category']))

Lets see if there is a difference between 'Criminal Damage' and 'Other Theft'.

_If you are using your own data then you might want to replace these two types in the text below with two crime types that appear in your data_

First create two lists of all the words associated with reports of 'Criminal Damage' and 'Other Theft' (as we did before for _all_ words).

In [None]:
# Create two new bags of words, as we did before, but this time only including
# the words associated with reports of 'Criminal Damage' and 'Other Theft'.

crim_damage = []
other_theft = []
for (i, row) in df.iterrows():
    if row['Crime Category'] == 'Criminal Damage':
        crim_damage.extend(nltk.word_tokenize(row['clean_crime']))
    elif row['Crime Category'] == 'Other Theft':
        other_theft.extend(nltk.word_tokenize(row['clean_crime']))
crim_damage = nltk.Text(crim_damage)
other_theft = nltk.Text(other_theft)

print("Found {} words in Criminal Damage and {} with Other Theft".format(\
    len(crim_damage), len(other_theft)))

Now see what the frequency distributions of the words associated with those crime categories look like.

In [None]:
print("The most common words for Criminal Damage are:\n", \
      nltk.FreqDist(crim_damage).most_common(10) )

print("\nThe most common words for Other Theft are:\n", \
      nltk.FreqDist(other_theft).most_common(10) )

**Question**: what do you notice about the differences in the words that commonly appear in 'Criminal Damage' and 'Other Theft' crimes? Are these differences as you would expect?

### Colocations by Crime Type

Briefly repeat the colocations analysis to see if word combinations are different for the two crime types.

_As before the colocations might not work at the moment_

In [None]:
print("Colocations for Criminal Damage")
crim_damage.collocations(10)

print("\nColocations for Other Theft")
other_theft.collocations(10)

**Question**: again, do you notice any differences in the words that commonly appear together for the different crime types? Are these differences as you would expect?

**Question**: Could the analysis of collocations and/or word frequencies be useful for other investigatory work?

## Clustering / Topic Modelling

One of the most popular uses of natural language processing (and machine learing in general) is _classification_. Classification is a form of machine learning that involves grouping events that are similar.

Here, we can construct a classification of the crime notes to look for those that contain similar combinations of words. As a proof of concept, we will then compare these classifications to their underlying crime type to see if there are any differences. I.e. is the algorithm able to distinguish, based purely on the crme notes, the different types of crime?

For the model used below, we need to tell it how many clusters we want to create (`NUM_TOPICS`). We use six in this case because there are six different crime categories. To do this rigorously, we would experiment with different numbers of clusters to try to find an optimal number. Also, it is worth nothing that, strictly speaking, the algorithm doesn't actually assign an entry to a topic. Instead, it returns a _distirbution_ of topics (i.e. the probability of the crime belonging to each topic). If the algorithm has worked well, then the probability of one topic will be particularly high. If the probabilities are all similar, then the algorithm is struggling to find any particularly strong features that it can use to assign the particular crime to a single topic.

Note that we will use a method called 'Latedn Dirichlet Allocation' which actually does something called 'topic modelling'. But it's basically classification. Don't worry too much about the code below. It isn't doing anything particularly complicated, but looks like it is.

In [None]:
# First, we need a list of words. The all_words variable that we've been
# using so far is actually an nltk.Text object, which we can't use.
# Instead, create a new 'list of lists' that stores the notes for each 
# individual crime separately as inner lists.
all_words_list = []
# Run through every row and add the individual words to the all_words list
for row in df['clean_crime']:
    all_words_list.append(nltk.word_tokenize(row))

# Create a dictionary. This assigns a number to every unique word.
id2word = corpora.Dictionary(all_words_list)

# Creates the Bag of Word corpus (convert each offence (document) to bag of words).
mm = [id2word.doc2bow(text) for text in all_words_list]

# Trains the LDA model.
NUM_TOPICS = 6
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, \
    num_topics=NUM_TOPICS,                              update_every=1, chunksize=10000, passes=1)

# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]

Now we have created a model called `lda`. For information, you can use the `help` function to learn more about what it can do. (Although these documents are often difficult to understand!). If you're feeling brave, run the chunk below to see what help the authods of the `lda` model have provided.

In [None]:
help(lda)

We now have a topic model that is able to take some crime notes and put them into one of six different clusters. Lets test it with an arbitrary crime. The model will show us how likely the crime is to fit into one of the six topics. 

In [None]:
new_entry = ['moving vehicle', 'car', 'over 18', 'taxi driver', 'employee working in vehicle', 'stranger']
print(lda[id2word.doc2bow(new_entry)])

**Question**: which topic is the new arbitrary crime most strongly associated with?

Now lets see which words or phrases characterise each topic:

In [None]:
lda.print_topics()

**Question**: do you notice any obvious differences in the six topics? Might these tell you something about the type of crime that the notes are describing?

Optional: run the chunk below to see the topics more clearly (this makes quite a long list).

It is also possible to visualise the topics, but we don't do that here.

In [None]:
for topicid in range(NUM_TOPICS):
    terms = lda.get_topic_terms(topicid, topn=20)
    print("TOPIC {}".format(topicid))
    for word_id, prob in terms:
        print("\t{} : {}".format(id2word[word_id],prob))
    print()

Now link the results of the topic models back to the original data so that we can see which topic has been assigned to each individual crime.

In [None]:
# In the same way that we created the 'clean' column of crime text,
# we define a function and then apply it to the crime notes

def find_topic(crime_note):
    words = nltk.word_tokenize(crime_note)
    # Find the distribution of topics over this crime note
    topic_distribution = lda[id2word.doc2bow(words)]
    # Get the probabilities of each topic into their own list
    probs = [probability for topicid, probability in topic_distribution]
    # Find the most likey topic
    max_topic = probs.index(max(probs))
    # That is the topic to return
    return max_topic
    ## Now add all of the topic probabilities
    #for topicid, probability in topic_distribution:
    #    s += (","+str(probability))

df['topic'] = df['clean_crime'].apply(find_topic)

Lets see which topics are the most common:

In [None]:
# Make a pivot table
pivot1 = pd.pivot_table(df, index=['topic'], aggfunc='sum')
pivot1

In [None]:
# Draw a bar chart
plt.bar(x=range(NUM_TOPICS), height=pivot1.iloc[:,0].values.tolist(), \
        width=1/NUM_TOPICS, color="blue")

Finallly, lets see whether the topics correspond to the crime types by creating a table that shows how many different types of crime (as a percentage) are associated with each topic..

In [None]:
# Make a pivot table that counts the number of crime categories per topic

pivot2 = pd.pivot_table(df, index = ['topic'], columns = ['Crime Category'], aggfunc='sum')

# Calculate proportions
_sum = sum(pivot2.sum()) # This is the sum of all cells
pivot2 = pivot2.applymap(lambda x: round((100*x)/_sum,1) if x >0  else 0)

# Show the table
pivot2

**Question**: Has it 'worked'? Do the topics adequately distinguishable?

**Question**: Can you see any uses for clustering / topic modelling in your analysis?

## Classification

Another popular uses of natural language processing (and machine learing in general) is _classification_. Classification is a supervised form of machine learning. It reads data that have already been classified and tries to learn the patterns that lead to a particular classification. This is useful for classifying new data that we don't already have a classification for.

Classification could be useful in the analysis of crime data by attempting to identify crimes, from their notes, that don't have their own classification already. E.g. trying to find crimes that are associated with a journey in a taxi. A human could begin by manually identifying and classifying a few hundred individual crimes, and the algorithm could then run through the rest of the data looking for crimes with similar characteristics.

Another use, as discussed during the presentation, could be to classify text on social media by whether it is _hateful_ or not.

We don't have the time to run through an example of classification as well, but there are plenty of examples online.

## Appendix A: Whitelisting Crime Notes

Without manually editting crime notes, it can be difficult to ensure that they are anonymous. It is easy to accidentally miss someone's name, an address, or some unique details. One way to reduce the risks of releasing identifying information is to **whitelist** the notes. In effect, this means looking at all of the unique words that appear in the data and removing all but the most common ones. As it happens, the few most common words often account for a very large portion of the total text, so removing the others shoudn't affect the natural language processing.

The following code demonstrates how to do some simple whitelisting.

In [None]:
# As before, you need to insert the name of your csv file and the column name below:
crime_col = "XXXX"   
crime_file = "XXXX"

# Read the data:
raw_data = pd.read_csv('data/'+crime_file)

# Tokenize and add to a big bag of all words
all_words = []
for index, row in raw_data.iterrows():
    text = row[crime_col].lower() # Get the crime notes for this row and make lower case
    tokens = nltk.word_tokenize(text) # Tokenize the crime notes
    all_words.extend(tokens) # Add them to the big list of words

# Create a big bag of words
text = nltk.Text(all_words)

# Count the frequencies of the words
fd = nltk.FreqDist(text)

# Display the most common words, their count, and their proportion,
# stopping when the list of words accounts for 90% of all words
# Also store these words in a 'whitelist'
whitelist = set()
cumulative = 0.0 # keep track of the cumulative percentage
for i, (word, count) in enumerate(fd.most_common(1000)):
    whitelist.add(word)
    prop = count/len(all_words)*100
    cumulative += prop
    print("{i} {word} -> {count}, {proportion}, {cumulative}".format(\
      i=i, word=word, count=count, proportion=prop, cumulative=cumulative))
    if cumulative > 90:
        break

We now have a potential whitelist of words. Go through and make sure that they're OK, removing any that are sensitive.

In [None]:
words_to_remove = [] # Add any extra words here
whitelist = [word for word in whitelist if word not in words_to_remove]

Finall go back to the original data and remove any words that are not in the whitelist. Then save the csv file

In [None]:
# Create a new DataFrame which will have only the whitelist words
white_data = pd.DataFrame().reindex_like(raw_data)

# Go through each row of the original data, clean the crime notes colum,
# and then add the new row to the white_data
for index, row in raw_data.iterrows():
    text = row[crime_col].lower() # Get the crime notes for this row
    tokens = nltk.word_tokenize(text) # Tokenize the crime notes
    tokens = [t for t in tokens if t in whitelist]
    white_text = " ".join(tokens)
    white_data.loc[index] = row.values.tolist()
    white_data.loc[index,crime_col] = white_text
    
white_data.to_csv('data/taxis-after_whitelisting.csv')