# Mississauga News Analysis 

We will be using Tweets from the Mississauga News to conduct our analysis. 
The focus of this analysis is on the crime in Mississauga to see if there are increases during certain months, and if it is increasing yearly.

By: Kanika Chopra

## Importing Libraries

We begin by importing the necessary libraries for the analysis.

In [1]:
import numpy as np
import pandas as pd

import spacy
from spacy import displacy
import string

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

from collections import Counter

import contractions
from string import punctuation

from nlp import *
from similarity import *

## Importing Data
We get our data from downloading it to a csv file and then importing the Mississauga News tweets that we have scraped from the Mississauga News twitter account.

In [2]:
df = pd.read_csv('MissiNewsRoom_tweets.csv')

# Let's take a look at our data 
df.head()

Unnamed: 0.1,Unnamed: 0,id,Date,Tweet
0,0,1194057385817587714,2019-11-12 01:00:39,A badly broken leg could have ended #Brampton ...
1,1,1194042230106931206,2019-11-12 00:00:26,NEW BUSINESS: Align Custom Fit Footwear &amp; ...
2,2,1193998304163565571,2019-11-11 21:05:53,‘Right time for him to immediately step down’:...
3,3,1193981803742289921,2019-11-11 20:00:19,Where were you 40 years ago during the #Missis...
4,4,1193980972355723264,2019-11-11 19:57:01,Canadian Broadcast Standards Council says it c...


In [3]:
# Columns of our excel file
df.columns

Index(['Unnamed: 0', 'id', 'Date', 'Tweet'], dtype='object')

## Data Pre-Processing
Lets get our data into the form we need for the NLP work we'll be doing

In [4]:
# Let's drop the unnecessary columns
df.drop(columns=['Unnamed: 0', 'id'], inplace=True)
df.columns = ['Timestamp', 'Tweet']

df.head()

Unnamed: 0,Timestamp,Tweet
0,2019-11-12 01:00:39,A badly broken leg could have ended #Brampton ...
1,2019-11-12 00:00:26,NEW BUSINESS: Align Custom Fit Footwear &amp; ...
2,2019-11-11 21:05:53,‘Right time for him to immediately step down’:...
3,2019-11-11 20:00:19,Where were you 40 years ago during the #Missis...
4,2019-11-11 19:57:01,Canadian Broadcast Standards Council says it c...


### Dates
We begin with having the overall timestamp, we want to break it down into Date, Time and then separate months and years in case we want to compare monthly and yearly data.

In [5]:
# Break the Date apart into Date and Time 
df['Time'] = df['Timestamp'].apply(lambda x: x[10:])
df['Date'] = df['Timestamp'].apply(lambda x: x[:10])
df['Month'] = df['Date'].apply(lambda x: x[5:7])
df['Year'] = df['Date'].apply(lambda x: x[:4])

# Reorder the columns
df = df[['Date', 'Year', 'Month', 'Time', 'Tweet']]

df.head()

Unnamed: 0,Date,Year,Month,Time,Tweet
0,2019-11-12,2019,11,01:00:39,A badly broken leg could have ended #Brampton ...
1,2019-11-12,2019,11,00:00:26,NEW BUSINESS: Align Custom Fit Footwear &amp; ...
2,2019-11-11,2019,11,21:05:53,‘Right time for him to immediately step down’:...
3,2019-11-11,2019,11,20:00:19,Where were you 40 years ago during the #Missis...
4,2019-11-11,2019,11,19:57:01,Canadian Broadcast Standards Council says it c...


### Links
We have the links to the actual Mississauga News article included in the tweet. We are going to separate this from the bulk of the tweet; this way if we want to look at these articles for more information, we have access to the tweets.

In [6]:
# We need to separate the link from the tweet
df['Link'] = df['Tweet'].apply(lambda x: 'http' + x.split('http')[1] if 'http' in x else '')
df['Tweet'] = df['Tweet'].apply(lambda x: x.split('http')[0] if 'http' in x else x)

# Break apart all contractions (except name possession e.g. Sarah's)
df['Tweet'] = df['Tweet'].apply(lambda x: contractions.fix(x))

In [7]:
# Let's create another DataFrame with the original tweets in case we need to view them
original = df.copy()

original.head()

Unnamed: 0,Date,Year,Month,Time,Tweet,Link
0,2019-11-12,2019,11,01:00:39,A badly broken leg could have ended #Brampton ...,https://t.co/mgndlYyBhz
1,2019-11-12,2019,11,00:00:26,NEW BUSINESS: Align Custom Fit Footwear &amp; ...,https://t.co/hFvGVGsF8n
2,2019-11-11,2019,11,21:05:53,‘Right time for him to immediately step down’:...,https://t.co/x4oxb4LaDD
3,2019-11-11,2019,11,20:00:19,Where were you 40 years ago during the #Missis...,https://t.co/UmHGqFToqh
4,2019-11-11,2019,11,19:57:01,Canadian Broadcast Standards Council says it c...,https://t.co/6l2RIg8EpP


### Tokenization and Lemmatization
We are going to tokenize all the words, lemmatize them and store them in a list for the bag of words approach.

We're going to compare nltk and spaCy to see which works better for this use case. For nltk, we tokenize the words using a RegexpTokenizer to exclude punctuation, and then lemmatize it. spaCy will parse the sentences for each tweet and then lemmatize each token.

#### nltk Lemmatization

To use nltk wordnet, we need to import a created function that gets the POS tag so that we can use nltk's Wordnet Lemmatizer with the appropriate POS tag. This gets passed in as the second argument to lemmatize() once it is initialized.

In [8]:
# Initialize the tokenizer so that it doesn't include punctuation 
tokenizer = RegexpTokenizer(r'\w+')

In [9]:
# Change everything to lowercase words
df['Tweet'] = df['Tweet'].apply(lambda x: x.lower())

# Tokenization 
df['nltk_Tweet'] = [tokenizer.tokenize(x) for x in df['Tweet']]

In [10]:
# Remove stop words
stop_words = set(stopwords.words('english'))

df['nltk_Tweet'] = df['nltk_Tweet'].apply(lambda x: [word for word in x if word not in stop_words])

In [11]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\RQ\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\RQ\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [13]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer() 

In [14]:
# Convert words to root word (e.g. 'shooting to shoot')
df['nltk_Tweet'] = df['nltk_Tweet'].apply(lambda x: [lemmatizer.lemmatize(word, get_pos(word)) for word in x])

# Let's take a look at our data now 
df.head()

Unnamed: 0,Date,Year,Month,Time,Tweet,Link,nltk_Tweet
0,2019-11-12,2019,11,01:00:39,a badly broken leg could have ended #brampton ...,https://t.co/mgndlYyBhz,"[badly, broken, leg, could, end, brampton, nat..."
1,2019-11-12,2019,11,00:00:26,new business: align custom fit footwear &amp; ...,https://t.co/hFvGVGsF8n,"[new, business, align, custom, fit, footwear, ..."
2,2019-11-11,2019,11,21:05:53,‘right time for him to immediately step down’:...,https://t.co/x4oxb4LaDD,"[right, time, immediately, step, sportsnet, cu..."
3,2019-11-11,2019,11,20:00:19,where were you 40 years ago during the #missis...,https://t.co/UmHGqFToqh,"[40, year, ago, mississauga, miracle, train, d..."
4,2019-11-11,2019,11,19:57:01,canadian broadcast standards council says it c...,https://t.co/6l2RIg8EpP,"[canadian, broadcast, standard, council, say, ..."


#### spaCy Lemmatization

In [15]:
df2 = original.copy()

# Initialize spacy 'en_core_web_sm' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Lemmatize the tweet
df2['spacy_Tweet'] = df2['Tweet'].apply(lambda x: spacy_lemmatize(x))

# Remove punctuation from spacy_Tweets
df2['spacy_Tweet'] = df2['spacy_Tweet'].apply(lambda x: [word for word in x if word.isalpha()])

# Let's view our data now 
df2.head()

Unnamed: 0,Date,Year,Month,Time,Tweet,Link,spacy_Tweet
0,2019-11-12,2019,11,01:00:39,A badly broken leg could have ended #Brampton ...,https://t.co/mgndlYyBhz,"[a, badly, broken, leg, could, have, end, bram..."
1,2019-11-12,2019,11,00:00:26,NEW BUSINESS: Align Custom Fit Footwear &amp; ...,https://t.co/hFvGVGsF8n,"[new, business, align, custom, fit, footwear, ..."
2,2019-11-11,2019,11,21:05:53,‘Right time for him to immediately step down’:...,https://t.co/x4oxb4LaDD,"[right, time, for, to, immediately, step, spor..."
3,2019-11-11,2019,11,20:00:19,Where were you 40 years ago during the #Missis...,https://t.co/UmHGqFToqh,"[where, be, year, ago, during, the, mississaug..."
4,2019-11-11,2019,11,19:57:01,Canadian Broadcast Standards Council says it c...,https://t.co/6l2RIg8EpP,"[canadian, broadcast, standards, council, say,..."


### nltk vs. spaCy
We'll compare the first and last tweet tokenization and lemmatization using both libraries.

In [16]:
# nltk Tweet
' '.join(df['nltk_Tweet'][0])

'badly broken leg could end brampton native tyler ennis career surgery countless rehab hour play raptors905 mississauga'

In [17]:
# spaCy Tweet
' '.join(df2['spacy_Tweet'][0])

'a badly broken leg could have end brampton native tyler career but after surgery and countless rehab hour be play again with in mississauga'

As we can see, the spaCy approach keeps words such as 'a', 'but', 'with' which are words that are not useful for our bag of words approach if we are trying to get the common topics in the news. Thus, we will use the nltk approach to tokenize and lemmatize our tweets and continue with this.

## Bag of Words Approach
We take the list of tokens in each of the tweets, and based on the month, we're going to take the 10 most common tokens and their frequency.

In [18]:
# Let's combine the sentences back together for later use
df['cleaned_Tweet'] = df['nltk_Tweet'].apply(lambda x: ' '.join(x))

df.head()

Unnamed: 0,Date,Year,Month,Time,Tweet,Link,nltk_Tweet,cleaned_Tweet
0,2019-11-12,2019,11,01:00:39,a badly broken leg could have ended #brampton ...,https://t.co/mgndlYyBhz,"[badly, broken, leg, could, end, brampton, nat...",badly broken leg could end brampton native tyl...
1,2019-11-12,2019,11,00:00:26,new business: align custom fit footwear &amp; ...,https://t.co/hFvGVGsF8n,"[new, business, align, custom, fit, footwear, ...",new business align custom fit footwear amp foo...
2,2019-11-11,2019,11,21:05:53,‘right time for him to immediately step down’:...,https://t.co/x4oxb4LaDD,"[right, time, immediately, step, sportsnet, cu...",right time immediately step sportsnet cut tie ...
3,2019-11-11,2019,11,20:00:19,where were you 40 years ago during the #missis...,https://t.co/UmHGqFToqh,"[40, year, ago, mississauga, miracle, train, d...",40 year ago mississauga miracle train derailment
4,2019-11-11,2019,11,19:57:01,canadian broadcast standards council says it c...,https://t.co/6l2RIg8EpP,"[canadian, broadcast, standard, council, say, ...",canadian broadcast standard council say take c...


In [19]:
# Let's try to do this with the first month (Month = 1)
Jan = df[df['Month'] == '01']
Jan.reset_index(inplace=True,drop=True)

Jan.head()

Unnamed: 0,Date,Year,Month,Time,Tweet,Link,nltk_Tweet,cleaned_Tweet
0,2019-01-31,2019,1,22:05:25,it is #throwbackthursday time! is not this the...,https://t.co/kmtxpQOWv9,"[throwbackthursday, time, year, pig, tbt]",throwbackthursday time year pig tbt
1,2019-01-31,2019,1,20:55:37,rt @bramptonfirees: @bramptontransit @1010traf...,,"[rt, bramptonfirees, bramptontransit, 1010traf...",rt bramptonfirees bramptontransit 1010traffic ...
2,2019-01-31,2019,1,14:44:31,update: another multi-vehicle crash in #missis...,https://t.co/bwT1FFKe77,"[update, another, multi, vehicle, crash, missi...",update another multi vehicle crash mississauga...
3,2019-01-31,2019,1,14:06:06,have you seen her? aliaiyda temiz was last see...,https://t.co/6xoldbdNnm,"[see, aliaiyda, temiz, last, see, late, yester...",see aliaiyda temiz last see late yesterday aft...
4,2019-01-31,2019,1,12:28:11,warning: police are cautioning drivers about i...,https://t.co/f5pVuXo69t,"[warn, police, caution, driver, icy, road, con...",warn police caution driver icy road condition ...


We need to take every tweet's list of words and combine them into one list to be used in the bag of words.

In [20]:
all_words = [] 
for i in range(len(Jan)):
    lst = Jan.loc[i]['nltk_Tweet']
    all_words.extend(lst)
    
# How many words do we have?
len(all_words)

2693

In [21]:
# Bag of Words with all the tweets for the first month
bagofwords = Counter(all_words)

bagofwords.most_common(10)

[('mississauga', 132),
 ('police', 35),
 ('peel', 33),
 ('year', 31),
 ('news', 28),
 ('man', 25),
 ('break', 23),
 ('brampton', 18),
 ('new', 18),
 ('rt', 16)]

We can see that police is the second most common topic mentioned. So let's take a look at some of the tweets associated with the police

In [22]:
# Let's take the first month of original tweets 
Jan_original = original[original['Month'] == '01']
Jan_original.reset_index(inplace=True, drop=True)

Jan_police = Jan_original[Jan_original['Tweet'].str.contains('police')]['Tweet']
Jan_police.reset_index(inplace=True, drop=True)

police_tweets = []
for tweet in Jan_police:
    police_tweets.append(tweet)
    
police_tweets

['His company was supposed to help people scammed by door-to-door sales. But police have now charged him with fraud. #Brampton #Mississauga ',
 '#Peel police are reminding residents what is allowed and what is not when it comes to recreational #cannabis after #Mississauga’s neighbour #Brampton gives green light to retail stores\n',
 'BREAKING NEWS: Police ID and name suspect in vicious sexual assault inside Square One mall @shopSQUAREONE in #Mississauga that left a young woman traumatized. A #warrant is out for his arrest and he is known to police: ',
 'BREAKING NEWS: Heavy police and paramedic presence after #bomb threat made at Credit Valley Hospital in #Mississauga this evening. @Trillium_Health : ',
 'THIS JUST IN: Young woman #sexually assaulted inside Square One in #Mississauga and police allege this man is responsible for the #attack. he is still on the run: ',
 'BREAKING NEWS: A #Mississauga man serving time for #guns and drug trafficking #crimes has disappeared and police need

## Police and Crime Analysis
We are going to take the tweets that are related to police, assault, shootings, stabbings and robberies to compare those crimes first.

In [23]:
# Set the crime_words that we are focusing on first
crime_words = ['police', 'shoot',  'stab', 'robs', 'robbery', 'assault']

In [24]:
Jan_crime = pd.DataFrame(columns=Jan.columns)

for word in crime_words:
    subset = Jan[Jan['cleaned_Tweet'].str.contains(word)]
    subset['word'] = word
    Jan_crime = Jan_crime.append(subset)
    
Jan_crime.reset_index(inplace=True,drop=True)
len(Jan_crime)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


50

In [25]:
Jan_crime

Unnamed: 0,Date,Link,Month,Time,Tweet,Year,cleaned_Tweet,nltk_Tweet,word
0,2019-01-31,https://t.co/bwT1FFKe77,1,14:44:31,update: another multi-vehicle crash in #missis...,2019,update another multi vehicle crash mississauga...,"[update, another, multi, vehicle, crash, missi...",police
1,2019-01-31,https://t.co/f5pVuXo69t,1,12:28:11,warning: police are cautioning drivers about i...,2019,warn police caution driver icy road condition ...,"[warn, police, caution, driver, icy, road, con...",police
2,2019-01-30,https://t.co/Gg8d60xV51,1,18:36:50,rise in mental health calls to @peelpolicemedi...,2019,rise mental health call peelpolicemedia sympto...,"[rise, mental, health, call, peelpolicemedia, ...",police
3,2019-01-30,https://t.co/zboaK5xpSS,1,11:54:41,have you seen this man? peel regional police a...,2019,see man peel regional police search matthew ho...,"[see, man, peel, regional, police, search, mat...",police
4,2019-01-29,https://t.co/3lB7CGKLQb,1,14:57:43,his company was supposed to help people scamme...,2019,company suppose help people scammed door door ...,"[company, suppose, help, people, scammed, door...",police
5,2019-01-28,https://t.co/iV9UPOgc6c,1,18:37:16,breaking news: massive #snowstorm hits #missis...,2019,break news massive snowstorm hit mississauga c...,"[break, news, massive, snowstorm, hit, mississ...",police
6,2019-01-25,https://t.co/nTiv9MC75V,1,19:43:00,warning: police hunting 2 distraction thieves ...,2019,warn police hunt 2 distraction thief target mi...,"[warn, police, hunt, 2, distraction, thief, ta...",police
7,2019-01-25,https://t.co/txQ7wLWYvN,1,16:05:58,‘this guy has a death wish’: this #audi was ca...,2019,guy death wish audi caught go crazy speed hwy4...,"[guy, death, wish, audi, caught, go, crazy, sp...",police
8,2019-01-25,https://t.co/1j3XFkekcy,1,13:02:47,just in: police identify and name #mississauga...,2019,police identify name mississauga man accuse dr...,"[police, identify, name, mississauga, man, acc...",police
9,2019-01-25,https://t.co/by08kU6yzf,1,01:30:25,this just in: police looking for surveillance ...,2019,police look surveillance footage pedestrian hi...,"[police, look, surveillance, footage, pedestri...",police


We have 50 tweets in January relating to crime, we've added a section to show which word was associated with the tweet.

In [26]:
# Let's take a look at our data
Jan_crime['word'].value_counts()

police     34
assault     7
robbery     4
shoot       3
stab        2
Name: word, dtype: int64

In [27]:
# Let's see what our time period was (our minimum and maximum date)
min_date, max_date = min(Jan_crime['Date']), max(Jan_crime['Date'])

min_date, max_date

('2019-01-02', '2019-01-31')

So we can see that from January 4th, 2019 until January 29th, 2019, there are were 34 police incidents, 7 assaults, 4 robberies, 3 shootings, and 2 stabbings. Let's take a deeper look at this! 

## Tweet Similarity

One issue our dataset could have is tweets based on the same newspiece; having similar articles could cause inaccuracy issues in our data. We are going to check the text similarity with the Tweets using Jacard Similarity and Cosine Similarity and comparing the two methods.

In [None]:
word_vectors = spacy.load('en_vectors_web_lg')

In [None]:
# Take a series of the cleaned tweets in January crime
ser = Jan_crime['cleaned_Tweet']

# Convert it to an array 
crimes = ser.values

In [None]:
# Create an empty similarity matrix
similarity_matrix = np.zeros([len(crimes), len(crimes)])

similarity_matrix

In [None]:
# Implement a similarity matrix
for i, sentence_i in enumerate(crimes):
    for j, sentence_j in enumerate(crimes):
        if i!=j:
            similarity_matrix[i][j] = cosine(sentence_i, sentence_j)

In [None]:
crimes.values

In [None]:
sentence_i = crimes[0]
sentence_j = crimes[1]


In [None]:
cosine(sentence_i, sentence_j)