# Document Similarity

The first step in determing the validity of the message is to compare it to a database of known hoax messages.

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib

The `messages.csv` file contains 82 known hoax WhatsApp messages that we collected over the course of the 2 day hackathon.  We don't need volume of data for this step as we are simply going to compare incoming messages for similarity against a known corpus of hoaxes.  Over time this corpus will grow.

In [2]:
df = pd.read_csv('../data/messages.csv', header=None)
df.columns = ['quote']

In [3]:
df.sample(10)

Unnamed: 0,quote
32,1st time in the world history a 101year old la...
66,\nCarte Blanche: E-toll. Forward on to as many...
16,40 Year Old Man Is Diagnosed With Eye Cancer B...
75,"\nHi guys, Just want to tell you about this cr..."
14,"ISIS recruiter, traitor and runaway criminal Z..."
81,\nWhatsApp is going to cost us money soon. The...
53,\nPlease note that there won't be electricity ...
57,\nWhatsapp is shutting down on 28th jan Messag...
9,PAKISTAN: MAN SENTENCED TO DEATH FOR FARTING I...
45,Whatsapp celebrating 10 years by giving exciti...


In [4]:
# Create the vectorizer
tfidf = TfidfVectorizer()
# Transform all of our known documents into vector space
tfidf_matrix = tfidf.fit_transform(df['quote'].values)

In [5]:
# Create a helper function to return similarity of a given string
def getSimilarity(text_string):
    # convert test into vector matrix
    test_matrix = tfidf.transform([text_string])
    # Create array of cosine similarities
    sim = (tfidf_matrix * test_matrix.T).A
    # Print out the similarity score of the most similar document
    return sim[np.argmax(sim)][0]

In [6]:
# Test with an exact match string from our corpus
test = """
Please note that there won’t be electricity tomorrow from 06:00- 12:00 through out the whole of South Africa. Please pass on the message and make everyone aware.
"""
print(getSimilarity(test))

1.0


So an exact match gives us a match of 1.0, which means it's working as expected.  Now we need to start testing how sensitive this is to changing up the words in the test string.  We'll create some variations of the string, and then iterate through them to see how they fare.

In [7]:
tests = {
    'exact': 'Please note that there won’t be electricity tomorrow from 06:00- 12:00 through out the whole of South Africa. Please pass on the message and make everyone aware.',
    'timechange': 'Please note that there won’t be electricity tomorrow from 08:00- 14:00 through out the whole of South Africa. Please pass on the message and make everyone aware.',
    'wordchange': 'Please note that there won’t be electricity tomorrow from 06:00- 12:00 through out the whole of Zimbabwe. Please pass on the message and make everyone aware.',
    'truncate': 'Please note that there won’t be electricity tomorrow from 06:00- 12:00 through out the whole of South Africa. Please pass on the message.',
    'truncatechange': 'Please note that there won’t be electricity tomorrow from 06:00- 12:00 through out the whole of Zimbabwe. Please pass on the message.'
}

In [8]:
for k, v in tests.items():
    print(k, getSimilarity(v))

timechange 0.947136805033
wordchange 0.922108320412
exact 1.0
truncatechange 0.864348231688
truncate 0.946688357645


The similarity drops away quite quickly if you make big changes (`truncatechange`).  If we wanted to test for exact matches, this would be a good measure, but these messages might get changes up slightly.

This is working with unprocessed text.  Let's see if we can get better results by processing the text.

In [11]:
# First, try stripping accents
tfidf = TfidfVectorizer(strip_accents = 'unicode')
# Transform all of our known documents into vector space
tfidf_matrix = tfidf.fit_transform(df['quote'].values)

for k, v in tests.items():
    print(k, getSimilarity(v))

timechange 0.947136805033
wordchange 0.922108320412
exact 1.0
truncatechange 0.864348231688
truncate 0.946688357645


In [12]:
# Now, work on characters instead of words
tfidf = TfidfVectorizer(analyzer = 'char')
# Transform all of our known documents into vector space
tfidf_matrix = tfidf.fit_transform(df['quote'].values)

for k, v in tests.items():
    print(k, getSimilarity(v))

timechange 0.9894413548
wordchange 0.992490253004
exact 0.998185613117
truncatechange 0.984808925703
truncate 0.988944173976


This is perhaps too high for our liking and there is not enough spread between hits and misses.

In [13]:
# Let's add in 2 character n-grams as a middle-ground between characters and words
tfidf = TfidfVectorizer(analyzer = 'char',  ngram_range=(1, 2))
# Transform all of our known documents into vector space
tfidf_matrix = tfidf.fit_transform(df['quote'].values)

for k, v in tests.items():
    print(k, getSimilarity(v))

timechange 0.979670146393
wordchange 0.972088116539
exact 0.996801513474
truncatechange 0.950934466419
truncate 0.979433514806


Those numbers are still pretty high, what happens if we just put random stuff in there?

In [17]:
test_matrix = tfidf.transform(['Those numbers are still pretty high, what happens if we just put random stuff in there?'])
sim = (tfidf_matrix * test_matrix.T).A
print(sim[np.argmax(sim)][0])
print(df['quote'].values[np.argmax(sim)])

0.821485827105
So beautiful. Cannot resist not to share.
TRUE STORY…PLEASE DO NOT DELETE, RETURN IF YOU CAN’T FORWARD TO AT LEAST ONE PERSON!!!
At the prodding of my friends I am writing this story. My name is Mildred Honor. I am a former elementary school Music Teacher from Des Moines, Iowa.



     


I have always supplemented my income by Teaching Piano Lessons…Something I have done for over 30 years. During those years, I found that Children have many levels of musical ability, and even though I have never had the prodigy, I have taught some very talented students. However, I have also had my share of what I call ‘Musically Challenged Pupils.
One such Pupil being Robby. Robby was 11 years old when his Mother (a Single Mom) dropped him off for his first Piano Lesson.
I prefer that Students (especially Boys) begin at an earlier age, which I explained to Robby. But Robby said that it had always been his Mother’s Dream to hear him play the Piano, so I took him as a Student.
At the end

0.82 feels a bit high for a string that is not even close to anything in the list of documents, especially when we look at what the best match is.

What if we take the ngrams and use them on words instead of characters?

In [18]:
tfidf = TfidfVectorizer(ngram_range=(1, 2), strip_accents = 'unicode')
# Transform all of our known documents into vector space
tfidf_matrix = tfidf.fit_transform(df['quote'].values)

for k, v in tests.items():
    print(k, getSimilarity(v))

timechange 0.931334741122
wordchange 0.937623399575
exact 1.0
truncatechange 0.863901510335
truncate 0.931156472286


Results are not that much different to using straight words.  This is probably the best option
And how does it scale?

And when we put in the nonsensical message?

In [19]:
test_matrix = tfidf.transform(['Those numbers are still pretty high, what happens if we just put random stuff in there?'])
sim = (tfidf_matrix * test_matrix.T).A
print(sim[np.argmax(sim)][0])
print(df['quote'].values[np.argmax(sim)])

0.0893570257251
There is a ‘Floating Rock’ in Jerusalem, floating in air from thousand’s of years. After many researches, still there is no explaination of it.


I think this is a far better result.

## Does it scale?

In [20]:
%time
test = """
Please note that there won’t be electricity tomorrow from 06:00- 12:00 through out the whole of South Africa. Please pass on the message and make everyone aware.
"""
getSimilarity(test)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.11 µs


0.99999999999999845

And if instead of 82 messages, we had 20 000?

In [39]:
import random
# https://pypi.python.org/pypi/RandomWords/0.1.5
from random_words import RandomWords

In [22]:
documents = []
rw = RandomWords()
for i in range(0, 20000):
    documents.append(' '.join(rw.random_words(count=random.randint(10, 50))))

In [23]:
scale_tfidf = TfidfVectorizer(ngram_range=(1, 2), strip_accents = 'unicode')
scale_tfidf_matrix = scale_tfidf.fit_transform(documents)

In [26]:
%time
test = """
laboratory classrooms camp forearm regulation dates bunches surprise orifices
"""
test_matrix = scale_tfidf.transform([test])
sim = (scale_tfidf_matrix * test_matrix.T).A
print(sim[np.argmax(sim)][0])

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 9.06 µs
0.692516021808


8 microseconds to check similarity in 20 000 documents.  I'm OK with that. 

In [27]:
documents[np.argmax(sim)]

'verb thread setup rain laboratory classrooms camp forearm regulation dates bunches surprise orifices perforation consideration fatigues input committees'

We can decide at a lookup level how much leeway we are willing to give for changes within the message.

## Save Data

Let's save the vectorizer, the vectorized matrix and the list of messages

In [28]:
# Dump the vectorizer and the matrix:
joblib.dump(tfidf, '../cache/prod_tfidf.pkl')
joblib.dump(tfidf_matrix, '../cache/prod_tfidf_matrix.pkl')
joblib.dump(df['quote'].values, '../cache/prod_messages.pkl')

['../cache/prod_messages.pkl']

In [34]:
# In case we want to work with the messages again later, easier format than CSV
df.to_feather('../data/messages.feather')

## Instantiate and test

When we come to working with this data, we'll want to load it up test that it will work as expected:

In [37]:
%time
# Load all the bits and test
tfidf = joblib.load('../cache/prod_tfidf.pkl')
tfidf_matrix = joblib.load('../cache/prod_tfidf_matrix.pkl')
messages = joblib.load('../cache/prod_messages.pkl')

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 9.06 µs


In [38]:
%time
test = """
Kzn chicken has bird flu. Its not safe to buy chicken - 90000 of chicken is contaminated. Please do not purchase any chicken. Please send to family and friends urgently.
"""
test_matrix = tfidf.transform([test])
sim = (tfidf_matrix * test_matrix.T).A
print(sim[np.argmax(sim)][0])

CPU times: user 6 µs, sys: 1 µs, total: 7 µs
Wall time: 13.8 µs
1.0
