Skip to content
Michael Vogiatzis edited this page Sep 7, 2013 · 4 revisions

Upon a new tweet arrival, the short text is split into words such that a partial counting can be applied. In an attempt to reduce the lexical variation that occurs in different documents, a small normalization set of out-of-vocabulary (OOV) words has been used. Words like “coz” are replaced with “because” and “sureeeeeeeeeeeeeee” with “sure”. Porter Stemming algorithm was considered as an extra step here but was not applied as it didn’t improve the accuracy of the final results. Having replaced the OOV words, the URLs and mentions (@) are removed from the tweet text.

The algorithm continues by representing each tweet in the vector space using the TF-IDF (Term Frequency – Inverted Document Frequency) weighting. By applying the weighting, words that appear more frequently are expected to have lower weight in contrast to the rarer ones. Each vector is also normalized using the Euclidean norm. Given a term and a total number of documents:

idf where D

is the total number of documents that the term t appears in.

Having converted the tweet to vector, we will apply the Locality Sensitive Hashing (LSH) algorithm to find the nearest neighbor candidates. LSH method aims to benefit from a reduced number of comparisons required to find the approximate (1 + ε)-near neighbors among a number of documents.

The idea behind it is to use hash-tables called buckets for similar points (tweets). According to this approach, whenever a new point (tweet) arrives, its hash value will be computed and it will be stored into several buckets. Inside the same bucket, the probability of collision with similar documents is much higher. In other words, tweets that have an identical hash with the arriving tweet are nearest neighbor candidates.By using this method, we will not compare a "weather" tweet with a "music" one as most probably they are not related. I should remind you that we need to find the tweet with the shortest distance to the arriving tweet. The figure below helps to understand the core LSH mechanism by showing what happens in each bucket.

Locality Sensitive Hashing img

By increasing the number of buckets used you repeat the process of hashing to find more nearest neighbors. Therefore more comparisons are made and the probability to find a closer neighbor is increased. Different buckets give different hashes for the same tweet, as the hash value is created using the random projection method. You could as well use other methods to create the hash such as k-means or minHash. According to random projection method, k random vectors are assigned to each bucket. Each random vector consists of values derived from a Gaussian distribution N(0, 1). Τhe binary hash value consists of k bits each of which is calculated using the below function.

hash img

where r is the random vector and u the tweet vector. Each hash bit is a Dot Product. The concatenation of all hash bits will form the hash which will be stored inside each bucket. At this example, 13 bits were used to define the length of the hash. The less bits you use, the more collisions you will find, thus more comparisons to make. However, the effort to compute the dot product for each bucket would be reduced. Limitation: Storing tweets into all buckets would require infinite memory for the hashing of all previously seen tweets. To prevent this from happening, we keep up to 20 tweets with the same hash per bucket.

The interesting part in applying the LSH algorithm lies in the hash property. The fact that incoming tweets will be compared with only the tweets that have the same hash, as opposed to all previously seen ones is the beauty of it. It greatly reduces the comparisons required to find the closest neighbour.

Having gathered the nearest neighbors from all buckets we compare the distance between the tweet and the nearest neighbors using the cosine similarity measure. The cosine similarity between two vectors and is defined as:

cosine similarity

As an extra step, the tweet is also compared to a fixed number of most recently seen tweets. The neighbor with the shortest distance (the highest cosine score) is assigned as the nearest and the distance represents the score of the tweet. If a tweet has score above a certain threshold, it is identified as a first story.

You can read about the code logic here.

Clone this wiki locally