Skip to content

Capstone Project for Galvanize Data Science Immersive: This project provides a real-time solution for finding similarity amongst a growing database of billions of tweets.

Notifications You must be signed in to change notification settings

minoobeyzavi/Twinkle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Twinkle: Most Similar Tweets in Real-Time

This project presents a natural language processing model capable of processing a growing database of billions of tweets and finding the most similar tweet in real-time as a new tweet is being typed. This model computes similarities by using a hashing vectorizer and a locality sensitive hashing forest to achieve sub-linear scaling of computation time.

This Repo Contains:

  • Project Description
  • Code
  • Capstone presentation
  • Future Direction
  • References

Motivations

When we adopt a tool we also adopt the management philosophy embedded in that tool. Twitter's influence is transforming the way society communicates about about everything; It has become a place where thoughts get purified and voices get amplified. Twitter users practice conciseness and focus. Twitter is equipped with unique powers to quickly spread information, identify what really matters and host extremely effective conversations; improving our characters and relationships and creating more integrity. The most valuable social teachings are now happening on Twitter at light speed and hashtags are advancing the evolution of society beyond all expectations.

  • Growth of Positive Interactions Having the most similar tweets available to users would enhance user experience by internalizing a strong sense of connection, encouraging creativity and inspiring communication which lead to more exciting interactions on Twitter. While hashtags focus on extracts, abbreviations and creative expressions or offer alternative variations for the actual context of the tweet, resemblence of the tweet itself displays exciting connections amongst users and offering this feature could quickly increase positive interactions.
  • User Growth Twitter is live; live commentary, live conversations, live videos, live connections; growth of positive real-time interactions could be the best opportunity to attract and increase the number of monthly active users. Twitter is well-positioned to benefit from attracting users by maintaining and growing positivity on the platfrom.

Challenge

The application has to find the most similar text amongst billions of tweets as a new tweet is being typed. In order to implement this feature in real-time we must create a solution to significantly lower the computation time.

Real-Time Solution

Sublinear scaling of run time with sequential use of Hashing Vectorizer and Locality Sensitive Hashing Forest:

  • LSH Forest generates hash trees and uses hashing to define neighborhoods;
  • Scikit-learn implementation of LSH provides access to Nearest Neighbors algorithm methods;
  • Similar items map to the same “buckets” with high probability (the number of buckets are much smaller than the universe of possible inputs);
  • Search space is limited to a bucket and we find the most similar tweets by computing similarities to a small portion of tweets.

As the database of tweets grows larger, we continues to find most similar tweets in real time!

LSH Forest

Locality Sensitive Hashing forest is an alternative method for vanilla approximate nearest neighbor search methods. In Scikit-Learn, LSH forest data structure has been implemented using sorted arrays, binary search and 32 bit fixed-length hashes. Random projection is used as the hash family which approximates cosine distance.

hv = HashingVectorizer(n_features=10000, non_negative=True)
docs_vectorized = hv.fit_transform(docs_joined).toarray()

# Locality Sensitive Hashing Forest
lshf = LSHForest(random_state=42, radius_cutoff_ratio=0.4)
lshf.fit(docs_vectorized)

# Hashing new input
new_doc_vectorized = hv.transform(new_doc).toarray()

# Distances & indices
dist, ind= lshf.radius_neighbors(new_doc_vectorized, return_distance=True)

LSH Forest uses internally random hyperplanes to index the samples into buckets and cosine similarities are only computed for samples that collide with the query hence achieving sublinear scaling of computation time.

Data

  • Twitter Streaming API
  • NLTK WordNet lemmatizer and word_tokenize functions are applied to tokenize tweets prior to training the model.

Future Direction

History is driven by spoken word. Once a speech is given or a book is published, one powerful sentence sets the course of evolution. "I have a dream." Offering Positive Transformative Suggestions as a new tweet is being typed.

"JFK: I would never lie to you"  ->  "I always speak to you with integrity”

Suggestions will be generated based on:

  "Good Morning” => Literal
  "JFK: I would never lie to you." => Conceptual
  • And giving weight-power to words, pertaining to focus and expressivity.
  "Audacity of Hope" vs. "Make America Great Again”
  • Beyond sentiment analysis: Simple words and phrases that signal wisdom and genius.
"You know that it takes time but maybe it also takes iterrations."

References

About

Capstone Project for Galvanize Data Science Immersive: This project provides a real-time solution for finding similarity amongst a growing database of billions of tweets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published