Twinkle: Most Similar Tweets in Real-Time

This project presents a natural language processing model capable of processing a growing database of billions of tweets and finding the most similar tweet in real-time as a new tweet is being typed. This model computes similarities by using a hashing vectorizer and a locality sensitive hashing forest to achieve sub-linear scaling of computation time.

This Repo Contains:

Project Description
Code
Capstone presentation
Future Direction
References

Motivations

When we adopt a tool we also adopt the management philosophy embedded in that tool. Twitter's influence is transforming the way society communicates about about everything; It has become a place where thoughts get purified and voices get amplified. Twitter users practice conciseness and focus. Twitter is equipped with unique powers to quickly spread information, identify what really matters and host extremely effective conversations; improving our characters and relationships and creating more integrity. The most valuable social teachings are now happening on Twitter at light speed and hashtags are advancing the evolution of society beyond all expectations.

Growth of Positive Interactions Having the most similar tweets available to users would enhance user experience by internalizing a strong sense of connection, encouraging creativity and inspiring communication which lead to more exciting interactions on Twitter. While hashtags focus on extracts, abbreviations and creative expressions or offer alternative variations for the actual context of the tweet, resemblence of the tweet itself displays exciting connections amongst users and offering this feature could quickly increase positive interactions.
User Growth Twitter is live; live commentary, live conversations, live videos, live connections; growth of positive real-time interactions could be the best opportunity to attract and increase the number of monthly active users. Twitter is well-positioned to benefit from attracting users by maintaining and growing positivity on the platfrom.

Challenge

The application has to find the most similar text amongst billions of tweets as a new tweet is being typed. In order to implement this feature in real-time we must create a solution to significantly lower the computation time.

Real-Time Solution

Sublinear scaling of run time with sequential use of Hashing Vectorizer and Locality Sensitive Hashing Forest:

LSH Forest generates hash trees and uses hashing to define neighborhoods;
Scikit-learn implementation of LSH provides access to Nearest Neighbors algorithm methods;
Similar items map to the same “buckets” with high probability (the number of buckets are much smaller than the universe of possible inputs);
Search space is limited to a bucket and we find the most similar tweets by computing similarities to a small portion of tweets.

As the database of tweets grows larger, we continues to find most similar tweets in real time!

LSH Forest

Locality Sensitive Hashing forest is an alternative method for vanilla approximate nearest neighbor search methods. In Scikit-Learn, LSH forest data structure has been implemented using sorted arrays, binary search and 32 bit fixed-length hashes. Random projection is used as the hash family which approximates cosine distance.

hv = HashingVectorizer(n_features=10000, non_negative=True)
docs_vectorized = hv.fit_transform(docs_joined).toarray()

# Locality Sensitive Hashing Forest
lshf = LSHForest(random_state=42, radius_cutoff_ratio=0.4)
lshf.fit(docs_vectorized)

# Hashing new input
new_doc_vectorized = hv.transform(new_doc).toarray()

# Distances & indices
dist, ind= lshf.radius_neighbors(new_doc_vectorized, return_distance=True)

LSH Forest uses internally random hyperplanes to index the samples into buckets and cosine similarities are only computed for samples that collide with the query hence achieving sublinear scaling of computation time.

Data

Twitter Streaming API
NLTK WordNet lemmatizer and word_tokenize functions are applied to tokenize tweets prior to training the model.

Future Direction

History is driven by spoken word. Once a speech is given or a book is published, one powerful sentence sets the course of evolution. "I have a dream." Offering Positive Transformative Suggestions as a new tweet is being typed.

"JFK: I would never lie to you"  ->  "I always speak to you with integrity”

Suggestions will be generated based on:

Word-Sense Induction
Word-Sense Disambiguation
Grammar
Intertextuality
Sentiment analysis
Conceptual vs. literal alacrity

  "Good Morning” => Literal
  "JFK: I would never lie to you." => Conceptual

And giving weight-power to words, pertaining to focus and expressivity.

  "Audacity of Hope" vs. "Make America Great Again”

Beyond sentiment analysis: Simple words and phrases that signal wisdom and genius.

"You know that it takes time but maybe it also takes iterrations."

Name		Name	Last commit message	Last commit date
Latest commit History 334 Commits
APP		APP
Presentation.key		Presentation.key
README.md		README.md
Streaming_Data.py		Streaming_Data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twinkle: Most Similar Tweets in Real-Time

This Repo Contains:

Motivations

Challenge

Real-Time Solution

LSH Forest

Data

Future Direction

References

About

Releases

Packages

Languages

minoobeyzavi/Twinkle

Folders and files

Latest commit

History

Repository files navigation

Twinkle: Most Similar Tweets in Real-Time

This Repo Contains:

Motivations

Challenge

Real-Time Solution

LSH Forest

Data

Future Direction

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages