Recommending News Articles to Twitter Users based on their Tweets
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
website
README.md
readlikeyoutweet_schematic.png
underthehood.ipynb

README.md

Read Like You Tweet

A New York Times Article Recommendation System Based On Your Twitter Timeline

http://readlikeyoutweet.herokuapp.com/

Author: Karsten Kreis

Overview (a very detailed description of the recommender can be found in underthehood.ipynb)

This project started as my final project for General Assembly's Data Science class in New York City in summer 2015. The idea is to recommend New York Times articles to Twitter users based on their tweets. This is established in the following way:

  • I downloaded over 100.000 article snippets from the New York Times Article Search API and categorized them according to their sections
  • I vectorized the text and created text features with a term frequency-inverse document frequency vectorizer
  • I trained a multiclass Logistic Regression classifier to identify the classes

The above happened "offline". Similarly as the words in an article indicate the section it belongs to, the same words in tweets are likely to indicate that the Twitter user is interested in news from this section. Therefore, the obtained model can be used to predict a Twitter user's interests.

The program/website does the following:

  • A Twitter user provides her or his Twitter handle
  • With the Twitter API the 100 latest tweets are downloaded
  • These tweets are processed and vectorized as the article data before and feeded into the Logistic Regression model
  • This should, hopefully, yield the category the user may want to read news from

The final step:

  • Connect to the New York Times Top Stories API
  • Fetch the top story articles from the section which was predicted by the classifier. This usually yields 30 articles from this section
  • Calculate the Jaccard distance between these articles and the user's tweets
  • Recommend the closest article to the Twitter user

Possible further modifications

There are many possible improvements and extensions:

  • Try to fit a stronger model, possibly using other classifiers
  • Use dimensionality reduction or clustering techniques to gain further insights and/or reduce features
  • Predict several probable labels and do not only recommend from one section but from several probable ones
  • Scrape whole articles using webscraping tools like beautifulsoup to get whole articles instead of only headlines, snippets and keywords. This could maybe help when training the algorithm and when calculating the Jaccard distances
  • Include further newspapers other than the New York Times, both for model training as well as recommendation (use for example also the Guardian, which also has a great API framework)
  • Check where the user comes from (UK, US, Australia) and recommend either from NYT/Guardian US, Guardian UK, or Guardian Australia
  • Extend the system beyond targeting only English twitterers and recommending only English newspaper articles
  • Try to get even more user information, for example from Facebook, LinkedIn, etc., to make even better recommendations

Files

Note that I did not upload the actual datasets, the pickled logistic regression model, the pickled tfidf vectorizer and the pickled stopwords (for the website also the stopwords need to be pickled). However, with the code the data can be downloaded again and the models parametrized again.

Furthermore, note that the whole code naturally requires API keys for all involved APIs to work.