Gensim document similarity demonstration using RSS feeds as document sources
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
README.rst
feeds.opml
requirements.txt
rsscluster.py

README.rst

rsscluster

Gensim document similarity demonstration using RSS feeds as document sources

About

Gensim is a Python library used to perform "topic modeling". The most practical purpose for topic modelling is finding similar documents. This simple script serves as an example of the usage and power of this library. Given a list of RSS feeds (in OPML format), the script will create a database of all stories contained in those feeds, and show some sample clusters of similar stories.

For example, if an OPML file contains both The Washington Post and New York Times feeds, a good cluster would include stories from both papers on the same event, such as a recent presidential tour of Africa.

Installation

First off, you'll need virtualenv. Depending on your operating system and your tolerance for system packages, you have a choice of installation methods. Install it the way you want (or just follow the directions).

Once you're done with that, you can set up your environment. From this directory, run:

$ virtualenv --no-site-packages .
$ bin/pip install -r requirements.txt

At this point, you'll probably have some errors. Numpy and scipy (two dependencies of gensim) are notoriously difficult to build on your first try. The first thing to check is that you have all of the dependencies installed (eg., gcc). Hopefully you'll see an error message indicating what's missing.

I also sometimes have an issue installing numpy from requirements.txt. If I build it separately with:

$ bin/pip install numpy==1.6.2

And then build the rest with:

$ bin/pip install -r requirements.txt

That can move the build process along. You also may get things working by installing system packages for numpy and scipy, but that can bring its own problems. If none of the above helps, I apologize, and can only say that Google is your friend.

Usage

First, you'll need an OPML file. Most readers allow you to export your feeds as OPML. If you can't, or you don't have enough feeds to generate interesting output, feel free to use my sample file in this directory, named "feeds.opml"

The basic usage of rsscluster.py is:

Usage: rsscluster.py [options] OPML_FILE

Options:
-h, --help            show this help message and exit
-t THRESHOLD, --threshold=THRESHOLD
                        Documents whose similarity is larger than this
                        threshold will be considered similar (0-1,
                        default=0.6)
-d DATE, --date=DATE  Publication date of stories to base clusters around
                        (format=YYY-MM-DD, default=today)
-s, --skip-training   Skip training (if you already have an existing
                        database, you may want to skip the training step)
-m, --html            HTML output
-f OUTPUT_FILE, --output-file=OUTPUT_FILE
                        Output file (default=stdout)

For example, if you want to generate HTML output of feeds.opml in feeds.html, you would run:

$ bin/python rsscluster.py --html --output-file=feeds.html feeds.opml

By default, rsscluster will generate clusters around stories published on the current day. If you want to generate clusters around stories published on other days, you would run:

$ bin/python rsscluster.py --date=2013-06-29 feeds.opml

Also, rsscluster keeps the database around between runs. This way, as older stories fall off RSS feeds, they can still be indexed for similiarity in the future. Because the database sticks around, you don't really need to retrain it on each run; you just need to index the new documents. To skip the training phase, you run:

$ bin/python rsscluster.py --skip-training feeds.opml