Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
bot.py		bot.py
setup.sh		setup.sh
slate_scraper.py		slate_scraper.py
tweet.sh		tweet.sh

Repository files navigation

natural language processing of the writings of Christopher Hitchens

Table of Contents

Description
Documentation
Stack
License

##Description

This project collects the writings of Christopher Hitchens for natural language processing. The goal is to enable fans of Hitchens to analyze the corpus of his writing and discover things like:

trends in the topics he wrote about
favored words and phrases
the size of his vocabulary

Currently, the collection only contains Hitchens' writings from 2002 - 2011 in Slate (his Fighting Words column). Words from this corpus which Hitchens only ever used once -- hapax legomena -- are being tweeted by @HitchensHapaxes.

Contributions are welcome! Hitchens' writings from a number of places remain to be scraped and added to the database:

Vanity Fair
The Atlantic
London Review of Books
The Nation
New Statesman
misc.

To complete the collection, his books should also be added. An interesting extension of the project would be to add the text of his speeches.

##Documentation

slate_scraper.py scrapes the Slate archive of Hitchens' columns and stores their text in string and tokenized form in the database.

bot.py powers the @HitchensHapaxes Twitter bot. It extracts all of Hitchens' columns from the database, identifies all of the hapaxes, picks a random one, fetches its definition from Google and the context in which Hitchens used it, and tweets.

tweet.sh is the shell script for executing bot.py as a cron job.

setup.sh is the shell script for configuring the Ubuntu 12.04 LTS cloud server on which the bot runs.

The database table in which writings are stored is created with the following command in the psql client:

CREATE TABLE documents (id serial primary key, url text, title text, subtitle text, publication_date date, content text, content_tokenized text[]);

##Stack

Hitchens' writings are stored in a PostgreSQL database and analyzed using the Python library nltk.

##License

See LICENSE.md.

About

Natural language processing of the writings of Christopher Hitchens

twitter.com/hitchenshapaxes

Report repository

Releases

No releases published

Packages

Languages