Skip to content

ihinsdale/hitchens-lexicon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

natural language processing of the writings of Christopher Hitchens

Hitchens at Victoria Falls

Table of Contents

##Description

This project collects the writings of Christopher Hitchens for natural language processing. The goal is to enable fans of Hitchens to analyze the corpus of his writing and discover things like:

  • trends in the topics he wrote about
  • favored words and phrases
  • the size of his vocabulary

Currently, the collection only contains Hitchens' writings from 2002 - 2011 in Slate (his Fighting Words column). Words from this corpus which Hitchens only ever used once -- hapax legomena -- are being tweeted by @HitchensHapaxes.

Contributions are welcome! Hitchens' writings from a number of places remain to be scraped and added to the database:

  • Vanity Fair
  • The Atlantic
  • London Review of Books
  • The Nation
  • New Statesman
  • misc.

To complete the collection, his books should also be added. An interesting extension of the project would be to add the text of his speeches.

##Documentation

slate_scraper.py scrapes the Slate archive of Hitchens' columns and stores their text in string and tokenized form in the database.

bot.py powers the @HitchensHapaxes Twitter bot. It extracts all of Hitchens' columns from the database, identifies all of the hapaxes, picks a random one, fetches its definition from Google and the context in which Hitchens used it, and tweets.

tweet.sh is the shell script for executing bot.py as a cron job.

setup.sh is the shell script for configuring the Ubuntu 12.04 LTS cloud server on which the bot runs.

The database table in which writings are stored is created with the following command in the psql client:

CREATE TABLE documents (id serial primary key, url text, title text, subtitle text, publication_date date, content text, content_tokenized text[]);

##Stack

Hitchens' writings are stored in a PostgreSQL database and analyzed using the Python library nltk.

Python

PostgreSQL

##License

See LICENSE.md.

About

Natural language processing of the writings of Christopher Hitchens

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages