Skip to content
Python script to archive Tweets
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



A Python script to archive a user's tweets (including responses,
retweets, tweets responded to, etc.).

The Twitter API is used and tweets are stored in Twitter's XML format. A
separate script generates an html version, divided into monthly
archives. Ultimately it will produce pdfs as well.

Currently it archives these streams:

* user_timeline (your tweets)

* mentions

* direct_messages

* direct_messages_sent

For any tweet that is a response to another tweet (from any source), it
archives the original in a file called "references".

Dependencies: these Python modules are required:

* oauth2

* lxml


First you must authorize the script to access your Twitter account using
OAuth. Follow the instructions in this tutorial:

Note that you'll need to give your application permission to access direct messages.

Copy the properties file to
and insert the values that were generated by the authentication process,
as well as the Twitter ID you wish to archive. Make sure that is secure (e.g. "chmod go-r").


Now run the script from the command line. It will create
a directory "archive" and download an XML file into it. The XML
downloads are placed in timestamped directories, e.g.
"archive/masters/2011/02/2011-01-16-112031.xml". A new properties file is also created containing the most recent tweet's id in
each stream. This is used in the next run to create "since_id"
parameters, to allow incremental updating. After the master download,
monthly dump files are created in "archive/xml", e.g.
"archive/xml/2011/02/2011-02-24-135642.xml". These dump files are never
altered; if a subsequent fetch gets more tweets for a given month,
another timestamped dump file will be created.

(To do a fresh full fetch, simply delete and the archives

A separate script,, can then be run to create the
html. It creates yearly directories in "archive/html" and converts the
most recent dump files into monthly html files, e.g.
"archive/html/2011/2011_02.html", using statuses2html.xsl. Subsequent
runs of will generate new html archives if there are
new monthly dumps, or if new dumps have been added to previously
processed months. All html archives will be rebuilt if the xsl
stylesheet changes.

To do:

*	update to API 1.1 before 1.0 is removed. This will take some work,
since 1.1 only returns json.

*	deal with fetches that exceed the 200 requests/hour limit (which
can easily happen, since each responded-to tweet has to be fetched

*	generate pdfs from html using Firefox Command Line Print plug-in

*	incorporate generating of html and pdfs into main script

*	look up shortened urls, store the real url

*	maybe fetch referenced pictures? capture snapshot of referenced

*	allow for customized external css

*	allow for tracking replies across monthly boundaries: currently
each month is self-contained

*	check whether any of this works on Windows; if not, learn about
cross-platform file separators in Python

*	handle timezones consistently
Something went wrong with that request. Please try again.