GitHub - pbinkley/Tweet-Archive: Python script to archive Tweets

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README		README
fetch-oauth2.py		fetch-oauth2.py
secrets.SAMPLE.properties		secrets.SAMPLE.properties
static-archive.py		static-archive.py
statuses2html.xsl		statuses2html.xsl
twurl.py		twurl.py

Repository files navigation

Tweet-Archive

A Python script to archive a user's tweets (including responses,
retweets, tweets responded to, etc.).

The Twitter API is used and tweets are stored in Twitter's XML format. A
separate script generates an html version, divided into monthly
archives. Ultimately it will produce pdfs as well.

Currently it archives these streams:

* user_timeline (your tweets)

* mentions

* direct_messages

* direct_messages_sent

For any tweet that is a response to another tweet (from any source), it
archives the original in a file called "references".

Dependencies: these Python modules are required:

* oauth2

* lxml

Installation:

First you must authorize the script to access your Twitter account using
OAuth. Follow the instructions in this tutorial:

http://jeffmiller.github.com/2010/05/31/twitter-from-the-command-line-in-python-using-oauth

Note that you'll need to give your application permission to access direct messages.

Copy the properties file secrets.SAMPLE.properties to secrets.properties
and insert the values that were generated by the authentication process,
as well as the Twitter ID you wish to archive. Make sure that
secrets.properties is secure (e.g. "chmod go-r secrets.properties").

Running:

Now run the script fetch-oauth2.py from the command line. It will create
a directory "archive" and download an XML file into it. The XML
downloads are placed in timestamped directories, e.g.
"archive/masters/2011/02/2011-01-16-112031.xml". A new properties file
ids.properties is also created containing the most recent tweet's id in
each stream. This is used in the next run to create "since_id"
parameters, to allow incremental updating. After the master download,
monthly dump files are created in "archive/xml", e.g.
"archive/xml/2011/02/2011-02-24-135642.xml". These dump files are never
altered; if a subsequent fetch gets more tweets for a given month,
another timestamped dump file will be created.

(To do a fresh full fetch, simply delete ids.properties and the archives
directory).

A separate script, static-archive.py, can then be run to create the
html. It creates yearly directories in "archive/html" and converts the
most recent dump files into monthly html files, e.g.
"archive/html/2011/2011_02.html", using statuses2html.xsl. Subsequent
runs of static-archive.py will generate new html archives if there are
new monthly dumps, or if new dumps have been added to previously
processed months. All html archives will be rebuilt if the xsl
stylesheet changes.

To do:

* update to API 1.1 before 1.0 is removed. This will take some work,
since 1.1 only returns json.

* deal with fetches that exceed the 200 requests/hour limit (which
can easily happen, since each responded-to tweet has to be fetched
individually)

* generate pdfs from html using Firefox Command Line Print plug-in
(http://sites.google.com/site/torisugari/commandlineprint2)

* incorporate generating of html and pdfs into main script

* look up shortened urls, store the real url

* maybe fetch referenced pictures? capture snapshot of referenced
pages?

* allow for customized external css

* allow for tracking replies across monthly boundaries: currently
each month is self-contained

* check whether any of this works on Windows; if not, learn about
cross-platform file separators in Python

* handle timezones consistently