Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
Tweet-Archive A Python script to archive a user's tweets (including responses, retweets, tweets responded to, etc.). The Twitter API is used and tweets are stored in Twitter's XML format. A separate script generates an html version, divided into monthly archives. Ultimately it will produce pdfs as well. Currently it archives these streams: * user_timeline (your tweets) * mentions * direct_messages * direct_messages_sent For any tweet that is a response to another tweet (from any source), it archives the original in a file called "references". Dependencies: these Python modules are required: * oauth2 * lxml Installation: First you must authorize the script to access your Twitter account using OAuth. Follow the instructions in this tutorial: http://jeffmiller.github.com/2010/05/31/twitter-from-the-command-line-in-python-using-oauth Note that you'll need to give your application permission to access direct messages. Copy the properties file secrets.SAMPLE.properties to secrets.properties and insert the values that were generated by the authentication process, as well as the Twitter ID you wish to archive. Make sure that secrets.properties is secure (e.g. "chmod go-r secrets.properties"). Running: Now run the script fetch-oauth2.py from the command line. It will create a directory "archive" and download an XML file into it. The XML downloads are placed in timestamped directories, e.g. "archive/masters/2011/02/2011-01-16-112031.xml". A new properties file ids.properties is also created containing the most recent tweet's id in each stream. This is used in the next run to create "since_id" parameters, to allow incremental updating. After the master download, monthly dump files are created in "archive/xml", e.g. "archive/xml/2011/02/2011-02-24-135642.xml". These dump files are never altered; if a subsequent fetch gets more tweets for a given month, another timestamped dump file will be created. (To do a fresh full fetch, simply delete ids.properties and the archives directory). A separate script, static-archive.py, can then be run to create the html. It creates yearly directories in "archive/html" and converts the most recent dump files into monthly html files, e.g. "archive/html/2011/2011_02.html", using statuses2html.xsl. Subsequent runs of static-archive.py will generate new html archives if there are new monthly dumps, or if new dumps have been added to previously processed months. All html archives will be rebuilt if the xsl stylesheet changes. To do: * update to API 1.1 before 1.0 is removed. This will take some work, since 1.1 only returns json. * deal with fetches that exceed the 200 requests/hour limit (which can easily happen, since each responded-to tweet has to be fetched individually) * generate pdfs from html using Firefox Command Line Print plug-in (http://sites.google.com/site/torisugari/commandlineprint2) * incorporate generating of html and pdfs into main script * look up shortened urls, store the real url * maybe fetch referenced pictures? capture snapshot of referenced pages? * allow for customized external css * allow for tracking replies across monthly boundaries: currently each month is self-contained * check whether any of this works on Windows; if not, learn about cross-platform file separators in Python * handle timezones consistently