XPORT is the binary file format used by a bunch of United States government agencies for publishing data sets. It made a lot of sense if you were trying to read data files on your IBM mainframe back in 1988.
Let's make this short and sweet:
import xport with xport.XportReader(xport_file) as reader: for row in reader: print(row)
Each row will be a dict with a key for each field in the dataset. Values will be either a unicode string, a float or an int, depending on the type specified in the file for that field.
Once you have an XportReader object, there are a few properties and methods that will give you details about the file:
reader.file: the underlying Python file object (see next section).
reader.record_start: the position (in bytes) in the file where records start (see next section).
reader.record_length: the length (in bytes) of each record (see next section).
reader.record_count(): number of records in file. (Warning: this will seek to the end of the file to determine file length.)
reader.file_info and reader.member_info: dicts containing information about when and how the dataset was created.
reader.fields: list of fields in the dataset. Each field is a dict containing the following keys, copied from the spec:
struct NAMESTR { short ntype; /* VARIABLE TYPE: 1=NUMERIC, 2=CHAR */ short nhfun; /* HASH OF NNAME (always 0) */ * short field_length; /* LENGTH OF VARIABLE IN OBSERVATION */ short nvar0; /* VARNUM */ * char8 name; /* NAME OF VARIABLE */ * char40 label; /* LABEL OF VARIABLE */ char8 nform; /* NAME OF FORMAT */ short nfl; /* FORMAT FIELD LENGTH OR 0 */ * short num_decimals; /* FORMAT NUMBER OF DECIMALS */ short nfj; /* 0=LEFT JUSTIFICATION, 1=RIGHT JUST */ char nfill[2]; /* (UNUSED, FOR ALIGNMENT AND FUTURE) */ char8 niform; /* NAME OF INPUT FORMAT */ short nifl; /* INFORMAT LENGTH ATTRIBUTE */ short nifd; /* INFORMAT NUMBER OF DECIMALS */ long npos; /* POSITION OF VALUE IN OBSERVATION */ char rest[52]; /* remaining fields are irrelevant */ };
NOTE: items with stars have been renamed from the short names given in the spec. Since this is an alpha release, other items may be renamed in the future, if someone tells me what they're for.
If you want to access specific records, instead of iterating, you can use Python's standard file access functions and a little math.
Get 1000th record:
reader.file.seek(reader.record_start + reader.record_length * 1000, 0) reader.next() #<=Python 2.5 next(reader) #Python 2.6+, Python 3
Get record before most recent one fetched:
reader.file.seek(-reader.record_length * 2, 1) reader.next() #<=Python 2.5 next(reader) #Python 2.6+, Python 3
Get last record:
reader.file.seek(reader.record_start + reader.record_length * (reader.record_count() - 1), 0) reader.next() #<=Python 2.5 next(reader) #Python 2.6+, Python 3
(In this last example, note that we can't seek from the end of the file, because there may be padding bytes. Good old fixed-width binary file formats.)
I wrote this up because it seemed ridiculous that there was no easy way to read a standard government data format in most programming languages. I may have gotten things wrong. If you find a file that doesn't decode propery, send a pull request. The official spec is here. It's surprisingly straightforward for a binary file format from the 80s.
Please also feel free to use this code as a base to write your own library for your favorite programming language. Government data should be accessible, man.