# Introduction

Let's look at some data from the Federal Election Commission (FEC). Some of the links point to sites that are no longer maintained, though they still good to do some research. This site is an example of organizations sharing public data without investing too much to make it user friendly. That's great for us: less glitter, more data.

Paper Reports http://classic.fec.gov/finance/disclosure/ftppaper.shtml


## What are .inc files, and how can we read them?
Visit https://www.fec.gov/files/bulk-downloads/index.html?prefix=bulk-downloads/ and you see a couple of files with this extension.
1. Let's download one, see whether it has any data.
2. If so, how can we save them in a data table?

In [None]:
%%sh
wget "https://cg-519a459a-0ea3-42c2-b7bc-fa1143481f74.s3-us-gov-west-1.amazonaws.com/bulk-downloads/leadership_pacs.inc"

In [None]:
%%sh
head leadership_pacs.inc

## Traversing the site
When visiting https://www.fec.gov/files/bulk-downloads/index.html?prefix=bulk-downloads/ the page looks pretty 'OK' ... Though, is it?

## What secrets do *these* files hold?

http://classic.fec.gov/finance/disclosure/ftppaper.shtml

A paper report submitted to the Commission is a committee's official campaign finance disclosure filing for paper filers. If there are any discrepancies between the paper report and the electronic data file received from our data entry contractor, the paper report takes precedence. 
The summary financial information and transactions disclosed in paper reports are data entered into an electronic format, and then stored in a downloadable file similar to electronically filed reports, known as a .fec file. Paper filings are data entered in a batch process; therefore, the Commission does not receive converted paper filings every day from our contractor. On days the Commission does not receive converted paper filings, an empty file (YYYYMMDD.nofiles.zip) will be placed on the FTP server. On days the Commission receives converted paper filings, a file (YYYYMMDD.zip) will contain that day's .fec files.

https://www.fec.gov/files/bulk-downloads/index.html?prefix=bulk-downloads/paper/

### Tasks 1: convert .fec files into data tables
1. download a zip file from the site,
2. extract it,
3. figure out how to read a single `.fec` file, and
4. save data into a proper `.csv` file

### Task 2: automatic download
Now, let's build a crawler to download and extract those fec files
1. traverse site and download zip files,
2. unpack zip files,
3. extract data tables from fec files
The result should be a directory tree full of CSV files (header-less), keep the file names intact, keep YYYYMMDD sub-directories.

In [None]:
%%sh
ls -l *.fec
wc -l sample.fec

In [None]:
open('sample.fec', 'rb').readlines()[:4]

Read a file line by line

In [None]:
with open('sample.fec', 'rb') as io:
    header = io.readline()
    line = io.readline()
    while line:
        vals = line.strip().split('\x1C')
        print len(vals), vals
        line = io.readline()