Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

404 errors in some results files for Washington #179

Closed
ghing opened this issue Aug 8, 2014 · 15 comments
Closed

404 errors in some results files for Washington #179

ghing opened this issue Aug 8, 2014 · 15 comments
Assignees
Labels

Comments

@ghing
Copy link
Contributor

ghing commented Aug 8, 2014

The fetch task appears to get 404 errors for some WA URLs and is outputting the message to the output filenames.

The fetch logic should probably be updated to do more robust error handling, but the immediate need is to update the Washington Datasource/url_path.csv so we don't get the 404s.

ghing@Latitude-E6500:~/workspace/openelex-core/openelex/us/wa/cache$ find . -type f -size -2
./20080819__wa__primary__snohomish__precinct.csv
./20080219__wa__primary__grays_harbor__precinct
./20080819__wa__primary__thurston__precinct
./20101102__wa__general__congressional_district.csv
./20101102__wa__general__state_legislative.csv
./20080219__wa__primary__thurston__precinct
./20080219__wa__primary__lincoln__precinct.csv
./20080219__wa__primary__jefferson__precinct
./20080219__wa__primary__clallam__precinct.csv
./20080219__wa__primary__san_juan__precinct.csv
./20111108__wa__general__state_legislative.csv
./20080219__wa__primary__pend_oreille__precinct
./20080219__wa__primary__whitman__precinct
./20080219__wa__primary__walla_walla__precinct.csv
./20080219__wa__primary__pacific__precinct.csv
./20080219__wa__primary__pierce__precinct.csv
./20080219__wa__primary__cowlitz__precinct
./20080219__wa__primary__spokane__precinct
./20080219__wa__primary__columbia__precinct
./20080219__wa__primary__snohomish__precinct.csv
./20080219__wa__primary__klickitat__precinct.csv
./20111108__wa__general__congressional_district.csv
./20080219__wa__primary__grant__precinct

This was originally reported via email by @EricLagerg

@ghing ghing added bug labels Aug 8, 2014
@ghing ghing self-assigned this Aug 8, 2014
@ghing
Copy link
Contributor Author

ghing commented Aug 8, 2014

I looked into this a little bit and one issue is that the datasource creates URLs for the expected CSV conversions (which will eventually live in https://github.com/openelections/openelections-data-wa) of PDF files. But, the CSV conversions don't exist yet, so we get a 404.

An example of this is the file with the original URL of https://wei.sos.wa.gov/agency/osos/en/press_and_research/PreviousElections/2008/2008PrimaryPrecinctData/Documents/Snohomish prec results with RV, cast.pdf

We could add temporary logic that doesn't try to download these, files, but in spite of my initial comment I think the best solution is better handling of 404s in the fetcher.

I'm still looking into the reason for the empty files without filename extensions.

@ericlagergren
Copy link
Member

In this section of /base/fetch.py you're simply getting the URL without actually checking for 404 errors:

else:
    name, response = urlretrieve(url, local_file_name)
    print "Added to cache: %s" % local_file_name

My $0.02 would be to put in a simple try/except statement to pass over invalid URLs until you can safely convert .pdf to .csv. Additionally, you could use urllib2 (although I don't know if changing modules is allowed) which is much nicer to work with.

Something like:

else:
    try:
        name, response = urlretrieve(url, local_file_name)
        print "Added to cache: %s" % local_file_name
    except # generic error:
        # pass, do something, etc
invoke clear.cache --state=wa

also fails because directories are being made. I don't know if this is a result of the invalid URLs or not.

@ghing
Copy link
Contributor Author

ghing commented Aug 8, 2014

@EricLagerg Makes sense re: better exception handling in the base fetcher. I'm going to try to get to this over the afternoon.

Can you open a separate issue for clear.cache failing? I don't think this is due to the invalid URLs. It's more likely a result of the artifacts of extracting the ZIPed results files. That was a pretty quick-and-dirty implementation and could stand to be cleaned up a bit, I bet.

@ghing
Copy link
Contributor Author

ghing commented Aug 8, 2014

Empty files without filename extensions are due to 404s caused by bad URLs in url_paths.csv. For example https://wei.sos.wa.gov/agency/osos/en/press_and_research/PreviousElections/2008/2008PP/2008PPPrecinctData/Documents/Jefferson prec with RV, cast

I'm working on cleaning this up.

For reference, here's some csvkit foo to find these bad lines:

$ csvgrep -c 7 -r '^.*\.(csv|xls|xlsx|pdf|zip)$' -i mappings/url_paths.csv

@ericlagergren
Copy link
Member

I'm a little confused.

This URL: "https://wei.sos.wa.gov/agency/osos/en/press_and_research/PreviousElections/2008/2008PrimaryPrecinctData/Documents/Snohomish prec results with RV, cast.pdf" from line 130 in url_paths.csv is a valid URL that urlretrieve accepts.

That file is then given this name: 20080219__wa__primary__snohomish__precinct.csv, but look at this:

eric@crunchbang /home/eric/sbdmn/core/openelex/us/wa/cache $ cat 20080219__wa__primary__snohomish__precinct.csv 
Not Found

Now, if you run urlretrieve on the original .pdf URL (https://gist.github.com/EricLagerg/162f30804dca8aa6ea85), you'll get a .pdf document. If you change the file extension on the original URL to .csv, you'll get a .csv file with the contents 404 NOT FOUND, yet the fetch... command results in different file contents than my example.

Consider this:

0 ;) eric@crunchbang ~/sbdmn/core $ grep -r '404 NOT FOUND\|Not Found' .
./openelex/us/wa/cache/20080219__wa__primary__snohomish__precinct.csv:Not Found
./openelex/us/wa/cache/20011106__wa__general.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__pierce__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__klickitat__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__king__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__clallam__precinct.csv:Not Found
./openelex/us/wa/cache/20080819__wa__primary__thurston__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__columbia__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__cowlitz__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__jefferson__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__pend_oreille__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__walla_walla__precinct.csv:Not Found
./openelex/us/wa/cache/20111108__wa__general__congressional_district.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__whitman__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__spokane__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__grays_harbor__precinct:404 NOT FOUND
./openelex/us/wa/cache/20000919__wa__primary.csv:Not Found
./openelex/us/wa/cache/20040914__wa__primary.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__san_juan__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__thurston__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__lincoln__precinct.csv:Not Found
./openelex/us/wa/cache/20071106__wa__general__pacific__precinct.csv:Not Found
./openelex/us/wa/cache/20001107__wa__general.csv:Not Found
./openelex/us/wa/cache/20080819__wa__primary__snohomish__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__grant__precinct:404 NOT FOUND
./openelex/us/wa/cache/20041102__wa__general.csv:Not Found
./openelex/us/wa/cache/20111108__wa__general__state_legislative.csv:Not Found
./openelex/us/wa/cache/20101102__wa__general__congressional_district.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__pacific__precinct.csv:Not Found
./openelex/us/wa/cache/20101102__wa__general__state_legislative.csv:Not Found
0 ;) eric@crunchbang ~/sbdmn/core $ 

What's the difference between Not Found and 404 NOT FOUND?

@ghing
Copy link
Contributor Author

ghing commented Aug 8, 2014

@EricLagerg The fetcher tries to get CSV versions of the PDF files from the GitHub repo openelections-data-wa (see lib.build_github_url()). If it's not a PDF file it tries to fetch the file from the Washignton SOS website.

I believe the response body for GitHub's webserver is "Not Found" and for the SoS' website it's "404 NOT FOUND". I'm in the process of implementing a fix similar to what you suggested to more accurately handle 404s.

The behavior of urllib.urlretrieve() is a little annoying because it doesn't raise exceptions for HTTP errors, but I think it will be pretty easy to just check the response and clean up the local file if needed. If this continues to be a pain point we might want to use a different library (requests?) for doing the downloads.

Again, thanks for looking into this.

ghing added a commit that referenced this issue Aug 8, 2014
Add filename extensions to URLs defined in Washington's
url_paths.csv that were missing the extension.

Addresses #179
ghing added a commit that referenced this issue Aug 8, 2014
When using the default urllib.urlretrieve() method, the method
happily creates a local file containing the webserver's error
response rather than indicating that there was an HTTP error.

The fix involves a custom subclass of urllib.FancyURLopener that
raises an exception, as suggested by
http://stackoverflow.com/a/1308846/386210

Currently, it only cares about 404 errors.  If we need more
error handling, it might be worthwhile to use something other
than urllib.urlretrieve().  The way that you do it with requests
is mentioned here: http://stackoverflow.com/a/14114741/386210.

Addresses #179
@ghing
Copy link
Contributor Author

ghing commented Aug 8, 2014

Fixed in df9cfda. @EricLagerg, you'll probably want to implement your loader so that it gracefully handles the case where an expected local file doesn't exist.

@ghing ghing closed this as completed Aug 8, 2014
@ericlagergren
Copy link
Member

I made a list of all the 404 errors by running invoke fetch --state=wa | grep "Error" > badurls

I think it'd be good to have a list of the bad URLs created when the fetch command is run so that you can manually get the files in question.

@ghing
Copy link
Contributor Author

ghing commented Aug 12, 2014

@EricLagerg can you post your list to a gist somewhere. I'm interested in whether the bulk of the URLs are for PDFs that we haven't prepossessed yet, or whether they represent typos or changed URLs in the datasource wiring.

@ericlagergren
Copy link
Member

@ghing
Copy link
Contributor Author

ghing commented Aug 12, 2014

@EricLagerg Thanks. So all the nonexistent URLs here represent the expected URLs of PDF files that we plan to preprocess and put in http://github.com/openelections/openelections-data-wa/. If you want to grab the raw PDF versions of these files, you can run:

inv fetch --state=wa --unprocessed

@ghing
Copy link
Contributor Author

ghing commented Aug 12, 2014

@EricLagerg Also, the docs for our preprocessing workflow/repo layout are at http://docs.openelections.net/guide/preprocessing/

@ericlagergren
Copy link
Member

@ghing Where did the .pdf files come from? The first file on that list can be found here: http://www.sos.wa.gov/elections/results_report.aspx?e=20&c=&c2=&t=&t2=&p=&p2=&y=. (And I assume that's the same for the rest of the URLs.) Forgive me if I'm getting myself all mixed up here, but wouldn't it be easier to parse the raw HTML of the website than scrape a .pdf?

@ghing
Copy link
Contributor Author

ghing commented Aug 12, 2014

@EricLagerg Sorry for the confusion. I've given you the breakdown of Washington's data piecemeal, mostly because I hadn't looked at the Washington information in a while. I'm going to try to summarize everything here.

tl;dr The first URL in your list isn't from a PDF, it's extracted a database dump and it's not found because of a discrepancy in election dates.

For most states, the openelections-data-{abbrev} repos contain preprocessed CSVs extracted from PDFs. For Washington, there will be some files like this. For example the URL Error downloading https://raw.githubusercontent.com/openelections/openelections-data-wa/master/20071106__wa__general__pacific__precinct.csv represents a CSV conversion that doesn't yet exist of the precinct-level results found in https://wei.sos.wa.gov/agency/osos/en/press_and_research/PreviousElections/2007/General/Data/Documents/Precinct%20Results/Pacific%20Gen07%20Reg%20Voters.pdf. The URLs for these were entered by volunteers who looked through the state's websites. If you know of better sources of precinct-level data that isn't in PDF, that would be awesome.

However, for pre-2007 elections, results openelections-data-wa also contains CSV files extracted from a database dump.

The first URL in your 404 error list represents one of these files.

The reason we get the 404 error is because our elections API (what gets called when you run ``inv datasource.elections --state=wa --datefilter=2000) is used to help build the urls that get fetched in openelex.us.wa.datasource`. The data that drives the API was populated by research and data entry by volunteers.

The API has a record for the 2000 general election on 2000-11-07. The database dump has the entry as 2000-11-04. The script that I used to flatten the database dumps used the dates from the data to name the output files, so the file was created as 20001104__wa__general.csv.
From your knowledge of WA elections, do you know whether 2000-11-04 or 2000-11-07 is the correct date? Wikipedia seems to suggest that it's 2000-11-07.

I opened an issue, #148, to follow up with someone at the state to get more insight into the date discrepancy and other data weirdness in the database dump, but never heard back. I need to try to pursue this further with our contact at the state.

Finally, in general, you're write in thinking that we want to avoid scraping PDFs like the plague, so if there is CSV, XLS, text or HTML that we can parse, we definitely want to use that.

@ericlagergren
Copy link
Member

Unfortunately I think we're stuck with PDF scraping for precinct-level results. In 2000, the general was held on November 7. The February primary was held on February 29. The September primary was held on September 19.

I'd check out http://www.thegreenpapers.com/ I haven't found anything incorrect with their data so far. A lot of weird things happened with Washington's voting system in the last 20 years http://www.sos.wa.gov/elections/timeline/time5.htm so things can be a bit wacky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants