-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
404 errors in some results files for Washington #179
Comments
I looked into this a little bit and one issue is that the datasource creates URLs for the expected CSV conversions (which will eventually live in https://github.com/openelections/openelections-data-wa) of PDF files. But, the CSV conversions don't exist yet, so we get a 404. An example of this is the file with the original URL of https://wei.sos.wa.gov/agency/osos/en/press_and_research/PreviousElections/2008/2008PrimaryPrecinctData/Documents/Snohomish prec results with RV, cast.pdf We could add temporary logic that doesn't try to download these, files, but in spite of my initial comment I think the best solution is better handling of 404s in the fetcher. I'm still looking into the reason for the empty files without filename extensions. |
In this section of else:
name, response = urlretrieve(url, local_file_name)
print "Added to cache: %s" % local_file_name My $0.02 would be to put in a simple Something like: else:
try:
name, response = urlretrieve(url, local_file_name)
print "Added to cache: %s" % local_file_name
except # generic error:
# pass, do something, etc invoke clear.cache --state=wa also fails because directories are being made. I don't know if this is a result of the invalid URLs or not. |
@EricLagerg Makes sense re: better exception handling in the base fetcher. I'm going to try to get to this over the afternoon. Can you open a separate issue for clear.cache failing? I don't think this is due to the invalid URLs. It's more likely a result of the artifacts of extracting the ZIPed results files. That was a pretty quick-and-dirty implementation and could stand to be cleaned up a bit, I bet. |
Empty files without filename extensions are due to 404s caused by bad URLs in url_paths.csv. For example I'm working on cleaning this up. For reference, here's some csvkit foo to find these bad lines:
|
I'm a little confused. This URL: That file is then given this name: eric@crunchbang /home/eric/sbdmn/core/openelex/us/wa/cache $ cat 20080219__wa__primary__snohomish__precinct.csv
Not Found Now, if you run urlretrieve on the original .pdf URL (https://gist.github.com/EricLagerg/162f30804dca8aa6ea85), you'll get a .pdf document. If you change the file extension on the original URL to .csv, you'll get a .csv file with the contents Consider this: 0 ;) eric@crunchbang ~/sbdmn/core $ grep -r '404 NOT FOUND\|Not Found' .
./openelex/us/wa/cache/20080219__wa__primary__snohomish__precinct.csv:Not Found
./openelex/us/wa/cache/20011106__wa__general.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__pierce__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__klickitat__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__king__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__clallam__precinct.csv:Not Found
./openelex/us/wa/cache/20080819__wa__primary__thurston__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__columbia__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__cowlitz__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__jefferson__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__pend_oreille__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__walla_walla__precinct.csv:Not Found
./openelex/us/wa/cache/20111108__wa__general__congressional_district.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__whitman__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__spokane__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__grays_harbor__precinct:404 NOT FOUND
./openelex/us/wa/cache/20000919__wa__primary.csv:Not Found
./openelex/us/wa/cache/20040914__wa__primary.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__san_juan__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__thurston__precinct:404 NOT FOUND
./openelex/us/wa/cache/20080219__wa__primary__lincoln__precinct.csv:Not Found
./openelex/us/wa/cache/20071106__wa__general__pacific__precinct.csv:Not Found
./openelex/us/wa/cache/20001107__wa__general.csv:Not Found
./openelex/us/wa/cache/20080819__wa__primary__snohomish__precinct.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__grant__precinct:404 NOT FOUND
./openelex/us/wa/cache/20041102__wa__general.csv:Not Found
./openelex/us/wa/cache/20111108__wa__general__state_legislative.csv:Not Found
./openelex/us/wa/cache/20101102__wa__general__congressional_district.csv:Not Found
./openelex/us/wa/cache/20080219__wa__primary__pacific__precinct.csv:Not Found
./openelex/us/wa/cache/20101102__wa__general__state_legislative.csv:Not Found
0 ;) eric@crunchbang ~/sbdmn/core $ What's the difference between |
@EricLagerg The fetcher tries to get CSV versions of the PDF files from the GitHub repo openelections-data-wa (see lib.build_github_url()). If it's not a PDF file it tries to fetch the file from the Washignton SOS website. I believe the response body for GitHub's webserver is "Not Found" and for the SoS' website it's "404 NOT FOUND". I'm in the process of implementing a fix similar to what you suggested to more accurately handle 404s. The behavior of urllib.urlretrieve() is a little annoying because it doesn't raise exceptions for HTTP errors, but I think it will be pretty easy to just check the response and clean up the local file if needed. If this continues to be a pain point we might want to use a different library (requests?) for doing the downloads. Again, thanks for looking into this. |
Add filename extensions to URLs defined in Washington's url_paths.csv that were missing the extension. Addresses #179
When using the default urllib.urlretrieve() method, the method happily creates a local file containing the webserver's error response rather than indicating that there was an HTTP error. The fix involves a custom subclass of urllib.FancyURLopener that raises an exception, as suggested by http://stackoverflow.com/a/1308846/386210 Currently, it only cares about 404 errors. If we need more error handling, it might be worthwhile to use something other than urllib.urlretrieve(). The way that you do it with requests is mentioned here: http://stackoverflow.com/a/14114741/386210. Addresses #179
Fixed in df9cfda. @EricLagerg, you'll probably want to implement your loader so that it gracefully handles the case where an expected local file doesn't exist. |
I made a list of all the 404 errors by running I think it'd be good to have a list of the bad URLs created when the |
@EricLagerg can you post your list to a gist somewhere. I'm interested in whether the bulk of the URLs are for PDFs that we haven't prepossessed yet, or whether they represent typos or changed URLs in the datasource wiring. |
@EricLagerg Thanks. So all the nonexistent URLs here represent the expected URLs of PDF files that we plan to preprocess and put in http://github.com/openelections/openelections-data-wa/. If you want to grab the raw PDF versions of these files, you can run:
|
@EricLagerg Also, the docs for our preprocessing workflow/repo layout are at http://docs.openelections.net/guide/preprocessing/ |
@ghing Where did the .pdf files come from? The first file on that list can be found here: |
@EricLagerg Sorry for the confusion. I've given you the breakdown of Washington's data piecemeal, mostly because I hadn't looked at the Washington information in a while. I'm going to try to summarize everything here. tl;dr The first URL in your list isn't from a PDF, it's extracted a database dump and it's not found because of a discrepancy in election dates. For most states, the openelections-data-{abbrev} repos contain preprocessed CSVs extracted from PDFs. For Washington, there will be some files like this. For example the URL Error downloading https://raw.githubusercontent.com/openelections/openelections-data-wa/master/20071106__wa__general__pacific__precinct.csv represents a CSV conversion that doesn't yet exist of the precinct-level results found in https://wei.sos.wa.gov/agency/osos/en/press_and_research/PreviousElections/2007/General/Data/Documents/Precinct%20Results/Pacific%20Gen07%20Reg%20Voters.pdf. The URLs for these were entered by volunteers who looked through the state's websites. If you know of better sources of precinct-level data that isn't in PDF, that would be awesome. However, for pre-2007 elections, results openelections-data-wa also contains CSV files extracted from a database dump. The first URL in your 404 error list represents one of these files. The reason we get the 404 error is because our elections API (what gets called when you run ``inv datasource.elections --state=wa --datefilter=2000 The API has a record for the 2000 general election on 2000-11-07. The database dump has the entry as 2000-11-04. The script that I used to flatten the database dumps used the dates from the data to name the output files, so the file was created as 20001104__wa__general.csv. I opened an issue, #148, to follow up with someone at the state to get more insight into the date discrepancy and other data weirdness in the database dump, but never heard back. I need to try to pursue this further with our contact at the state. Finally, in general, you're write in thinking that we want to avoid scraping PDFs like the plague, so if there is CSV, XLS, text or HTML that we can parse, we definitely want to use that. |
Unfortunately I think we're stuck with PDF scraping for precinct-level results. In 2000, the general was held on November 7. The February primary was held on February 29. The September primary was held on September 19. I'd check out http://www.thegreenpapers.com/ I haven't found anything incorrect with their data so far. A lot of weird things happened with Washington's voting system in the last 20 years http://www.sos.wa.gov/elections/timeline/time5.htm so things can be a bit wacky. |
The fetch task appears to get 404 errors for some WA URLs and is outputting the message to the output filenames.
The fetch logic should probably be updated to do more robust error handling, but the immediate need is to update the Washington Datasource/url_path.csv so we don't get the 404s.
This was originally reported via email by @EricLagerg
The text was updated successfully, but these errors were encountered: