Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement name conversion strategy for raw results files #5

Closed
zstumgoren opened this issue Jul 26, 2013 · 5 comments
Closed

Implement name conversion strategy for raw results files #5

zstumgoren opened this issue Jul 26, 2013 · 5 comments
Assignees
Milestone

Comments

@zstumgoren
Copy link
Contributor

Create a module to standardize names of raw result files. Raw results will be stored on S3 using the standardized name.

Standardized names should:

  • be resolvable back to raw file names
  • encapsulate enough information about the contained results to link up to metadata via API

Naming Convention

See #4 for details on naming convention

Standardization should generate a composite file name that reflects metadata captured in our data admin.

File name components should include:

  • election date - YYYYMMDD
  • state - postal code
  • race type - general, primary-dem, primary runoff-dem, etc.
  • jurisdiction - OCD id of the jurisdiction, or geographic area, for which results are provided. For example, a file for MD that contains precinct-level results for Anne Arundel County could use a slugified version of the plain OCD name
  • race_code that denotes types of races covered in the data file. Optional element that should only be used when state provides data for single race in distinct file. For example, Louisiana provides precinct-level results, by parish, for each race. This field could also be expanded, on a state-by-state basis, to handle arbitrary groupings of results (e.g. separate files for state leg., federal, local).
  • reporting level - precinct, city, county, state, etc.
  • file type extension - db, csv, html, json, xml, etc.

Format

File name components separated by double underscores; component sub-parts separated by single underscores.

<YYYYMMDD>__<state>__<race_type>__<jurisdiction>__[<race_code>__]<level>.<ext>

Examples

Louisiana Congressional District 1 precinct level results, Jefferson Davis Parish
20121106__la__general__jefferson_davish_parish__cd_1__precinct.html

Allegeny County precincnt results for general election (contains multiple race types)
20121106__md__general__allegany_county__precinct.csv

Implementation

Standardized name should be generated during file download process (in state-specific fetch.py modules).

Each state directory should have a 2-column mappings.txt file that contains standardized name and link to raw result file. The raw link should point to result file located at source agency or to copy of raw file archived on S3. The latter would be used in cases where result files are not scrapable (e.g. if agency provided a database dump).

## mappings.txt ##
standard_name, raw_source_name
20121106__md__general__anne_arundel_county__precinct.csv, http://www.elections.state.md.us/elections/2012/election_data/Anne_Arundel_By_Precinct_2012_General.csv
@ghost ghost assigned zstumgoren Jul 26, 2013
@dwillis
Copy link
Contributor

dwillis commented Jul 26, 2013

Would specify that race_code should only be used when files are specific to single race.

@zstumgoren
Copy link
Contributor Author

Yep, for time being that's way to go. If necessary down the road, we could expand its usage to account for arbitrary partitioning of results. For example, if a state partitioned results by race type into separate files for state leg, federal, local. Can't think of any examples of that right now, so we can deal with that on a state-by-state basis if need arises. Meantime, i'll tweak note next to the race_code field.

@dwillis
Copy link
Contributor

dwillis commented Aug 9, 2013

What about where a file contains not a single jurisdiction but multiple ones? For example, MD's results by state legislative district are in a single file with all districts. Proposing something like: 20121106__md__general__state_legislative.csv

@dwillis
Copy link
Contributor

dwillis commented Aug 9, 2013

Where in the process should the writing to mappings.txt occur? What happens when we run the fetcher again - should the script check the file to see if the mapping is already in there? If so, should we think about moving away from text files?

@ghing
Copy link
Contributor

ghing commented Oct 3, 2014

@zstumgoren, @dwillis It seems like this can be closed as it's been addressed by the Datasource API.

@dwillis dwillis closed this as completed Oct 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants