New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement name conversion strategy for raw results files #5

Closed
zstumgoren opened this Issue Jul 26, 2013 · 5 comments

Comments

Projects
None yet
3 participants
@zstumgoren
Contributor

zstumgoren commented Jul 26, 2013

Create a module to standardize names of raw result files. Raw results will be stored on S3 using the standardized name.

Standardized names should:

  • be resolvable back to raw file names
  • encapsulate enough information about the contained results to link up to metadata via API

Naming Convention

See #4 for details on naming convention

Standardization should generate a composite file name that reflects metadata captured in our data admin.

File name components should include:

  • election date - YYYYMMDD
  • state - postal code
  • race type - general, primary-dem, primary runoff-dem, etc.
  • jurisdiction - OCD id of the jurisdiction, or geographic area, for which results are provided. For example, a file for MD that contains precinct-level results for Anne Arundel County could use a slugified version of the plain OCD name
  • race_code that denotes types of races covered in the data file. Optional element that should only be used when state provides data for single race in distinct file. For example, Louisiana provides precinct-level results, by parish, for each race. This field could also be expanded, on a state-by-state basis, to handle arbitrary groupings of results (e.g. separate files for state leg., federal, local).
  • reporting level - precinct, city, county, state, etc.
  • file type extension - db, csv, html, json, xml, etc.

Format

File name components separated by double underscores; component sub-parts separated by single underscores.

<YYYYMMDD>__<state>__<race_type>__<jurisdiction>__[<race_code>__]<level>.<ext>

Examples

Louisiana Congressional District 1 precinct level results, Jefferson Davis Parish
20121106__la__general__jefferson_davish_parish__cd_1__precinct.html

Allegeny County precincnt results for general election (contains multiple race types)
20121106__md__general__allegany_county__precinct.csv

Implementation

Standardized name should be generated during file download process (in state-specific fetch.py modules).

Each state directory should have a 2-column mappings.txt file that contains standardized name and link to raw result file. The raw link should point to result file located at source agency or to copy of raw file archived on S3. The latter would be used in cases where result files are not scrapable (e.g. if agency provided a database dump).

## mappings.txt ##
standard_name, raw_source_name
20121106__md__general__anne_arundel_county__precinct.csv, http://www.elections.state.md.us/elections/2012/election_data/Anne_Arundel_By_Precinct_2012_General.csv

@ghost ghost assigned zstumgoren Jul 26, 2013

@dwillis

This comment has been minimized.

Show comment
Hide comment
@dwillis

dwillis Jul 26, 2013

Contributor

Would specify that race_code should only be used when files are specific to single race.

Contributor

dwillis commented Jul 26, 2013

Would specify that race_code should only be used when files are specific to single race.

@zstumgoren

This comment has been minimized.

Show comment
Hide comment
@zstumgoren

zstumgoren Jul 26, 2013

Contributor

Yep, for time being that's way to go. If necessary down the road, we could expand its usage to account for arbitrary partitioning of results. For example, if a state partitioned results by race type into separate files for state leg, federal, local. Can't think of any examples of that right now, so we can deal with that on a state-by-state basis if need arises. Meantime, i'll tweak note next to the race_code field.

Contributor

zstumgoren commented Jul 26, 2013

Yep, for time being that's way to go. If necessary down the road, we could expand its usage to account for arbitrary partitioning of results. For example, if a state partitioned results by race type into separate files for state leg, federal, local. Can't think of any examples of that right now, so we can deal with that on a state-by-state basis if need arises. Meantime, i'll tweak note next to the race_code field.

@dwillis

This comment has been minimized.

Show comment
Hide comment
@dwillis

dwillis Aug 9, 2013

Contributor

What about where a file contains not a single jurisdiction but multiple ones? For example, MD's results by state legislative district are in a single file with all districts. Proposing something like: 20121106__md__general__state_legislative.csv

Contributor

dwillis commented Aug 9, 2013

What about where a file contains not a single jurisdiction but multiple ones? For example, MD's results by state legislative district are in a single file with all districts. Proposing something like: 20121106__md__general__state_legislative.csv

@dwillis

This comment has been minimized.

Show comment
Hide comment
@dwillis

dwillis Aug 9, 2013

Contributor

Where in the process should the writing to mappings.txt occur? What happens when we run the fetcher again - should the script check the file to see if the mapping is already in there? If so, should we think about moving away from text files?

Contributor

dwillis commented Aug 9, 2013

Where in the process should the writing to mappings.txt occur? What happens when we run the fetcher again - should the script check the file to see if the mapping is already in there? If so, should we think about moving away from text files?

@ghing

This comment has been minimized.

Show comment
Hide comment
@ghing

ghing Oct 3, 2014

Contributor

@zstumgoren, @dwillis It seems like this can be closed as it's been addressed by the Datasource API.

Contributor

ghing commented Oct 3, 2014

@zstumgoren, @dwillis It seems like this can be closed as it's been addressed by the Datasource API.

@dwillis dwillis closed this Oct 3, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment