Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for 2019 #38

Closed
jamesturk opened this issue Dec 18, 2018 · 21 comments

Comments

@jamesturk
Copy link
Member

@jamesturk jamesturk commented Dec 18, 2018

Akin to openstates/openstates#2681 this is the master ticket for 2019 updates
Right now the process is a bit rough, but described here: https://github.com/openstates/people#updating-an-entire-state-via-a-scrape

Please reference this issue in any PRs 🙂

Remaining States as of 1/16:

  • Alaska
  • Connecticut
  • Florida
  • New Mexico
  • North Carolina - legislator pages 500ing as of 1/3
  • Oregon
  • Washington

Completed:

  • Alabama
  • Arizona
  • Arkansas
  • California
  • Colorado
  • Delaware
  • District of Columbia
  • Georgia
  • Hawaii
  • Idaho
  • Illinois
  • Indiana
  • Iowa
  • Kansas
  • Kentucky
  • Louisiana
  • Maine
  • Maryland
  • Massachusetts
  • Michigan
  • Minnesota
  • Mississippi
  • Missouri
  • Montana
  • Nebraska
  • Nevada
  • New Hampshire
  • New Jersey
  • New York
  • North Dakota
  • Ohio
  • Oklahoma
  • Pennsylvania
  • Puerto Rico
  • Rhode Island
  • South Carolina
  • South Dakota
  • Tennessee
  • Texas
  • Utah
  • Vermont
  • Virginia
  • West Virginia
  • Wisconsin - 2017 data as of 1/3
  • Wyoming
@jamesturk jamesturk referenced this issue Dec 18, 2018
@jamesturk

This comment has been minimized.

Copy link
Member Author

@jamesturk jamesturk commented Dec 18, 2018

worth noting:
2018 House-only: KS, MN, NM, SC
2018 No elections: LA, MS, NJ, VA
(we should still check them as there are likely a few specials)

@jamesturk

This comment has been minimized.

Copy link
Member Author

@jamesturk jamesturk commented Jan 2, 2019

DC is all incumbents, all good

@jamesturk

This comment has been minimized.

Copy link
Member Author

@jamesturk jamesturk commented Jan 2, 2019

thanks a ton for all the help so far @csnardi - if you're working on any and want to comment here to claim them that'd be great so we don't duplicate effort 😄

@nickoneill

This comment has been minimized.

Copy link
Contributor

@nickoneill nickoneill commented Jan 3, 2019

Hi there - if I wanted to get started helping out to scrape some states in, which state should I start with to not step on anyone else's toes?

@csnardi

This comment has been minimized.

Copy link
Contributor

@csnardi csnardi commented Jan 3, 2019

I don't have any state in progress right now, so you're probably good to start wherever!

@nickoneill

This comment has been minimized.

Copy link
Contributor

@nickoneill nickoneill commented Jan 3, 2019

Cool, I will take a look at Nevada this afternoon.

@nickoneill

This comment has been minimized.

Copy link
Contributor

@nickoneill nickoneill commented Jan 3, 2019

NV is throwing 500 errors on the endpoints that were previously used. The website seems to display the latest reps though, so the scraper might need an update. I'm going to try to find a first state that works before I dive into editing a scraper.

WA seems like it runs using docker-compose run --rm scrape wa --fastmode --scrape, but only produces empty json files:

Starting openstates_database_1 ... done
loaded Open States pupa settings...
wa (scrape)
19:32:35 INFO pupa: save jurisdiction Washington as jurisdiction_ocd-jurisdiction-country:us-state:wa-government.json
19:32:35 INFO pupa: save organization Washington State Legislature as organization_52aa27f8-0f8e-11e9-8c2b-0242ac140003.json
19:32:35 INFO pupa: save organization Senate as organization_52aa2e2e-0f8e-11e9-8c2b-0242ac140003.json
19:32:35 INFO pupa: save organization House as organization_52aa3162-0f8e-11e9-8c2b-0242ac140003.json
wa (scrape)
jurisdiction scrape:
  duration:  0:00:00.022996
  objects:
    jurisdiction: 1
    organization: 3

Going through the scraper contributing guide, it says after this stuff "And then the actual data scraping begins" with logged GET requests and whatnot. It doesn't seem like it's doing that at all. Any idea where to start to debug this?

@jamesturk jamesturk referenced this issue Jan 3, 2019
@jamesturk

This comment has been minimized.

Copy link
Member Author

@jamesturk jamesturk commented Jan 3, 2019

@nickoneill WA might need scraper updates too, it might be a small change to the page that is causing us to miss whatever data is there -- I can take a look at one of those if you want to move on to try to find one that runs without issue for now

@jamesturk

This comment has been minimized.

Copy link
Member Author

@jamesturk jamesturk commented Jan 3, 2019

also, possibly of interest to you guys (& anyone else looking to take a crack at this) I added a few experimental command line flags to the merge script

  --remove-identical / --no-remove-identical
                                  In incoming mode, remove identical files.
  --copy-new / --no-copy-new      In incoming mode, copy brand new files over.
  --interactive / --no-interactive
                                  Do interactive merges.

these are all fairly rough right now, but Colorado was a lot faster when I ran with

--copy-new --remove-identical the first time

and then --interactive after that to take care of the rest (you should hit a to abort when the merge candidates get really bad)

it got me about 85% of the way there without a ton of manual file moving/etc.

@nickoneill

This comment has been minimized.

Copy link
Contributor

@nickoneill nickoneill commented Jan 3, 2019

@jamesturk Thanks, I mostly wanted to check that I wasn't missing something really obvious. I will dig around for something that works on its own for starters.

@jamesturk

This comment has been minimized.

Copy link
Member Author

@jamesturk jamesturk commented Jan 3, 2019

np, working on SD now (just a heads up)

@nickoneill

This comment has been minimized.

Copy link
Contributor

@nickoneill nickoneill commented Jan 3, 2019

Ah, I didn't realize I had to manually request people to get it to fetch them. I'm moving forward on AZ, will open a PR with more questions about how the person merge process works.

@nickoneill

This comment has been minimized.

Copy link
Contributor

@nickoneill nickoneill commented Jan 3, 2019

Actually more questions here first: I ran the merge helper and have some assumptions about what to do next:

For people who are the same (like this with a 0.70 score), update the file in the data directory.
0.70 data/az/people/Ben-Toma-6d71c8d3-c677-4f4b-96eb-411fbc464c2c.yml incoming/az/people/Ben-Toma-74b10855-2da0-4a63-bf6a-620c8ab4a6e8.yml

  • Why do they have new IDs? Which one should I keep?

For people who are not the same (like this with a 0.10 score), move the unused file to retired and add the new file?
data/az/people/Ken-Clark-a4d3ecd4-8bfc-4e6c-8290-a557bc347680.yml incoming/az/people/Amish-Shah-0147aaf4-bd4a-4a55-8003-684e3e60f522.yml

@csnardi

This comment has been minimized.

Copy link
Contributor

@csnardi csnardi commented Jan 3, 2019

The new IDs are because they're just GUIDs, there's no central database of legislator IDs like OCD has for Divisions/etc. So when you run the scraper, it creates a new GUID since it doesn't know any better. The old ID should be kept, since it'll match up with old data.

For people who are not the same, you're likely correct, you'd just want to retire the old legislator and bring in the new one.

I've created a simple script that can automate a lot of the process, it's not as complex and smart as merge.py, but it largely can perform all of the tasks as long as you check over it at the end/update the end/start dates: https://gist.github.com/csnardi/518cf39c0d0e909132f8ddec6f3817e9.

@jamesturk jamesturk referenced this issue Jan 3, 2019
@nickoneill

This comment has been minimized.

Copy link
Contributor

@nickoneill nickoneill commented Jan 6, 2019

Washington has not been updated, they convene on January 14th 2019

@jamesturk jamesturk referenced this issue Jan 9, 2019
jamesturk added a commit that referenced this issue Jan 9, 2019
@jamesturk jamesturk referenced this issue Jan 9, 2019
@jamesturk

This comment has been minimized.

Copy link
Member Author

@jamesturk jamesturk commented Jan 9, 2019

taking another swing at NH next

@jamesturk

This comment has been minimized.

Copy link
Member Author

@jamesturk jamesturk commented Feb 7, 2019

AK finishes the batch, still tweaks to make but closing this in favor of individual issues as we go

@jamesturk jamesturk closed this Feb 7, 2019
@jamesturk jamesturk unpinned this issue Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.