Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for 2019 #38

Closed
52 tasks done
jamesturk opened this issue Dec 18, 2018 · 21 comments
Closed
52 tasks done

Updates for 2019 #38

jamesturk opened this issue Dec 18, 2018 · 21 comments

Comments

@jamesturk
Copy link
Member

jamesturk commented Dec 18, 2018

Akin to openstates/openstates-scrapers#2681 this is the master ticket for 2019 updates
Right now the process is a bit rough, but described here: https://github.com/openstates/people#updating-an-entire-state-via-a-scrape

Please reference this issue in any PRs 🙂

Remaining States as of 1/16:

  • Alaska
  • Connecticut
  • Florida
  • New Mexico
  • North Carolina - legislator pages 500ing as of 1/3
  • Oregon
  • Washington

Completed:

  • Alabama
  • Arizona
  • Arkansas
  • California
  • Colorado
  • Delaware
  • District of Columbia
  • Georgia
  • Hawaii
  • Idaho
  • Illinois
  • Indiana
  • Iowa
  • Kansas
  • Kentucky
  • Louisiana
  • Maine
  • Maryland
  • Massachusetts
  • Michigan
  • Minnesota
  • Mississippi
  • Missouri
  • Montana
  • Nebraska
  • Nevada
  • New Hampshire
  • New Jersey
  • New York
  • North Dakota
  • Ohio
  • Oklahoma
  • Pennsylvania
  • Puerto Rico
  • Rhode Island
  • South Carolina
  • South Dakota
  • Tennessee
  • Texas
  • Utah
  • Vermont
  • Virginia
  • West Virginia
  • Wisconsin - 2017 data as of 1/3
  • Wyoming
@jamesturk jamesturk mentioned this issue Dec 18, 2018
@jamesturk
Copy link
Member Author

jamesturk commented Dec 18, 2018

worth noting:
2018 House-only: KS, MN, NM, SC
2018 No elections: LA, MS, NJ, VA
(we should still check them as there are likely a few specials)

@jamesturk
Copy link
Member Author

DC is all incumbents, all good

@jamesturk
Copy link
Member Author

thanks a ton for all the help so far @csnardi - if you're working on any and want to comment here to claim them that'd be great so we don't duplicate effort 😄

@nickoneill
Copy link
Contributor

Hi there - if I wanted to get started helping out to scrape some states in, which state should I start with to not step on anyone else's toes?

@csnardi
Copy link
Contributor

csnardi commented Jan 3, 2019

I don't have any state in progress right now, so you're probably good to start wherever!

@nickoneill
Copy link
Contributor

Cool, I will take a look at Nevada this afternoon.

@nickoneill
Copy link
Contributor

NV is throwing 500 errors on the endpoints that were previously used. The website seems to display the latest reps though, so the scraper might need an update. I'm going to try to find a first state that works before I dive into editing a scraper.

WA seems like it runs using docker-compose run --rm scrape wa --fastmode --scrape, but only produces empty json files:

Starting openstates_database_1 ... done
loaded Open States pupa settings...
wa (scrape)
19:32:35 INFO pupa: save jurisdiction Washington as jurisdiction_ocd-jurisdiction-country:us-state:wa-government.json
19:32:35 INFO pupa: save organization Washington State Legislature as organization_52aa27f8-0f8e-11e9-8c2b-0242ac140003.json
19:32:35 INFO pupa: save organization Senate as organization_52aa2e2e-0f8e-11e9-8c2b-0242ac140003.json
19:32:35 INFO pupa: save organization House as organization_52aa3162-0f8e-11e9-8c2b-0242ac140003.json
wa (scrape)
jurisdiction scrape:
  duration:  0:00:00.022996
  objects:
    jurisdiction: 1
    organization: 3

Going through the scraper contributing guide, it says after this stuff "And then the actual data scraping begins" with logged GET requests and whatnot. It doesn't seem like it's doing that at all. Any idea where to start to debug this?

@jamesturk jamesturk mentioned this issue Jan 3, 2019
@jamesturk
Copy link
Member Author

@nickoneill WA might need scraper updates too, it might be a small change to the page that is causing us to miss whatever data is there -- I can take a look at one of those if you want to move on to try to find one that runs without issue for now

@jamesturk
Copy link
Member Author

jamesturk commented Jan 3, 2019

also, possibly of interest to you guys (& anyone else looking to take a crack at this) I added a few experimental command line flags to the merge script

  --remove-identical / --no-remove-identical
                                  In incoming mode, remove identical files.
  --copy-new / --no-copy-new      In incoming mode, copy brand new files over.
  --interactive / --no-interactive
                                  Do interactive merges.

these are all fairly rough right now, but Colorado was a lot faster when I ran with

--copy-new --remove-identical the first time

and then --interactive after that to take care of the rest (you should hit a to abort when the merge candidates get really bad)

it got me about 85% of the way there without a ton of manual file moving/etc.

@nickoneill
Copy link
Contributor

@jamesturk Thanks, I mostly wanted to check that I wasn't missing something really obvious. I will dig around for something that works on its own for starters.

@jamesturk
Copy link
Member Author

np, working on SD now (just a heads up)

@nickoneill
Copy link
Contributor

Ah, I didn't realize I had to manually request people to get it to fetch them. I'm moving forward on AZ, will open a PR with more questions about how the person merge process works.

@nickoneill
Copy link
Contributor

Actually more questions here first: I ran the merge helper and have some assumptions about what to do next:

For people who are the same (like this with a 0.70 score), update the file in the data directory.
0.70 data/az/people/Ben-Toma-6d71c8d3-c677-4f4b-96eb-411fbc464c2c.yml incoming/az/people/Ben-Toma-74b10855-2da0-4a63-bf6a-620c8ab4a6e8.yml

  • Why do they have new IDs? Which one should I keep?

For people who are not the same (like this with a 0.10 score), move the unused file to retired and add the new file?
data/az/people/Ken-Clark-a4d3ecd4-8bfc-4e6c-8290-a557bc347680.yml incoming/az/people/Amish-Shah-0147aaf4-bd4a-4a55-8003-684e3e60f522.yml

@csnardi
Copy link
Contributor

csnardi commented Jan 3, 2019

The new IDs are because they're just GUIDs, there's no central database of legislator IDs like OCD has for Divisions/etc. So when you run the scraper, it creates a new GUID since it doesn't know any better. The old ID should be kept, since it'll match up with old data.

For people who are not the same, you're likely correct, you'd just want to retire the old legislator and bring in the new one.

I've created a simple script that can automate a lot of the process, it's not as complex and smart as merge.py, but it largely can perform all of the tasks as long as you check over it at the end/update the end/start dates: https://gist.github.com/csnardi/518cf39c0d0e909132f8ddec6f3817e9.

@jamesturk jamesturk mentioned this issue Jan 3, 2019
@nickoneill
Copy link
Contributor

Washington has not been updated, they convene on January 14th 2019

@jamesturk
Copy link
Member Author

taking another swing at NH next

@jamesturk
Copy link
Member Author

AK finishes the batch, still tweaks to make but closing this in favor of individual issues as we go

@jamesturk jamesturk unpinned this issue Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants