Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TX not parsing legislators' names for all votes #1782

Closed
scichelli opened this issue Jun 12, 2017 · 7 comments
Closed

TX not parsing legislators' names for all votes #1782

scichelli opened this issue Jun 12, 2017 · 7 comments

Comments

@scichelli
Copy link
Contributor

@scichelli scichelli commented Jun 12, 2017

State: Texas

I'm working on finding the root cause; just logging a ticket here to make it easy to communicate status.

When scraping votes, the scraper does not always separate the list of names into individual records, and instead creates one record with all the legislators' names in a list.

For example, download the TX CSV from Open States Downloads and find the rows for vote TXV00004597. There are only three, one for each vote type, instead of one for each legislator, and for example the name for the 'yes' vote is a list, 'Bettencourt, Birdwell, Buckingham, Campbell, Creighton...' On the web, you can see that https://openstates.org/tx/votes/TXV00004597/ is not able to display the list of votes with links to the legislators' pages; contrast with https://openstates.org/tx/votes/TXV00000001/ which correctly shows a table of votes.

@estaub
Copy link
Contributor

@estaub estaub commented Sep 3, 2017

Is this fixed, or become even worse? The referenced vote (https://openstates.org/tx/votes/TXV00004597/) is no longer present, which may be good or bad.

@scichelli
Copy link
Contributor Author

@scichelli scichelli commented Sep 4, 2017

Hi @estaub. I don't think worse, although I too noticed that some IDs that were previously present no longer were, which told me I didn't yet understand the overall system well enough. Also, I've been struggling to come up to speed with Docker and Docker Compose, to be able to isolate the Texas scraper's problems locally. That's why this issue is languishing.

The Texas scraper could certainly be improved, but perhaps it would be best to close this particular github issue, and open separate ones as we identify them.

Flaws I'm aware of:

  • In many House journals, R. Anderson's name has the wrong punctuation, making it split into separate names. You'll see Anderson, C.; Anderson; R.;, where that semi-colon between Anderson and R ought to be a comma. Our scraper is tripped up by this typo.
  • When the Senate has duplicate last names, they'll write them like Taylor of Galveston, and the scraper doesn't handle this.
  • It doesn't match a vote event back to the legislator when the legislator's name contains unicode characters, e.g., Rodr\u00edguez.
  • I've been told that Eddie Lucio, Jr., and Eddie Lucio, III, are not getting treated as separate people.
  • A subject matter expert on our team told me the Bill Subjects are not containing the text she expects. I haven't dug into this.

I'm okay with closing this issue for the sake of housekeeping until more specific bugfix issues can be created.

@estaub
Copy link
Contributor

@estaub estaub commented Sep 4, 2017

@scichelli How about creating the point issues (or a bucket list issue for them) and closing this? (I'm not a project member, don't look to me for direction!)
@jamesturk ?

@mscarey
Copy link

@mscarey mscarey commented Oct 10, 2017

I looked through the JSON bulk data downloads for names with "null" in the leg_id field. The top 3 names that failed in the 85th Legislature were Turner, White, and Miller. Each failed over 1500 times. Those could refer to these leg_id codes:

TXL000354: Chris Turner
TXL000355: Sylvester Turner
TXL000410: James White
TXL000508: 'White, Molly'
TXL000449: 'Miller, Rick'
TXL000315: Doug Miller
TXL000316: Sid Miller

"Taylor of Collin" (TXL000350: Van Taylor) and "Taylor of Galveston" (TXL000349: Larry Taylor) also failed over 300 times each.

Right now, I don't understand the scrapers well enough to fix the issue.

@jamesturk
Copy link
Member

@jamesturk jamesturk commented Feb 14, 2018

we currently don't have any votes in the pupa DB, might be a regression

@jmillxyz
Copy link

@jmillxyz jmillxyz commented Sep 5, 2018

I tried to dig into this a little bit today. That being said, I'm also very new to the project.

Could it be due to the TX FTP server rate-limiting whatever system is feeding data into open states? I hit that issue today doing a full scrape (bills, committees, and people; not even counting votes: #2499)

@jamesturk
Copy link
Member

@jamesturk jamesturk commented Jan 14, 2020

rolling into #3102

@jamesturk jamesturk closed this Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants