Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parser detection of unhandled content #80

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

struan
Copy link
Member

@struan struan commented Mar 6, 2017

The parser now tracks all the tags it sees as it goes using tag IDs and then compares those to a list of IDs extracted using XPath. If there is a difference between the lists it throws an Exception.

There's also a number of parser improvements in here which were found in the process of making sure that it parsed things correctly:

  • Fix for the parser failing to pick up all the text if there is more than one hs_Para element inside a Question tag
  • Fixes broken table parsing code
  • Fixes missing some content inside division tags
  • Correctly handles clause tags to be part of the immediately following Amendment
  • Makes hs_2cDebatedMotion a major heading
  • Fixes missing some content inside new debate tags.

It also adds a script to make re-parsing easier.

Fixes #54
Fixes #66

@@ -617,6 +617,15 @@ def parse_question(self, question):

p.text = re.sub('\n', ' ', text)
tag.append(p)

if len(para) > 1:
for p in para:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not double the output in the cases it's trying to catch? e.g.

<Question><hs_Para><Number>Q2</Number>.
<Uin>[908984]</Uin>
<Member><B>Mr
Steve Reed</B> (Croydon North) (Lab):</Member>
<QuestionText></QuestionText>I
add my condolences to those already expressed about the former Father
of the House, and I welcome
my<hs_TimeCode time="2017-03-01T12:21:31"></hs_TimeCode> new hon.
Friend the Member for Stoke-on-Trent Central (Gareth Snell) to his
place.</hs_Para><hs_Para>Young
black men who use mental health services are more likely than other
people to be subject to detention, extreme forms of medication and
severe physical restraint, and, in extreme cases, this has led to
death, including that of my constituent Seni Lewis. Too many black
people with mental ill health are afraid to seek treatment from a
service they fear will not treat them fairly. Will the Prime Minister
meet me and some of the affected families to discuss the need for an
inquiry into institutional racism in the mental health
service<hs_TimeCode time="2017-03-01T12:22:18"></hs_TimeCode>?</hs_Para></Question>

The following-sibling would catch the first para after QuestionText, and then this loop would catch it again.

Fix for the parser failing to pick up all the text if there is more than
one hs_Para element instite a Question tag
Store the UID and HRSContentID of handled tags so we can later compare
to a list of all IDs in the document
Get a list of all tag IDs in the document and compare to the list we've
processed and throw an exception if they don't match.
@struan struan force-pushed the question-parsing-missing-paras branch from 9f0e0a4 to d511b3e Compare March 16, 2017 09:48
@struan struan changed the title handle mutiple para tags in a debate question Improve parser detection of unhandled content Mar 16, 2017
Copes with tags that are mostly processed from inside another tag
@struan struan force-pushed the question-parsing-missing-paras branch 2 times, most recently from c4476de to f96f8f3 Compare March 17, 2017 10:14
There's lots of tags that we don't directly parse as we're interested in
sub tags or they are parsed as part of the parent. Mark these as seen.
We didn't use namespaces before so they weren't being parsed properly.
Correct this and track the tags.
Make sure we are coping with questions where part of the question isn't
in the tail of QuestionText but is in following tags.

Also cope with oddities like multiple question number tags.
Clause tags actually relate to the text after them so ignore them at the
top level and then go back and parse them as part of the following
heading tag. Then add them as the first part of the first speech under
the heading.

Fixes #53
If there's more than one heading or procedure in a new debate tag then
make those into paragraphs in the first speech of the debate.
rather than just parsing it all into a single line of text parse all the
paragraphs and indents so that we try and retain a bit more structure.
Scans the list of seen files and then picks out the latest one
and then re-parses that. Assumes that the files are ordered in
date order in the list.
@struan struan force-pushed the question-parsing-missing-paras branch from f96f8f3 to 97d679c Compare March 17, 2017 12:40
)
for t in following_tags:
tag_name = self.get_tag_name_no_ns(t)
self.handle_tag(tag_name, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adapted part of this commit in master to fix a recent issue. Note this doesn't fully work, in that any subsequent paragraphs would become a new no-speaker speech. What I've done in e8acc13 is make sure this uses new_speech() so current_speech is set and then they'll be attached correctly. This simplifies the function a bit too.

@dracos dracos force-pushed the master branch 6 times, most recently from bc05e4e to cf4da9e Compare March 20, 2023 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve robustness of parsing Missing Lords speech
2 participants