Improve parser detection of unhandled content #80

struan · 2017-03-06T17:30:10Z

The parser now tracks all the tags it sees as it goes using tag IDs and then compares those to a list of IDs extracted using XPath. If there is a difference between the lists it throws an Exception.

There's also a number of parser improvements in here which were found in the process of making sure that it parsed things correctly:

Fix for the parser failing to pick up all the text if there is more than one hs_Para element inside a Question tag
Fixes broken table parsing code
Fixes missing some content inside division tags
Correctly handles clause tags to be part of the immediately following Amendment
Makes hs_2cDebatedMotion a major heading
Fixes missing some content inside new debate tags.

It also adds a script to make re-parsing easier.

Fixes #54
Fixes #66

dracos · 2017-03-07T12:45:36Z

pyscraper/new_hansard.py

@@ -617,6 +617,15 @@ def parse_question(self, question):

        p.text = re.sub('\n', ' ', text)
        tag.append(p)
+
+        if len(para) > 1:
+            for p in para:


Does this not double the output in the cases it's trying to catch? e.g.

<Question><hs_Para><Number>Q2</Number>. <Uin>[908984]</Uin> <Member><B>Mr Steve Reed</B> (Croydon North) (Lab):</Member> <QuestionText></QuestionText>I add my condolences to those already expressed about the former Father of the House, and I welcome my<hs_TimeCode time="2017-03-01T12:21:31"></hs_TimeCode> new hon. Friend the Member for Stoke-on-Trent Central (Gareth Snell) to his place.</hs_Para><hs_Para>Young black men who use mental health services are more likely than other people to be subject to detention, extreme forms of medication and severe physical restraint, and, in extreme cases, this has led to death, including that of my constituent Seni Lewis. Too many black people with mental ill health are afraid to seek treatment from a service they fear will not treat them fairly. Will the Prime Minister meet me and some of the affected families to discuss the need for an inquiry into institutional racism in the mental health service<hs_TimeCode time="2017-03-01T12:22:18"></hs_TimeCode>?</hs_Para></Question>

The following-sibling would catch the first para after QuestionText, and then this loop would catch it again.

Fix for the parser failing to pick up all the text if there is more than one hs_Para element instite a Question tag

Store the UID and HRSContentID of handled tags so we can later compare to a list of all IDs in the document

Get a list of all tag IDs in the document and compare to the list we've processed and throw an exception if they don't match.

Copes with tags that are mostly processed from inside another tag

Fixes #63

There's lots of tags that we don't directly parse as we're interested in sub tags or they are parsed as part of the parent. Mark these as seen.

We didn't use namespaces before so they weren't being parsed properly. Correct this and track the tags.

Make sure we are coping with questions where part of the question isn't in the tail of QuestionText but is in following tags. Also cope with oddities like multiple question number tags.

Clause tags actually relate to the text after them so ignore them at the top level and then go back and parse them as part of the following heading tag. Then add them as the first part of the first speech under the heading. Fixes #53

If there's more than one heading or procedure in a new debate tag then make those into paragraphs in the first speech of the debate.

rather than just parsing it all into a single line of text parse all the paragraphs and indents so that we try and retain a bit more structure.

Scans the list of seen files and then picks out the latest one and then re-parses that. Assumes that the files are ordered in date order in the list.

dracos · 2017-12-13T13:24:51Z

pyscraper/new_hansard.py

+        )
+        for t in following_tags:
+            tag_name = self.get_tag_name_no_ns(t)
+            self.handle_tag(tag_name, t)


I've adapted part of this commit in master to fix a recent issue. Note this doesn't fully work, in that any subsequent paragraphs would become a new no-speaker speech. What I've done in e8acc13 is make sure this uses new_speech() so current_speech is set and then they'll be attached correctly. This simplifies the function a bit too.

struan added the Reviewing label Mar 6, 2017

dracos reviewed Mar 7, 2017

View reviewed changes

struan added 3 commits March 10, 2017 12:30

handle mutiple para tags in a debate question

681f228

Fix for the parser failing to pick up all the text if there is more than one hs_Para element instite a Question tag

Keep a track of all the tag we've processed

12eef35

Store the UID and HRSContentID of handled tags so we can later compare to a list of all IDs in the document

throw an exception if it looks like we missed a tag

391caef

Get a list of all tag IDs in the document and compare to the list we've processed and throw an exception if they don't match.

struan force-pushed the question-parsing-missing-paras branch from 9f0e0a4 to d511b3e Compare March 16, 2017 09:48

struan changed the title ~~handle mutiple para tags in a debate question~~ Improve parser detection of unhandled content Mar 16, 2017

mark as seen tags not always accessed via handle_tag

b7b303a

Copes with tags that are mostly processed from inside another tag

struan force-pushed the question-parsing-missing-paras branch 2 times, most recently from c4476de to f96f8f3 Compare March 17, 2017 10:14

struan added 12 commits March 17, 2017 10:52

parse english and welsh only division counts

4b90d64

Fixes #63

mark all the paragraph and time tags in a division as seen

e7cc27e

mark various bits of structure as seen

519f3c3

There's lots of tags that we don't directly parse as we're interested in sub tags or they are parsed as part of the parent. Mark these as seen.

correctly parse tables

810f5b1

We didn't use namespaces before so they weren't being parsed properly. Correct this and track the tags.

improve question parsing and track the tags

c8fe25f

Make sure we are coping with questions where part of the question isn't in the tail of QuestionText but is in following tags. Also cope with oddities like multiple question number tags.

handle clause tags in the following heading

f0425a8

Clause tags actually relate to the text after them so ignore them at the top level and then go back and parse them as part of the following heading tag. Then add them as the first part of the first speech under the heading. Fixes #53

cope with multiple debate heading and procedure tags in new debate

cf011ad

If there's more than one heading or procedure in a new debate tag then make those into paragraphs in the first speech of the debate.

parse time tags inside a division tag

17b220b

gather up all the text inside a division tag

fb88a06

hs_2cDebatedMotion is actually a major heading

08af3bc

better parsing for Lords Amemdments

dcb33e5

rather than just parsing it all into a single line of text parse all the paragraphs and indents so that we try and retain a bit more structure.

Script to reparse all hansard zip file contents

8c26cad

Scans the list of seen files and then picks out the latest one and then re-parses that. Assumes that the files are ordered in date order in the list.

struan mentioned this pull request Mar 17, 2017

parse_opposition/parse_debated_motion assume 0 or 1 following #57

Open

handle times with a newline between hours and minutes

97d679c

struan force-pushed the question-parsing-missing-paras branch from f96f8f3 to 97d679c Compare March 17, 2017 12:40

dracos reviewed Dec 13, 2017

View reviewed changes

dracos force-pushed the master branch from cb2569b to 9b900c5 Compare September 5, 2020 16:43

dracos force-pushed the master branch 3 times, most recently from 403ee7b to 0c4983b Compare March 12, 2023 09:43

dracos force-pushed the master branch 6 times, most recently from bc05e4e to cf4da9e Compare March 20, 2023 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parser detection of unhandled content #80

Improve parser detection of unhandled content #80

struan commented Mar 6, 2017 •

edited

dracos Mar 7, 2017

dracos Dec 13, 2017

Improve parser detection of unhandled content #80

Are you sure you want to change the base?

Improve parser detection of unhandled content #80

Conversation

struan commented Mar 6, 2017 • edited

dracos Mar 7, 2017

Choose a reason for hiding this comment

dracos Dec 13, 2017

Choose a reason for hiding this comment

struan commented Mar 6, 2017 •

edited