Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing 43 sentences from EWT #7

Open
arademaker opened this issue Jan 31, 2020 · 6 comments
Open

missing 43 sentences from EWT #7

arademaker opened this issue Jan 31, 2020 · 6 comments

Comments

@arademaker
Copy link
Contributor

https://catalog.ldc.upenn.edu/LDC2012T13 says that EWT has 16,624 sentences. They actually have:

% wc -l `find . -name '*.tree'` | tail
     ...
      10 ./reviews/penntree/278775.xml.tree
       5 ./reviews/penntree/389136.xml.tree
       3 ./reviews/penntree/374604.xml.tree
       2 ./reviews/penntree/137883.xml.tree
       8 ./reviews/penntree/382073.xml.tree
       2 ./reviews/penntree/022273.xml.tree
       3 ./reviews/penntree/211933.xml.tree
       2 ./reviews/penntree/332068.xml.tree
       1 ./reviews/penntree/289763.xml.tree
   16622 total

This number matches the number of sentences in the https://github.com/universaldependencies/UD_English-EWT treebank:

ud-english-ewt % grep sent_id *.conllu | wc -l
   16622

But this propbank-release contains only 16579 sentences. We are missing the following 43 sentences:

  • reviews-052884-0001 : unique gifts and cards
  • reviews-172245-0001 : Great store great products
  • reviews-258042-0001 : Lovley food and fab chips
  • reviews-018268-0001 : best square slice around.
  • reviews-253807-0001 : Cheapest drinks in Keene!
  • reviews-105719-0001 : Over priced for Mexican food
  • reviews-190389-0001 : very miss informed people!!
  • reviews-035932-0001 : Simple, Quick take away.
  • reviews-173758-0001 : best place for snowboard eva.
  • reviews-211844-0001 : Favorite DD spot in the area!
  • reviews-189171-0001 : A most outstanding, professional firm.
  • reviews-228154-0001 : Good food and coffee with a nice atmosphere
  • reviews-208180-0001 : Good quality Indian food in a pleasant environment
  • reviews-242303-0001 : Awesome bacon egg and cheese sandwich for breakfast.
  • reviews-317480-0001 : Great atmosphere, great food.
  • reviews-317480-0002 : Definitely a must.
  • reviews-107292-0001 : awesome bagels
  • reviews-107292-0002 : long lines on the weekends but worth it
  • reviews-330275-0001 : Some of the nicest people and very good work standards
  • reviews-235462-0001 : Hobbs on Mass.
  • reviews-235462-0002 : Absolutely my favorite store in Lawrence, KS
  • reviews-341435-0001 : Nice and quiet place with cosy living room just outside the city.
  • reviews-203196-0001 : VINGAS
  • reviews-203196-0002 : VISAKHA INDUSTRIAL GASES PVT. LTD., location at google maps.
  • reviews-008635-0001 : Good food and very friendly staff.
  • reviews-008635-0002 : Very good with my 5 year old daughter.
  • reviews-008635-0003 : Interesting good value wine list to.
  • reviews-008635-0004 : Beer a bit expensive.
  • answers-20090203211448AAoG2yX_ans-0001 : Green Tea Or White Tea?
  • answers-20090203211448AAoG2yX_ans-0002 : Green
  • answers-20090203211448AAoG2yX_ans-0003 : Green Tea.
  • answers-20090203211448AAoG2yX_ans-0004 : Green tea
  • reviews-327867-0001 : Good clean store nice car wash
  • reviews-081116-0001 : Best fried shrimp in the state!
  • reviews-314938-0001 : The best pilates on the Gold Coast!
  • reviews-184290-0001 : wow wow wow.
  • reviews-184290-0002 : the bast cab in minneapolis
  • reviews-388121-0001 : Too many kids, too many knifings, too many taserings.
  • reviews-058878-0001 : Nice little locally owned greek bar and grill.
  • reviews-058878-0002 : Good food.
  • reviews-058878-0003 : Great wings!
  • reviews-046500-0001 : Mens and Boys Barbers, on the number 9 Bus route.
  • reviews-046500-0002 : Ladies room, Open Sundays
@timjogorman
Copy link
Member

Thanks for noting this! While it would be good for us to include these, this does not mean that there is missing SRL data -- while I'll need to look into it more, I'm pretty sure that each of these sentences is from a document that had zero predicates to annotate, and our pipeline ended up simply not preparing documents with zero annotations. I think that's an error in our pipeline -- while dropping them would have no effect on standard SRL training (where you have gold predicate identification) it would be more accurate to have these documents included. I'll look into adding them in.

@arademaker
Copy link
Contributor Author

The strange thing here is that there are other sentences without predicate and arguments annotation but still in the corpus.

@manning
Copy link

manning commented Oct 30, 2020

Random additional comment: There are a number of failures to divide sentences in the LDC EWT data. Up until now, we have kept the sentence divisions consistent between LDC EWT and UD EWT, but I have an intention of some day fixing the erroneous sentence divisions and giving the results back to LDC....

@arademaker
Copy link
Contributor Author

arademaker commented Nov 11, 2020

Can I help somehow? I really would like to see the data more consistent between the LDC EWT, UD EWT, and Propbank. Do you have the list of errors in the division of sentences?

@arademaker
Copy link
Contributor Author

Hi @manning, I have just noticed that LDC EWT does not contain the division dev/test/train. So maybe the split used in the UD EWT was based on the sets defined here in this repository?

@arademaker
Copy link
Contributor Author

The strange thing here is that there are other sentences without predicate and arguments annotation but still in the corpus.

I missed one important detail in @timjogorman explanation above. He repeated what he said in #2 (comment) actually. Only files that do not contain any predicate annotated in all its sentences are omitted. So my comment above can be ignored, we do have files with some sentences missing SRL annotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants