Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction Issue #27

Open
vinylrichie opened this issue Jan 31, 2021 · 1 comment
Open

Extraction Issue #27

vinylrichie opened this issue Jan 31, 2021 · 1 comment

Comments

@vinylrichie
Copy link

Hello @kohlschuetter ,

First off, I have to say, Boilerpipe is AMAZING! Thank you for your work on this.

In a few cases, I am having a bit of an extraction issue. With the github code, there are some articles where the extraction is starting late. For example, on https://en.wikipedia.org/wiki/New_York_City the output starts at "Further information: Police surveillance in New York City and Crime in New York City". However, when I check that same article on https://boilerpipe-web.appspot.com/, the web API is always getting the full text. I've been banging my head against the wall trying to figure out what I was doing wrong, and just figured I should message the inventor. The only two things I could think of are: 1) I am totally missing something or 2) the web api might slightly different version. Do you what might be going on here?

Hope you are having a great weekend!

Best,
Kevin

@RenanMoreiraDK
Copy link

I'm facing some issues with the ArticleExtractor producing completely different results for two pages that have really similar HTML:

https://www.posb.com.sg/personal/deposits/savings-accounts/emysavings-account
https://www.dbs.com.sg/personal/deposits/savings-accounts/mysavings-account

When I use the DefaultExtractor, the response is 96% similar. But using ArticleExtractor is completely different, any ideas why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants