Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different in number of headers vs. the number of generic headers #15

Closed
davidrapoport opened this issue Aug 20, 2014 · 5 comments
Closed
Assignees

Comments

@davidrapoport
Copy link

"iconv -f utf-8 -t utf-8 -c " run before each paper is extracted.
Email me at drapoport847 at gmail dot com for a list of papers which cause this error

GNU nano 2.0.6 File: log.txt

184.175.2.245 - - [19/Aug/2014 11:58:51] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in

'
Die: SectLabel::Controller::getGenericHeaders different in number of headers 38 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:00:49] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:02:16] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:03:18] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in'
Die: SectLabel::Controller::getGenericHeaders different in number of headers 13 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:04:20] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:05:27] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in'
Die: SectLabel::Controller::getGenericHeaders different in number of headers 23 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:06:39] "POST /pc/upload HTTP/1.1" 200 -
Citation text longer than article body: ignoring
184.175.2.245 - - [19/Aug/2014 12:08:09] "POST /pc/upload HTTP/1.1" 200 -
Citation text longer than article body: ignoring
184.175.2.245 - - [19/Aug/2014 12:10:21] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:12:57] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in'
Die: SectLabel::Controller::getGenericHeaders different in number of headers 15 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:14:18] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:16:18] "POST /pc/upload HTTP/1.1" 200 -
Citation text longer than article body: ignoring
184.175.2.245 - - [19/Aug/2014 12:17:58] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:19:15] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in'
Die: SectLabel::Controller::getGenericHeaders different in number of headers 9 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:20:37] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:21:43] "POST /pc/upload HTTP/1.1" 200 -

@knmnyn
Copy link
Owner

knmnyn commented Aug 20, 2014

It'd be good to have some source files after iconv to test with. David, can you provide these?

@cmkumar87
Copy link
Contributor

Hi David

"Before running ParsCit I run "iconv -f utf-8 -t utf-8 -c " because
otherwise I would get a UTF error."

I notice that your command for iconv specifies both your from and to file
formats as utf-8. If your input is already in utf8 why would you convert it
to utf8? Could you please check what your input format is and is iconv
converting anything at all?

Thanks!

Muthu

On 20 August 2014 13:14, Min-Yen Kan notifications@github.com wrote:

It'd be good to have some source files after iconv to test with. David,
can you provide these?


Reply to this email directly or view it on GitHub
#15 (comment).

@davidrapoport
Copy link
Author

Hi Muthu,
Before sending it to the webservice I run pdftotext -raw

pdftotext version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

I run iconv in this way because if I do not some papers give me this error

Malformed UTF-8 character (unexpected continuation byte 0xad, with no preceding start byte) in pattern match (m//) at /Users/logp/ParsCit/bin/../lib/SectLabel/Tr2crfpp.pm line 216.
Malformed UTF-8 character (unexpected non-continuation byte 0x61, immediately after start byte 0xe9) in pattern match (m//) at /Users/logp/ParsCit/bin/../lib/SectLabel/Tr2crfpp.pm line 216.

However, I have disabled any preprocessing on the papers and I still receive the original error when run with certain papers. I will email a list of papers which have caused the error.

@davidrapoport
Copy link
Author

Attached are 3 papers (pdf and result after running pdftotext -raw).

On Wed, Aug 20, 2014 at 12:03 PM, cmkumar87 notifications@github.com
wrote:

Hi David

"Before running ParsCit I run "iconv -f utf-8 -t utf-8 -c " because
otherwise I would get a UTF error."

I notice that your command for iconv specifies both your from and to file
formats as utf-8. If your input is already in utf8 why would you convert
it
to utf8? Could you please check what your input format is and is iconv
converting anything at all?

Thanks!

Muthu

On 20 August 2014 13:14, Min-Yen Kan notifications@github.com wrote:

It'd be good to have some source files after iconv to test with. David,
can you provide these?


Reply to this email directly or view it on GitHub
#15 (comment).


Reply to this email directly or view it on GitHub
#15 (comment).

@cmkumar87
Copy link
Contributor

David's files work on our webservice at http://aye.comp.nus.edu.sg/parsCit/. The download we provide on the same page is a replica of the codebase that runs our webservice. So we aren't sure what's causing the reported error David's end. Please get in touch with us with us if you have anymore specfic error logs.

Thanks!

@cmkumar87 cmkumar87 self-assigned this Oct 4, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants