-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different in number of headers vs. the number of generic headers #15
Comments
It'd be good to have some source files after iconv to test with. David, can you provide these? |
Hi David "Before running ParsCit I run "iconv -f utf-8 -t utf-8 -c " because I notice that your command for iconv specifies both your from and to file Thanks! Muthu On 20 August 2014 13:14, Min-Yen Kan notifications@github.com wrote:
|
Hi Muthu, pdftotext version 0.24.5 I run iconv in this way because if I do not some papers give me this error Malformed UTF-8 character (unexpected continuation byte 0xad, with no preceding start byte) in pattern match (m//) at /Users/logp/ParsCit/bin/../lib/SectLabel/Tr2crfpp.pm line 216. However, I have disabled any preprocessing on the papers and I still receive the original error when run with certain papers. I will email a list of papers which have caused the error. |
Attached are 3 papers (pdf and result after running pdftotext -raw). On Wed, Aug 20, 2014 at 12:03 PM, cmkumar87 notifications@github.com
|
David's files work on our webservice at http://aye.comp.nus.edu.sg/parsCit/. The download we provide on the same page is a replica of the codebase that runs our webservice. So we aren't sure what's causing the reported error David's end. Please get in touch with us with us if you have anymore specfic error logs. Thanks! |
"iconv -f utf-8 -t utf-8 -c " run before each paper is extracted.
Email me at drapoport847 at gmail dot com for a list of papers which cause this error
GNU nano 2.0.6 File: log.txt
184.175.2.245 - - [19/Aug/2014 11:58:51] "POST /pc/upload HTTP/1.1" 200 -
'/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
Die: SectLabel::Controller::getGenericHeaders different in number of headers 38 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:00:49] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:02:16] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:03:18] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
'Die: SectLabel::Controller::getGenericHeaders different in number of headers 13 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:04:20] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:05:27] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
'Die: SectLabel::Controller::getGenericHeaders different in number of headers 23 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:06:39] "POST /pc/upload HTTP/1.1" 200 -
Citation text longer than article body: ignoring
184.175.2.245 - - [19/Aug/2014 12:08:09] "POST /pc/upload HTTP/1.1" 200 -
Citation text longer than article body: ignoring
184.175.2.245 - - [19/Aug/2014 12:10:21] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:12:57] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
'Die: SectLabel::Controller::getGenericHeaders different in number of headers 15 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:14:18] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:16:18] "POST /pc/upload HTTP/1.1" 200 -
Citation text longer than article body: ignoring
184.175.2.245 - - [19/Aug/2014 12:17:58] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:19:15] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
'Die: SectLabel::Controller::getGenericHeaders different in number of headers 9 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:20:37] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:21:43] "POST /pc/upload HTTP/1.1" 200 -
The text was updated successfully, but these errors were encountered: