Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode error thrown when parsing references #22

Closed
david-caro opened this issue Jan 18, 2017 · 3 comments
Closed

Unicode error thrown when parsing references #22

david-caro opened this issue Jan 18, 2017 · 3 comments
Assignees

Comments

@david-caro
Copy link
Contributor

Got unicode error when parsing arxiv 1701.04322

[2017-01-18 14:56:30,362: ERROR/MainProcess] Task invenio_workflows.tasks.start[c031ad62-e178-43d7-a6a5-c288ca3a1da0] raised unexpected: UnicodeEncodeError('ascii', u'* Unknown citation found. Searching for book title in: , General topology. Mathematical Monographs, Vol. 60, PWN\u2014 Polish Scientific Publishers, Warsaw, (1977).', 112, 113, 'ordinal not in range(128)')
Traceback (most recent call last):
  File "/opt/inspire/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/inspire/src/flask-celeryext/flask_celeryext/app.py", line 52, in __call__
    return Task.__call__(self, *args, **kwargs)
  File "/opt/inspire/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/tasks.py", line 77, in start
    return text_type(run_worker(workflow_name, data, **kwargs).uuid)
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/worker_engine.py", line 52, in run_worker
    engine.process(objects, **kwargs)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 390, in process
    self._process(objects)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 547, in _process
    obj, self, callbacks, exc_info
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/engine.py", line 364, in Exception
    obj, eng, callbacks, exc_info
  File "/opt/inspire/src/workflow/workflow/engine.py", line 970, in Exception
    reraise(*exc_info)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 529, in _process
    self.run_callbacks(callbacks, objects, obj)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 481, in run_callbacks
    self.execute_callback(callback_func, obj)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 564, in execute_callback
    callback(obj, self)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/arxiv.py", line 134, in arxiv_refextract
    mapped_references = extract_references(pdf.file.uri)
  File "/opt/inspire/src/inspire/inspirehep/modules/refextract/tasks.py", line 90, in extract_references
    reference_format="{title},{volume},{page}",
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/api.py", line 140, in extract_references_from_file
    override_kbs_files=override_kbs_files,
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 1358, in parse_references
    parse_references_elements(reference_lines, kbs, linker_callback)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 819, in parse_references_elements
    ref_line, kbs, bad_titles_count, linker_callback)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 630, in parse_reference_line
    look_for_undetected_books(splitted_citations, kbs)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 659, in look_for_undetected_books
    search_for_book_in_misc(citation, kbs)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 668, in search_for_book_in_misc
    citation_element['misc_txt'])
  File "/usr/lib64/python2.7/socket.py", line 316, in write
    data = str(data) # XXX Should really reject non-string non-buffers
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 112: ordinal not in range(128)

Using refextract==0.1.0.

@david-caro david-caro self-assigned this Jan 18, 2017
@michamos
Copy link
Contributor

Probably fixed by #16, but no release was made

@michamos
Copy link
Contributor

Indeed, on master

In [2]: refextract.extract_references_from_url("http://arxiv.org/pdf/1701.04322")
...
Out[2]: 
[{'author': [u'D. Chodounsk\xfd'],
  'linemarker': [u'1'],
  'misc': [u'Non-normality and relative normality of Niemytzki plane. Acta Univ. Carolin. Math. Phys. 48 , no. 2, 37-41'],
  'year': [u'2007']},
 {'author': [u'K. Ch'],
  'linemarker': [u'2'],
  'misc': [u'Ciesielski and J. Wojciechowski, Cardinality of regular spaces admitting only constant continuous functions. Topology Proc. 47 , 313-329'],
  'year': [u'2016']},
 {'author': [u'R. Engelking'],
  'linemarker': [u'3'],
  'misc': [u'General topology. Mathematical Monographs, Vol. 60, PWN\u2014 Polish Scientific Publishers, Warsaw'],
  'year': [u'1977']},
 {'author': [u'R. Engelking'],
  'linemarker': [u'4'],
  'misc': [u'Topologia og\xf3lna I, Pa\u0144stwowe Wydawnictwo Naukowe, Warszawa'],
  'year': [u'1989']},
 {'author': [u'F. Hern\xe1ndez-Hern\xe1ndez and M. Hru\u0161\xe1k'],
  'linemarker': [u'5'],
  'misc': [u'Q-sets and normality of \u03a8-spaces. Spring Topology and Dynamical Systems Conference. Topology Proc. 29 , no. 1, 155-165'],
  'year': [u'2005']},
 {'author': [u'F. B. Jones'],
  'linemarker': [u'6'],
  'misc': [u'Hereditarily separable, non-completely regular spaces, Proceedings of the Blacksburg Virginia Topological Conference, March'],
  'year': [u'1973']},
 {'author': [u'K. Kuratowski'],
  'linemarker': [u'7'],
  'misc': [u'Topology-Volume I. Transl. by J. Jaworowski',
   u'Press, New York-London'],
  'publisher': [u'Academic']},
 {'linemarker': [u'7'],
  'misc': [u'Pa\xf1stwowe Wydawnictwo Naukowe Polish Scientific Publishers, Warsaw'],
  'year': [u'1966']},
 {'author': [u'N. Luzin'],
  'linemarker': [u'8'],
  'misc': [u'On subsets of the series of natural numbers, Isv. Akad. Nauk. SSSR Ser. Mat. 11 , 403-411'],
  'year': [u'1947']},
 {'author': [u'S. Mr\xf3wka'],
  'journal_page': [u'105-106'],
  'journal_reference': [u'Fundam. Math. 41 (1954) 105-106'],
  'journal_title': [u'Fundam. Math.'],
  'journal_volume': [u'41'],
  'journal_year': [u'1954'],
  'linemarker': [u'9'],
  'misc': [u'On completely regular spaces'],
  'year': [u'1954']},
 {'author': [u'A. Mysior'],
  'journal_page': [u'652-653'],
  'journal_reference': [u'Proc. Am. Math. Soc. 81 (1981) 652-653'],
  'journal_title': [u'Proc. Am. Math. Soc.'],
  'journal_volume': [u'81'],
  'journal_year': [u'1981'],
  'linemarker': [u'10'],
  'misc': [u'A regular space which is not completely regular'],
  'year': [u'1981']},
 {'author': [u'L.A. Steen'],
  'linemarker': [u'11'],
  'misc': [u'and J.A.jun. Seebach, Counterexamples in topology. New York etc.: Holt, Rinehart and Winston, Inc., XIII'],
  'year': [u'1970']},
 {'author': [u'W. Sierpi\u0144ski'],
  'linemarker': [u'12'],
  'misc': [u'Introduction to General Topology. Lectures in Mathematics at the University of Toronto. The University of Toronto Press . Piotr Kalemba, Institute of Mathematics, University of Silesia, ul. Bankowa 14, 40-007 Katowice E-mail address: piotr.kalemba@us.edu.pl Szymon Plewik, Institute of Mathematics, University of Silesia, ul. Bankowa 14, 40-007 Katowice E-mail address: plewik@math.us.edu.pl'],
  'year': [u'1934']}]

@david-caro
Copy link
Contributor Author

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants