Skip to content

Commit

Permalink
extractor: enforce utf8 also for non PDF files
Browse files Browse the repository at this point in the history
Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>
  • Loading branch information
kaplun committed May 19, 2017
1 parent 19029e2 commit 217ba94
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion invenio_classifier/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,8 @@ def text_lines_from_local_file(document, remote=False):
# We are in Python 2. We need to cast to unicode
lines = [line.decode('utf8', 'replace') for line in lines]
else:
filestream = codecs.open(document, "r", errors="replace")
filestream = codecs.open(document, "r", encoding="utf8",
errors="replace")
# FIXME - we assume it is utf-8 encoded / that is not good
lines = [line for line in filestream]
filestream.close()
Expand Down

0 comments on commit 217ba94

Please sign in to comment.