Pdffulltext #1103

stuartf · 2015-01-07T18:37:26Z

requesting pull of this from oaeproject repo so that travis will upload the logs when it fails

simong · 2015-01-07T22:24:20Z

From those logs

[2015-01-07T18:43:51.511Z] ERROR: oae-preview-processor/10850 on testing-worker-linux-8-1-9724-linux-20-46226010: Could not convert the PDF to plain text (contentId=c:localhost:Qkmbo8nJE, stdout="")
    Error: Command failed: /bin/sh: 1: pdftotext: not found

        at ChildProcess.exithandler (child_process.js:648:15)
        at ChildProcess.emit (events.js:98:17)
        at maybeClose (child_process.js:756:16)
        at Process.ChildProcess._handle.onexit (child_process.js:823:5)
    --
    stderr: /bin/sh: 1: pdftotext: not found

Presumable it needs the full path towards the binary (we seem to do similar things for pdf2htmlex and pdftk)

coveralls · 2015-01-07T23:24:51Z

Changes Unknown when pulling 9048ba4 on pdffulltext into * on master*.

coveralls · 2015-01-07T23:49:47Z

Changes Unknown when pulling e23fa44 on pdffulltext into * on master*.

Install poppler-utils on travis ci

Pdffulltext

simong · 2015-01-08T12:08:22Z

We might need to play with scoring/boosting of documents. The cutoff point is currently set at 0.09 which might still be too high. The has_child query results in generally low document scores (as the content bodies tend to be quite large)

mrvisser · 2015-01-08T12:17:15Z

Discussing search analysis requirements sounds remarkably similar to a weather forecast :)

nicolaasmatthijs · 2015-01-08T14:48:30Z

node_modules/oae-preview-processor/tests/test-previews.js

@@ -525,6 +525,13 @@ describe('Preview processor', function() {
                                assert.ok(_.find(previews.files, function(file) { return file.filename === 'page.1.html'; }));
                                assert.ok(!_.find(previews.files, function(file) { return file.filename === 'page.2.html'; }));

+                                // The PDF has 1 page, there should only be 1 corresponding txt file


Shouldn't we have a test for a multi-page PDF where we verify that multiple page.x.txt files?

stuartf added 7 commits December 17, 2014 11:13

add a plain text preview for pdfs

3e6c43f

use the plain text preview for search indexing

9bcc518

fix error messages in �_convertToText

15e9280

save individual page plain text

7942149

push content_body as a string not a buffer

537ac46

move _.after call outside of _.each

e2da025

rename html_content analyzer to text_content

2b4b9e5

stuartf mentioned this pull request Jan 7, 2015

Full text indexing is not returning expected results #1100

Closed

simong force-pushed the pdffulltext branch from 9048ba4 to e23fa44 Compare January 7, 2015 23:31

simong added 3 commits January 8, 2015 11:14

Made pdftotext path configurable

f066a7e

Install poppler-utils on travis ci

Ensure that a final empty text page is not stored

ed04957

Slightly refactored content body search indexing

82b90bf

simong force-pushed the pdffulltext branch from e23fa44 to 82b90bf Compare January 8, 2015 11:15

simong added a commit that referenced this pull request Jan 8, 2015

Merge pull request #1103 from oaeproject/pdffulltext

fceb46e

Pdffulltext

simong merged commit fceb46e into master Jan 8, 2015

simong deleted the pdffulltext branch January 8, 2015 12:42

nicolaasmatthijs reviewed Jan 8, 2015
View reviewed changes

stuartf mentioned this pull request Jan 8, 2015

Preserve page-based text output of pdf2htmlEX documents #1102

Closed

simong mentioned this pull request May 6, 2015

Hilary errors when deploying 10.0 #1114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pdffulltext #1103

Pdffulltext #1103

stuartf commented Jan 7, 2015

simong commented Jan 7, 2015

coveralls commented Jan 7, 2015

coveralls commented Jan 7, 2015

simong commented Jan 8, 2015

mrvisser commented Jan 8, 2015

nicolaasmatthijs Jan 8, 2015

Pdffulltext #1103

Pdffulltext #1103

Conversation

stuartf commented Jan 7, 2015

simong commented Jan 7, 2015

coveralls commented Jan 7, 2015

coveralls commented Jan 7, 2015

simong commented Jan 8, 2015

mrvisser commented Jan 8, 2015

nicolaasmatthijs Jan 8, 2015

Choose a reason for hiding this comment