Add PDF Normalization #681

lw7360 · 2014-10-16T02:59:01Z

#566. Still need to fix some coffeescript syntax, use random.Id, add a database migration, change files to cached, and update comments.

…and url

into normalize

lw7360 · 2014-10-27T19:27:55Z

Proper random id's now, and also some work was done trying to fix the migration, which kind of half works now. New random id's seem to be generated for the files, but they aren't being moved to their new directories.

I also moved normalization to its own separate job. I'll also probably add another job that normalizes all currently unnormalized PDFs, kind of like ProcessPublicationsJob.

Also, there's now a weird bug where PDF's don't ever seem to get rendered on the client, which is strange because I don't think I've changed any clientside code.

I didn't get as much done as I would have liked, partially because I upgraded to OS X Yosemite, which broke a lot of things on my computer. I really want to finish this by next week.

mitar · 2014-10-27T23:23:17Z

I'll also probably add another job that normalizes all currently unnormalized PDFs, kind of like ProcessPublicationsJob.

Good.

Who calls your normalization job? If you call it from ProcessPublicationJob, then ProcessPublicationsJob would also call it? But yes, you can make one special job for renormalizing. This is always good to have to fix things.

Also, there's now a weird bug where PDF's don't ever seem to get rendered on the client, which is strange because I don't think I've changed any clientside code.

Maybe paths used on the client are wrong? You should check everywhere where Storage is used and check if you have to modify something, and where publication URLs and thumbnails are accessed.

How much time you spend?

mitar · 2014-10-27T23:26:26Z

lib/documents/publication.coffee

@@ -114,13 +114,15 @@ class @Publication extends BasicAccessDocument
  @_filenamePrefix: ->
    'publication' + Storage._path.sep

-  cachedFilename: =>
+  cachedFilename: (id) =>
    throw new Error "Cached filename not available" unless @cachedId and @mediaType


Is not media type now per file?

Probably check should be now if there is cachedId and if files is non-empty.

Yes it is, but because of the way I'm currently storing each file, there isn't really a great way to get it. I'd have to search the file list for the specified fileId to get the media type. I guess I could just do that for now though.

files: [ fileId: Random.id() createdAt: createdAt updatedAt: createdAt sha256: sha256 mediaType: 'pdf' type: 'original' ]

Then search the list. :-) What in fact you want to do is:

verify that cachedId exists

files contain an fileId entry, or that it is non empty

Probably we don't have to check for mediaType, not sure when it would not exist (only as an error).

mitar · 2014-11-03T08:23:35Z

One thing. One important thing at PDF processing is that you remove all metadata in the PDF which is sometimes there and has names of an importer of a PDF, not author. So I would just remove all metadata (we will later on extract it into MongoDB data). And do the same with highlights and annotations and links and everything else in the PDF. Can GhostScript do that? So remove everything which is not strict PDF content?

lw7360 · 2014-11-03T16:53:34Z

You could do that by first converting the PDF to PS with Ghostscript, and then converting it back. I don't think there's a less roundabout way though.

lw7360 · 2014-11-03T19:26:59Z

server/publication.coffee

@@ -160,6 +160,7 @@ class @Publication extends Publication
      adminPersons: 1
      adminGroups: 1
      cached: 1
+      files: 1


Does this gives the client access to a publication's files? Right now, when I open a publication though my browser, it seems like the client calls url() twice. The first time, everything is fine and it returns the proper url for the default pdf, but the 2nd time it calls it, it doesn't seem to have access to files, so nothing is returned and the PDF doesn't get rendered.

No. This is just used in the method when uploading. No need to add it here, if you do not use files inside this method.

Sorry, I was looking at the wrong thing.

This is correct, this would make it push for normal publish endpoints.

mitar · 2014-11-03T19:31:46Z

You could do that by first converting the PDF to PS with Ghostscript, and then converting it back. I don't think there's a less roundabout way though.

You can check what happens then. Is text flow preserved? So what else do we remove together with annotations? We would not like to remove fonts, or vector graphics or other stuff.

mitar · 2014-11-03T19:36:07Z

server/publication.coffee

@@ -281,6 +290,7 @@ Meteor.methods
        sha256: 1
        cachedId: 1
        mediaType: 1
+        files: 1


This one is unnecessary if you are not using this data in the method.

lw7360 · 2014-11-03T19:38:58Z

I've actually tested it out quite a bit. Everything is preserved quite well actually. The only time I saw a difference was one time the font color for one formula seemed to be a bit lighter after the conversion. It doesn't completely remove metadata though. It just replaces everything with stuff like Author: GPL GhostScript.

mitar · 2014-11-03T19:48:06Z

server/jobs/normalize.coffee

+
+        future.wait()
+
+      result = execFileSync 'gs', ['-sDEVICE=pdfwrite', '-dNOPAUSE', '-dQUIET', '-dBATCH', '-dFastWebView=true' , "-sOutputFile=#{path}/#{fileId}.pdf", Storage._fullPath publication.cachedFilename()]


Maybe use -dUseCIEColor -sProcessColorModel=DeviceCMYK as well?

lw7360 · 2014-11-03T19:50:36Z

~11 hours. Fixed most of the issues in the pull request. PDF rendering still bugged, because for some reason, the client doesn't seem to have access to files even though it's being published to it. The last task, that I already have worked a bit on, is to replace the cached timestamp with the files list, and to remove file specific values from the publication itself, and just keep that stored in the files list. Also might use pdf2ps/ps2pdf instead of gs to get rid of metadata in each pdf.

mitar · 2014-11-03T19:52:35Z

It doesn't completely remove metadata though. It just replaces everything with stuff like Author: GPL GhostScript.

You cannot configure what they set it to?

Maybe you could also use -dPDFA and -sPDFACompatibilityPolicy=1 switches?

mitar · 2014-11-03T19:53:53Z

lib/documents/publication.coffee


-    Publication._filenamePrefix() + 'cache' + Storage._path.sep + @cachedId + '.' + @mediaType
+  cachedFilename: (fileId) =>
+    throw new Error "Cached filename not available" unless @cachedId and @mediaType and @files.length


Why is mediaType still here?

mitar · 2014-11-03T21:02:41Z

I think you are a bit sloppy. You should search all around the codebase for things like mediaType, cachedId and other relevant keywords/terms and see if there are things you have to update as well.

mitar · 2014-11-03T21:03:51Z

You should not need top-level mediaType.

mitar · 2014-11-03T21:11:48Z

You should check everywhere where Storage is used and check if you have to modify something, and where publication URLs and thumbnails are accessed.

Did you look into this? For example, check this as well.

mitar · 2014-11-03T21:25:30Z

So the reason why things don't work for you is because you didn't update all codepaths. files is available on the client as it should be, but we do limit which fields we request when doing find on the client to limit reactivity. The point is that if you limit fields returned by find, then context gets rerun only when one of those fields changes, and not when any of the document's fields changes.

mitar · 2014-11-11T10:30:37Z

What's the status here?

lw7360 added 3 commits October 11, 2014 21:00

Changed cache file structure

41446bf

Added GhostScript integration

f04f149

Check Meteor.settings for ghostScript

7c25d6d

mitar added this to the Parallel milestone milestone Oct 16, 2014

lw7360 added 20 commits October 18, 2014 00:13

Changed syntax

69b52b1

Use underscore.js and string arguments

14cbc9a

Publish files field and removed default parameters in cachedFilename …

6edfaf5

…and url

Simplified logic in cachedFilename

12fefcf

Use random ids

f471e45

Check for ghostScript setting

638fca3

Removed some logging

a81f518

Added migration

225cf7d

Changed some names

5205213

Moved normalization to a seperate job

d0f28a9

Use randomID for normalized filenames

aefdffb

Merge branch 'development' of https://github.com/peerlibrary/peerlibrary

afc7220

into normalize

Fixed syntax

b3e2643

Proper id handling

39224ac

Check for GhostScript before running job

59ded35

Updated settings

610cc4d

Removed originalid

370195d

removed trailing whitespace

9d73236

Updated migration

9e5539b

Normalize migration

6de309d

mitar reviewed Oct 27, 2014
View reviewed changes

lw7360 added 3 commits November 2, 2014 05:55

Moved update to update object

7398719

Stashing Stuff

22884e6

Fixed id

f1dc2b4

lw7360 added 2 commits November 3, 2014 08:11

Use proper mediatypei n cachedFileName

44b54ef

Changed comment

c214070

Calculate cachedFilename instead of calling it

3616aa6

lw7360 reviewed Nov 3, 2014
View reviewed changes

mitar reviewed Nov 3, 2014
View reviewed changes

lw7360 added 6 commits November 4, 2014 21:49

Use underscore

09ad5d0

Abstracted execfilesync to its own function

5432f5b

Fixed display bug

3764680

Fix execFileSync

3ae5ab9

Added job that normalizes all currently unnormalized publications

a18f43a

Single line

27e2056

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PDF Normalization #681

Add PDF Normalization #681

lw7360 commented Oct 16, 2014

lw7360 commented Oct 27, 2014

mitar commented Oct 27, 2014

mitar Oct 27, 2014

mitar Oct 27, 2014

lw7360 Nov 3, 2014

mitar Nov 3, 2014

mitar commented Nov 3, 2014

lw7360 commented Nov 3, 2014

lw7360 Nov 3, 2014

mitar Nov 3, 2014

mitar Nov 3, 2014

mitar Nov 3, 2014

mitar commented Nov 3, 2014

mitar Nov 3, 2014

lw7360 commented Nov 3, 2014

mitar Nov 3, 2014

lw7360 commented Nov 3, 2014

mitar commented Nov 3, 2014

mitar Nov 3, 2014

mitar commented Nov 3, 2014

mitar commented Nov 3, 2014

mitar commented Nov 3, 2014

mitar commented Nov 3, 2014

mitar commented Nov 11, 2014


		future.wait()

		result = execFileSync 'gs', ['-sDEVICE=pdfwrite', '-dNOPAUSE', '-dQUIET', '-dBATCH', '-dFastWebView=true' , "-sOutputFile=#{path}/#{fileId}.pdf", Storage._fullPath publication.cachedFilename()]

Add PDF Normalization #681

Are you sure you want to change the base?

Add PDF Normalization #681

Conversation

lw7360 commented Oct 16, 2014

lw7360 commented Oct 27, 2014

mitar commented Oct 27, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitar commented Nov 3, 2014

lw7360 commented Nov 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitar commented Nov 3, 2014

Choose a reason for hiding this comment

lw7360 commented Nov 3, 2014

Choose a reason for hiding this comment

lw7360 commented Nov 3, 2014

mitar commented Nov 3, 2014

Choose a reason for hiding this comment

mitar commented Nov 3, 2014

mitar commented Nov 3, 2014

mitar commented Nov 3, 2014

mitar commented Nov 3, 2014

mitar commented Nov 11, 2014