Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PDF Normalization #681

Open
wants to merge 35 commits into
base: development
Choose a base branch
from
Open

Add PDF Normalization #681

wants to merge 35 commits into from

Conversation

lw7360
Copy link

@lw7360 lw7360 commented Oct 16, 2014

#566. Still need to fix some coffeescript syntax, use random.Id, add a database migration, change files to cached, and update comments.

@mitar mitar added this to the Parallel milestone milestone Oct 16, 2014
@lw7360
Copy link
Author

lw7360 commented Oct 27, 2014

Proper random id's now, and also some work was done trying to fix the migration, which kind of half works now. New random id's seem to be generated for the files, but they aren't being moved to their new directories.

I also moved normalization to its own separate job. I'll also probably add another job that normalizes all currently unnormalized PDFs, kind of like ProcessPublicationsJob.

Also, there's now a weird bug where PDF's don't ever seem to get rendered on the client, which is strange because I don't think I've changed any clientside code.

I didn't get as much done as I would have liked, partially because I upgraded to OS X Yosemite, which broke a lot of things on my computer. I really want to finish this by next week.

@mitar
Copy link
Member

mitar commented Oct 27, 2014

I'll also probably add another job that normalizes all currently unnormalized PDFs, kind of like ProcessPublicationsJob.

Good.

Who calls your normalization job? If you call it from ProcessPublicationJob, then ProcessPublicationsJob would also call it? But yes, you can make one special job for renormalizing. This is always good to have to fix things.

Also, there's now a weird bug where PDF's don't ever seem to get rendered on the client, which is strange because I don't think I've changed any clientside code.

Maybe paths used on the client are wrong? You should check everywhere where Storage is used and check if you have to modify something, and where publication URLs and thumbnails are accessed.

How much time you spend?

@@ -114,13 +114,15 @@ class @Publication extends BasicAccessDocument
@_filenamePrefix: ->
'publication' + Storage._path.sep

cachedFilename: =>
cachedFilename: (id) =>
throw new Error "Cached filename not available" unless @cachedId and @mediaType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not media type now per file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably check should be now if there is cachedId and if files is non-empty.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, but because of the way I'm currently storing each file, there isn't really a great way to get it. I'd have to search the file list for the specified fileId to get the media type. I guess I could just do that for now though.

 files: [
      fileId: Random.id()
      createdAt: createdAt
      updatedAt: createdAt
      sha256: sha256
      mediaType: 'pdf'
      type: 'original'
    ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then search the list. :-) What in fact you want to do is:

  • verify that cachedId exists
  • files contain an fileId entry, or that it is non empty

Probably we don't have to check for mediaType, not sure when it would not exist (only as an error).

@mitar
Copy link
Member

mitar commented Nov 3, 2014

One thing. One important thing at PDF processing is that you remove all metadata in the PDF which is sometimes there and has names of an importer of a PDF, not author. So I would just remove all metadata (we will later on extract it into MongoDB data). And do the same with highlights and annotations and links and everything else in the PDF. Can GhostScript do that? So remove everything which is not strict PDF content?

@lw7360
Copy link
Author

lw7360 commented Nov 3, 2014

You could do that by first converting the PDF to PS with Ghostscript, and then converting it back. I don't think there's a less roundabout way though.

@@ -160,6 +160,7 @@ class @Publication extends Publication
adminPersons: 1
adminGroups: 1
cached: 1
files: 1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this gives the client access to a publication's files? Right now, when I open a publication though my browser, it seems like the client calls url() twice. The first time, everything is fine and it returns the proper url for the default pdf, but the 2nd time it calls it, it doesn't seem to have access to files, so nothing is returned and the PDF doesn't get rendered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is just used in the method when uploading. No need to add it here, if you do not use files inside this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was looking at the wrong thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, this would make it push for normal publish endpoints.

@mitar
Copy link
Member

mitar commented Nov 3, 2014

You could do that by first converting the PDF to PS with Ghostscript, and then converting it back. I don't think there's a less roundabout way though.

You can check what happens then. Is text flow preserved? So what else do we remove together with annotations? We would not like to remove fonts, or vector graphics or other stuff.

@@ -281,6 +290,7 @@ Meteor.methods
sha256: 1
cachedId: 1
mediaType: 1
files: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is unnecessary if you are not using this data in the method.

@lw7360
Copy link
Author

lw7360 commented Nov 3, 2014

I've actually tested it out quite a bit. Everything is preserved quite well actually. The only time I saw a difference was one time the font color for one formula seemed to be a bit lighter after the conversion. It doesn't completely remove metadata though. It just replaces everything with stuff like Author: GPL GhostScript.


future.wait()

result = execFileSync 'gs', ['-sDEVICE=pdfwrite', '-dNOPAUSE', '-dQUIET', '-dBATCH', '-dFastWebView=true' , "-sOutputFile=#{path}/#{fileId}.pdf", Storage._fullPath publication.cachedFilename()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use -dUseCIEColor -sProcessColorModel=DeviceCMYK as well?

@lw7360
Copy link
Author

lw7360 commented Nov 3, 2014

~11 hours. Fixed most of the issues in the pull request. PDF rendering still bugged, because for some reason, the client doesn't seem to have access to files even though it's being published to it. The last task, that I already have worked a bit on, is to replace the cached timestamp with the files list, and to remove file specific values from the publication itself, and just keep that stored in the files list. Also might use pdf2ps/ps2pdf instead of gs to get rid of metadata in each pdf.

@mitar
Copy link
Member

mitar commented Nov 3, 2014

It doesn't completely remove metadata though. It just replaces everything with stuff like Author: GPL GhostScript.

You cannot configure what they set it to?

Maybe you could also use -dPDFA and -sPDFACompatibilityPolicy=1 switches?


Publication._filenamePrefix() + 'cache' + Storage._path.sep + @cachedId + '.' + @mediaType
cachedFilename: (fileId) =>
throw new Error "Cached filename not available" unless @cachedId and @mediaType and @files.length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is mediaType still here?

@mitar
Copy link
Member

mitar commented Nov 3, 2014

I think you are a bit sloppy. You should search all around the codebase for things like mediaType, cachedId and other relevant keywords/terms and see if there are things you have to update as well.

@mitar
Copy link
Member

mitar commented Nov 3, 2014

You should not need top-level mediaType.

@mitar
Copy link
Member

mitar commented Nov 3, 2014

You should check everywhere where Storage is used and check if you have to modify something, and where publication URLs and thumbnails are accessed.

Did you look into this? For example, check this as well.

@mitar
Copy link
Member

mitar commented Nov 3, 2014

So the reason why things don't work for you is because you didn't update all codepaths. files is available on the client as it should be, but we do limit which fields we request when doing find on the client to limit reactivity. The point is that if you limit fields returned by find, then context gets rerun only when one of those fields changes, and not when any of the document's fields changes.

@mitar
Copy link
Member

mitar commented Nov 11, 2014

What's the status here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants