New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PDF Normalization #681
base: development
Are you sure you want to change the base?
Conversation
Proper random id's now, and also some work was done trying to fix the migration, which kind of half works now. New random id's seem to be generated for the files, but they aren't being moved to their new directories. I also moved normalization to its own separate job. I'll also probably add another job that normalizes all currently unnormalized PDFs, kind of like ProcessPublicationsJob. Also, there's now a weird bug where PDF's don't ever seem to get rendered on the client, which is strange because I don't think I've changed any clientside code. I didn't get as much done as I would have liked, partially because I upgraded to OS X Yosemite, which broke a lot of things on my computer. I really want to finish this by next week. |
Good. Who calls your normalization job? If you call it from
Maybe paths used on the client are wrong? You should check everywhere where How much time you spend? |
@@ -114,13 +114,15 @@ class @Publication extends BasicAccessDocument | |||
@_filenamePrefix: -> | |||
'publication' + Storage._path.sep | |||
|
|||
cachedFilename: => | |||
cachedFilename: (id) => | |||
throw new Error "Cached filename not available" unless @cachedId and @mediaType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not media type now per file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably check should be now if there is cachedId
and if files
is non-empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, but because of the way I'm currently storing each file, there isn't really a great way to get it. I'd have to search the file list for the specified fileId to get the media type. I guess I could just do that for now though.
files: [
fileId: Random.id()
createdAt: createdAt
updatedAt: createdAt
sha256: sha256
mediaType: 'pdf'
type: 'original'
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then search the list. :-) What in fact you want to do is:
- verify that
cachedId
exists files
contain anfileId
entry, or that it is non empty
Probably we don't have to check for mediaType
, not sure when it would not exist (only as an error).
One thing. One important thing at PDF processing is that you remove all metadata in the PDF which is sometimes there and has names of an importer of a PDF, not author. So I would just remove all metadata (we will later on extract it into MongoDB data). And do the same with highlights and annotations and links and everything else in the PDF. Can GhostScript do that? So remove everything which is not strict PDF content? |
You could do that by first converting the PDF to PS with Ghostscript, and then converting it back. I don't think there's a less roundabout way though. |
@@ -160,6 +160,7 @@ class @Publication extends Publication | |||
adminPersons: 1 | |||
adminGroups: 1 | |||
cached: 1 | |||
files: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this gives the client access to a publication's files
? Right now, when I open a publication though my browser, it seems like the client calls url()
twice. The first time, everything is fine and it returns the proper url for the default pdf, but the 2nd time it calls it, it doesn't seem to have access to files
, so nothing is returned and the PDF doesn't get rendered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is just used in the method when uploading. No need to add it here, if you do not use files
inside this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was looking at the wrong thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, this would make it push for normal publish endpoints.
You can check what happens then. Is text flow preserved? So what else do we remove together with annotations? We would not like to remove fonts, or vector graphics or other stuff. |
@@ -281,6 +290,7 @@ Meteor.methods | |||
sha256: 1 | |||
cachedId: 1 | |||
mediaType: 1 | |||
files: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is unnecessary if you are not using this data in the method.
I've actually tested it out quite a bit. Everything is preserved quite well actually. The only time I saw a difference was one time the font color for one formula seemed to be a bit lighter after the conversion. It doesn't completely remove metadata though. It just replaces everything with stuff like |
|
||
future.wait() | ||
|
||
result = execFileSync 'gs', ['-sDEVICE=pdfwrite', '-dNOPAUSE', '-dQUIET', '-dBATCH', '-dFastWebView=true' , "-sOutputFile=#{path}/#{fileId}.pdf", Storage._fullPath publication.cachedFilename()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use -dUseCIEColor -sProcessColorModel=DeviceCMYK
as well?
~11 hours. Fixed most of the issues in the pull request. PDF rendering still bugged, because for some reason, the client doesn't seem to have access to |
You cannot configure what they set it to? Maybe you could also use |
|
||
Publication._filenamePrefix() + 'cache' + Storage._path.sep + @cachedId + '.' + @mediaType | ||
cachedFilename: (fileId) => | ||
throw new Error "Cached filename not available" unless @cachedId and @mediaType and @files.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is mediaType still here?
I think you are a bit sloppy. You should search all around the codebase for things like |
You should not need top-level |
Did you look into this? For example, check this as well. |
So the reason why things don't work for you is because you didn't update all codepaths. |
What's the status here? |
#566. Still need to fix some coffeescript syntax, use random.Id, add a database migration, change
files
tocached
, and update comments.