PDF Bug - The Sequel #21

athityakumar · 2018-08-04T18:01:44Z

Le Conspiracy Theory

The code changes in the previous PR actually reflected in no apparent changes to MFTP as a service, because the issue was in the mailer function (hooks.py). The mailer was fetching the PDF's attachment URL and then reading it before sending to MailGun API.

Well, due to the recent update of the PDF portal, the PDF URL is no longer available directly without SSO (ERP login). (Yup, previously - the PDF URL was THE only non-SSO openly available URL being used in our script)

Hence, without providing any session details in the mailer script, requests.get(pdf_url) would simply return an auth error. I think this is what's been being written into our PDF files - error codes that are obviously not PDF content. Hence, the PDFs failed to open.

athityakumar · 2018-08-05T10:24:31Z

Ping @amrav @ghostwriternr @icyflame

amrav · 2018-08-05T12:44:19Z

update.py

-            for chunk in iter(lambda: r.raw.read(4096), b""):
-                hash_.update(chunk)
-            notice['attachment_md5'] = hash_.hexdigest()
+            notice['attachment_raw'] = r


We weren't storing the raw attachment because over time this would consume all our storage (hence md5-ing it and storing just that).

A better approach here is to keep the md5, and ignore the attachment_raw field when doing db writes/reads.

Hmm, but wouldn't we actually require the raw attachment for directly using from the mailer (in hooks.py) rather than the md5?

Yes, you could pass attachment_raw through to the mailer, but remove it when interacting with the database. There are two interactions: the first when we search for any notices that match this one, and the second when we write a new notice into the database. We basically want to skip over the attachment_raw key for both these interactions (you can do this by constructing a new dict with only the required keys).

icyflame · 2018-08-07T13:37:06Z

On a separate matter, you can check if the error codes are being written to the attachments by downloading them and trying to open them in a text editor. (Or binary editor)

If they are error codes, the size of each PDF attachment will be same to the last byte. This is generally a good indication that these are error codes.

icyflame · 2018-08-07T13:39:05Z

@athityakumar

From your description of the issue, it seems like the problem with "GET"-ing the PDF file is that we are not authenticating when we do requests.get. In that same function, we already have an authenticated session. So, another approach might be to do the same thing as we used to do before, but instead of getting the PDF from a plain session, get it from the already authenticated session.

Of course, we only need to do this if the raw PDF data isn't already there in the notices API call (which is what your PR seems to suggest)

icyflame · 2018-08-07T13:35:50Z

hooks.py

-            r = requests.get(notice['attachment_url'], stream=True,
-                             **req_args)
-            r.raw.decode_content = True
+            r = notice['attachment_raw']


Is the notice now returned as part of the notices call on ERP? i.e. In every notice, the data for the raw PDF as well as a URL to the PDF is included. This doesn't seem optimal or required.

@athityakumar can you explain what has changed in the response to the notices call? :)

You got it absolutely right. They've shifted from a non-SSO to an SSO URL, and now require auth to read the PDFs. So, session.get works whereas requests.get doesn't (it used to, previously).

So, I'm storing the raw attachment text to the notice variable (db?) and using it while calling the mailer function.

AH! I see, I understand your approach now! I was reading the code in the reverse order, because I was confused about the control flow. Thanks for explaining! :)

athityakumar · 2018-08-07T20:09:03Z

@icyflame - Yup, they're all consistently 4.1kb files. So, definitely error codes - thanks for pointing to that clue. 😄

athityakumar · 2018-08-07T20:21:10Z

@amrav @icyflame - I'm pretty new to this codebase and tornado. So, can you please verify once if this is what you mean in the requested changes?

Instead of storing the raw attachment text to notice (in update.py), store the pdf url and fetch the raw attachment text from this url from the mailer function before calling mailgun API. Is that right?

Also, why are we currently storing the md5-ed attachment? I can't see it being called from anywhere else in the codebase.

amrav · 2018-08-08T00:05:33Z

@athityakumar I meant something else. Sorry, let me explain more clearly.

This is the way the code currently works. Suppose you have a notice:

{
  "subject": "...",
  "company": "...",
  "text": "..."
  "time": "..."
  "attachment_url": "...",
  "attachment_md5": "..."
}

you can see that these are the only fields being set in update.py. So this is the structure of the notice we're currently passing around.

So every time the updater runs, we fetch the last NUM_NOTICES_DIFFED notices (currently 50), and we populate the above structure for each notice by scraping the appropriate part of the html. Now we have the top N notices in memory.

In the same file, in handle_notices_diff, you can see how we figure out which are the new notices. Here's the function, annotated with what each line is doing

def handle_notices_diff(notices):
    # get the database connector for mongodb
    notices_coll = mc.get_default_database().notices

    # initialise an empty list of different notices; we will populate this
    different_notices = []
    print 'Checking ', len(notices), 'notices'
    # loop through the notices fetched
    for notice in notices:
        # Fetch an *exact* match, if it exists, from the database. This will match a stored document
        # only if *every single field* matches
        db_notice = notices_coll.find_one(notice)
        if db_notice is None:
            # We didn't find any matches, so at least one field on this notice must be different. This could
            # happen if the whole notice is new, but it also happens when the text or attachment url or
            # attachment content changes.
            #
            # This the point of storing the attachment_md5 as one of the fields - if the PDF 
            # content changes, even if the attachment_url doesn't change, we'll still send out 
            # an update because we'll detect a difference. We could do the same thing by 
            # storing the raw attachment content itself, but  that takes more space in the database,
            # and we're on a free tier :)

            # This notice is different, add it to our list
            different_notices.append(notice)
    print 'Different notices: ', different_notices
    if len(different_notices) > 0:
        for notice in different_notices:
            hooks.notices_updated([notice])
            # store this notice into the database so we know we've processed it the next time
            # the updater runs
            notices_coll.insert_one(notice)

Hopefully this makes it clear why we store the attachment md5, and also why we don't want to store the attachment raw content. Really these functions should be commented much more, the fault is entirely mine :) Please add comments wherever you think is appropriate.

The change I'm suggesting is to add attachment_raw into the notice structure above (in addition to attachment_md5 already present, and then add a function like:

def sanitise_notice_for_database(notice):
  # return copy of notice with raw attachment content removed

and use this function in handle_notices_diff when searching for or storing notices. This approach avoids having to fetch the notice content twice (once in update.py and once in the mailer), and avoids needing to establish yet another authenticated session in the mailer (which involves several requests to ERP) or pass the authenticated session to the mailer (breaks separation of concerns).

Let me know if anything is still unclear :) And thanks for taking the time to fix!

athityakumar · 2018-08-08T10:12:37Z

@amrav - Thanks a lot for spending your time on explaining this. I think I've understood the MFTP flow now. 😅

Review again, please? I'll also add another PR with inline comments. 😄

amrav · 2018-08-08T10:26:38Z

update.py

@@ -63,7 +64,9 @@ def check_notices(session, sessionData):
            notice['attachment_url'] = ERP_ATTACHMENT_URL + m.group(1)
            r = session.get(notice['attachment_url'], stream=True)
            r.raw.decode_content = True
+            notice['attachment_raw'] = r


This actually adds the request object itself rather than the raw data. What is the type of request.raw, and can we use that directly?

request.raw is a urllib3.response.HTTPResponse object, and we can use it directly too. 👍

amrav · 2018-08-08T10:30:11Z

update.py

            hash_ = hashlib.md5()
+
            for chunk in iter(lambda: r.raw.read(4096), b""):


If r.raw is a stream, are we allowed to read it twice (once here and once in the mailer), or does it get consumed the first time you read it?

It gets consumed only after calling r.raw.read() - r.raw can be used twice.

amrav · 2018-08-08T16:11:17Z

Hmm, pretty sure the mailer needs to call `r.raw.read` too, in order to populate the attachment. Have you tried running locally/on heroku with your credentials and seeing if you get an email correctly? You can use the mailgun secret key from the heroku app, but make sure to set your own email address.

…

On Wed, 8 Aug 2018 at 12:33, Athitya Kumar ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In update.py <#21 (comment)>: > hash_ = hashlib.md5() + for chunk in iter(lambda: r.raw.read(4096), b""): It gets consumed only after calling r.raw.read() - r.raw can be used twice. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABIZ_5wUskVr4bi_OMZI0AV54sDoCWlAks5uOsyQgaJpZM4VvFcW> .

athityakumar · 2018-08-10T00:14:01Z

@amrav @icyflame - And it now finally works, locally! 🎉

MFTP works locally! PDFs are properly parsed and written now.

amrav

Looks great! Couple of minor issues. We're streaming the attachment twice - not a big deal, but can be avoided by storing the byte stream in memory.

While you're at it, would be good to use r.iter_content instead of r.raw as the docs suggest.

Out of curiosity, why did you remove deepcopy (shallow copy works too)?

athityakumar · 2018-08-10T01:40:13Z

The iter_content approach actually made it easier to prevent streaming the attachment twice. 😄

While trying to setup MFTP locally, the deepcopy line threw a couple of errors without debuggable error messages. So, I just wanted to workaround the errors - and focus on getting the PDFs right. And I forgot to revert it back afterwards. Seems like the error then, was just due to the missing if 'attachment_raw' in sanitised_notice check and I had overlooked it. 😓

Yes, shallow copy is enough for our notice structure and works fine. 👍

amrav · 2018-08-10T09:11:54Z

update.py

@@ -64,7 +65,9 @@ def check_notices(session, sessionData):
            r = session.get(notice['attachment_url'], stream=True)
            r.raw.decode_content = True
            hash_ = hashlib.md5()
-            for chunk in iter(lambda: r.raw.read(4096), b""):
+            notice['attachment_raw'] = ""


Hmm, this seems to be appending bytes to a string. Have you checked whether this actually works? I think it's better to use an explicit byte stream?

Have you checked whether this actually works?

Yup, it does. By appending the bytes to the string, we're getting back what we would have got anyway previously from r.raw 😄

amrav

Looks good, thanks for doing this fix!

amrav · 2018-08-10T10:10:34Z

Oh you could also use b"" instead of "" for the empty string, just to make it even more explicit that it's a string of bytes.

athityakumar · 2018-08-11T06:40:17Z

I'm (kinda hurriedly, sorry!) merging this PR now, so that the bug fix gets deployed at least from this weekend for MFTP users. 😅

I'll document the MFTP flow too and send a PR soon.

Thanks for reviewing, @amrav @icyflame 🎉

Directly uses PDF content rather than PDF URL

50524e0

amrav requested changes Aug 5, 2018

View reviewed changes

icyflame reviewed Aug 7, 2018

View reviewed changes

Sanitises notice for db purposes

d68c865

amrav reviewed Aug 8, 2018

View reviewed changes

Passes on requests.raw object rather than requests object

27867d1

Removes deepcopy and raw.read() bugs

36a9b41

MFTP works locally! PDFs are properly parsed and written now.

athityakumar force-pushed the really-trying-hard-to-fix-pdf-issue branch from 48781fc to 36a9b41 Compare August 10, 2018 00:42

amrav reviewed Aug 10, 2018

View reviewed changes

Reduces request calls back to 1, and uses iter_content

052ca58

amrav reviewed Aug 10, 2018

View reviewed changes

amrav approved these changes Aug 10, 2018

View reviewed changes

Makes it explicit that raw_attachment is a string of bytes

47fa0d3

athityakumar merged commit bf6e651 into metakgp:master Aug 11, 2018

athityakumar mentioned this pull request Aug 11, 2018

Notice PDFs not opening #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Bug - The Sequel #21

PDF Bug - The Sequel #21

athityakumar commented Aug 4, 2018

athityakumar commented Aug 5, 2018

amrav Aug 5, 2018

athityakumar Aug 5, 2018

amrav Aug 6, 2018

icyflame commented Aug 7, 2018

icyflame commented Aug 7, 2018

icyflame Aug 7, 2018

athityakumar Aug 7, 2018

icyflame Aug 8, 2018

athityakumar commented Aug 7, 2018

athityakumar commented Aug 7, 2018

amrav commented Aug 8, 2018

athityakumar commented Aug 8, 2018

amrav Aug 8, 2018

athityakumar Aug 8, 2018

amrav Aug 8, 2018

athityakumar Aug 8, 2018

amrav commented Aug 8, 2018 via email

athityakumar commented Aug 10, 2018

amrav left a comment

athityakumar commented Aug 10, 2018

amrav Aug 10, 2018

athityakumar Aug 10, 2018

amrav left a comment

amrav commented Aug 10, 2018

athityakumar commented Aug 11, 2018

		hash_ = hashlib.md5()

		for chunk in iter(lambda: r.raw.read(4096), b""):

PDF Bug - The Sequel #21

PDF Bug - The Sequel #21

Conversation

athityakumar commented Aug 4, 2018

Le Conspiracy Theory

athityakumar commented Aug 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icyflame commented Aug 7, 2018

icyflame commented Aug 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athityakumar commented Aug 7, 2018

athityakumar commented Aug 7, 2018

amrav commented Aug 8, 2018

athityakumar commented Aug 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amrav commented Aug 8, 2018 via email

athityakumar commented Aug 10, 2018

amrav left a comment

Choose a reason for hiding this comment

athityakumar commented Aug 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amrav left a comment

Choose a reason for hiding this comment

amrav commented Aug 10, 2018

athityakumar commented Aug 11, 2018