Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Bug - The Sequel #21

Merged

Conversation

athityakumar
Copy link
Member

Le Conspiracy Theory

The code changes in the previous PR actually reflected in no apparent changes to MFTP as a service, because the issue was in the mailer function (hooks.py). The mailer was fetching the PDF's attachment URL and then reading it before sending to MailGun API.

Well, due to the recent update of the PDF portal, the PDF URL is no longer available directly without SSO (ERP login). (Yup, previously - the PDF URL was THE only non-SSO openly available URL being used in our script)

Hence, without providing any session details in the mailer script, requests.get(pdf_url) would simply return an auth error. I think this is what's been being written into our PDF files - error codes that are obviously not PDF content. Hence, the PDFs failed to open.

@athityakumar
Copy link
Member Author

Ping @amrav @ghostwriternr @icyflame

update.py Outdated
for chunk in iter(lambda: r.raw.read(4096), b""):
hash_.update(chunk)
notice['attachment_md5'] = hash_.hexdigest()
notice['attachment_raw'] = r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't storing the raw attachment because over time this would consume all our storage (hence md5-ing it and storing just that).

A better approach here is to keep the md5, and ignore the attachment_raw field when doing db writes/reads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but wouldn't we actually require the raw attachment for directly using from the mailer (in hooks.py) rather than the md5?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you could pass attachment_raw through to the mailer, but remove it when interacting with the database. There are two interactions: the first when we search for any notices that match this one, and the second when we write a new notice into the database. We basically want to skip over the attachment_raw key for both these interactions (you can do this by constructing a new dict with only the required keys).

@icyflame
Copy link
Member

icyflame commented Aug 7, 2018

On a separate matter, you can check if the error codes are being written to the attachments by downloading them and trying to open them in a text editor. (Or binary editor)

If they are error codes, the size of each PDF attachment will be same to the last byte. This is generally a good indication that these are error codes.

@icyflame
Copy link
Member

icyflame commented Aug 7, 2018

@athityakumar

From your description of the issue, it seems like the problem with "GET"-ing the PDF file is that we are not authenticating when we do requests.get. In that same function, we already have an authenticated session. So, another approach might be to do the same thing as we used to do before, but instead of getting the PDF from a plain session, get it from the already authenticated session.

Of course, we only need to do this if the raw PDF data isn't already there in the notices API call (which is what your PR seems to suggest)

hooks.py Outdated
r = requests.get(notice['attachment_url'], stream=True,
**req_args)
r.raw.decode_content = True
r = notice['attachment_raw']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the notice now returned as part of the notices call on ERP? i.e. In every notice, the data for the raw PDF as well as a URL to the PDF is included. This doesn't seem optimal or required.

@athityakumar can you explain what has changed in the response to the notices call? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got it absolutely right. They've shifted from a non-SSO to an SSO URL, and now require auth to read the PDFs. So, session.get works whereas requests.get doesn't (it used to, previously).

So, I'm storing the raw attachment text to the notice variable (db?) and using it while calling the mailer function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH! I see, I understand your approach now! I was reading the code in the reverse order, because I was confused about the control flow. Thanks for explaining! :)

@athityakumar
Copy link
Member Author

@icyflame - Yup, they're all consistently 4.1kb files. So, definitely error codes - thanks for pointing to that clue. 😄

@athityakumar
Copy link
Member Author

@amrav @icyflame - I'm pretty new to this codebase and tornado. So, can you please verify once if this is what you mean in the requested changes?

Instead of storing the raw attachment text to notice (in update.py), store the pdf url and fetch the raw attachment text from this url from the mailer function before calling mailgun API. Is that right?

Also, why are we currently storing the md5-ed attachment? I can't see it being called from anywhere else in the codebase.

@amrav
Copy link
Member

amrav commented Aug 8, 2018

@athityakumar I meant something else. Sorry, let me explain more clearly.

This is the way the code currently works. Suppose you have a notice:

{
  "subject": "...",
  "company": "...",
  "text": "..."
  "time": "..."
  "attachment_url": "...",
  "attachment_md5": "..."
}

you can see that these are the only fields being set in update.py. So this is the structure of the notice we're currently passing around.

So every time the updater runs, we fetch the last NUM_NOTICES_DIFFED notices (currently 50), and we populate the above structure for each notice by scraping the appropriate part of the html. Now we have the top N notices in memory.

In the same file, in handle_notices_diff, you can see how we figure out which are the new notices. Here's the function, annotated with what each line is doing

def handle_notices_diff(notices):
    # get the database connector for mongodb
    notices_coll = mc.get_default_database().notices

    # initialise an empty list of different notices; we will populate this
    different_notices = []
    print 'Checking ', len(notices), 'notices'
    # loop through the notices fetched
    for notice in notices:
        # Fetch an *exact* match, if it exists, from the database. This will match a stored document
        # only if *every single field* matches
        db_notice = notices_coll.find_one(notice)
        if db_notice is None:
            # We didn't find any matches, so at least one field on this notice must be different. This could
            # happen if the whole notice is new, but it also happens when the text or attachment url or
            # attachment content changes.
            #
            # This the point of storing the attachment_md5 as one of the fields - if the PDF 
            # content changes, even if the attachment_url doesn't change, we'll still send out 
            # an update because we'll detect a difference. We could do the same thing by 
            # storing the raw attachment content itself, but  that takes more space in the database,
            # and we're on a free tier :)

            # This notice is different, add it to our list
            different_notices.append(notice)
    print 'Different notices: ', different_notices
    if len(different_notices) > 0:
        for notice in different_notices:
            hooks.notices_updated([notice])
            # store this notice into the database so we know we've processed it the next time
            # the updater runs
            notices_coll.insert_one(notice)

Hopefully this makes it clear why we store the attachment md5, and also why we don't want to store the attachment raw content. Really these functions should be commented much more, the fault is entirely mine :) Please add comments wherever you think is appropriate.

The change I'm suggesting is to add attachment_raw into the notice structure above (in addition to attachment_md5 already present, and then add a function like:

def sanitise_notice_for_database(notice):
  # return copy of notice with raw attachment content removed

and use this function in handle_notices_diff when searching for or storing notices. This approach avoids having to fetch the notice content twice (once in update.py and once in the mailer), and avoids needing to establish yet another authenticated session in the mailer (which involves several requests to ERP) or pass the authenticated session to the mailer (breaks separation of concerns).

Let me know if anything is still unclear :) And thanks for taking the time to fix!

@athityakumar
Copy link
Member Author

@amrav - Thanks a lot for spending your time on explaining this. I think I've understood the MFTP flow now. 😅

Review again, please? I'll also add another PR with inline comments. 😄

update.py Outdated
@@ -63,7 +64,9 @@ def check_notices(session, sessionData):
notice['attachment_url'] = ERP_ATTACHMENT_URL + m.group(1)
r = session.get(notice['attachment_url'], stream=True)
r.raw.decode_content = True
notice['attachment_raw'] = r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually adds the request object itself rather than the raw data. What is the type of request.raw, and can we use that directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request.raw is a urllib3.response.HTTPResponse object, and we can use it directly too. 👍

update.py Outdated
hash_ = hashlib.md5()

for chunk in iter(lambda: r.raw.read(4096), b""):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If r.raw is a stream, are we allowed to read it twice (once here and once in the mailer), or does it get consumed the first time you read it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets consumed only after calling r.raw.read() - r.raw can be used twice.

@amrav
Copy link
Member

amrav commented Aug 8, 2018 via email

@athityakumar
Copy link
Member Author

@amrav @icyflame - And it now finally works, locally! 🎉

image

MFTP works locally! PDFs are properly parsed and written now.
@athityakumar athityakumar force-pushed the really-trying-hard-to-fix-pdf-issue branch from 48781fc to 36a9b41 Compare August 10, 2018 00:42
Copy link
Member

@amrav amrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Couple of minor issues. We're streaming the attachment twice - not a big deal, but can be avoided by storing the byte stream in memory.

While you're at it, would be good to use r.iter_content instead of r.raw as the docs suggest.

Out of curiosity, why did you remove deepcopy (shallow copy works too)?

@athityakumar
Copy link
Member Author

The iter_content approach actually made it easier to prevent streaming the attachment twice. 😄

While trying to setup MFTP locally, the deepcopy line threw a couple of errors without debuggable error messages. So, I just wanted to workaround the errors - and focus on getting the PDFs right. And I forgot to revert it back afterwards. Seems like the error then, was just due to the missing if 'attachment_raw' in sanitised_notice check and I had overlooked it. 😓

Yes, shallow copy is enough for our notice structure and works fine. 👍

update.py Outdated
@@ -64,7 +65,9 @@ def check_notices(session, sessionData):
r = session.get(notice['attachment_url'], stream=True)
r.raw.decode_content = True
hash_ = hashlib.md5()
for chunk in iter(lambda: r.raw.read(4096), b""):
notice['attachment_raw'] = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this seems to be appending bytes to a string. Have you checked whether this actually works? I think it's better to use an explicit byte stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked whether this actually works?

Yup, it does. By appending the bytes to the string, we're getting back what we would have got anyway previously from r.raw 😄

Copy link
Member

@amrav amrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for doing this fix!

@amrav
Copy link
Member

amrav commented Aug 10, 2018

Oh you could also use b"" instead of "" for the empty string, just to make it even more explicit that it's a string of bytes.

@athityakumar
Copy link
Member Author

I'm (kinda hurriedly, sorry!) merging this PR now, so that the bug fix gets deployed at least from this weekend for MFTP users. 😅

I'll document the MFTP flow too and send a PR soon.

Thanks for reviewing, @amrav @icyflame 🎉

@athityakumar athityakumar merged commit bf6e651 into metakgp:master Aug 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants