-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF Bug - The Sequel #21
PDF Bug - The Sequel #21
Conversation
update.py
Outdated
for chunk in iter(lambda: r.raw.read(4096), b""): | ||
hash_.update(chunk) | ||
notice['attachment_md5'] = hash_.hexdigest() | ||
notice['attachment_raw'] = r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We weren't storing the raw attachment because over time this would consume all our storage (hence md5-ing it and storing just that).
A better approach here is to keep the md5, and ignore the attachment_raw
field when doing db writes/reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but wouldn't we actually require the raw attachment for directly using from the mailer (in hooks.py
) rather than the md5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you could pass attachment_raw
through to the mailer, but remove it when interacting with the database. There are two interactions: the first when we search for any notices that match this one, and the second when we write a new notice into the database. We basically want to skip over the attachment_raw
key for both these interactions (you can do this by constructing a new dict with only the required keys).
On a separate matter, you can check if the error codes are being written to the attachments by downloading them and trying to open them in a text editor. (Or binary editor) If they are error codes, the size of each PDF attachment will be same to the last byte. This is generally a good indication that these are error codes. |
From your description of the issue, it seems like the problem with "GET"-ing the PDF file is that we are not authenticating when we do Of course, we only need to do this if the raw PDF data isn't already there in the |
hooks.py
Outdated
r = requests.get(notice['attachment_url'], stream=True, | ||
**req_args) | ||
r.raw.decode_content = True | ||
r = notice['attachment_raw'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the notice now returned as part of the notices
call on ERP? i.e. In every notice, the data for the raw PDF as well as a URL to the PDF is included. This doesn't seem optimal or required.
@athityakumar can you explain what has changed in the response to the notices call? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You got it absolutely right. They've shifted from a non-SSO to an SSO URL, and now require auth to read the PDFs. So, session.get
works whereas requests.get
doesn't (it used to, previously).
So, I'm storing the raw attachment text to the notice
variable (db?) and using it while calling the mailer function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH! I see, I understand your approach now! I was reading the code in the reverse order, because I was confused about the control flow. Thanks for explaining! :)
@icyflame - Yup, they're all consistently 4.1kb files. So, definitely error codes - thanks for pointing to that clue. 😄 |
@amrav @icyflame - I'm pretty new to this codebase and tornado. So, can you please verify once if this is what you mean in the requested changes? Instead of storing the raw attachment text to Also, why are we currently storing the md5-ed attachment? I can't see it being called from anywhere else in the codebase. |
@athityakumar I meant something else. Sorry, let me explain more clearly. This is the way the code currently works. Suppose you have a notice:
you can see that these are the only fields being set in So every time the updater runs, we fetch the last In the same file, in def handle_notices_diff(notices):
# get the database connector for mongodb
notices_coll = mc.get_default_database().notices
# initialise an empty list of different notices; we will populate this
different_notices = []
print 'Checking ', len(notices), 'notices'
# loop through the notices fetched
for notice in notices:
# Fetch an *exact* match, if it exists, from the database. This will match a stored document
# only if *every single field* matches
db_notice = notices_coll.find_one(notice)
if db_notice is None:
# We didn't find any matches, so at least one field on this notice must be different. This could
# happen if the whole notice is new, but it also happens when the text or attachment url or
# attachment content changes.
#
# This the point of storing the attachment_md5 as one of the fields - if the PDF
# content changes, even if the attachment_url doesn't change, we'll still send out
# an update because we'll detect a difference. We could do the same thing by
# storing the raw attachment content itself, but that takes more space in the database,
# and we're on a free tier :)
# This notice is different, add it to our list
different_notices.append(notice)
print 'Different notices: ', different_notices
if len(different_notices) > 0:
for notice in different_notices:
hooks.notices_updated([notice])
# store this notice into the database so we know we've processed it the next time
# the updater runs
notices_coll.insert_one(notice) Hopefully this makes it clear why we store the attachment md5, and also why we don't want to store the attachment raw content. Really these functions should be commented much more, the fault is entirely mine :) Please add comments wherever you think is appropriate. The change I'm suggesting is to add
and use this function in Let me know if anything is still unclear :) And thanks for taking the time to fix! |
@amrav - Thanks a lot for spending your time on explaining this. I think I've understood the MFTP flow now. 😅 Review again, please? I'll also add another PR with inline comments. 😄 |
update.py
Outdated
@@ -63,7 +64,9 @@ def check_notices(session, sessionData): | |||
notice['attachment_url'] = ERP_ATTACHMENT_URL + m.group(1) | |||
r = session.get(notice['attachment_url'], stream=True) | |||
r.raw.decode_content = True | |||
notice['attachment_raw'] = r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually adds the request object itself rather than the raw data. What is the type of request.raw
, and can we use that directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request.raw
is a urllib3.response.HTTPResponse
object, and we can use it directly too. 👍
update.py
Outdated
hash_ = hashlib.md5() | ||
|
||
for chunk in iter(lambda: r.raw.read(4096), b""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If r.raw
is a stream, are we allowed to read it twice (once here and once in the mailer), or does it get consumed the first time you read it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets consumed only after calling r.raw.read()
- r.raw
can be used twice.
Hmm, pretty sure the mailer needs to call `r.raw.read` too, in order to
populate the attachment. Have you tried running locally/on heroku with your
credentials and seeing if you get an email correctly? You can use the
mailgun secret key from the heroku app, but make sure to set your own email
address.
…On Wed, 8 Aug 2018 at 12:33, Athitya Kumar ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In update.py
<#21 (comment)>:
> hash_ = hashlib.md5()
+
for chunk in iter(lambda: r.raw.read(4096), b""):
It gets consumed only after calling r.raw.read() - r.raw can be used
twice.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABIZ_5wUskVr4bi_OMZI0AV54sDoCWlAks5uOsyQgaJpZM4VvFcW>
.
|
MFTP works locally! PDFs are properly parsed and written now.
48781fc
to
36a9b41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Couple of minor issues. We're streaming the attachment twice - not a big deal, but can be avoided by storing the byte stream in memory.
While you're at it, would be good to use r.iter_content
instead of r.raw
as the docs suggest.
Out of curiosity, why did you remove deepcopy (shallow copy works too)?
The While trying to setup MFTP locally, the Yes, shallow copy is enough for our |
update.py
Outdated
@@ -64,7 +65,9 @@ def check_notices(session, sessionData): | |||
r = session.get(notice['attachment_url'], stream=True) | |||
r.raw.decode_content = True | |||
hash_ = hashlib.md5() | |||
for chunk in iter(lambda: r.raw.read(4096), b""): | |||
notice['attachment_raw'] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this seems to be appending bytes to a string. Have you checked whether this actually works? I think it's better to use an explicit byte stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked whether this actually works?
Yup, it does. By appending the bytes to the string, we're getting back what we would have got anyway previously from r.raw
😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for doing this fix!
Oh you could also use |
Le Conspiracy Theory
The code changes in the previous PR actually reflected in no apparent changes to MFTP as a service, because the issue was in the mailer function (
hooks.py
). The mailer was fetching the PDF's attachment URL and then reading it before sending to MailGun API.Well, due to the recent update of the PDF portal, the PDF URL is no longer available directly without SSO (ERP login). (Yup, previously - the PDF URL was THE only non-SSO openly available URL being used in our script)
Hence, without providing any session details in the mailer script,
requests.get(pdf_url)
would simply return an auth error. I think this is what's been being written into our PDF files - error codes that are obviously not PDF content. Hence, the PDFs failed to open.