Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_entries() w/ content_type ignores entries with no mimeType instead of crashing #36

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

betsy
Copy link

@betsy betsy commented Aug 23, 2020

match_content_type checks if mimeType exists in an entry before trying to access it, and does not match if it doesn't find it.


:param entry: ``dict`` of a single entry from a HarPage
:param content_type: ``str`` of regex to use for finding content type
:param regex: ``bool`` indicating whether to use regex or exact match.
"""
mimeType = entry['response']['content']['mimeType']
content = entry['response']['content']
if 'mimeType' not in content:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. In the HAR file that caused this issue, do you see an equivalent to the mimeType in any other keys? Like contentType or something maybe? Just wondering if the issue is not necessarily that it is MISSING but rather just stored somewhere else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entries that are causing the problem have content keys ['size', 'encoding', 'text'], while other entries in the har file have content keys ['mimeType', 'size', 'encoding', 'text']. The response status of the problematic entries is listed as 0, but I'm not sure entirely what that means in this context (the content text is still present).

There is an equivalent 'Content-Type' field in the all of the response headers in this har file. I haven't had exposure to enough different har formtas to know if this is a consistent thing, but if so we could possibly just match on response->headers->Content-Type rather than response->content->mimeType.

A somewhat less elegant but safer possibility is to check for both of these fields.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the entries missing mimeType also are missing things like serverIPAddress so I'm not sure what type of special state this is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! Interesting, is it possible that you could provide a redacted example of the HAR file? If we could determine that there is an equivalent key to pull from, or determine exactly what is unique about these requests, we might be able to do better than just ignoring them. Also, could you add a test please? Thanks!

Copy link
Author

@betsy betsy Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha apologies for /my/ delay, just moved and started this semester's classes.
I don't think there's anything particularly redact-worthy in there (the most interesting things are some netflix links and my OS/browser type). Can't attach .har files here so have to send to you separately (is there an email I can use?)
Will add a test to check the match functions on artificial entries once we decide what expected behavior should be for the different formats.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Depending on the size of the entries, you could also post just a few example entries here as opposed to the entire HAR file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Taking a look at one of the entries with no mimeType property in the response, I do see Content-Type in the response headers. Theoretically these should be interchangeable so I think the best bet here would be to fall back to the content-type in the headers in the absence of mime-type. Here is an example of one of the entries in question:

>>> my_entry['response']['headers']
[{'name': 'Server', 'value': 'nginx'}, {'name': 'Date', 'value': 'Wed, 19 Aug 2020 21:46:11 GMT'}, {'name': 'Content-Type', 'value': 'application/octet-stream'}, {'name': 'Content-Length', 'value': '261971'}, {'name': 'Last-Modified', 'value': 'Tue, 28 Jul 2020 13:03:17 GMT'}, {'name': 'Connection', 'value': 'keep-alive'}, {'name': 'Timing-Allow-Origin', 'value': '*'}, {'name': 'Cache-Control', 'value': 'no-store'}, {'name': 'Pragma', 'value': 'no-cache'}, {'name': 'Access-Control-Allow-Origin', 'value': '*'}, {'name': 'Access-Control-Expose-Headers', 'value': 'X-TCP-Info'}, {'name': 'X-TCP-Info', 'value': 'addr=108.24.111.88;port=55018'}]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants