Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor fbscraper.py to remove facepy dependency #73

Merged
merged 2 commits into from Feb 11, 2017

Conversation

ghostwriternr
Copy link
Member

facepy's fork that has been in active use in naarad has not been updated
in a while, and falls way behind vanilla Graph API in terms of
functionality as well as documentation. Directly interfacing with the
Graph API makes it easier to work with the API, and helps keeping the
code simple enough for future contributors by not adding another
abstraction layer in the middle.

Notes:

  • Regenerated requirements.txt using 'pipreqs' (Please do check if the
    list is exhaustive)
  • [WIP] Cleaned up code to be more in sync with PEP 8 style guide

@athityakumar
Copy link
Member

athityakumar commented Feb 5, 2017

@ghostwriternr - Scraping part is working so far. 👍

@DefCon-007 - There are warnings like Insecure Platform Warning and SNI Missing Warning.
Did these warnings appear before too?

image

Copy link
Member

@athityakumar athityakumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghostwriternr - Just came across this error :

image

@DefCon-007
Copy link
Member

@athityakumar No, These warnings were not there before. And I think they are because of requests module and not due to facebook.

@ghostwriternr
Copy link
Member Author

ghostwriternr commented Feb 5, 2017

@athityakumar I will look into the reason for those warnings. Funnily enough, I didn't encounter this when I was testing the code. Could this be because of difference in Python (and pip package) versions?

UPDATE: Looks like that is indeed the issue. A quick google search says this is a security issue observed in Python versions < 2.7.9 and is commonly fixed by using pip install requests[security] (instead of just pip install requests), which installs all required packages for certificate verification on Python 2. Might modifying this in requirements.txt do the trick?

Also, I'm not able to replicate the error you have pointed out in the screenshot. The line in question is the exception handler for file opening, and I have not edited any of the code in the function call stack till that point. Can you please tell me when does this occur? I might have missed a test case.

@athityakumar
Copy link
Member

athityakumar commented Feb 5, 2017

@ghostwriternr - I'm using python 2.7, and both - the warning & error were due to different python versions.

No FileNotFoundError for Python 2.7 : http://stackoverflow.com/questions/26745283/how-do-i-import-filenotfounderror-from-python-3
We have to setup OSError instead of FileNotFoundError for making it work across different Python versions, or simply just have it as exception: in that line 158.

The requests[security] didn't solve the warnings. Let me check if anything else solves the warning.
Source for future reference : http://stackoverflow.com/questions/19559247/requirements-txt-depending-on-python-version

UPDATE : pip install pyopenssl ndg-httpsclient pyasn1 solved the warning for python 2.7. And no warnings / errors for 3.4 version. 😄

@athityakumar
Copy link
Member

@icyflame @DefCon-007 : Python 2.x and 3.x are quite incompatible. Do we need naarad to be supported by both? Anyways, the server has 3.x currently.

@DefCon-007
Copy link
Member

@athityakumar : It will be better if naarad can become cross compatible with python 2 and python 3. How about we create a new issue for it ?

@athityakumar
Copy link
Member

athityakumar commented Feb 5, 2017

@DefCon-007 : Sure. But we don't want the server executing the cron to be confused about this.

@athityakumar
Copy link
Member

Opened issue #74 . Do test this PR on 3.x version, @DefCon-007.

@ghostwriternr
Copy link
Member Author

ghostwriternr commented Feb 5, 2017

If this is working fine, should I go ahead with fixing issue #49 ? Or do we fix issue #74 first before moving forward?

It should be pretty straightforward now since graph API's posts edge has an updated_time field that can be used to easily check for updates to posts in the feed.

And we could also decide on a rough metric to decide how far back the posts do we traverse to check for edits. Such a simple traversal to, say 10-15 posts, should be enough for our purpose. A comparison based on hash values using ETags might be an overkill (and also ineffective) for our current scraping style.

@athityakumar
Copy link
Member

@ghostwriternr - #49 is the thing that's going to matter for a user. So, let's do that first. Range of 10-15 posts seems logical - no sane group would edit a post that is older than that. And we would also need a script that deletes duplicate posts in all existing json files.

@ghostwriternr
Copy link
Member Author

@athityakumar : Noted. Will incorporate both when I'm done with my classes tomorrow.

@icyflame
Copy link
Member

icyflame commented Feb 6, 2017

Python 2.x and 3.x are quite incompatible. Do we need naarad to be supported by both? Anyways, the server has 3.x currently.

There's not much sense in making this cross compatible. A better choice would be to use Python 3.

If you are having to add dependencies (like python version wise requirements.txt) or go out of your way to support Python 2, then I would suggest that you do NOT do it, and stick to Python 3.

Copy link
Member

@icyflame icyflame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good PR, with a very good premise. The diff right now is too cluttered with these identical blocks appearing in the diff because of changed indentation, and it's hard to look at what has changed, and what has been moved forward by one space. Fixing this should be the way to go forward.

fbscraper.py Outdated
DateTime[0]['real_date'],
event['place']['name'])
return message
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the exact same code that's appearing in the diff because this file was apparently using 3 spaces (??) as it's previous default indent.
@ghostwriternr: In any case, this has cluttered the diff too much, and I can't really figure what are your changes, and what changes are because of this diff.
3 spaces is not a good idea, but I would suggest that you keep that PR different from this. i.e. merge this PR in and follow it up with a PR that fixes all the indentation and does NOT touch any of the code.

The problem with the current approach is that git blame will falsely accuse @ghostwriternr of writing these lines of code when he actually only shifted them forward by 1 space.

fbscraper.py Outdated
item['real_date'] = local_date.strftime('%d-%m-%Y')
item['real_time'] = local_date.strftime('%I:%M%p')
return data
for item in data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the indentation has caused these identical blocks to appear in the diff.

facepy's fork that has been in active use in naarad has not been updated
in a while, and falls way behind vanilla Graph API in terms of
functionality as well as documentation. Directly interfacing with the
Graph API makes it easier to work with the API, and helps keeping the
code simple enough for future contributors by not adding another
abstraction layer in the middle.

Notes:

  - Regenerated requirements.txt using 'pipreqs' (Please do check if the
    list is exhaustive)
@ghostwriternr
Copy link
Member Author

@athityakumar @icyflame @DefCon-007 : In case you missed the notification for the updated commit 3 days back. Do let me know if it needs any more changes 😄

@raivivek
Copy link
Member

raivivek commented Feb 9, 2017

Just to echo @icyflame's comment: Supporting Python 2 or cross compatibility is not a (big) priority.

fbscraper.py Outdated

graph = GraphAPI(access_token)
s = requests.Session()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel like a good name for a session that's being used throughout the file. Maybe you should use something like session or req_session, that way the req_session.get call would make more sense at first glance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@icyflame : Aye aye sir! That's a very valid point. Updated accordingly to req_session.

@icyflame
Copy link
Member

icyflame commented Feb 9, 2017

@ghostwriternr Patch LGTM. One pretty minor aesthetic change, the diff is a lot cleaner now, and it was easier to review.

@DefCon-007 you should ensure that this builds and updates the page properly before merging of course.

Copy link
Member

@DefCon-007 DefCon-007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is working fine.

@DefCon-007 DefCon-007 merged commit 23db28c into metakgp:master Feb 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants