Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor fbscraper.py to remove facepy dependency #73

Merged
merged 2 commits into from Feb 11, 2017

Conversation

@ghostwriternr
Copy link
Member

commented Feb 4, 2017

facepy's fork that has been in active use in naarad has not been updated
in a while, and falls way behind vanilla Graph API in terms of
functionality as well as documentation. Directly interfacing with the
Graph API makes it easier to work with the API, and helps keeping the
code simple enough for future contributors by not adding another
abstraction layer in the middle.

Notes:

  • Regenerated requirements.txt using 'pipreqs' (Please do check if the
    list is exhaustive)
  • [WIP] Cleaned up code to be more in sync with PEP 8 style guide
@athityakumar

This comment has been minimized.

Copy link
Member

commented Feb 5, 2017

@ghostwriternr - Scraping part is working so far. 👍

@DefCon-007 - There are warnings like Insecure Platform Warning and SNI Missing Warning.
Did these warnings appear before too?

image

@athityakumar
Copy link
Member

left a comment

@ghostwriternr - Just came across this error :

image

@DefCon-007

This comment has been minimized.

Copy link
Member

commented Feb 5, 2017

@athityakumar No, These warnings were not there before. And I think they are because of requests module and not due to facebook.

@ghostwriternr

This comment has been minimized.

Copy link
Member Author

commented Feb 5, 2017

@athityakumar I will look into the reason for those warnings. Funnily enough, I didn't encounter this when I was testing the code. Could this be because of difference in Python (and pip package) versions?

UPDATE: Looks like that is indeed the issue. A quick google search says this is a security issue observed in Python versions < 2.7.9 and is commonly fixed by using pip install requests[security] (instead of just pip install requests), which installs all required packages for certificate verification on Python 2. Might modifying this in requirements.txt do the trick?

Also, I'm not able to replicate the error you have pointed out in the screenshot. The line in question is the exception handler for file opening, and I have not edited any of the code in the function call stack till that point. Can you please tell me when does this occur? I might have missed a test case.

@athityakumar

This comment has been minimized.

Copy link
Member

commented Feb 5, 2017

@ghostwriternr - I'm using python 2.7, and both - the warning & error were due to different python versions.

No FileNotFoundError for Python 2.7 : http://stackoverflow.com/questions/26745283/how-do-i-import-filenotfounderror-from-python-3
We have to setup OSError instead of FileNotFoundError for making it work across different Python versions, or simply just have it as exception: in that line 158.

The requests[security] didn't solve the warnings. Let me check if anything else solves the warning.
Source for future reference : http://stackoverflow.com/questions/19559247/requirements-txt-depending-on-python-version

UPDATE : pip install pyopenssl ndg-httpsclient pyasn1 solved the warning for python 2.7. And no warnings / errors for 3.4 version. 😄

@athityakumar

This comment has been minimized.

Copy link
Member

commented Feb 5, 2017

@icyflame @DefCon-007 : Python 2.x and 3.x are quite incompatible. Do we need naarad to be supported by both? Anyways, the server has 3.x currently.

@DefCon-007

This comment has been minimized.

Copy link
Member

commented Feb 5, 2017

@athityakumar : It will be better if naarad can become cross compatible with python 2 and python 3. How about we create a new issue for it ?

@athityakumar

This comment has been minimized.

Copy link
Member

commented Feb 5, 2017

@DefCon-007 : Sure. But we don't want the server executing the cron to be confused about this.

@athityakumar

This comment has been minimized.

Copy link
Member

commented Feb 5, 2017

Opened issue #74 . Do test this PR on 3.x version, @DefCon-007.

@ghostwriternr

This comment has been minimized.

Copy link
Member Author

commented Feb 5, 2017

If this is working fine, should I go ahead with fixing issue #49 ? Or do we fix issue #74 first before moving forward?

It should be pretty straightforward now since graph API's posts edge has an updated_time field that can be used to easily check for updates to posts in the feed.

And we could also decide on a rough metric to decide how far back the posts do we traverse to check for edits. Such a simple traversal to, say 10-15 posts, should be enough for our purpose. A comparison based on hash values using ETags might be an overkill (and also ineffective) for our current scraping style.

@athityakumar

This comment has been minimized.

Copy link
Member

commented Feb 5, 2017

@ghostwriternr - #49 is the thing that's going to matter for a user. So, let's do that first. Range of 10-15 posts seems logical - no sane group would edit a post that is older than that. And we would also need a script that deletes duplicate posts in all existing json files.

@ghostwriternr

This comment has been minimized.

Copy link
Member Author

commented Feb 5, 2017

@athityakumar : Noted. Will incorporate both when I'm done with my classes tomorrow.

@icyflame

This comment has been minimized.

Copy link
Member

commented Feb 6, 2017

Python 2.x and 3.x are quite incompatible. Do we need naarad to be supported by both? Anyways, the server has 3.x currently.

There's not much sense in making this cross compatible. A better choice would be to use Python 3.

If you are having to add dependencies (like python version wise requirements.txt) or go out of your way to support Python 2, then I would suggest that you do NOT do it, and stick to Python 3.

@icyflame
Copy link
Member

left a comment

It's a good PR, with a very good premise. The diff right now is too cluttered with these identical blocks appearing in the diff because of changed indentation, and it's hard to look at what has changed, and what has been moved forward by one space. Fixing this should be the way to go forward.

DateTime[0]['real_date'],
event['place']['name'])
return message
"""

This comment has been minimized.

Copy link
@icyflame

icyflame Feb 6, 2017

Member

This is the exact same code that's appearing in the diff because this file was apparently using 3 spaces (??) as it's previous default indent.
@ghostwriternr: In any case, this has cluttered the diff too much, and I can't really figure what are your changes, and what changes are because of this diff.
3 spaces is not a good idea, but I would suggest that you keep that PR different from this. i.e. merge this PR in and follow it up with a PR that fixes all the indentation and does NOT touch any of the code.

The problem with the current approach is that git blame will falsely accuse @ghostwriternr of writing these lines of code when he actually only shifted them forward by 1 space.

item['real_date'] = local_date.strftime('%d-%m-%Y')
item['real_time'] = local_date.strftime('%I:%M%p')
return data
for item in data:

This comment has been minimized.

Copy link
@icyflame

icyflame Feb 6, 2017

Member

Again, the indentation has caused these identical blocks to appear in the diff.

Refactor fbscraper.py to remove facepy dependency
facepy's fork that has been in active use in naarad has not been updated
in a while, and falls way behind vanilla Graph API in terms of
functionality as well as documentation. Directly interfacing with the
Graph API makes it easier to work with the API, and helps keeping the
code simple enough for future contributors by not adding another
abstraction layer in the middle.

Notes:

  - Regenerated requirements.txt using 'pipreqs' (Please do check if the
    list is exhaustive)

@ghostwriternr ghostwriternr force-pushed the ghostwriternr:refactor branch from 52933b9 to 6591e7b Feb 6, 2017

@ghostwriternr

This comment has been minimized.

Copy link
Member Author

commented Feb 9, 2017

@athityakumar @icyflame @DefCon-007 : In case you missed the notification for the updated commit 3 days back. Do let me know if it needs any more changes 😄

@raivivek

This comment has been minimized.

Copy link
Member

commented Feb 9, 2017

Just to echo @icyflame's comment: Supporting Python 2 or cross compatibility is not a (big) priority.


graph = GraphAPI(access_token)
s = requests.Session()

This comment has been minimized.

Copy link
@icyflame

icyflame Feb 9, 2017

Member

This doesn't feel like a good name for a session that's being used throughout the file. Maybe you should use something like session or req_session, that way the req_session.get call would make more sense at first glance.

This comment has been minimized.

Copy link
@ghostwriternr

ghostwriternr Feb 10, 2017

Author Member

@icyflame : Aye aye sir! That's a very valid point. Updated accordingly to req_session.

@icyflame

This comment has been minimized.

Copy link
Member

commented Feb 9, 2017

@ghostwriternr Patch LGTM. One pretty minor aesthetic change, the diff is a lot cleaner now, and it was easier to review.

@DefCon-007 you should ensure that this builds and updates the page properly before merging of course.

@DefCon-007
Copy link
Member

left a comment

It is working fine.

@DefCon-007 DefCon-007 merged commit 23db28c into metakgp:master Feb 11, 2017

@ghostwriternr ghostwriternr deleted the ghostwriternr:refactor branch Feb 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.