-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use Py-StackExchange rather than requests and bs4 #13
Conversation
This was a big thing for us––hopefully will improve speed as well. I'll take a look at this later; there was some issue communication on the Py-StackExchange repo about getting just answer bodies using the API. There's a good chance you've implemented that, but it sounds like a great way to eliminate unnecessary data transfer (the rest of the HTML in the site, as with requests/bs4). Just dropping this here for my own reference when reviewing––cheers! |
I hadn't seen that issue before, but I've read the StackExchange API and Py-StackExchange's source code before implementing this feature, so yes it's implemented indeed ;-) and yes I think it a more efficient method to deal only with the json API rather than full html requests |
@WnP This looks suuuuper clean. Starting testing, hopefully will merge by EOD. |
@WnP I dig it, merging. I will make some small modifications to the way the output is printed myself, just because I think it is easier to implement those changes than communicate them. Very minor, just adding some newlines here and there. Will do new release with those changes. Also, I notice an anecdotal speed difference... Do you? Thanks! |
use Py-StackExchange rather than requests and bs4
@lukasschwab yes the speed difference is anecdotal from client side, let's compare them with this simple script: #!/usr/bin/env python
# -*- coding: utf-8 -*-
from timeit import timeit
import stackexchange
from stackexchange import Sort
import bs4
import requests
import html2text
h = html2text.HTML2Text()
term = 'python flask'
API_KEY = "3GBT2vbKxgh*ati7EBzxGA(("
so = stackexchange.Site(stackexchange.StackOverflow, app_key=API_KEY, impose_throttling=True)
questions = so.search_advanced(
q=term,
sort=Sort.Votes)
question = None
for q in questions:
if 'accepted_answer_id' in q.json:
question = q
break
else:
raise Exception('No question found')
def old_way_query(question):
questionurl = question.json['link']
answerid = question.json['accepted_answer_id']
response = requests.get(questionurl)
soup = bs4.BeautifulSoup(response.text)
# Focuses on the single div with the matching answerid--necessary b/c bs4 is quirky
for answerdiv in soup.find_all('div', attrs={'id': 'answer-' + str(answerid)}):
answertext = h.handle(answerdiv.find('div', attrs={'class': 'post-text'}).prettify())
def new_way_query(question):
answerid = question.json['accepted_answer_id']
questiontext = h.handle(so.question(question.id, body=True).body)
answer = h.handle(so.answer(answerid, body=True).body)
print('old way: %s' % timeit("old_way_query(question)", "from __main__ import question, old_way_query", number=20))
print('new way: %s' % timeit("new_way_query(question)", "from __main__ import question, new_way_query", number=20)) on my laptop using
so in this case (20 executions) it's 22 times faster, the more executions you have the more faster it is for one execution the difference is really anecdotal
1.11 times faster ^^ however, these tests are highly dependent on the network connection |
answers are now 140 character long in listing and follow by
...
if they are more longlet me know if you think it's a good idea or not