Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9.4 + 9.7 Twitter Search + Saving to MongoDB #212

Open
curtiswallen opened this issue Aug 10, 2014 · 8 comments
Open

9.4 + 9.7 Twitter Search + Saving to MongoDB #212

curtiswallen opened this issue Aug 10, 2014 · 8 comments

Comments

@curtiswallen
Copy link

For some reason, no matter what value I pass for max_results it always collects 200 tweets, no more, no less.

Code:

import twitter
import json
import io
import pymongo

def oauth_login():
    
    CONSUMER_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXx'
    CONSUMER_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
    OAUTH_TOKEN = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
    OAUTH_TOKEN_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
    
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)
    
    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api

def twitter_search(twitter_api, q, max_results=1000, **kw):
 
    search_results = twitter_api.search.tweets(q=q, count=100, **kw)
    
    statuses = search_results['statuses']

    max_results = min(1000, max_results)
    tweet_count = 0

    for _ in range(10):
        try:
            next_results = search_results['search_metadata']['next_results']
        except KeyError, e: 
            break
            
        kwargs = dict([ kv.split('=') 
                        for kv in next_results[1:].split("&") ])
        
        search_results = twitter_api.search.tweets(**kwargs)
        statuses += search_results['statuses']
        
        tweet_count += 100
        print tweet_count
        
        if len(statuses) > max_results: 
            break
            
    return statuses

def save_to_mongo(data, mongo_db, mongo_db_coll, **mongo_conn_kw):
    
    client = pymongo.MongoClient(**mongo_conn_kw)
    db = client[mongo_db]
    coll = db[mongo_db_coll]
    
    return coll.insert(data)

twitter_api = oauth_login()
print "Authed to Twitter. Searching now..."

q = "#ISIS"
results = twitter_search(twitter_api, q, max_results=1000)
print "Results retrieved. Saving to MongoDB..." 

save_to_mongo(results, 'search_results', q)

In the terminal I get:

Authed to Twitter. Searching now...
100
200
Results retrieved. Saving to MongoDB...

Then when I check the DB, 200 results. Every time.
I've tried passing "10" for max_results, still 200.
I've tried passing "1000" for max_results (as shown), still 200.

Thoughts?

@ptwobrussell
Copy link
Owner

In terms of why you never get more than 200 results, it is entirely possible that Twitter is limiting the search results to 200 as maximum value at this point in time (all subject to their platform operational capacity.) Per their own API docs[1], the code looks for the 'next_results' node in the response and bails out when it doesn't find it, since that's the way you're supposed to navigate to the next batch of results.

In terms of why you always get 200 results instead of fewer results (say, 10 results or 100 results, or 142 results as specified by the max_results parameter), I just noticed that the twitter_search function returns results, which should technically have been written as results[:max_results] so as to slice off just what you asked for instead of returning whatever it happened to get (which is optimized for max volume.)

Does that help? So, in the former, I think it's just a current (possibly semi-permanent -- who knows?) limitation of the Search API where Twitter has been known to adjust API responses as needed to maintain platform performance. I can't see a problem with the code as written, though maybe I just have a blind spot... In the latter, it's a mostly harmless bug where the list slice is missing.

[1] https://dev.twitter.com/docs/api/1.1/get/search/tweets

On Aug 10, 2014, at 11:45 AM, curtiswallen notifications@github.com wrote:

For some reason, no matter what value I pass for max_results it always collects 200 tweets, no more, no less.

Code:

import twitter
import json
import io
import pymongo

def oauth_login():

CONSUMER_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXx'
CONSUMER_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
OAUTH_TOKEN = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
OAUTH_TOKEN_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)

twitter_api = twitter.Twitter(auth=auth)
return twitter_api

def twitter_search(twitter_api, q, max_results=1000, **kw):

search_results = twitter_api.search.tweets(q=q, count=100, **kw)

statuses = search_results['statuses']

max_results = min(1000, max_results)
tweet_count = 0

for _ in range(10):
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError, e: 
        break

    kwargs = dict([ kv.split('=') 
                    for kv in next_results[1:].split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses += search_results['statuses']

    tweet_count += 100
    print tweet_count

    if len(statuses) > max_results: 
        break

return statuses

def save_to_mongo(data, mongo_db, mongo_db_coll, **mongo_conn_kw):

client = pymongo.MongoClient(**mongo_conn_kw)
db = client[mongo_db]
coll = db[mongo_db_coll]

return coll.insert(data)

twitter_api = oauth_login()
print "Authed to Twitter. Searching now..."

q = "#ISIS"
results = twitter_search(twitter_api, q, max_results=1000)
print "Results retrieved. Saving to MongoDB..."

save_to_mongo(results, 'search_results', q)
In the terminal I get:

Authed to Twitter. Searching now...
100
200
Results retrieved. Saving to MongoDB...
Then when I check the DB, 200 results. Every time.
I've tried passing "10" for max_results, still 200.
I've tried passing "1000" for max_results (as shown), still 200.

Thoughts?


Reply to this email directly or view it on GitHub.

@curtiswallen
Copy link
Author

That makes sense. Thanks!

So then, follow-up question: If I ran the request multiple times (scraping 200 tweets at a time), can I prevent the collection of duplicate results?

Is there a way to pull a 'next_results' node from the last tweet stored to the DB? So I could crawl back through the history of the query?

Or is that something I'll need to figure out on my own? ;-)

@ptwobrussell
Copy link
Owner

The best advice I could offer at this very moment would be to carefully review the official Search API docs at https://dev.twitter.com/docs/api/1.1/get/search/tweets since the API client used in the code is literally just providing Pythonic wrapper to this API. In other words, that API doc is the authority, and we'd need to do the same tinkering experimenting that it sounds like you're already doing to get to the bottom of some of these things.

I think your best bet is to probably make sure that tweets are keyed on their tweet_id so that you can trivially avoid duplicate results by effectively overwriting any pre-existing info that you'd get in subsequent batches. Or filter out duplicates at query time. Whichever is easiest for you.

On Aug 10, 2014, at 12:05 PM, curtiswallen notifications@github.com wrote:

That makes sense. Thanks!

So then, follow-up question: If I ran the request multiple times (scraping 200 tweets at a time), can I prevent the collection of duplicate results?

Is there a way to pull a 'next_results' node from the last tweet stored to the DB? So I could crawl back through the history of the query?

Or is that something I'll need to figure out on my own? ;-)


Reply to this email directly or view it on GitHub.

@curtiswallen
Copy link
Author

Cheers! Thanks so much, Matthew.

Love love love the book, and I tremendously admire/appreciate both your activity on github and all the work you've done to make the concepts and content so accessible. Can't wait to see what's next!

@ptwobrussell
Copy link
Owner

Thanks! So glad to hear it. Once you work through things some more, I'd
love to hear more about your work and what helped/didn't help. Amazon
reviews are also a luxury these days if you have a few moments to leave one
of those at some point. Thanks again for the encouraging words.

On Sun, Aug 10, 2014 at 12:13 PM, curtiswallen notifications@github.com
wrote:

Cheers! Thanks so much, Matthew.

Love love love the book, and I tremendously admire/appreciate both your
activity on github and all the work you've done to make the concepts and
content so accessible. Can't wait to see what's next!


Reply to this email directly or view it on GitHub
#212 (comment)
.

@LisaCastellano
Copy link

Thank you Matthew for your amazing work.
I bought both the Mining the Social Web 2nd edition and the Dojo: The Definitive Guide too! I love your books and the way you've organized the content and the excercises.
I followed your instructions in setting up the VB with vagrant and python etc: great! it works.
In addition I installed Django and I'm working with python via web.

I had exactly the same issue of Curtis for twitter api exercises: 200 statuses returned.

Now it seems working: indeed my problem was due to the fact the next_results was url-encoded twice and the hashtag #Obama became:
1st time= %23Obama.
2nd time= %25%23Obama.
and the third api call did not find any status, that's because I had 200 results only.

So I replaced the statements below:

kwargs = dict([ kv.split('=')
for kv in next_results[1:].split("&") ])

with the following ones:

next_results = urlparse.parse_qsl(next_results[1:])
kwargs = dict(next_results)
importing urlparse in my py file.

Hope it can help.
It's a pity that I cannot test again for a while since I reached the Twitter search api limits :(

Waiting for your next books!

@ptwobrussell
Copy link
Owner

Thanks so much for this update. I'll take a closer look and update the code in the repo soon.

@nietzschetmh
Copy link

Thanks a lot LisaCastellano. Your solution works great for me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants