# Using Twitter API for social research purposes
Setting objectives (just an example, pipeline has been modified):
1. Start from a basic ashtag on smoke 
    * use _GET search/tweets_ methods in the __REST API__
    * start pulling data into MongoDB using same ashtags and __stresming API__
2. Look for co-occurrence of ashstags (count other ashtags that appear the most with the searched ones)
3. Start "tracking" also those ashstags --> list of ashstags
4. Find the ashstags that are more correlated with smoke
5. Network of ashstags
(Repeat the process using keywords instead of ashtags.
Eventually we could use a mixed approach)

#### Empirical research on twitter.com 
**_#tobacco_** is in the first two suggestions when we start the search.
Could be a good starting point as it seems to have a more neutral sentiment compared to other choiches that appear (e.g. _#tobaccofree_, _#smoke_, _#smokefree_; hopefully we will find these ashtag in the network)


# REST and streaming API
We obtain the same objects (tweets) from both APIs but different ones hopefully.
In MongoDB create a collection for each: -search -stream

#### REST API
The _GET search/tweets_ has __number of queries over time limit__.
We have to use it wisely and be careful of not collecting the same tweet twice (see indexing).

WHAT ARE WE USING HERE?

- API type
[REST APIs](https://dev.twitter.com/rest/public)

- API sub-type
[The Search API](https://dev.twitter.com/rest/public/search)

- Method of the API
[GET search/tweets](https://dev.twitter.com/rest/reference/get/search/tweets)

- Parameters callable on this method (parameters are similar for all methods within API type)
[Search parameters](https://dev.twitter.com/rest/reference/get/search/tweets)

__Note__: Unlike the streaming API which has the same parameters on all its methods (that are way less), the parameters of the REST API methods are reported per each method, because they vary.

Rate limits for this API explained:
[How do rate limits work for the REST API? (Stack Overflow)](http://stackoverflow.com/questions/21305547/how-rate-limit-works-in-twitter-in-search-api)

#### Streaming API
Start streaming tweet into Mongo and update the ashtags list (key-words list) as new evidence is found.

WHAT ARE WE USING HERE?

- API type
[The Streaming APIs](https://dev.twitter.com/streaming/overview)

- API sub-type
[Public streams](https://dev.twitter.com/streaming/public)

- Method of the API
[GET statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample)

- Parameters callable on this method (parameters are similar for all methods within API type)
[Statuses parameters](https://dev.twitter.com/streaming/overview/request-parameters)

__Note__: The streaming API has common parameters that can be called on each of its methods (follow the last link).

---------------------------------------------------------------------------------------------

# Accessing Twitter for any API requests
Application created "soton_uni" [Application Management](https://apps.twitter.com/)

In [18]:
import twitter

def oauth_login():
    
    CONSUMER_KEY = '6YyEfFZtKh3qoGJ3DTy35ToFl'
    CONSUMER_SECRET = 'dvhZX8j3kp5sPcDNivj8BGLoylJUOUbQkVG3qbICNA81R86kh8'
    OAUTH_TOKEN = '4061210361-6OWiTmHf6JpMBdjWnk3GHzNo57M1AtAXxF1gxdt'
    OAUTH_TOKEN_SECRET = 'QwvuefhvzzSFbuECijEyj1hPMA2jelF1sdpFD7hDmhZZl'
    
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)
    
    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api

# test
twitter_api = oauth_login()    

# twitter_api it's now a defined variable
print twitter_api

<twitter.api.Twitter object at 0x10a0311d0>


# use GET search/tweets method in the REST API
Search for the **_#tobacoo_** ashtag. We use it as a starting point

In [21]:
import json

def twitter_search(twitter_api, q, max_results=100000, **kw):
    
    search_results = twitter_api.search.tweets(q=q, count=100, **kw)
    
    statuses = search_results['statuses']
    # Iterate through batches of results by following the cursor until we
    # reach the desired number of results. Maximum count is 100, keeping in mind that OAuth users
    # can "only" make 180 search queries per 15-minute interval, reasonable number of results is ~1000, although
    # that number of results may not exist for all queries.
    
    # Enforce a reasonable limit
    max_results = min(18000, max_results)
    
    for _ in range(179): # 180 max number of query we can issue in 15 minutes (1 it is outside the loop)
        print "Length of statuses", len(statuses)
        try:
            next_results = search_results['search_metadata']['next_results']
        except KeyError, e: # No more results when next_results doesn't exist
            break
            
        # Create a dictionary from next_results, which has the following form:
        # ?max_id=313519052523986943&q=NCAA&include_entities=1
        kwargs = dict([ kv.split('=') 
                        for kv in next_results[1:].split("&") ])
        
        search_results = twitter_api.search.tweets(**kwargs)

        statuses.extend(search_results['statuses'])
        
        if len(statuses) > max_results: 
            break
        
    return statuses


# test
twitter_api = oauth_login()
q = "uk"

results = twitter_search(twitter_api, q)

# Number of tweets collected with the query
print len(results)
print
print

# Show 15 tweets: empirical exploration
for i in range(15):
    print json.dumps(results[i]["text"], indent=1)
    print

Length of statuses 100
Length of statuses 200
Length of statuses 300
Length of statuses 400
Length of statuses 500
Length of statuses 600
Length of statuses 700
Length of statuses 800
Length of statuses 900
Length of statuses 1000
Length of statuses 1100
Length of statuses 1200
Length of statuses 1300
Length of statuses 1400
Length of statuses 1500
Length of statuses 1600
Length of statuses 1693
1693


"RT @BBCSport: Olympic and Wimbledon champion Andy Murray will carry the flag for @TeamGB at #Rio2016 \ud83c\uddec\ud83c\udde7\n\nhttps://t.co/z7SBWIWP93 https://t.\u2026"

"RT @_dogandbone_: #RETWEET #FOLLOW @LilEddies 4 a chance to #WIN a PHONECASE #BIGSALE #SUMMER #WednesdayWisdom https://t.co/8JayfDruVY http\u2026"

"#LFC Star https://t.co/lbVIbqdMfe Liverpool ace: This England star is an inspiration to me"

"RT @AutoExpress: New #RAC report reveals pothole damage to UK cars has doubled in a decade... https://t.co/u1vB5GHoNp https://t.co/whEPsMjR\u2026"

"Interview: @jaredyung of UK 

__Note__:
Although we're just passing in a hashtag to the Search API at this point, it's well worth noting that it contains a number of [powerful operators](https://dev.twitter.com/rest/public/search) that allow you to filter queries according to the existence or nonexistence of various keywords, originator of the tweet, location associated with the tweet, etc.

__Note:__ when we issue a serch query to the REST API what we are actually returned is the following object; the result of our query in terms of tweets is inside _statuses_ but we also have search metadata that we use to continue our search and return more than 100 tweets (query limit).
    
--------------------------------------------------------------------------------------------------------------    
    {
    
    "search_metadata": {
        "completed_in": 0.092, 
        "count": 100, 
        "max_id": 749828837471584259, 
        "max_id_str": "749828837471584259", 
        "next_results": "?max_id=749671693128531967&q=%2523tobacco&count=100&include_entities=1", 
        "query": "%2523tobacco", 
        "refresh_url": "?since_id=749828837471584259&q=%2523tobacco&include_entities=1", 
        "since_id": 0, 
        "since_id_str": "0"
    }, 
    
    "statuses": [
        {},
        {},
        {},
        ...
        ## tweets ##
        ...
        {},
        {},
        {}
        ]
        
     }
     
--------------------------------------------------------------------------------------------------------------     


##### The search_metadata and the cursoring system

In essence, all the previous code for the search does is __repeatedly make requests to the Search API__(enforcing a limit on the number of request not to pass the Twitter limits). 
One thing that might initially catch you off guard if you've worked with other web APIs (including version 1 of Twitter's API) is that there's no explicit concept of pagination in the Search API itself. Reviewing the API documentation reveals that this is a intentional decision, and there are some good reasons for taking a [cursoring approach](https://dev.twitter.com/rest/public/timelines) instead, given the highly dynamic state of Twitter resources. The best practices for cursoring vary a bit throughout the Twitter developer platform, with the Search API providing a slightly simpler way of navigating search results than other resources such as timelines.
Search results contain a special __search_metadata__ node that embeds a __next_results field__ with __a query string that provides the basis of a subsequent query__. If we weren't using a library like twitter to make the HTTP requests for us, this preconstructed query string would just be appended to the Search API URL, and we'd update it with additional parameters for handling OAuth. However, __since we are not making our HTTP requests directly, we must parse the query string into its constituent key/value pairs and provide them as keyword arguments.__

In Python parlance, we are unpacking the values in a dictionary into keyword arguments that the API search function receives (twitter_api.search.tweets(__**kwargs__)). In other words, the function call inside of the for loop ultimately invokes a function such as twitter_api.search.tweets(q='%23Tobacco', include_entities=1, max_id=313519052523986943) even though it appears in the source code as twitter_api.search.tweets(__**kwargs__), with kwargs being a dictionary of key/value pairs ([*Args, **Kwargs](https://www.youtube.com/watch?v=WWm5DxTzLuk)).

__Note__:
The search_metadata field also contains a __refresh_url__ value that can be used if you'd like to maintain and periodically update your collection of results with new information that's become available since the previous query.

--------------------------------------------------------------------------------------------------------------

In this example we use the GET search/tweets method of the Search API to perform a custom query  against the entire twitterverse. Similar to the way that search engines work, __Twitter's Search API return results in batches, and you can configure the number of results per batch to a maximum value of 100 by using the count keyword (count: The number of tweets to return per page, up to a maximum of 100. Defaults to 15. This was formerly the “rpp” parameter in the old Search API; from documentation).__ It is possible that more than 100 results (or the count specified) may be available for a given query and, in the parlance of Twitter's API, you'll need to use a  **_cursor_**, to __navigate to the next batch of results__.

[Cursors](https://dev.twitter.com/rest/public/timelines) are a new enhancement to Twitter's v1.1 API and provide a more robust scheme than the pagination paradigm offered by the v1.0 API, which involved specifying a page number and a result per page constraint. The essence of the cursor paradigm is that it is able to better accomodate the dynamic and real-time nature of the Twitter platform. For example, Twitter's API cursor are designed to inherently take into account the possibility that updated information may become available in real time while you are navigating a batch of query results. In other words, it could be the case that while you are navigating a batch of query resuts, relevant information may become available that you would want to have included in your current results while you are navigating them, rather then needing to dispatch a new query.

With the code in the previous function, we use the Search API and navigate the cursor that is included in a response (in the search_metadata field) to fetch more than one batch of results.


 

# use same query within streaming API
Search for the **_#tobacoo_** ashtag.

__Note:__Check the parameters of the stream method [Streaming API request parameters](https://dev.twitter.com/streaming/overview/request-parameters)

In [14]:
# Finding topics of interest by using the filtering capablities it offers.

import sys

# Query terms

q = 'tobacco' # Comma-separated list of terms

print >> sys.stderr, 'Filtering the public timeline for track="%s"' % (q,)

# Returns an instance of twitter.Twitter
twitter_api = oauth_login()

# Reference the self.auth parameter
twitter_stream = twitter.TwitterStream(auth=twitter_api.auth)

# See https://dev.twitter.com/docs/streaming-apis
stream = twitter_stream.statuses.filter(track=q)

# For illustrative purposes, when all else fails, search for Justin Bieber
# and something is sure to turn up (at least, on Twitter)

print stream 

for tweet in stream:
    print tweet

Filtering the public timeline for track="tobacco"


<generator object __iter__ at 0x10cf27f50>
{u'favorited': False, u'contributors': None, u'truncated': False, u'text': u'RT @mihotep: #harmless #vaping\n\u2705eliminates the #Tobacco\n\u2705*ENDS* the Con\nin #TobaccoConTrol\n\n#WorldLungCancerDay https://t.co/y7o0iDDEQg', u'possibly_sensitive': False, u'is_quote_status': False, u'in_reply_to_status_id': None, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 119625872, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/583067608838094849/IdqLtH7G_normal.jpg', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 382, u'profile_sidebar_border_color': u'000000', u'id_str': u'119625872', u'profile_background_color': u'000000', u'listed_count': 40, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme9/bg.gif', u'utc_offset': -14400, u'statuses_count': 4827, u'descr

KeyboardInterrupt: 

__IMPORTANT Note__: when we access twitter streming API we obtain objects of type __tweet__. Read streaming API documentation here [STREAMING API](https://dev.twitter.com/streaming/overview).

# Search API VS Streaming API
The Twitter Search API is part of Twitter’s REST API. It allows queries against the indices of recent or popular Tweets and behaves similarly to, but not exactly like the Search feature available in Twitter mobile or web clients, such as Twitter.com search. __The Twitter Search API searches against a sampling of recent Tweets published in the past 7 days.__

Before getting involved, it’s important to know that __the Search API is focused on relevance and not completeness.__ This means that some Tweets and users may be missing from search results. __If you want to match for completeness you should consider using a Streaming API instead.__

# Summary of the documents types (objects) returned by the methods we used

__Note:__ all documents returned by Twitter APIs are in JSON (Java Script Object Notation) format; information is concatenated trough dictionaries {} (characterised by key-value pairs) and lists [] (ordered structure). These two basic data structures can be nested in one another.

##### GET search/tweets

Return a dictionary containing "search_metadata" and "statuses" keys; the actual tweets are stored in statuses (a list of tweets objects).

In [15]:
%%HTML
<iframe width="100%" height="1000" src="http://www.jsoneditoronline.org/"></iframe>

##### GET statuses/sample

Return a "<generator object __iter__ at 0x10cf27f50>" object which is a kind of iterable variable updated in real time with the last streamed tweet object. The tweets can be accessed doing: 

for tweet in stream:
    print tweet
    # or do something else with the tweet like saving it in MongoDB

A __tweet JSON object__ contain the text of the tweet along with many other meta-data about it (look the following example).

In [16]:
%%HTML
<iframe width="100%" height="1000" src="http://www.jsoneditoronline.org/"></iframe>

----------

----------

# _GET search/tweets_ method clarification
## How to query using the Search API (GET search/tweets method)

To understand the cursoring system we execute a sample query using the __GET search/tweets__ method. 
Any method that works with a cursoring system follows the same principle.

Consider the following code to understand:

In [None]:
def twitter_search(twitter_api, q, max_results=200, **kw):

    search_results = twitter_api.search.tweets(q=q, count=100, **kw)
    print 'search_rsults out of loop'
    print json.dumps(search_results, indent=1)
    
    statuses = search_results['statuses']
    print 'statuses out of loop'
    print json.dumps(statuses, indent=1)

    # Iterate through batches of results by following the cursor until we
    # reach the desired number of results, keeping in mind that OAuth users
    # can "only" make 180 search queries per 15-minute interval. 
    # A reasonable number of results is ~1000, although
    # that number of results may not exist for all queries.
    
    # Enforce a reasonable limit
    max_results = min(1000, max_results)
    
    for _ in range(10): # 10*100 = 1000
        try:
            next_results = search_results['search_metadata']['next_results']
            print 'key-value pairs used to recall the function'
            print json.dumps(next_results, indent=1)
        except KeyError, e: # No more results when next_results doesn't exist
            break
            
        # Create a dictionary from next_results, which has the following form:
        # ?max_id=313519052523986943&q=NCAA&include_entities=1
        kwargs = dict([ kv.split('=') 
                        for kv in next_results[1:].split("&") ])
        print '### ### ###'
        print 'kwargs'
    
        search_results = twitter_api.search.tweets(**kwargs)
        print '### ### ###'
        print 'search_results'
    
        statuses.extend(search_results['statuses']) 
        print '### ### ###'
        print 'statuses'
        
        if len(statuses) > max_results: 
            break
            
    return statuses

    
statuses_final = twitter_search(twitter_api, 'tobacco')
print '### ### ###'
print 'statuses_final'
print json.dumps(statuses_final, indent=1)

We are going to explain the above snippet: 

__We initially give a general overview of the steps it goes through.__ 

__Then we examine each of those reporting the output of the steps.__

--------------------------------

## General overview

INPUT: 
- twitter_api: instance of a twitter_api class that establish the connection with Twitter 
- q: the query (maybe possible to search for more ashtags or keywords at the same time [Using GET search/tweets to query multiple hashtags](https://twittercommunity.com/t/using-get-search-tweets-to-query-multiple-hashtags/12830))
- max_results: maximum number of tweets we want
- ****kv: the other query parameters given as key-value pairs (parameters of the API method, see documentation) 

OUTPUT:
- statuses: list of tweet objects returned initially with search_metadata

## Steps

### first query to the REST API
We do our first query to the API. This takes the __query__, the __count__ (how many tweets we want to be returned by query; the maximum allowed is 100) and other __method specific arguments__ (in this case we are using the [GET search/tweets](https://dev.twitter.com/rest/reference/get/search/tweets) method of the REST API.

__Note:__ the limit of _query per time interval_ is the same for the majority of the method of the REST API.

'search_results out of loop'

In [1]:
%%HTML
<iframe width="100%" height="1000" src="http://www.jsoneditoronline.org/"></iframe>

statuses_out_of_loop
It is the list of tweets returned by the first "direct" query.

In [2]:
%%HTML
<iframe width="100%" height="1000" src="http://www.jsoneditoronline.org/"></iframe>

### Enforce a reasonable limit

We build the function so that we can set the max number of tweets that the whole multi-query process can return with the initial __max_results__, but then inside the function we force that max_results to be 1000 at maximum.
Twitter limits allow us to do 180 query per 15 minutes interval; we could theoretically force our max_results to 180x100=18000 tweets. 

We set a more cautious limit because:
- the twitter search does not return tweets older than 1 week from the day the query is issued;
- there might not be enough tweets that matches our query (if it is to specific) in the past week.

__Note__: another conststraint is imposed on the lenght of the query that can be 500 characters maximum, including operators.

So for example, if we query for "#tobacco" (a very specific string) is unlikely that we obtain more than 1000 tweets, considering that the search only goes back one week.

As general benchmark consider that querying "uk" (a very general term) we obtain 2699 tweets.


### Build key-value pairs from search_metadata information and issue new query

When we do

```python
for _ in range(179):
```
we are doing at maximum 179 other queries (+1, the one outside the loop) where we build our arguments from the search_metadata information.
In particular we are automatically sending a __max_id__ parameter so that the next query will return "results with an ID less than (that is, older than) or equal to the specified ID" (from documentation).


The next_results key inside search_metadata looks like
```python
next_results: ?max_id=757661781355626495&q=tobacco&count=100&include_entities=1
```
that's why we do
```python
kwargs = dict([ kv.split('=') 
                        for kv in next_results[1:].split("&") ])
```                        
to build the dictionary that follows and pass its key-value pairs as parameters to the GET search/tweets method through
```python
search_results = twitter_api.search.tweets(__**kwargs__)
```
we finally extend our list of tweets with the new tweets by doing
```python
statuses.extend(search_results['statuses'])
```
__Note:__ in subsequent iterations the max_id parameter may remain the same; this happens because there are more than 100 tweets (the maximum that can be returned by a query) up to that id.

In [3]:
%%HTML
<iframe width="100%" height="1000" src="http://www.jsoneditoronline.org/"></iframe>

Finally the results is obtained through subsequent extensions of the search_results['statuses'] list. In the following case we retrived a total of 300 tweets.

In [4]:
%%HTML
<iframe width="100%" height="1000" src="http://www.jsoneditoronline.org/"></iframe>

-------

-------

# Conclusions
### Final note
This method is ment to be complete for what regards the qury use but allows us to go back in time at maximum one week.
If we identify interesting users for whom we want to collect more tweets we can use the 

__GET statuses/user_timeline__

that allows us to take up to the last 3200 tweets of that user !!!!!!!!!

### List of useful methods
- __REST APIs__:
    1. method with link
        * brief description
    2. method with link
        * brief description


- __The Streaming APIs__:
    1. method with link
        * brief description