## Tutorial contents
* [Different ways of collecting Twitter data](#Different-ways-of-collecting-Twitter-data)
* [Providing authorization to the Twitter API](#Providing-authorization-to-the-Twitter-API)
* [Collecting tweets](#Collecting-tweets)
* [Getting information about a user account](#Getting-information-about-a-user-account)
* [Getting follower IDs](#Getting-follower-IDs) 
* [Getting the IDs of users being followed by a specified account](#Getting-the-IDs-of-users-being-followed-by a-specified-account) 
* [Getting tweets favorited by a user](#Getting-tweets-favorited-by-a-user)
* [Getting info on friendship relations](#Getting-info-on-friendship-relations)
* [Getting retweets of a certain status](#Getting-retweets-of-a-certain-status)
* [Searching for tweets](#Searching-for-tweets)
* [Rate limits and cursor](#Rate-limits-and-cursor)
* [Getting Brexit tweets](#Getting-Brexit-tweets)
* [Data processing](#Data-processing)
* [Designing crowdsourcing job](#Designing-crowdsourcing-job)


## Different ways of collecting Twitter data
![collection_methods](collecting_tweets_options.png)

## Twitter APIs
API stands for Application Programming Interface. They allow developers to build tools and applications based on the data stored. And they give researchers interested in data collection an easy way to access the data. 

## Providing authorization to the Twitter API

The first step is to become a Twitter developer. For this you need a Twitter account yourself, and [to create a new app](https://apps.twitter.com/).


You need a unique name. You can fill in anything as the website and description. 
![create_app](create_app.png)

Once you're a developer, you will found your access credentials under the Keys and Access Tokens tab of your new app. You will need to copy the following fields in this form:

1. Consumer Key (API Key)
2. Consumer Secret (API Secret)
3. Access Token
4. Access Token Secret


To get the consumer key and secret: 
![consumer_key](consumer_key.png)

To get the access token and token secret: 
![token_action](token_actions.png)

![access_token](access_token.png)


Now we have access. 

The next thing we need to do it install Tweepy. Tweepy is a Python library for accessing the Twitter API. You can install it by typing: 

    pip install tweepy

in your Anaconda Prompt. 
    
This provides a convenient front-end for the Twitter API, giving us easy access without having to venture outside of our Python environment.

In [None]:
import tweepy

CONSUMER_KEY        = 'vRoZs3Z27vyb7FFAmNl2EJ7Ei'
CONSUMER_KEY_SECRET = 'RLcA5LBkaZHgTFzGTuiOXn4SEgoLh62HFw7gmYcsfoWOTgNgmX'
ACCESS_TOKEN        = '1904917267-5J95nCkfnY8E8XMdNZeDKk0wpIbdo767TunsCAw'
ACCESS_TOKEN_SECRET = '9YtZyqN0rZrcPkOokDZPRw7VwOQXl9nQbtUukVXBQiU8I'

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_KEY_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)
print(api)


Now that we have access to the Twitter API, there are a range of different requests we can make. We can use GET to retreive information about any public users or tweets, and even POST to make changes to the account we used to authorize, such as following accounts and making tweets. All functions of the API are [thoroughly documented](https://dev.twitter.com/rest/reference), so below we will only go over a few examples of the most common tasks.  

## Collecting tweets

Statuses posted by a specified user can be collected with a [GET statuses/user_timeline](https://dev.twitter.com/rest/reference/get/statuses/user_timeline) request. We need to specify either the ID, user ID or screen name of the user, and we can include other options such as the number of statuses to retrieve, the first and last status to be collected, and whether retweets should be included or not. If provided, `count` limits the number of results returned from the search. Otherwise you will simply encounter the rate limit on the Twitter API or the end of the user's timeline. Retweets are counted towards your app rate limit. See the last section on [Rate limits and cursor](#Rate-limits-and-cursor) to learn how to handle rate limits and get more tweets using the cursor. 

The search returns a list of *Status* objects. 

In [None]:
search = api.user_timeline(screen_name = 'theresa_may', count = 100, include_rts = True) 
search

Each `Status` object contains a number of relevant fields, which can be accessed with `status.[field_name]`. 

In [None]:
status=search[0]
print("Tweet text:", status.text)

In [None]:
status.text

You can get a list of all the field names:  

In [None]:
for key,value in status.__dict__.items():  #same thing as `vars(status)`
    print(key)

Printing the entire content of the request is not very informative, since it contains a large amount of meta-data. And while it is useful to know how to access particular fields, often what we want is to retrieve all the information and store it somewhere for later processing. We will therefore write our search output to a file, where each line corresponds to a tweet in .json format. Note that this is one of the fields included in the `Status` object. 

In [None]:
# First change the working directory to the location where you downloaded the files for today
import os
os.chdir("C:/Code/teaching")
import json
F_NAME = 'theresa_may_timeline.json'
with open(F_NAME,'w') as f_out:
    for status in search:
        json.dump(status._json, f_out)
        f_out.write('\n')

We can open this file in notepad to investigate it. 

## Getting information about a user account

We can also get detailed information about an account, such as the account description, number of followers, number of users followed, the date the account was created, location, number of tweets, a link to the profile image, number of favorites, etc. The argument needed is either `id`, `user_id` or `screen_name`. The output is a User object. We will again save the output object as a .json file.  

In [None]:
user_info = api.get_user(screen_name = 'theresa_may')
print("Account description:", user_info.description)
print("Following:", user_info.friends_count)
print("Followers:", user_info.followers_count)

#for key,value in user_info.__dict__.items():  #same thing as `vars(status)`
#    print key

F_NAME = 'theresa_may_user_info.json'
with open(F_NAME,'w') as f_out:
    json.dump(user_info._json, f_out)

## Getting follower IDs

We can get a list of the IDs of the first 5000 users following a certain account with `api.followers([id/screen_name/user_id])`. See the [Rate limits and cursor](#Rate-limits-and-cursor) section at the end to find out how to get more than the first 100 users.  

In [None]:
followers = api.followers_ids(screen_name = 'theresa_may', count=100)

#Save list of followers:
F_NAME = 'theresa_may_followers.txt'
with open(F_NAME,'w') as f_out:
    for follower in followers:
        f_out.write("%s\n" % follower)

## Getting the IDs of users being followed by a specified account

We can also get the IDs of users being followed by the specified user:

In [None]:
friends = api.friends_ids(screen_name = 'theresa_may')
print("May follows", len(friends), "users.")

F_NAME = 'followed_by_theresa_may.txt'
with open(F_NAME,'w') as f_out:
    for friend in friends:
        f_out.write("%s\n" % friend)

## Getting tweets favorited by a user
We can get a list of tweets favorited by a user:

In [None]:
favorites = api.favorites(screen_name = 'theresa_may')
print("Number of likes:", len(favorites))

F_NAME = 'theresa_may_favorites.json'
with open(F_NAME,'w') as f_out:
    for favorite in favorites:
        json.dump(favorite._json, f_out)
        f_out.write('\n')

## Getting info on friendship relations

We can get informaton about the existance of a friendhip between two users (a `subject user` and a `target`), and other characeristics of the relation with `api.show_friendship(source_id/source_screen_name, target_id/target_screen_name)`. 

In [None]:
friendship=api.show_friendship(source_screen_name="theresa_may", target_screen_name="ExeterQStep")
print("Source(May) followed by target(Exeter Q-Step)?", friendship[0].followed_by)
print("Target(Exeter Q-Step) followed by source(May)?", friendship[1].followed_by)

## Getting retweets of a certain status
`api.retweets(id[, count])` returns up to 100 of the first retweets of a given tweet.

In [None]:
retweets = api.retweets(id = 701057384869969921, count=100)
retweets
F_NAME = 'status_retweets.json'
with open(F_NAME,'w') as f_out:
    for retweet in retweets:
        json.dump(retweet._json, f_out)
        f_out.write('\n')

## Searching for Tweets
Twitter offers different options for searches. 

![search_options](search_options.png)

"Standard" means free (and incomplete), and this is what we get unless we contact Twitter/Gnip sales and get a paid account. 
Let's find 10 tweets that mention "Brexit" and "NHS" and see what the first one in the returned list looks like: 

In [None]:
query="Brexit AND NHS"
max_tweets = 10
searched_tweets = [status for status in tweepy.Cursor(api.search, q=query).items(max_tweets)]
print([status.text for status in searched_tweets[0:10]])

Now let's say we want to restrict our search to tweets that are in English, which are from the UK. We also want to get up to 100 tweets, and write the results out to a file:

In [None]:
max_tweets=10
searched_tweets = [status for status in tweepy.Cursor(api.search, q=query, 
                                                      geocode="54.323486,-3.396256,500km", 
                                                      lang="en").items(max_tweets)]
F_NAME = 'Search_tweets.json'
with open(F_NAME,'w') as f_out:
    for status in searched_tweets:  
        json.dump(status._json, f_out)
        f_out.write('\n')

## Rate limits and cursor

Twitter [API rate limits](https://dev.twitter.com/rest/public/rate-limiting) are limiting the number of requests you can make in a certain time frame. Tweepy can help handle these limitations. 
First, you can set a number of additional parameters in the `tweepy.api` class: 
* `retry_count` – default number of retries to attempt when error occurs
* `retry_delay` – number of seconds to wait between retries
* `retry_errors` – which HTTP status codes to retry
* `wait_on_rate_limit` – Whether or not to automatically wait for rate limits to replenish
* `wait_on_rate_limit_notify` – Whether or not to print a notification when Tweepy
Setting the last two parameters to `True` usually handles the rate limits. 
So we can redefine our API instance with these parameters: 

In [None]:
api = tweepy.API(auth, 
                 retry_count=5,
                 retry_delay=10,
                 retry_errors=set([401, 404, 500, 503, 429]),
                 wait_on_rate_limit=True,
                 wait_on_rate_limit_notify=True)

To handle pagination, Tweepy has the extremely helpful Cursor object. Instead of manually iterating through the pages of a user timeline, we can use the cursor: 

In [None]:
F_NAME = 'ExeterQStep_timeline_all.json'
with open(F_NAME,'w') as f_out:
    for status in search:
        for status in tweepy.Cursor(api.user_timeline, screen_name = 'ExeterQStep', include_rts = True).items():
            json.dump(status._json, f_out)
            f_out.write('\n')

This is going to take even longer, so don't run it now. 

In [None]:
#F_NAME = 'ExeterQStep_followers.txt'            
#Save list of followers:
#with open(F_NAME,'w') as f_out:
#    for follower in tweepy.Cursor(api.followers_ids, screen_name = 'theresa_may').items():
#        f_out.write("%s\n" % follower)

## Collecting Brexit tweets to analyse sentiment

**Exercise:**

Your task is to analyse the sentiment in recent tweets about Brexit. Think about how you might do that, what type of data you need and what the first steps in the analysis should be. 


Let's start by collecting the most recent tweets that the Standard search API gives us access to. The code below is uncommented because the full collection takes a few hours. Uncomment and edit it if you want to run your own twitter search. 

In [None]:
#with open("brexit_tweets_april_1.json",  "w", encoding="utf-8") as f_out:
#    for tweets in tweepy.Cursor(api1.search,
#                                q='brexit or BREXIT or Brexit',
#                                tweet_mode="extended",
#                                since="2019-03-31", 
#                                until="2018-04-01",
#                                lang="en",
#                                result_type='recent',
#                                geocode="54.323486,-3.396256,500km",
#                                include_entities=True,
#                                monitor_rate_limit=True, 
#                                wait_on_rate_limit=True).items():
#        # convert from Python dict to JSON: 
#        data = json.dumps(tweets._json, ensure_ascii=False)
#        f_out.write(data)
#        f_out.write('\n')
#        count=count+1
#        if count % 1000 == 0:
#            print(count)

## Extra resources on Twitter data collection

Twitter developer resources and documentation, including tutorials: 
https://developer.twitter.com/en/docs

This list of all API methods is particularly useful:
https://developer.twitter.com/en/docs/api-reference-index
