# SCC.413 Applied Data Mining
# NLP: Week 16
# Twitter Data Collection

## Contents
- [Introduction](#intro)
- [Packages & imports](#packages)
- [Authentication](#authentication)
- [User timelines](#user)
- [Searching for tweets](#searching)
    - [Search operators](#search_ops)
- [Outputting tweets](#outputting)
- [Exercise](#exercise)
- [Further tasks](#tasks)
- [Advanced tasks](#advanced)

<a name="intro"></a>
## Introduction

In this lab exercise you will interact with the [Twitter REST API](https://developer.twitter.com/en/docs) using Python code to download tweets for future analysis. Data collected via APIs are generally much cleaner than web scraped data, and also structured nicely (here, via JSON) for easy querying.

To collect data from Twitter you need to have a Twitter account, and also create an authorised application. If you do not want to do this, you could skip most of this lab and just use pre-collected data, although it is useful to see how to collect your own data. One option would be to partner with a neighbour and use a single Twitter account. As a minimum, you should observe how you can read in previously collected Twitter data, and output this in a different format (see [Outputting tweets](#outputting), and [Exercise](#exercise)).

Ensure you have downloaded the code from the git repository (as described on Moodle), and place it in a folder for this lab. Your h-drive is probably the best place, although keep an eye on the space available with the various data files you will be creating in the lab.

The code provides a functions collecting Twitter data via the [Twitter REST API](https://developer.twitter.com/en/docs). We use the [Twython](https://github.com/ryanmcgrath/twython) Python package to assist us, although others are available, most notably [Tweepy](https://github.com/tweepy/tweepy).

Ensure you have completed the separate instructions for creating a developer account and Twitter app.

<a name="packages"></a>
## Packages & Imports

The Twython package will need installing on Google Colab, as below. Non-standard packages are also included in `requirements.txt`, if you need to install them on your own machine.

In [None]:
!pip install twython

You should upload all of the provided files to a Google Drive folder, you can then access these files from your Python code. See also the files tab.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

The below code adds a folder from Google Drive. You may need to edit the path to match your own.

In [10]:
import sys
sys.path.append('/content/gdrive/MyDrive/413/wk16')

This particular lab uses the Twython package. Other imports are included in one cell here for convenience.

To interact with the Twitter API, you need developer credentials, with a the *Consumer Key (API Key)*, *Consumer Secret (API Secret)*, *Access Token*, and *Access Token Secret*. These should be copied and saved into the relevant variables in `twitter_auth.py` before running the cells below. The authorisation variables are read in with the import below. You can replace the import and add the variables directly here (though this is not good practice, as it reveals your credentials).

In [53]:
from twython import Twython, TwythonError, TwythonRateLimitError #https://github.com/ryanmcgrath/twython
import sys
import time
import json

from twitter_auth import *

<a name="authentication"></a>
## Authentication

Our hook into the Twitter API is via [Twython](https://github.com/ryanmcgrath/twython), and we make API calls via functions on a Twitter authenticated Twython object. We create and authorise this below, with supplied credentials (from `twitter_auth.py`).

In [None]:
twitter = Twython(consumer_key, consumer_secret, access_token, access_secret)

<a name="user"></a>
## User timelines

The Twitter API allows for the downloading of any (unprotected) user's tweets, limited to their last 3,200. Collecting a user's tweets can be useful for various research questions, such as comparing language usage across individuals/organisations, and for performing various authorship analysis tasks (as we'll see later in the module).

A function is provided below for collecting a given user's Tweets. Review the code and check your understanding of how it is collecting tweets. The function is also available from `user_tweets.py`.

The Twitter API throttles the downloading of data, here allowing for 200 tweets per request, and 1,500 requests per 15-minute window (and 100,000 requests per day). Therefore we need to collect 200 tweets at a time, with an older starting point each time, until there are no more tweets available. If we hit the [rate limit](https://developer.twitter.com/en/docs/basics/rate-limiting), we pause the collection until the rate-limit window resets. Other options are available for the collecting the user tweets: [user timeline](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html). How would you discard tweets which are replies to other users?

In [55]:
def get_user_tweets(twitter, screen_name, **kwargs):
    """uses twitter (Twython object) to collect tweets from user with screen_name (no @), can include extra search parameters for get_user_timeline, returns list of tweets"""
    
    #initialize a list to hold all the tweets
    user_tweets = []
    try:
        #make initial request for most recent tweets (200 is the maximum allowed count).
        #We normally don't want retweets, so we set include_rts to false.
        #tweet_mode="extended" allows for full text tweets, rather than truncated (i.e. over 140 chars)
        #https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html
        new_tweets = twitter.get_user_timeline(screen_name=screen_name,count=200,include_rts=False,tweet_mode="extended", **kwargs)

        #add to list
        user_tweets.extend(new_tweets)

        #save the id of the oldest tweet less one, this is the starting point for collecting further tweets.
        oldest = user_tweets[-1]['id'] - 1
        #keep grabbing tweets until there are no tweets left to grab. Twitter limits us to 3,200 (including retweets)
        while len(new_tweets) > 0:
            try:
                #all subsequent requests use the max_id param to prevent duplicates
                new_tweets = twitter.get_user_timeline(screen_name=screen_name,count=200,include_rts=False,tweet_mode="extended",max_id=oldest, **kwargs)
                user_tweets.extend(new_tweets)
                oldest = user_tweets[-1]['id'] - 1
                print("...%s tweets downloaded so far" % (len(user_tweets)))
            except TwythonRateLimitError as e:
                #We have hit the rate limit, so we need to take a break.
                #find how much time need to sleep for.
                remainder = float(twitter.get_lastfunction_header(header='x-rate-limit-reset')) - time.time()
                print("sleeping for %d seconds" % remainder)
                #Pause until we can go again.
                time.sleep(remainder)
                continue
                
    except TwythonRateLimitError as e:
        #We have hit the rate limit on first call, so we need to take a break, and start again.
        #find how much time need to sleep for.
        remainder = float(twitter.get_lastfunction_header(header='x-rate-limit-reset')) - time.time()
        print("sleeping for %d seconds" % remainder)
        #Pause until we can go again.
        time.sleep(remainder)
        #start again
        return get_user_tweets(twitter, screen_name, **kwargs)

    except TwythonError as e:
        print(e)

    return user_tweets

To collect tweets for a user, we simply call the method, providing our authenticated Twython object (twitter), and a twitter user screen name (without the @). Below we collect [Lancaster University's twitter timeline](https://twitter.com/LancasterUni).

In [None]:
tweets = get_user_tweets(twitter, "LancasterUni")

This returns a list of tweet dictionary objects. A lot of information is provided per Tweet. The attributes are detailed for [Tweet objects in the API](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object). Review the information that is available.

The first tweet will be the latest tweet. You can see the list of keys, and a full tweet below. We can also view individual attributes, such as the tweet text.

In [None]:
print(tweets[0].keys())

In [None]:
print(tweets[0])

In [None]:
tweets = get_user_tweets(twitter, "UCREL_NLP")
print(tweets[0]['full_text'])

<a name="searching"></a>
## Searching for tweets

The Twitter API also allows for the searching for Tweets, albeit only over a sample from the last 7 days (unless you pay). This could be useful for collecting a selection of tweets on specific topic, or mentioning people.

A function is provided below for performing searches, in a similar manner to extracting a user’s tweets. Review and check your understanding of the code. The function is also available from `search_tweets.py`.

In [58]:
def search_twitter(twitter, search_term, limit, **kwargs):
    """uses twitter (Twython object) to collect tweets returned from given search_term, up to limit, , can include extra search parameters, returns list of tweets"""
    
    #initialise list of tweets
    tweets = []

    try:
        #count=100 is the maximum allowed
        #tweet_mode="extended" allows for full text tweets, rather than truncated (i.e. over 140 chars)
        #https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html
        search_results = twitter.search(q=search_term,tweet_mode="extended",count=100, **kwargs)
        tweets.extend(search_results['statuses'])

        if len(tweets) == 0:
            print("No results found")
            return tweets
        
        #save the id of the oldest tweet less one, this is the starting point for collecting further tweets.
        oldest = tweets[-1]['id'] - 1
        #keep grabbing tweets until there are no tweets left to grab.
        while len(search_results['statuses']) > 0 and len(tweets) < limit:
            try:
                #all subsequent requests use the max_id param to prevent duplicates
                search_results = twitter.search(q=search_term,tweet_mode="extended",count=100,max_id=oldest, **kwargs)
                tweets.extend(search_results['statuses'])
                oldest = tweets[-1]['id'] - 1
                print("...%s tweets downloaded so far" % (len(tweets)))
            except TwythonRateLimitError as e:
                #We have hit the rate limit, so we need to take a break.
                #find how much time need to sleep for.
                remainder = float(twitter.get_lastfunction_header(header='x-rate-limit-reset')) - time.time()
                print("sleeping for %d seconds" % remainder)
                #Pause until we can go again.
                time.sleep(remainder)
                continue

    except TwythonRateLimitError as e:
        #We have hit the rate limit on first call, so we need to take a break, and start again.
        #find how much time need to sleep for.
        remainder = float(twitter.get_lastfunction_header(header='x-rate-limit-reset')) - time.time()
        print("sleeping for %d seconds" % remainder)
        #Pause until we can go again.
        time.sleep(remainder)
        #start again
        return search_twitter(twitter, search_term, limit, **kwargs)
                
    except TwythonError as e:
        print(e)

    return tweets[:limit]

To search, simply provide a search string and a limit of the number of tweets to return, e.g. to get 500 tweets with the hashtag #NLProc (common hashtag for NLP related stuff):

In [None]:
tweets = search_twitter(twitter, "#NLProc", 500)

In [None]:
print(tweets[0]["full_text"])

<a name="search_ops"></a>
### Search operators

The are a number of search operators available, allowing for quite complex searches: [search operators](https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators). Ignore the instructions on URL encoding the search string, *Twython* takes care of this for us.

Search strings can be built up with multiple parameters. For example, to search for tweets from the *@LancasterUni* account mentioning *research*:

In [None]:
tweets = search_twitter(twitter, "from:LancasterUni research", 10)
for tweet in tweets:
    print("----")
    print(tweet['full_text'])

There are also different parameters available for the search request itself: [search parameters](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html).

These can be passed through the search_twitter method (using [kwargs](https://book.pythontips.com/en/latest/args_and_kwargs.html)). For example, to restrict a search to only tweets in English:

In [None]:
tweets = search_twitter(twitter, "\"faux pas\"", 10, lang='en')
for tweet in tweets:
    print("----")
    print(tweet['full_text'])

Try this with "fr", and without specifying the language to see the difference.

<a name="outputting"></a>
## Outputting tweets

To create a corpus for later use, you may want to save tweets to a file. A series of functions are provided below to output to JSON, plain text, and also to read in JSON saved tweets. Review and check your understanding of these functions. These are also provided in `tweets_json.py`.

In [31]:
def to_full_json(tweets, filepath="tweets.json"):
    """Saves to filepath with the provided tweets with all attributes, in JSON."""
    with open(filepath, 'w') as f:
        #Dump json file. indent=4 prints the output prettier, but will increase disk space.
        json.dump(tweets, f, indent=4)

def to_minimal_json(tweets, filepath="tweets.json"):
    #This reduces each tweet to the set of keys (attributes) listed.
    #Other attributes can be used here, see https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object
    atts=['id_str', 'full_text']
    minimal_tweets = [{k:tweet[k] for k in atts} for tweet in tweets]
    with open(filepath, 'w') as f:
        #Dump json file. indent=4 prints the output prettier, but will increase disk space.
        json.dump(minimal_tweets, f, indent=4)

def to_just_text(tweets, filepath="tweets.txt"):
    """Saves to filepath with the provided tweets in plaintext, one tweet per line"""
    with open(filepath, 'w') as f:
        for tweet in tweets:
            #Linebreaks are replaced so we have one tweet per line.
            f.write("{}\n".format(tweet['full_text'].replace("\n", "  ").replace("\r", "  ")))
            
def load_json_tweets(filepath):
    """Loads a JSON file into a list of tweet dictionary objects"""
    tweets = json.load(open(filepath))
    return tweets

The full JSON can be saved, although note that this will take up some space. You can save to minimal JSON, and plain text by using the different functions.

In [63]:
to_minimal_json(tweets, "@LancasterUni-min.json")

To load in previously saved tweets, just use load_json_tweets, as below, to load in the provided UCREL tweets.

In [None]:
ucrel_tweets = load_json_tweets("@UCREL_NLP-full.json")
print(ucrel_tweets[0])

<a name="exercise"></a>
## Exercise

Using the [list of attributes available](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object), produce a list of tweets (in JSON format or plain text, e.g. separated by a tab (TSV)) containing the time the tweet was written and the text. Of course, there are numerous ways the tweets could be outputted, e.g. into a database or a CSV file. Feel free to experiment with different outputs that might be useful to you. You can use the provided UCREL_NLP tweets(`@UCREL_NLP-full.json`) for this, or any other collection of tweets you have made.

In [65]:
# Exercise 1



<a name="tasks"></a>
## Further Tasks

Come up with your own searches, and discuss with your neighbors. A few you can try:

1. Find tweets mentioning *paper* and the *#NLProc* hashtag, which are not retweets.
2. Find 10 positive tweets about *rain*, and 10 negative tweets about *rain*.
3. Find Tweets **to or from** *@LancasterUni* mentioning *storm*, *wet* or *rain*.
4. Find recent tweets mentioning *snow*.
5. Find tweets mentioning *rain* from the Lancaster area.

Also, think about what useful tweet attributes are available to output for the above searches.

In [None]:
# optional 1. Find tweets mentioning paper and the #NLProc hashtag, which are not retweets.



In [None]:
# optional 2. Find 10 positive tweets about rain, and 10 negative tweets about rain.


In [None]:
# optional 3. Find Tweets to or from @LancasterUni mentioning *storm, *wet* or *rain*.


In [None]:
# optional 4. Find recent tweets mentioning snow.


In [None]:
# optional 5. Find tweets mentioning rain from the Lancaster area.


In [None]:
# optional -- additional searches



<a name="advanced"></a>
## Optional advanced tasks


The Twitter API provides access to further information about tweets, users, and plenty more. Here’s a list of things you can try if you have time. Please feel free to make other suggestions.

1. Many other methods are available from Twython linked to Twitter API requests: https://twython.readthedocs.io/en/latest/api.html. One potentially useful task you should be able perform by adapting the available code is to collect the user details of a given account (see *show_user*).
2. Expanding on 1., you could collect a list of users (e.g. a user's followers) and then collect all of their user information and tweets.
3. You can use the Twitter Streaming API to collect tweets in real-time as they are posted. See the instructions for implementing this with Twython: https://twython.readthedocs.io/en/latest/usage/streaming_api.html, and attempt to collect all tweets mentioning a word of interest.

In [None]:
# Advanced 1

In [None]:
# Advanced 2 

In [1]:
# Advanced 3 