# tweepy

The `tweepy` library has a really good [documentation](http://docs.tweepy.org/en/latest/) that you should check out if you want to use it.

The examples in this notebook are adapted from the workshop [*Fundamentals of Data Analysis with Python*](https://github.com/mclevey/GESIS_2020_Fundamentals) by [John McLevey](https://github.com/mclevey) and [Jilian Anderson](https://github.com/jillianderson8).

## Import libraries

In [None]:
import tweepy
import pandas as pd
import time

## Authentication

In order to use `tweepy` you need to have registered a Twitter application. Once you have done so, you can find the necessary information in the *Keys & Tokens* menu for your app. If you need some guidance/reminder on where to find that information, you can have a look at the documentation of the `rwteet` package which has a [section on this topic](https://rtweet.info/articles/auth.html). Keep in mind that the layout of the Twitter developer pages might change.

**NB**: You should treat all information relating to your API key like a password and never share it or post it publicly anywhere.
For the purpose of this notebook, we will store the keys in a separate file. To do this, simply open the file [config_twitter.py](./config_twitter.py), enter the information for your app, and save the file by pressing CTRL + S (Windows) or CMD + S (MacOS). After that you can run the code cell below to import the file and authenticate.

Although nobody except you should be able to access your personal instance of this notebook (and your edits will also not be persistent if you do not have/use a *GESIS Notebooks* user account), if you want to be extra cautious, you can delete your API access information from the [config_twitter.py file](./config_twitter.py) (and save the file again) after running the following cell once.

In [None]:
import config_twitter

auth = tweepy.OAuthHandler(config_twitter.API_KEY, config_twitter.API_KEY_SECRET)
auth.set_access_token(config_twitter.ACCESS_TOKEN, config_twitter.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)

The arguments `wait_on_rate_limit=True` and `wait_on_rate_limit_notify=True` enable `Tweepy` to make sure that we stay within the Twitter API rate limits.

## The `REST API`

The file [twitter_accounts.csv](./data/twitter_accounts.csv) in the `data` folder of this repository contains a few Twitter screen names which we will use in the following examples.

In [None]:
accounts = pd.read_csv('data/twitter_accounts.csv')
accounts = accounts['Screen Name'].tolist()
accounts

### Account information

Before we collect information about the accounts we may first want to retrieve the user IDs. The reason for this is that screen names can change, whereas user IDs remain the same.

In [None]:
ids = [api.get_user(i) for i in accounts]

Now we can use the user IDs to gather the account information. We will store the results as a [`pandas`](https://pandas.pydata.org/) dataframe.

In [None]:
account_info = [[i.name, i.screen_name, i.id, i.description, i.location, i.followers_count, i.friends_count, i.protected] for i in ids]
account_info = pd.DataFrame(account_info, columns = ['Name', 'Handle', 'Twitter ID Number', 'Description', 'Location', 'Number of Followers', 'Follows', 'Protected'])
account_info

If you want to, you can store the result as a `CSV` file by executing the following cell.

In [None]:
account_info.to_csv('data/twitter_accounts_workshop_orgs.csv', index = False)

### Accounts followed

Retrieve a list of (up to 5000) accounts a user follows. In the following function we, again, use the Twitter User ID instead of the screen name.

In [None]:
def get_followed(user_id):
    friends = []
    cursor = tweepy.Cursor(api.friends_ids, id=user_id, count=5000) 
    for page in cursor.pages():
        for friend in page:
            friends.append(friend)
    return friends

We can now use this function to collect the IDs of all the users followed by *@gesis_org*. From our previous call to the `REST API` we know that the User ID for the *GESIS* account is 145554242.

In [None]:
gesis_followed = get_followed('145554242')

In [None]:
len(gesis_followed)

We can use these IDs to collect information about these accounts. We could, of course, do this for all accounts followed by *@gesis_org*. However, this would take some time and we might hit our Twitter API rate limit with this. Hence, we'll just do this for the first 10 accounts here.

In [None]:
gesis_followed_test = gesis_followed[:10]

gesis_followed_info = [api.get_user(i) for i in gesis_followed_test]

We can now turn the result into a `pandas` dataframe.

In [None]:
gesis_followed_info_cut = [[i.name, i.screen_name, i.id, i.description, i.location, i.followers_count, i.friends_count, i.protected] for i in gesis_followed_info]
pd.DataFrame(gesis_followed_info_cut, columns = ['Person', 'Handle', 'Twitter ID Number', 'Description', 'Location', 'Number of Followers', 'Follows', 'Protected'])

### Historical tweets & metadata

The following cell defines the function `get_tweet_data` that takes a screen name and requests the tweets from the user timeline via the Twitter REST API. It also collects some of the metadata for each tweet.

In [None]:
def get_tweet_data(user, user_meta=False):
    tweets = []
    
    for tw in tweepy.Cursor(api.user_timeline, screen_name=user, exclude_replies=False, count = 200, tweet_mode = 'extended').items():
        tdict = {}
        
        tdict['text'] = tw.full_text.replace('\n', '').strip() # remove newline tags + leading and trailing whitespace    
        tdict['tweet_id'] = tw.id
        tdict['retweet_count'] = tw.retweet_count
        tdict['fav_count'] = tw.favorite_count
        tdict['user_id'] = tw.user.id        
        tdict['user_screen_name'] = tw.user.screen_name
        tdict['time'] = tw.created_at
        tdict['hashtags'] = [hashtag['text'] for hashtag in tw.entities['hashtags']]
        tdict['user_mentions'] = [user['screen_name'] for user in tw.entities['user_mentions']]
        
        if user_meta is True:
            tdict['location'] = tw.user.location
            tdict['user_description'] = tw.user.description
            tdict['user_url'] = tw.user.url 
        else:
            pass
        
        tweets.append(tdict)
    
    return tweets

We can now use this function to collect historical tweets (+ some metadata) for the accounts from our list. Note that running the next cell might take a few seconds. As the resulting objects is a list of lists, we need to flatten it before we can create a `pandas` dataframe.

In [None]:
account_tweets = [get_tweet_data(i) for i in accounts]

account_tweets = [y for x in account_tweets for y in x]

In [None]:
account_tweets_df = pd.DataFrame(account_tweets)

account_tweets_df['hashtags'] = account_tweets_df['hashtags'].apply(', '.join)
account_tweets_df['user_mentions'] = account_tweets_df['user_mentions'].apply(', '.join)

account_tweets_df.head()

As before, if you want to, you can store the result as a `CSV` file by executing the following cell.

In [None]:
account_tweets_df.to_csv('data/tweets_tweepy.csv', index = False)

## The `Streaming API`

If you want to use `tweepy` to collect data from the Twitter `Streaming API`, you need to set up a `StreamListener`. If you want to learn more about this, you can read the [corresponding section of the `tweepy` documentation](http://docs.tweepy.org/en/latest/streaming_how_to.html). The `MyStreamListener` defined in the cell below will collect data from the `STREAM API` for 10 seconds and append new tweet data to a file called `tweepy_stream.csv`, which is stored in the `data` folder.

In [None]:
SEP = ';'
csv = open('data/tweepy_stream.csv', 'a', encoding='utf-8')
csv.write('Date' + SEP + 'Tweet' + SEP + 'Number of Followers' + SEP + 'Follows' + SEP + 'Handle' + '\n')

class MyStreamListener(tweepy.StreamListener):
    def __init__(self, time_limit=10): # you can set the default timeout for this StreamListener here (in seconds)
        self.start_time = time.time()
        self.limit = time_limit
        super(MyStreamListener, self).__init__()

    def on_status(self, status):
        if (time.time() - self.start_time) < self.limit:
            if hasattr(status, 'retweeted_status'):
                try:
                    tweet = status.retweeted_status.extended_tweet["full_text"]
                except:
                    tweet = status.retweeted_status.text
            else:
                try:
                    tweet = status.extended_tweet["full_text"]
                except AttributeError:
                    tweet = status.text
        
            date = status.created_at.strftime("%Y-%m-%d-%H:%M:%S")
            follower = str(status.user.followers_count)
            friend = str(status.user.friends_count)
            name = status.user.screen_name
        
            csv.write(date + SEP + tweet.strip().replace("\n","").replace('\r','').replace(';',',') + SEP + follower + SEP + friend + SEP + name + '\n')
            return True
        else:
            csv.close()
            return False

When you run the two following cells, you will start to stream all tweets mentioning "COVID-19" or "Corona" for 10 seconds. If you change the time limit and want to manually stop the data collection, you need to interrupt the `Python` kernel in this notebook. You can do so by clicking the square stop button in the notebook menu on top or by pressing the i-key on your keyboard twice.

In [None]:
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener) # to change the duration of the collection, you can add the argument time_limit = value_in_seconds here

In [None]:
myStream.filter(track=['COVID-19', 'Corona'])

You can check the results by opening the [tweepy_stream.csv](./data/tweepy_stream.csv) file.

You can also stream tweets by specific users. To do this you need to provide an array of IDs to the `follow` parameter: `mystream.filter(follow=accounts)`. 