<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Melanie Walsh](https://melaniewalsh.org/) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email melwalsh@uw.edu
____

# Working with Twitter Data (Lesson 3) — 6/24/2022

This is lesson **3** of 3 in the educational series on **Working with Twitter Data**. This notebook will demonstrate how researchers can collect tweets from a user's timeline (or multiple users' timelines), how to find out information about who a particular Twitter user is following and who is following that user in turn, and how to work with the new "context annotations" metadata, which provides extra contextual information about tweets.

**Audience:** Teachers / Learners / Researchers

**Use case:** Tutorial / How-To

**Difficulty:** Intermediate

**Completion time:** 30 minutes to 1 hour

**Knowledge Required/Recommended:** 

* Command line knowledge
* Python basics (variables, functions, lists, dictionaries)
* Pandas basics (Python library for data manipulation and analysis)


**Learning Objectives:**
After this lesson, learners will be able to:

1. Collect tweets from a specific Twitter user's timeline
2. Collect data about the Twitter accounts that a specific user is following
3. Collect data about the Twitter accounts that are following a specific user
4. Work with the new "context annotations" metadata

___

# Required Python Libraries
* [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/) for collecting Twitter data.
* [plotly](https://plotly.com/python/) for making interactive plots 
* [pandas](https://pandas.pydata.org/) for manipulating and cleaning data

## Install Required Libraries

In [None]:
### Install Libraries ###
!pip install twarc --upgrade
!pip install twarc-csv --upgrade
!pip install plotly

In [None]:
### Import Libraries ###
import plotly.express as px
import pandas as pd
# Set max column width
pd.options.display.max_colwidth = 400
# Set max number of columns
pd.options.display.max_columns = 95

# Twitter API Setup

*This lesson presumes that you've already installed and configured twarc, which was covered in [a previous lesson](Twitter-API-Setup).*

## Configure Twarc

Once twarc is installed, you need to configure it with your API keys and/or bearer token so that you can actually access the API. 

To configure twarc, you would typically run `twarc2 configure` from the command line. This will prompt twarc to ask for your bearer token, which you can copy and paste into the blank after the colon, and then press enter. You can optionally enter your API keys, as well.

<div class="admonition attention" name="html-admonition" style="background: orange; padding: 10px">
<p class="title">Note</p>
   To get your Bearer Token, go to your Twitter Developer portal: <a href= "https://developer.twitter.com/en/portal/dashboard">https://developer.twitter.com/en/portal/dashboard</a>

</div>

However, when working in a Jupyter notebook in the cloud, it is easiest to configure twarc and enter your Bearer Token in a single command. Please paste your Bearer Token between the quotations marks below and run the cell.

In [None]:
!printf '%s\n' "YOUR BEARER TOKEN HERE" "no" | twarc2 configure

Now you're ready to collect and analyze tweets!

## Get a Users' Timeline (3200 Tweets)

To get all the most recent tweets from a Twitter user's timeline (up to 3200 tweets), we will use [`twarc2 timeline username`](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#timeline_1). We could also get tweets for multiple users by including a text file instead of a single username, e.g., [`twarc2 timeline usernames.txt`](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#timeline_1)

If you have access to the Academic Research track of the Twitter API, you can actually get all tweets from a user by including the flag `--use-search`.

Let's collect tweets from President Joe Biden's timeline: https://twitter.com/POTUS 🧐 What do you think the topic of the most retweeted tweets will be...?

In [None]:
!twarc2 timeline potus potus-tweets.jsonl

Let's convert these tweets to a CSV file

In [None]:
!twarc2 csv potus-tweets.jsonl potus-tweets.csv

Let's read in the CSV file.

In [None]:
tweets_df = pd.read_csv('potus-tweets.csv', parse_dates = ['created_at'])

Let's apply our helper functions and create new columns for type of tweet and tweet URL.

In [None]:
# Find the type of tweet
def find_type(tweet):
    
    # Check to see if tweet contains retweet, quote tweet, or reply tweet info
    contains_retweet = tweet['referenced_tweets.retweeted.id']
    contains_quote = tweet['referenced_tweets.quoted.id']
    contains_reply = tweet['referenced_tweets.replied_to.id']
    
    # Does tweet contain retweet info? (Is this category not NA or empty?)
    if pd.notna(contains_retweet):
        return "retweet"
    # Does tweet contain quote and reply info?
    elif pd.notna(contains_quote) and pd.notna(contains_reply):
        return "quote/reply"
    # Does tweet contain quote info? 
    elif pd.notna(contains_quote):
        return "quote"
    # Does tweet contain reply info? 
    elif pd.notna(contains_reply):
        return "reply"
    # If it doesn't contain any of this info, it must be an original tweet
    else:
        return "original"

# Make Tweet URL
def make_tweet_url(tweets):
    # Get username
    username = tweets[0]
    # Get tweet ID
    tweet_id = tweets[1]
    # Make tweet URL
    tweet_url = f"https://twitter.com/{username}/status/{tweet_id}"
    return tweet_url

In [None]:
# Create tweet type column
tweets_df['type'] = tweets_df.apply(find_type, axis =1)
# Create tweet URL column
tweets_df['tweet_url'] = tweets_df[['author.username', 'id']].apply(make_tweet_url, axis='columns')

Let's select and rename only the columns we're interested in.

In [None]:
# Select columns of interest
clean_tweets_df = tweets_df[['created_at', 'author.username', 'author.name', 'author.description',
                             'author.verified', 'type', 'text', 'public_metrics.retweet_count', 
                             'public_metrics.like_count', 'public_metrics.reply_count', 'public_metrics.quote_count',
                             'tweet_url', 'lang', 'source', 'geo.full_name']]

# Rename columns for convenience
clean_tweets_df = clean_tweets_df.rename(columns={'created_at': 'date', 'public_metrics.retweet_count': 'retweets', 
                          'author.username': 'username', 'author.name': 'name', 'author.verified': 'verified', 
                          'public_metrics.like_count': 'likes', 'public_metrics.quote_count': 'quotes', 
                          'public_metrics.reply_count': 'replies', 'author.description': 'user_bio'})

clean_tweets_df

We can also create a date column that does not have hour/minute/second information, like so

In [None]:
clean_tweets_df['formatted_date'] = clean_tweets_df['date'].dt.date

## Code Tweet Data By Keyword

In the previous lesson, we saw how Kevin McElwee was able to produce a [really cool Twitter analysis](https://www.kmcelwee.com/fortune-100-blm-report/site/index.html) by qualitatively coding whether Fortune 100 tweets were discussing racial justice or not.

I wanted to show a quick example of how we can use a Python function to do something similar: code whether or not a tweet contains certain keywords.

The function below will check to see whether a tweet contains any words that are included in the list `keywords`. In this example, we're coding whether or not the tweet is discussing COVID. 

In [None]:
def check_for_keywords(text):
    
    # Pick your own keywords!
    keywords = ["COVID", "virus"]
    
    for word in keywords:
        if word in text:
            return True
        else:
            return False

In [None]:
check_for_keywords("The COVID-19 crisis is serious")

We can create a new column (which could be named whatver we want) by applying this function to the text column.

In [None]:
clean_tweets_df['COVID?'] = clean_tweets_df['text'].apply(check_for_keywords)
#clean_tweets_df['your own column name'] = clean_tweets_df['text'].apply(check_for_keywords)

Now we can use this new column to filter and examime only the tweets that are explicitly discussing COVID.

In [None]:
clean_tweets_df[clean_tweets_df["COVID?"] == True]

## Save Tweets as Spreadsheet

Anytime we want to save a dataframe as a spreadhsheet, we can use the `.to_csv()` function.

In [None]:
clean_tweets_df.to_csv("clean-potus-tweets.csv", 
                       # remove the index
                       index=False)

## Datawrapper

With only the data that we just collected and coded, we can make a sophisticated data visualization — either in Python or with a different data visualization platform.

For example, if we drop our CSV file into Datawrapper (https://www.datawrapper.de/), we can create something that looks like this:

In [None]:
%%html
<iframe title="President Joe Biden's Most Recent Tweets " aria-label="Scatter Plot" id="datawrapper-chart-zeOuo" src="https://datawrapper.dwcdn.net/zeOuo/1/" scrolling="no" frameborder="0" style="border: none;" width="692" height="495"></iframe>

Be sure to check out these tips for customizing Datawrapper tooltips with HTML: https://academy.datawrapper.de/article/237-i-want-to-change-how-my-data-appears-in-tooltips

## Get Who a User Is Following

We can also use the Twitter API to find out who a Twitter user is following and who is following that user. Researchers and journalists have used follower/following data in a number of ways, such as examining [how conservative vs. liberal politicians gained or lost followers](https://www.theverge.com/2022/4/27/23045005/conservative-twitter-follower-boost-musk-acquisition-data) after Elon Musk finalized his deal to buy Twitter (via The Verge). 

To get information about all the Twitter accounts that a particular Twitter user is following, we will use [`twarc2 following username`](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#following_1).



Let's see who Joe Biden is following on Twitter.

In [None]:
!twarc2 following potus potus_following.jsonl 

To convert this user data into a CSV file, we can use `twarc2 csv` but we have to include a special flag that specifies this is user data, not tweet data `--input-data-type`

In [None]:
!twarc2 csv potus_following.jsonl --input-data-type users potus_following.csv

Let's see what this data looks like:

In [None]:
following_df = pd.read_csv('potus_following.csv', parse_dates = ['created_at'])

In [None]:
following_df = following_df.rename(columns={'public_metrics.following_count': 'following', 
                                            'public_metrics.followers_count': 'followers', 
                                            'public_metrics.tweet_count': 'tweets',
                                           })
following_df = following_df[["created_at", "username", "name", "description", "location", "followers",
              "following", "tweets", "url", "verified"]]

Which of the Twitter accounts that Joe Biden is following has the most followers, the most total tweets, and the most accounts that they themselves are following?

In [None]:
following_df.sort_values("followers", ascending=False)

We could imagine that we might want to collect tweets for all of these Twitter accounts. To do so, we might write all these usernames to a text file.

In [None]:
following_df['username']

Write usernames to a text file

In [None]:
following_df['username'].to_csv("usernames.txt", index=False, header=False)

Get the timelines for all of those users

In [None]:
#!twarc2 timelines usernames.txt all_timelines.jsonl

In [None]:
#!twarc2 csv all_timelines.jsonl all_timelines.csv

## Get a Users' Followers

To get information about all the Twitter accounts following a particular Twitter user is following, we will use [`twarc2 followers username`](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#following_1). 

Joe Biden has too many followers for a quick example, so let's see who's following my William Carlos Williams Twitter bot.

In [None]:
!twarc2 followers sosweetbot sosweetbot_followers.jsonl 

In [None]:
!twarc2 csv sosweetbot_followers.jsonl --input-data-type users sosweetbot_followers.csv

In [None]:
followers_df = pd.read_csv('sosweetbot_followers.csv', parse_dates = ['created_at'])

In [None]:
clean_followers_df = followers_df.rename(columns={'public_metrics.following_count': 'following', 
                                            'public_metrics.followers_count': 'followers', 
                                            'public_metrics.tweet_count': 'tweets',
                                           })
clean_followers_df = clean_followers_df[["created_at", "username", "name", "description", "location", "followers",
              "following", "tweets", "url", "verified"]]

In [None]:
clean_followers_df

## Context Annotations

Twitter recently introduced a new piece of metadata for tweets: **context annotations**. These annotations are supposed to help document the contextual topic of a tweet, even if the topic itself is not explicitly mentioned in the tweet.

> How does Twitter context annotations work?

> Twitter classifies Tweets semantically, meaning that we curate lists of keywords, hashtags, and @handles that are relevant to a given topic. If a Tweet contains the text we’ve specified, it will be labeled appropriately. This differs from a machine learning approach where a model is trained specifically to classify text (in this case, Tweets) and produce a probability score alongside the output/classification.

> How do I know that your data is complete and trustworthy?
Twitter's annotations are curated by domain experts using research and QA processes that have been refined over the course of several years. The process is supported by custom tooling to scale data tracking as far as we are able to maintain excellent precision and recall. In addition, our data is audited regularly by an internal team, having received a precision score of ~80% for the past several quarters.

> -[Twitter Context Annotation FAQ](https://developer.twitter.com/en/docs/twitter-api/annotations/faq)

Twitter has also provided a list of all the [currently existing context annotations](https://developer.twitter.com/en/docs/twitter-api/annotations/faq).

In [None]:
all_context_annotations = pd.read_csv("https://raw.githubusercontent.com/twitterdev/twitter-context-annotations/main/files/evergreen-context-entities-20220601.csv")
all_context_annotations

In [None]:
all_context_annotations[['entity_name']].sample(50)

Let's check out context annotations for a couple of tweets! Suhem Parack has created a small web application where we can insert any tweet URL and get that tweet's context annotations: https://tweet-entity-extractor.glitch.me/

Tweet 1: https://twitter.com/POTUS/status/1532057523347689472

Tweet 2: https://twitter.com/POTUS/status/1397595270582505474

In [None]:
tweets_df['text'][130]

In [None]:
tweets_df['context_annotations'][130]

As you can see, the "context_annotations" column is dense, and extracting this information is a bit tricky.

Here are two Python functions that can help us count up all the annotations and add extract annotations as a new column in the data.

In [None]:
from ast import literal_eval
from collections import Counter
import numpy as np

context_counter = Counter()

# Count up all context annotations
def count_context(annotations):
      # if not NaN
    if type(annotations) != float:
          # Convert to an actual Python list, not just a string
        annotations =  literal_eval(annotations)
        names = []
        # for every annotation in the tweet
        for annotation in annotations:
            # grab the name
            name = annotation['entity']['name']
        if name not in names:
            names.append(name)
        # add name to counter
        context_counter.update(names)
        
# Extract context annotations
def extract_context(annotations):
      # if not NaN
    if type(annotations) != float:
          # Convert to an actual Python list, not just a string
        annotations =  literal_eval(annotations)
        names = []
        # for every annotation in the tweet
        for annotation in annotations:
            # grab the name
            name = annotation['entity']['name']
            names.append(name)
        return names

Let's count up all the context annotations from Joe Biden's most recent tweets.

In [None]:
# Apply function to column
tweets_df['context_annotations'].apply(count_context)
# Pull out list of most common annotations
context_counter.most_common()

We can make a dataframe of the annotations counts easily with `pd.DataFrame()` (see the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)) and then we can plot the top 10 annotations (other than Joe Biden and The White House) with `.plot()` (see the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)).

In [None]:
%matplotlib inline
# Make a DataFrame
context_df = pd.DataFrame(context_counter.most_common(), columns = ['context', 'count'])
# Slice from the 2nd to 12th annotation, and make a bar plot
context_df[2:12].plot(kind = 'barh', x = "context", y = "count", title = "Joe Biden Tweets",
                   figsize = (10, 7)).invert_yaxis()

Let's extract the context annotations from Joe Biden's most recent tweets and add them as a new column.

In [None]:
tweets_df['context'] = tweets_df['context_annotations'].apply(extract_context)

In [None]:
# Select columns of interest
clean_tweets_df = tweets_df[['created_at', 'author.username', 'author.name', 'author.description',
                             'author.verified', 'type', 'text', 'context', 'public_metrics.retweet_count', 
                             'public_metrics.like_count', 'public_metrics.reply_count', 'public_metrics.quote_count',
                             'tweet_url', 'lang', 'source', 'geo.full_name']]

# Rename columns for convenience
clean_tweets_df = clean_tweets_df.rename(columns={'created_at': 'date', 'public_metrics.retweet_count': 'retweets', 
                          'author.username': 'username', 'author.name': 'name', 'author.verified': 'verified', 
                          'public_metrics.like_count': 'likes', 'public_metrics.quote_count': 'quotes', 
                          'public_metrics.reply_count': 'replies', 'author.description': 'user_bio'})

clean_tweets_df

## Further Resources

- Suhem's Parack's ["Getting started with the Twitter API v2 for academic research"](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research)
- Melanie Walsh's [chapter on Twitter data](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/12-Twitter-Data.html) from *Introduction to Cultural Analytics & Python*
- Twitter's blog [forum for academic research](https://twittercommunity.com/c/academic-research/62)
- Twitter's [Community space for academic researchers](https://twitter.com/i/communities/1494750019467063297)