<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Melanie Walsh](https://melaniewalsh.org/) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email melwalsh@uw.edu
____

# Tweet Collection (Workbook) — June 2022

Here's a streamlined workbook for collecting tweet counts and tweets.
___

# Install Libraries

In [None]:
### Install Libraries ###
!pip install twarc --upgrade
!pip install twarc-csv --upgrade
!pip install plotly
!pip install wordcloud

In [1]:
### Import Libraries ###
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import pandas as pd
# Set max column width
pd.options.display.max_colwidth = 400
# Set max number of columns
pd.options.display.max_columns = 95

# Twitter API Setup

## Configure Twarc

Once twarc is installed, you need to configure it with your API keys and/or bearer token so that you can actually access the API. 

To configure twarc, you would typically run `twarc2 configure` from the command line. This will prompt twarc to ask for your bearer token, which you can copy and paste into the blank after the colon, and then press enter. You can optionally enter your API keys, as well.

In [None]:
!printf '%s\n' "YOUR BEARER TOKEN HERE" "no" | twarc2 configure

If you've entered your information correctly, you should get a congratulatory message.

# Twitter Data Collection & Visualization
## Get Tweet Counts and Save as Spreadsheet

In [13]:
!twarc2 counts "your query" --csv --archive --granularity day > query-tweet-counts.csv

In [None]:
# Change the title of the plot here
my_title = "YOUR TITLE HERE"
# Change the file name here
filename = "YOUR FILENAME HERE"

# Read in CSV as DataFrame
tweet_counts_df = pd.read_csv(filename, parse_dates=['start', 'end'])

# Set start time as DataFrame index
tweet_counts_df.set_index('start', inplace=True)

# Regroup, or resample, tweets by month, day, or year
grouped_count = tweet_counts_df.resample('M')['day_count'].sum().reset_index() # Month
#grouped_count = tweet_counts_df.resample('D')['day_count'].sum().reset_index() # Day
#grouped_count = tweet_counts_df.resample('Y')['day_count'].sum().reset_index() # Year

# Make a line plot from the DataFrame and specify x and y axes, axes titles, and plot title
px.line(grouped_count, x = 'start', y = 'day_count',
    labels = {'start': 'Time', 'day_count': '# of Tweets'},
    title = my_title)

## Get Tweets and Save as Spreadsheet

To actually collect tweets and their associated metadata, we can use the command `twarc2 search` and insert a query.

Here we're going to search for any tweets that mention certain words and were tweeted by verified accounts. By default, `twarc2 search` will use the essential track of the Twitter API, which only collects tweets from the past week.

<div class="admonition attention" name="html-admonition" style="background: lightyellow; padding: 10px">
<p class="title">Attention 🚨</p>
    Remember that the <code>--archive</code> flag and full-archive search functionality is only available to those who have an <a href= "https://developer.twitter.com/en/products/twitter-api/academic-research">Academic Research account</a>. 
    Students with Essential API access should remove the <code>--archive</code> flag from the code below.

</div>

You might want to limit your search to 5000 tweets or less

In [30]:
!twarc2 search --archive --limit 5000 "YOUR QUERY" > my_tweets.jsonl

In [31]:
!twarc2 csv my_tweets.jsonl my_tweets.csv

100%|██████████████| Processed 3.36M/3.36M of input file [00:00<00:00, 7.44MB/s]

ℹ️
Parsed 1550 tweets objects from 17 lines in the input file.
Wrote 1550 rows and output 74 columns in the CSV.



Now we're ready to explore the data!

In [None]:
my_tweets_df = pd.read_csv('my_tweets.csv', parse_dates = ['created_at'])

Let's apply the helper functions and create new columns.

In [36]:
# Find the type of tweet
def find_type(tweet):
    
    # Check to see if tweet contains retweet, quote tweet, or reply tweet info
    contains_retweet = tweet['referenced_tweets.retweeted.id']
    contains_quote = tweet['referenced_tweets.quoted.id']
    contains_reply = tweet['referenced_tweets.replied_to.id']
    
    # Does tweet contain retweet info? (Is this category not NA or empty?)
    if pd.notna(contains_retweet):
        return "retweet"
    # Does tweet contain quote and reply info?
    elif pd.notna(contains_quote) and pd.notna(contains_reply):
        return "quote/reply"
    # Does tweet contain quote info? 
    elif pd.notna(contains_quote):
        return "quote"
    # Does tweet contain reply info? 
    elif pd.notna(contains_reply):
        return "reply"
    # If it doesn't contain any of this info, it must be an original tweet
    else:
        return "original"

# Make Tweet URL
def make_tweet_url(tweets):
    # Get username
    username = tweets[0]
    # Get tweet ID
    tweet_id = tweets[1]
    # Make tweet URL
    tweet_url = f"https://twitter.com/{username}/status/{tweet_id}"
    return tweet_url

In [37]:
# Create tweet type column
my_tweets_df['type'] =my_tweets_df.apply(find_type, axis =1)
# Create tweet URL column
my_tweets_df['tweet_url'] = my_tweets_df[['author.username', 'id']].apply(make_tweet_url, axis='columns')

Let's select and rename only the columns we're interested in.

Pick another new column that you find intriguing and add it below!!

In [None]:
# Select columns of interest
my_clean_tweets_df = my_tweets_df[['created_at', 'author.username', 'author.name', 'author.description',
                             'author.verified', 'type', 'text', 'public_metrics.retweet_count', 
                             'public_metrics.like_count', 'public_metrics.reply_count', 'public_metrics.quote_count',
                             'tweet_url', 'lang', 'source', 'geo.full_name']]

# Rename columns for convenience
my_clean_tweets_df = my_clean_tweets_df.rename(columns={'created_at': 'date', 'public_metrics.retweet_count': 'retweets', 
                          'author.username': 'username', 'author.name': 'name', 'author.verified': 'verified', 
                          'public_metrics.like_count': 'likes', 'public_metrics.quote_count': 'quotes', 
                          'public_metrics.reply_count': 'replies', 'author.description': 'user_bio'})

my_clean_tweets_df

Let's get an overview of some of these columns.

In [None]:
my_clean_tweets_df['type'].value_counts()

In [None]:
# Create stopwords list
stopwords = STOPWORDS.update(["plums", "icebox"])

# Set up wordcloud
wc = WordCloud(background_color="white", max_words=50,
               stopwords=stopwords, contour_width=5, contour_color='steelblue')

# Strip line breaks
tweet_texts = my_clean_tweets_df['text'].str.replace(r"\\n", " ", regex=True)
# Join all tweet texts together
tweet_texts = ' '.join(tweet_texts)
# Generate word cloud
wc.generate(tweet_texts)

# Create and save word cloud
plt.figure(figsize = (10,5))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.savefig("my_tweet_word-cloud.png", dpi=300)
plt.show()