# Week 3: the Twitter API

In this lesson, we're going to learn how to analyze and explore Twitter data with the Python/command line tool [twarc](https://twarc-project.readthedocs.io/en/latest/). We're specifically going to work with [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2/), which is designed for version 2 of the Twitter API (released in 2020) and the Academic Research track of the Twitter API (released in 2021), which enables researchers to collect tweets from the entire Twitter archive for free.

Twarc was developed by a project called [Documenting the Now](https://www.docnow.io/). The DocNow team develops tools and ethical frameworks for social media research.

> This notebook builds upon the work of [Melanie Walsh](https://melaniewalsh.github.io/Intro-Cultural-Analytics/), released under a [Creative Commons BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) License. This notebook shares the same license.

<div class="alert-warning alert">

**Note1:** There is a major difference in the workflow throughout this exercise. It is because the way `twarc2` is designed. It is a command-line tool that needs to be run from a terminal shell. In a Jupyter notebook, we can run these shell commands directely from the notebook prefixing them with `!`.

For example, you might have used `ls` command in the terminal to list the contents of the directory. Try running `!ls` from this notebook now. 
</div>



In [None]:
!ls

week-3-twitter.ipynb  week-3-twitter-solutions.ipynb


<div class="alert-warning alert">

**Note2:** The commands that we run using `!twarc2` will result in a file being saved in the local directory. You can check this file through the file browser towards the left hand side. We will use `pandas` to read the resulting file. 

</div>

### Workflow

We will need to install three command line tools:
- the program `twarc2` to access Twitter API, in order to archive tweets, counts, user metadata.
- `twarc-csv`, a plugin to convert the tweets archived by `twarc2` into a CSV format.
- `twarc-hashtags`, a plugin to analyze hashtag counts in the tweets retrieved by `twarc2`.

Note: the program `twarc2` requires the Python package `twarc` (with no 2). The number ‘2’ is for the Twitter API v2.


In [None]:
%%capture
!pip install twarc --upgrade
!pip install twarc-csv --upgrade
!pip install twarc-hashtags --upgrade

## Configure and set up warc2

Once twarc2 is installed, you need to configure it with your API keys and/or bearer token so that you can actually access the API.

If you are running this notebook in DeepNote, select the terminal icon on the left hand side. If you are running this notebook in your local computer or on server, you will need to open a terminal. 

Run the following command in the terminal 
```
twarc2 configure

```


Twarc will ask for your bearer token, which you can copy and paste into the blank after the colon, and then press enter. You can optionally enter your API keys, as well.

If you’ve entered your information correctly, you should get a congratulatory message that looks something like this:

```
Your keys have been written to /Users/<your username>/Library/Application Support/twarc/config

✨ ✨ ✨  Happy twarcing! ✨ ✨ ✨

```

<br />

Now you’re ready to collect and analyze tweets!

## Archive tweets matching a query

To collect tweets from the Twitter API, we need to form a query and ask twarc2 to download all the tweets that match it: `twarc2 search *query*`. The simplest kind of query is a keyword search, such as the phrase “Oxford Internet Institute,” which should return any tweet that contains all of these words in any order — `twarc2 search "Oxford Internet Institute"`.

<div class="alert-info alert">

**Exercise 0.1:** Retrieve the tweets that contain the keywords (in any order) "Oxford Internet Institute" and save them to `tweets-oii.json`

</div>


To output Twitter data to a file, we include a filename with the “.jsonl” file extension, which stands for JSON lines, a special kind of JSON file:

In [None]:
!twarc2 search "Oxford Internet Institute" tweets-oii.jsonl


👋  Hi I don't see a configuration file yet, so let's make one.

Please follow these steps:

1. visit https://developer.twitter.com/en/portal/
2. create a project and an app
3. go to your Keys and Tokens and generate your keys

Please enter your Bearer Token (leave blank to skip to API key configuration): 

KernelInterrupted: Execution interrupted by the Jupyter kernel.

<div class="alert-info alert">

**Exercise 0.2:** Convert these tweets into a `csv` file named `tweet-oii.csv`.

Use `pandas` in Python to load that CSV file and count the number of tweets returned by the search.
</div>


In [None]:
!twarc2 csv tweets-oii.jsonl tweets-oii.csv

In [None]:
import pandas as pd 
tweets = pd.read_csv("tweets-oii.csv")

tweets.head()

<div class="alert-info alert">

**Exercise 0.3:** Next, we are going to rename a number of columns to make the data more readable. we are going to rename the columns. Look at the columns and find which columns represent `date of the tweet`,`number of retweets`, `number of likes`, `number of quotes`,  `number of replies`, `author twitter handle`, `author name`, `author's twitter bio`, `text of the tweet`,  `whether the author is verified or not`.

</div>

In [None]:
print(tweets.columns)

In [None]:
def rename_dataframe_tweets(dataframe):
    """ Rename the columns of a dataset archived by twarc2. """
    return dataframe.rename(columns={
        'created_at': 'date',
        'public_metrics.retweet_count': 'retweets', 
        'author.username': 'username', 
        'author.name': 'name',
        'author.verified': 'verified', 
        'public_metrics.like_count': 'likes', 
        'public_metrics.quote_count': 'quotes', 
        'public_metrics.reply_count': 'replies',
        'author.description': 'user_bio'
    })

tweets = rename_dataframe_tweets(tweets)

# note we are keeping the column `text` as it is. 
tweets = tweets[['date', 'retweets', 'username', 'name', 'verified', 'likes', 'quotes', 'replies', 'user_bio', 'text']]

Now we can view our more focused DataFrame!

In [None]:
tweets.head()

<div class="alert-info alert">

**Exercise 0.4:** Let's have a look at the text of these tweets. Print 10 tweets from the dataframe. 

</div>

In [None]:
for t in tweets.text[:10]:
    print(t)

<div class="alert-info alert">

**Exercise 0.5:** Analyze  the hashtags in the above tweets and save the resulting counts of hashtags in a CSV file named `hashtags-oii.csv`. Read this csv file in `pandas` and check the counts of these hashtags.

</div>

In [None]:
!twarc2 hashtags tweets-oii.jsonl hashtags-oii.csv

In [None]:
hashtags_oii = pd.read_csv('hashtags-oii.csv')
hashtags_oii

## Get Tweets (Academic Track, Full Twitter Archive)

So far we have been querying the standard Twitter API that limits the search results to the past 7 days. In order to search the full Twitter archive, we only need to add a single flag of `--archive` to our `twarc2 search` command so that the command now looks like `twarc2 search *query* --archive filename`. 

<div class="alert-info alert">

**Exercise 1:** Now repeat the Exercise 0.1, 0.2, and 0.3 by running the search query on the entire archive.

For the sake of clarity, add `-archive` to the .jsonl output file. For example, save the tweets in `tweet-oii-archive.jsonl`. 

</div>




In [None]:
# To limit the timit your query will take to run, let's add --start-time 2020-01-01 to limit tweets to 2020—2022.

!twarc2 search --archive --start-time 2020-01-01 "Oxford Internet Institute" tweets-oii-archive.jsonl 
!twarc2 csv tweets-oii-archive.jsonl tweets-oii-archive.csv

In [None]:
tweets = pd.read_csv("tweets-oii-archive.csv")

tweets = rename_dataframe_tweets(tweets)  # We reuse the function we defined above

tweets = tweets[['date', 'retweets', 'username', 'name', 'verified', 'likes', 'quotes', 'replies', 'user_bio']]
print(f"We collected {tweets.shape[0]} tweets!")

In [None]:
!twarc2 hashtags tweets-oii-archive.jsonl hashtags-oii-archive.csv

In [None]:
hashtags_oii = pd.read_csv('hashtags-oii-archive.csv')
hashtags_oii

## Advanced Search of Twitter API





The Twitter API is very complex and not everything can be taught in this class. Our aim is to show you the basis but also get you comfortable in constructing more complex queries.

There are many other operators that we can add to a query, which would allow us to collect tweets only from specific Twitter users or locations, or to only collect tweets that meet certain conditions, such as containing an image or being authored by a verified Twitter user. Here’s an table of the main search operators taken from the Twitter documentation (https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list). If you're interested, look also at:

- Twarc's own documentation: https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/
- Some advanced examples from `twitterdev` Github account: https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/5-how-to-write-search-queries.md

<hr />
<br />

| Search Operator      | Explanation | Example |
|:---------------------|:------------|:------- |
| `keyword`              | Matches a keyword within the body of a Tweet. |`so sweet and so cold` |                    
| `"exact phrase match"` | Matches the exact phrase within the body of a Tweet. | `"so sweet and so cold" OR "plums in the icebox"` |
| `-`                    | Do NOT match a keyword or operator | `birthday -happy`, `oxford -university` |
| `#`                    | Matches any Tweet containing a recognized hashtag.  | `#arthistory` |
| `from:`, `to:`         | Matches any Tweet from or to a specific user. | `from:cynddl` `to:cynddl` |
| `place:`               | Matches Tweets tagged with the specified location or Twitter place ID. | `place:"new york city" OR place:london` |
| `is:reply`, `is:quote` | Returns only replies or quote tweets. | `thank you is:reply` ` from:OxfordUCU is:quote` |
| `is:verified`          | Returns only Tweets whose authors are verified by Twitter. | `buy NFT is:verified` |
| `has:media`            | Matches Tweets that contain a media object, such as a photo, GIF, or video, as determined by Twitter. | `My new song is out has:media` |
| `has:images`, `has:videos` | Matches Tweets that contain a recognized URL to an image. | `the view from my window has:images` |
| `has:geo`              | Matches Tweets that have Tweet-specific geolocation data provided by the Twitter user. | `honeymoon has:geo` |



### Query construction

- To construct a query of two keywords with **logical AND**, use a space. So if the search term is `corona` and `coronavirus`, our query will be `"corona coronavirus"`
- To construct a query of two keywords with **logical OR**, we use OR as a keyword. So if the search term is `corona`or `coronavirus` our query will be `"corona OR coronavirus"`
- To construct a query to **match exact phrase**, we enclose that phrase in quotation marks. So if the search term is `covid` OR `corona virus`, our query will be `'corona OR "corona virus"'`. **NOTE** how we have enclosed the query in single quotes because the inner phrase needs to use double quotes.
- To construct a **complex query** which has `keyword1` and either of `keyword2` or `keyword1`, we can enclose the OR condition in circular parenthesis. Thus, our query will be `"keyword1 (keyword2 OR keyword3)"`.

### Filters

The table above presented a few filters you can add to your queries. For a full list, please refer to the API documentation. For instance:

- To filter the tweets by authors who are verified by Twitter (journalists, artists, politicians, etc.), add `is:verified` to your query.  For example, to search for `keyword1` only from authors who are verified, our query will be ``keyword1 is:verified"`
- To filter by tweets that have some form of media, our query will look like `"keyword1 has:media"`

### Other command-line options

There are two options that are specific to `twarc` so you will have to add them to your command instead to your query.

- **Search limit:** if you want to limit your query to just 500 tweets, the command will look like `twarc2 search "query" --limit 500 output.jsonl`

- **Time range:** if you want to limit your search reesults to a time range, the command will look like `twarc2 search --start-time 2014-07-17 --end-time 2014-07-24 "query" output.jsonl`


Let's get used to all this through an example. 

<div class="alert-info alert">

**Exercise 2.1:** We will search for posts related to Russian invasion of Ukraine and the conflict in Ukraine. Since the volume of tweets containing `Ukraine` or `Russia` is too much to be dealt in the duration of the class, we will narrow down our search using keywords and filters.

Let's construct a query with the following specifications:
- contains exact words/phrases: either `Russia` or `Россия` and either `Ukraine` or `Украина`
- is tweeted only by verified authors
- contains hashtags
- has media
- is not a retweet (`-is:retweet`).

We will further limit our search results based on the following criteria
- Maximum 50,000 tweets
- Start date be 2022-02-27
- End date be 2022-02-28

Store the output in `tweets-ukr-rus.jsonl`.
</div>


In [None]:
!twarc2 search --archive --limit 50000 --start-time 2022-02-27 --end-time 2022-02-28 "(Russia OR Россия) (Ukraine OR Украина) is:verified -is:retweet has:media has:hashtags" tweets-ukr-rus.jsonl

<div class="alert-info alert">

**Exercise 2.2:** Write the `twarc2` command to convert the `jsonl` file to a CSV file.
</div>

In [None]:
!twarc2 csv tweets-ukr-rus.jsonl tweets-ukr-rus.csv

<div class="alert-info alert">

**Exercise 2.3:** Read the above CSV file using pandas

- Look at the columns and find which columns represent `date of the tweet`,`number of retweets`, `number of likes`, `number of quotes`,  `number of replies`, `author twitter handle`, `author name`, `author's twitter bio`, `whether the author is verified or not`
- Let's rename to `date`, `retweets`, `likes`, `quotes`, `replies`, `username`, `name`, `user_bio`, `verified`.
- Let's keep only these columns as well as the main text of the tweet.

</div>

In [None]:
tweets = pd.read_csv("tweets-ukr-rus.csv")

In [None]:
tweets = rename_dataframe_tweets(tweets)  # We reuse the function we defined above

tweets = tweets[['date', 'retweets', 'username', 'name', 'verified', 'likes', 'quotes', 'replies', 'user_bio', 'text']].copy()

tweets.head()

<div class="alert-info alert">

**Exercise 2.4:**  What are the most popular tweets? Let's sort them:

- in descending order of the number of  likes;
- in descending order of the number of  retweets.

</div>

In [None]:
tweets.sort_values("likes", ascending=False)

In [None]:
tweets.sort_values("retweets", ascending=False)

<div class="alert-info alert">

**Exercise 2.5:**  Let's plot the hourly frequency of tweets.

- Check the type of elements that `date` column has. Are they `str` or `datatime` objects? If they are `str`, we need to convert them to `datetime` objects so that we can do the necessary aggregation. We will use `pd.to_datetime` to do that.
- We will use a handy functionality in pandas to do a frequency count on datetime objects. To do that, we need to set the index of dataframe as `date` column.
- We will use `resample` functionality of `pandas` to resample the datetime index into bins according to the specification. For example, if you want to plot daily frequency, pass the argument `'D'`, and if you want to plot hourly frequency, pass the argument `'H'`. Call `.resample('H')` on the dataframe obtained above.
- In order to aggregate these results, we will use `.size()` function on the resampled dataframe.
- Finally, we will plot the resulting counts using `.plot()`, a handy pandas method that calls the matplotlib library.

</div>



In [None]:
# What does the date column contains? A series of string objects.
print(tweets['date'].loc[0])
print(type(tweets['date'].loc[0])) 

In [None]:
# Let's convert it to a pandas datetime series:
tweets['date'] = pd.to_datetime(tweets['date'])

In [None]:
# We convert the dataframe to have the date column as an ‘index’
tweets_by_date = tweets.copy()
tweets_by_date.set_index('date', inplace=True)

# We now make use of pandas' advanced timeseries processing tools: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
hourly_tweets = tweets_by_date.resample('H')
hourly_frequency_tweets = hourly_tweets.size()

In [None]:
hourly_frequency_tweets.plot(title="Hourly volume of tweets", rot=45);

<div class="alert-warning alert">

Although pandas takes time to learn, you eventually gain a lot of times due to the advanced functions it provides.

This **resampling technique is very easy to transfer** for other sort of plots:
- you can adjust the window from years to minutes. For instance, resampling with the rule `'D'` returns days, `'10T'` returns bins of 10 minutes…
- you can use it to compute statistics on any columns, such as average number of likes for instance.

</div>

In [None]:
hourly_tweets = tweets_by_date.resample('H')
hourly_tweets.likes.mean().plot(title="Hourly average number of likes", rot=45);

<div class="alert-info alert">


**Exercise 2.7:** Plot the volume of retweets every 30 minutes.

</div>

In [None]:
tweets_by_date.resample('30T').retweets.sum().plot(title="# of retweets every 30 minutes", rot=45)

<div class="alert-info alert">

**Exercise 2.8:**  Use `twarc` to output summary of hashtags in the retrieved tweets and save the final output in CSV format at `hashtags-ukr-rus.csv`

</div>

In [None]:
!twarc2 hashtags tweets-ukr-rus.jsonl hashtags-ukr-rus.csv

<div class="alert-info alert">

**Exercise 2.9:**  Read the above CSV file and retrieve the top 10 hastags from there. 

</div>

In [None]:
hashtags_ukraine = pd.read_csv("hashtags-ukr-rus.csv")
hashtags_ukraine.columns

In [None]:
hashtags_ukraine.sort_values("tweets", ascending=False)

## Optional: archiving only counts of tweets

Some of the queries we tested above quickly take time. If you only want to collect aggregated counts of tweets, disagregated by days for instance, there's an easier way. 

<div class="alert alert-info">

**Exercise 3.1:** Execute the twarc2 command below to count the number of tweets that match a given query.

We will directly obtain the counts by day, in a CSV format.
</div>


**Note:** To call `twarc` for counts, we execute the following command:

```
!twarc2 counts "query" --csv --granularity day > filename.csv
```

<br />

Here, `--csv` specifies the desired output format, `--granularity` specifies the aggregation level, and 
the keyword `>` redirects the output of the command before this keyword to the file mentioned after this keyword. 

In [None]:
!twarc2 counts '(Russia OR Россия) (Ukraine OR Украина) is:verified has:media has:hashtags ' --csv --granularity day > counts-ukr-rus.csv 

<div class="alert-info alert">

**Exercise 3.2:** Now use pandas to read the CSV file that you saved above. 

- How many columns are there? What does these column mean?
- Plot the daily counts using pandas plot function : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

</div>

In [None]:
counts_ukr_rus = pd.read_csv("counts-ukr-rus.csv")
print(counts_ukr_rus.columns)

In [None]:
counts_ukr_rus.head()

In [None]:
# Let's convert the start and end series to datetime
counts_ukr_rus['start'] = pd.to_datetime(counts_ukr_rus['start'])
counts_ukr_rus['end'] = pd.to_datetime(counts_ukr_rus['end'])

In [None]:
counts_ukr_rus.plot(y="day_count", x="start", rot=45, marker='x', legend=False, ylabel="Daily tweet volume", xlabel="Day");

<div class="alert-warning alert">

**For more examples** of accessing data from Twitter API, check out `twarc`'s documentation: https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/

You can notably use it to access the followers and friends (mutual followers) of a given user.

If you're interested in mapping networks, have a look at another plugin, twarc-network: https://github.com/DocNow/twarc-network

</div>

## This week's datasheet questions

Throughout this course, we will aim to build on the practice of documenting our datasets, using the Datasheet for Datasets framework (here is an <a href="https://github.com/zykls/folktables/blob/main/datasheet.md">example of a datasheet"</a>). In the notebook for this week, you designed a small dataset of tweets related to the conflict in Ukraine. Let's assume you plan to release this dataset online.

How would you structure your Datasheet for this small dataset? For this week's homework, please answer the following questions:

>**How was the data associated with each instance acquired?** Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

...

>**If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?**

...

>**Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)?** If not, please describe the timeframe in which the data associated with the instances was created.

...

>**Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?** If so, please describe why.

...

>**Did the individuals in question consent to the collection and use of their data?** If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

...


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f746e373-dc41-4dbe-b3f9-5f3af42ff658' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>