# Collecting Twitter Data

[Download relevant files here](https://melaniewalsh.org/Collecting-Twitter-Data.zip) or run `git pull` from command line in the "Intro-Cultural-Analytics-Notebooks" directory

<img src="https://cfcdnpull-creativefreedoml.netdna-ssl.com/wp-content/uploads/2017/06/Twitter-featured.png" width=100%>

In this lesson, we're going to learn how to collect Twitter data with the Python/command line tool [twarc](https://github.com/DocNow/twarc). This tool was developed by a project called [Documenting the Now](https://www.docnow.io/). The DocNow team develops tools and ethical frameworks for social media research.

Because twarc relies on Twitter's API, we need to apply for a Twitter developer account and create a Twitter application before we use it. You can find instructions for the application process and for installing and configuring twarc here: [Twitter Collection Setup](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Collecting-Cultural-Data/Twitter-Collection-Setup.html).

# Twitter API (Free Version)

With the free version of the Twitter API, there are basically two ways to collect your own Twitter data: in real time or ~7 days in the past. To get data any further in the past requires a paid version of the Twitter API. Twarc allows you to collect tweets both in real time and ~7 days in the past.

# Collect Tweets in Real Time

## Twarc From the Command Line

The easiest way to collect tweets with twarc is to use the command line. To collect tweets in real time, you can use the command `twarc filter`, followed by a search query, then the output operator `>` and a filename of your choosing with the ".jsonl" file extension (which outputs your Twitter data to this JSONL file).

`twarc filter "search term" > my_file.jsonl`

For example, to collect tweets in real time that include the word "coronavirus," you would run:

`twarc  filter "coronavirus" > coronavirus_filter.jsonl`

### Starting and Stopping Twarc

If you run `twarc filter` from your command line, `twarc` will keep running until you explicitly stop the process. You can stop a process on the command line by typing `Ctrl + C`.

As you may recall, we can run command line functions in Jupyter notebooks by putting an exclamation point `!` at the beginning of a cell. For some reason, however, `!twarc filter` and `!twarc search` don't play very well in Jupyter notebooks (or at least they don't play well consistently). Sometimes when you start running them, they won't stop—even when you hit the stop button or try to interrupt the kernel (the equivalent of `Ctrl + C`).

Because of this unpredictability, I strongly recommend that you open your Terminal or PowerShell and experiment with the twarc code below by copying it and pasting it into your command line, where you can more easily stop the processes.

To give just a taste of twarc in this notebook and to make sure it stops properly, however, we're going to add the `timeout` command, which will stop the process after a certain length of time (seconds `s`, minutes `m`, or hours `h`) . Unfortunately, `timeout` only works for Unix command lines (Mac/Chrome OS).

Run a live collection of tweets that include the word "coronavirus" for 10 seconds:

In [35]:
!timeout 10s twarc  filter "coronavirus" > coronavirus_filter.jsonl

Run a live collection of tweets that include the word "Shakespeare" for 10 seconds:

In [33]:
!timeout 10s twarc filter "Shakespeare" > shakespeare_filter.jsonl

## Twarc From Python/Jupyter Notebooks

Though I recommend collecting tweets from the command line, you can also use twarc as a Python library and run it in a Jupyter notebook. To import twarc, run `from twarc import Twarc` (as in the cell below). We're also going to import a library called JSON to help us output a JSON file.

In [42]:
from twarc import Twarc
import json

To use Twarc as a Python library, you'll once again need to configure twarc with your [API keys](https://developer.twitter.com/en/apps) (\*sigh\*). Copy and paste them into the quotation marks below.

In [44]:
consumer_key= ""
consumer_secret = ""
access_token = ""
access_token_secret= ""

Quick tip! If you've already set up your Twitter API keys with `twarc configure`, you can find your API keys by running `open ~/.twarc` (Mac/Chrome OS) or `Invoke-Item ~/.twarc` (Windows) from the command line:

 Mac/Chrome OS

In [1]:
!open ~/.twarc

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Windows_logo_-_2012_derivative.svg/1024px-Windows_logo_-_2012_derivative.svg.png width=20 align='left'> Windows 

In [None]:
!Invoke-Item ~/.twarc

These commands will open the ".twarc" document that hosts your API keys, and you can simply copy and paste the correct information into the variables in the cell above.

### Configure Twarc

In [45]:
twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

Below I've written a Python function called `collect_live_tweets()` that uses `twarc.filter()`. This function accepts a search query, the number of tweets that you want to collect, and a filename with a .jsonl extension. This function will output your Twitter data to a file with this filename.

### Make Live Tweet Collection Function

In [63]:
def collect_live_tweets(search_query, number_of_desired_tweets, filename):    
    
    twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

    tweets = []
    with open(filename, 'w', encoding='utf-8') as outfile:
        for tweet in twarc.filter(search_query):
            if len(tweets) < number_of_desired_tweets:
                tweets.append(tweet)
                json.dump(tweet, outfile)
                outfile.write('\n')
            else:
                break
    return

### Run Live Tweet Collection Function

In [52]:
collect_live_tweets("coronavirus", 100, "coronavirus_filter.jsonl")

## Check Number of Tweets Collected

 Mac/Chrome OS

In [None]:
!wc -l coronavirus_filter.jsonl

In [None]:
!wc -l shakespeare_filter.jsonl

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Windows_logo_-_2012_derivative.svg/1024px-Windows_logo_-_2012_derivative.svg.png width=20 align='left'> Windows 

In [None]:
!find /v /c “” coronavirus_filter.jsonl

In [None]:
!find /v /c “” shakespeare_filter.jsonl

# Collect Tweets From Past 7 days

## Twarc From the Command Line

To collect tweets from approximately 7 days in the past, you can use the command `twarc search`, followed by a search query, then the output operator `>` and a filename of your choosing with the ".jsonl" file extension (which outputs your Twitter data to this JSONL file).

Run a collection of tweets from the past ~7 days that include the word "coronavirus" for 10 seconds:

In [49]:
!timeout 10s twarc search "coronavirus" > coronavirus_search.jsonl 

Run a collection of tweets from the last ~7 days that include the word "Shakespeare" for 10 seconds:

In [8]:
!timeout 10s twarc search "Shakespeare" > shakespeare_search.jsonl 

^C


## Twarc From Python/Jupyter Notebooks

Below I've written a Python function called `collect_past_tweets()` that uses `twarc.search()`. This function accepts a search query, the maximum number of tweets that you want to collect, and a filename with a .jsonl extension. This function will output your Twitter data to a file with this filename. 

### Make Past Tweet Collection Function

In [57]:
def collect_past_tweets(search_query, number_of_max_tweets, filename):    
    
    twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

    tweets = []
    with open(filename, 'w', encoding='utf-8') as outfile:
        for tweet in twarc.search(search_query):
            if len(tweets) < number_of_max_tweets:
                tweets.append(tweet)
                json.dump(tweet, outfile)
                outfile.write('\n')
            else:
                break
    return

### Run Past Tweet Collection Function

In [60]:
collect_past_tweets("coronavirus", 1000, "coronavirus_search.jsonl")

## Check Number of Tweets Collected

 Mac/Chrome OS

In [None]:
!wc -l coronavirus_search.jsonl

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Windows_logo_-_2012_derivative.svg/1024px-Windows_logo_-_2012_derivative.svg.png width=20 align='left'> Windows 

In [None]:
!find /v /c “” coronavirus_search.jsonl

# Crafting a Good Twitter Query

We made relatively simple queries to Twitter's API in the examples above. But there are more specific and more complex ways to make queries.

To craft a good Twitter search query, it's important to understand and explore these myriad ways. A researcher named Igor Brigadir has compiled a wonderful resource that details many of the Twitter API search operators: https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md

## Search for Exact Phrases

`twarc search "\"an exact phrase\""`

You can search for an *exact* phrase in a tweet by including the phrase in escaped `\` quotation marks, as above.

### "Not with a bang but with a..."

The first phrase that we're going to search for comes from the conclusion of T.S. Eliot's 1925 [poem "The Hollow Men"](https://msu.edu/~jungahre/transmedia/the-hollow-men.html):

>This is the way the world ends<br>
>This is the way the world ends<br>
>This is the way the world ends<br>
>**Not with a bang but with a whimper.**

You've probably heard these lines before, even if you didn't know that they were written by the modernist poet T.S. Eliot. This phrase is a striking example of a bit of literary, poetic language that has gone "viral" in 21st-century American culture, both on and off the internet.

We don't need to include the `timeout` function below because there aren't a ton of tweets from the past 7 days that included "not with a bang but with a". Unlike with a live real-time tweet collection, the number of past tweets to be collected is finite. The search will thus complete in a relatively short amount of time, and it should be safe to run even if you have trouble stopping twarc commands in Jupyter notebooks. 

In [13]:
!twarc search "\"not with a bang but with a\"" > bang.jsonl

In [69]:
collect_past_tweets("\"not with a bang but with a\"", 1000, "bang.jsonl")

## Search for General Phrases

### "Touch my face"

The other phrase we're going to search for comes from public health recommendations about preventing the spread of the coronavirus: that people should avoid touching their faces. Many people are, in light of these recommendations, discovering that it's actually very difficult not to touch your own face.

Now the avoidance of touching one's face has sprouted up as a funny Twitter meme. These various "touch my face" memes serves as an interesting example of how online communities produce comedy and levity even in times of stress and crisis.

In [960]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Working on not touching my face :) <a href="https://t.co/qfyNdrDReh">pic.twitter.com/qfyNdrDReh</a></p>&mdash; Hannah (@McBBQSauce) <a href="https://twitter.com/McBBQSauce/status/1235700933801242626?ref_src=twsrc%5Etfw">March 5, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [14]:
!twarc search "touch my face min_retweets:10" > face.jsonl 

In [71]:
collect_past_tweets("touch my face min_retweets:10", 2000, "face.jsonl")

Great! Now we have some Twitter data. But before we dive into analysis, we need to complete one more step. We need to convert this JSON data to CSV data, which will be easier for us to work with. Luckily, there's a twarc "utility" for this very purpose.

# Get Twarc Utilities

There are a number of twarc "utilities" that enable you to manipulate and analyze Twitter data. With these utilities, you can do things such as convert JSON data to CSV data, count up the most frequent emojis used in tweets, make a network visualization of tweets and Twitter users, and more.

These utilities are not available from the `pip install twarc` installation. To access the twarc utilities, you'll need to `git clone` the [twarc GitHub repository](https://github.com/DocNow/twarc) or download it as a zip file.

The twarc repository should already be downloaded in your relevant files, but if you uncomment the line below, you can also clone the repository with this line of code.

In [80]:
#!git clone https://github.com/DocNow/twarc.git

Cloning into 'twarc'...
remote: Enumerating objects: 37, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 3464 (delta 16), reused 23 (delta 9), pack-reused 3427[K
Receiving objects: 100% (3464/3464), 906.45 KiB | 9.96 MiB/s, done.
Resolving deltas: 100% (2180/2180), done.


# Use Twarc Utilities

`python twarc/utils/your_desired_util.py tweets.jsonl`

To use a twarc utility, you need to call `python` from the command line and then include the utility's file path (they all should be in the "twarc/utils" subfolder).

Note that if your Jupyter notebook is in exactly the same directory as the "twarc" repository, then you can run the code as above. However, if your Jupyter notebook is somewhere else, you will have to direct it to the correct location of "twarc/utils". For example`python /Users/melaniewalsh/twarc/utils/your_desired_util.py tweets.jsonl`

## Convert JSON to CSV

To convert a JSON file to a CSV file, you can run `python twarc/utils/json2csv.py` followed by the JSONL filename, the output operator `>` and your desired filename for the CSV file.

`python twarc/utils/json2csv.py json_file.jsonl > csv_file.csv`

> <img src=https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Windows_logo_-_2012_derivative.svg/1024px-Windows_logo_-_2012_derivative.svg.png width=20 align='left'> Heads up Windows users! The twarc utility json2csv.py will probably not work on your computer by default. You'll likely get a UnicodeEncodeError because Windows computers do not use Unicode (UTF-8) by default. However, you can make UTF-8 your default by following [these instructions](https://scholarslab.github.io/learn-twarc/08-win-region-settings) and restarting your comptuer. Then json2csv.py should work.

Make "bang.jsonl" into "bang.csv" (with an extra field added for the full version of the original retweeted text)

In [72]:
!python twarc/utils/json2csv.py --extra-field rt_text retweeted_status.full_text bang.jsonl > bang.csv

Make "face.jsonl" into "face.csv" (with an extra field added for the full version of the original retweeted text)

In [73]:
!python twarc/utils/json2csv.py --extra-field rt_text retweeted_status.full_text face.jsonl > face.csv