![Cornell Day of Data](https://data.research.cornell.edu/sites/default/files/CornellDayOfData2019-hr.jpg)

# **Working with Twitter Data**

This interactive Jupyter notebook is a companion to "Working with Twitter Data," a workshop led by Melanie Walsh at Cornell's 2019 Day of Data. This notebook is designed to allow participants to use Twarc and Twint without prior set-up or installation.

Many of the cells below include code that should be run on the command-line. These cells all begin with an exclamation point `!`. The `!` allows a Jupyter notebook to run code from a shell.
![the command line](images/command-line.png)


# **Installation**

### Python

I recommend installing Python with Anaconda: https://docs.continuum.io/anaconda/install/

### [Twarc](https://github.com/DocNow/twarc)

In [None]:
!pip install twarc

Or download and open twarc as a zip file: https://github.com/DocNow/twarc/archive/master.zip

### [Twint](https://github.com/twintproject/twint)

In [None]:
!pip install twint

# **Tweet Collection**

## Collect tweets based on keyword and output to a CSV file with Twint

The `--search` or `-s` flag indicates scraping all tweets that include a specific keyword

The `--output` or `-o` flag indicates saving the tweets to a file

The `--csv` flag indicates write the file in CSV format

In [None]:
!twint --search demdebate --output dem-debate-2019-10.csv --csv

### Check how many tweets have been collected

An easy way to check how many tweets have been collected is to use the `wc` command with the `-l` flag, which returns the number of lines (`-l`) in a file

In [12]:
!wc -l dem-debate-2019-10.csv

    7411 dem-debate-2019-10.csv


The cell below uses the Python library Pandas, which we will not discuss in-depth today. It's included here because it's an easy way to quickly show what the the CSV file looks like.

In [90]:
import pandas
dem_debate_tweets = pandas.read_csv('dem-debate-2019-10.csv')
dem_debate_tweets

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,...,quote_url,video,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date
0,1183913331276570629,1183913331276570629,1571101908000,2019-10-14,21:11:48,EDT,101342744,bmackiswack,Bewack,,...,https://twitter.com/KyleKulinski/status/118390...,0,,,,,,,"[{'user_id': '101342744', 'username': 'bmackis...",
1,1183913274426961920,1183913274426961920,1571101895000,2019-10-14,21:11:35,EDT,1529917477,solon594bce,Matthew Kracht,,...,,0,,,,,,,"[{'user_id': '1529917477', 'username': 'solon5...",
2,1183912991084961792,1183912991084961792,1571101827000,2019-10-14,21:10:27,EDT,1282194938,m4progress,Mom4justice,,...,,0,,,,,,,"[{'user_id': '1282194938', 'username': 'M4prog...",
3,1183912966883819520,1183912966883819520,1571101822000,2019-10-14,21:10:22,EDT,270394054,mboteach,Melissa Boteach,,...,,0,,,,,,,"[{'user_id': '270394054', 'username': 'mboteac...",
4,1183912933111283715,1183912933111283715,1571101813000,2019-10-14,21:10:13,EDT,35221521,lensjockey,Holly Van Voast,,...,https://twitter.com/LaMottJackson/status/11839...,0,,,,,,,"[{'user_id': '35221521', 'username': 'lensjock...",
5,1183912781780787201,1183912781780787201,1571101777000,2019-10-14,21:09:37,EDT,2687613918,jarheadjenny,Jenny Hawkes Jenkins,,...,,0,,,,,,,"[{'user_id': '2687613918', 'username': 'Jarhea...",
6,1183912653347053568,1183912653347053568,1571101747000,2019-10-14,21:09:07,EDT,1529917477,solon594bce,Matthew Kracht,,...,,0,,,,,,,"[{'user_id': '1529917477', 'username': 'solon5...",
7,1183912541510164481,1183912541510164481,1571101720000,2019-10-14,21:08:40,EDT,1155256970653446145,pappledawn,"Dawn Papple, Independent Writer",,...,,0,,,,,,,"[{'user_id': '1155256970653446145', 'username'...",
8,1183912380478169088,1183912380478169088,1571101682000,2019-10-14,21:08:02,EDT,17006036,naral,NARAL,,...,,0,,,,,,,"[{'user_id': '17006036', 'username': 'NARAL'},...",
9,1183912277583417344,1183910821140807687,1571101657000,2019-10-14,21:07:37,EDT,730898818443530241,brendanreid20,Brendan Reid,,...,,0,,,,,,,"[{'user_id': '730898818443530241', 'username':...",


## Collect tweets from a specific user and output to a JSON file with Twint

The `--username` or `-u` flag indicates scraping a specific user's tweets

The `--output` or `-o` flag indicates saving the tweets to a file

The `--json` flag indicates write the file in JSON format

### Collect tweets from @washingtonpost

In [None]:
!twint --username washingtonpost --output washington-post-tweets.json --json

### Check how many @washingtonpost tweets have been collected

In [92]:
#Your code here

### Collect tweets from @FoxNews

In [None]:
!twint -u foxnews --output fox-news-tweets.json --json

Twarc won't work in this notebook unless you configure it with your own consumer key, consumer secret, access token, and access token secret. However, the code below shows how you would use twarc for the same purposes.

### Check how many @FoxNews tweets have been collected

In [92]:
#Your code here

### Collect tweets from a Twitter Account of your Choice

In [None]:
!twint -u YOUR-TWITTER-ACCOUNT-HERE --output YOUR-FILENAME-HERE.json --json

### Collect most recent tweets from a specific user and output to a JSON file with Twarc

In [None]:
!twarc search demdebate > dem-debate-2019-10-twarc.json

### Collect tweets based on keyword and output to a JSON file with Twarc

In [19]:
!twarc timeline washingtonpost > wa-po-recent.json

# **Tweet Analysis**

Twarc comes with a handy set of utilities for basic Twitter analysis.

To explore these utilities, we're going to use a small portion of the September 2019 #demdebate tweet dataset collected by Ed Summers and Matthew Salzano. The tweet IDs for this dataset can be found on Doc Now's catalog: https://www.docnow.io/catalog/

### Twarc utilities `twarc/utils`

### Identify Top Hashtags `twarc/utils/tags.py`

In [46]:
!python twarc/utils/tags.py dem-debate-2019-09.json | head -n 20

 4492 demdebate
  779 yanggang
  692 myyangstory
  609 yangbeatstrump
  514 yangsdebatesurprise
  479 isupportyang
  233 yang2020
  202 yangmediablackout
  191 democraticdebate
  189 demdebate3
  173 googleandrewyang
  136 teamtrump
  118 bigtruth
  115 texas
  115 beto2020
   99 humanityfirst
   83 phasethree
   68 medicareforall
   68 demdebatetsu
   60 election2020


### Create a Word Cloud `twarc/utils/wordcloud.py`

In [87]:
!python twarc/utils/wordcloud.py dem-debate-2019-09.json > dem-debate-2019-09.html

View your word cloud:

[dem-debate-2019-09.html](dem-debate-2019-09.html)

### Identify Top Emojis

In [50]:
!python twarc/utils/emojis.py dem-debate-2019-09.json | head -n 20

👍   499
😂   175
🔥    91
🇺🇸    76
🎉    70
😀    68
🤣    65
😅    58
✨    54
🚨    53
📺    52
👏    50
🤠    39
⬇    38
✅    37
❤    29
🤡    28
☀    20
👇    20
📢    20


# Further Resources

Deen Freelon, Charlton D. McIlwain, and Meredith D. Clark, [“Beyond the hashtags: #Ferguson, #Blacklivesmatter, and the online struggle for offline justice.”](http://cmsimpact.org/resource/beyond-hashtags-ferguson-blacklivesmatter-online-struggle-offline-justice/)

Freelon, et al. purchased 40,815,975 tweets that were published between June 1, 2014 and May 31, 2015 that matched 45 keywords—the words or hashtags #BlackLivesMatter, #Ferguson, or the names of 20 other black individuals that were killed by the police during this year. These researchers then generously shared [the tweet IDs for this dataset](http://dfreelon.org/2017/01/03/beyond-the-hashtags-twitter-data/) for free online.