![Cornell Day of Data](https://data.research.cornell.edu/sites/default/files/CornellDayOfData2019-hr.jpg)

# **Working with Twitter Data**

This interactive Jupyter notebook is a companion to "Working with Twitter Data," a workshop led by Melanie Walsh at Cornell's 2019 Day of Data. The notebook was designed to allow participants to experiment with Twarc and Twint without prior set-up or installation.

Many of the cells below include code that should be run on the command-line. These cells all begin with an exclamation point `!`. The `!` allows a Jupyter notebook to run code from a shell.
![the command line](images/command-line.png)


# **Run a Jupyter cell**

In [None]:
print('Nice! You did it. You just ran a cell.')

# **Installation**

### Python

I recommend installing Python with Anaconda: https://docs.continuum.io/anaconda/install/

### [Twarc](https://github.com/DocNow/twarc)

In [None]:
!pip install twarc

Or download and open twarc as a zip file: https://github.com/DocNow/twarc/archive/master.zip

### [Twint](https://github.com/twintproject/twint)

In [None]:
!pip install twint

# **Set up a Twitter developer account**

*Twarc won't work in this notebook unless you configure it with your own consumer key, consumer secret, access token, and access token secret.

1. Create a Twitter developer account: https://developer.twitter.com/en/apps 

2. Get consumer key, consumer secret, access token, and access token secret.

Outside this notebook, you would configure twarc by entering `twarc configure` at the command line and following the prompts. To explore twarc within notebook, however, simply enter your tokens between the single quotation marks below and then run the cell:

In [None]:
%env\
consumer_key = '',\
consumer_secret = '',\
access_token = '',\
access_token_secret = ''

# **Tweet Collection**

## Collect tweets based on keyword and output to a CSV file with Twint

The `--search` or `-s` flag indicates scraping all tweets that include a specific keyword

The `--output` or `-o` flag indicates saving the tweets to a file

The `--csv` flag indicates write the file in CSV format

Search for tweets with the keyword "demdebate"

In [None]:
!twint --search demdebate --output dem-debate-2019-10.csv --csv

### Check how many tweets have been collected

An easy way to check how many tweets have been collected is to use the `wc` command with the `-l` flag, which returns the number of lines (`-l`) in a file

In [None]:
!wc -l dem-debate-2019-10.csv

The cells below use the Python library [Pandas](https://pandas.pydata.org/), which we will not discuss in-depth today. It's included here because it's an easy way to quickly show what the the CSV file looks like.

In [None]:
import pandas
#Set column width
pandas.options.display.max_colwidth = 1000

#Read CSV file
dem_debate_tweets = pandas.read_csv('dem-debate-2019-10.csv')

#Display columns in a Twint-produced CSV
dem_debate_tweets.columns

Show preview of CSV file:

In [None]:
dem_debate_tweets[['date', 'time', 'username', 'tweet', 'mentions', 'retweets_count', 'hashtags', 'link']].head(100)

## Collect tweets from a specific user and output to a JSON file with Twint

The `--username` or `-u` flag indicates scraping a specific user's tweets

The `--output` or `-o` flag indicates saving the tweets to a file

The `--json` flag indicates write the file in JSON format

### Collect tweets from @washingtonpost

In [None]:
!twint --username washingtonpost --output washington-post-tweets.json --json

### Check how many @washingtonpost tweets have been collected

In [None]:
#Your code here

### Collect tweets from @FoxNews

In [None]:
!twint -u foxnews --output fox-news-tweets.json --json

### Check how many @FoxNews tweets have been collected

In [None]:
#Your code here

### ✨ Collect tweets from a Twitter Account of your Choice ✨

In [None]:
!twint -u YOUR-TWITTER-ACCOUNT-HERE --output YOUR-FILENAME-HERE.json --json

### Collect tweets based on keyword and output to a JSON file with Twarc

In [None]:
!twarc search demdebate > dem-debate-2019-10-twarc.json

### Collect most recent tweets from a specific user and output to a JSON file with Twarc

In [None]:
!twarc timeline washingtonpost > wa-po-recent.json

In [None]:
!twarc timeline foxnews > fox-news-recent.json

In [None]:
import pandas
fox_news_tweets = pandas.read_json('fox-news-recent.json', lines=True)

#Display columns in a Twarc-produced JSON
fox_news_tweets.columns

Show preview of JSON file sorted by date

In [None]:
fox_news_tweets[['created_at', 'full_text', 'user', 'geo', 'retweet_count', 'id_str']].sort_values(by='id_str', ascending=False).head(10)

# **Tweet Analysis**

### Twarc utilities `twarc/utils`

### Identify Top Hashtags `twarc/utils/tags.py`

In [None]:
!python twarc/utils/tags.py dem-debate-2019-10-twarc.json

### Create a Word Cloud `twarc/utils/wordcloud.py`

In [None]:
!python twarc/utils/wordcloud.py dem-debate-2019-10-twarc.json > dem-debate-2019-10-twarc.html

[dem-debate-2019-10-twarc.html](dem-debate-2019-10-twarc.html)

In [None]:
!python twarc/utils/wordcloud.py wa-po-recent.json > wa-po-recent.html

View your word cloud:

[wa-po-recent.html](wa-po-recent.html)

In [None]:
!python twarc/utils/wordcloud.py fox-news-recent.json > fox-news-recent.html

View your word cloud:

[fox-news-recent.html](fox-news-recent.html)

### Identify Top Emojis `twarc/utils/emojis.py`

In [None]:
!python twarc/utils/emojis.py dem-debate-2019-10-twarc.json | head -n 20