# **Collecting Twitter Data**

This interactive Jupyter notebook will allow you to experiment with Twarc without prior set-up or installation.

Most of the cells below include code that should be run on the command line. These cells all begin with an exclamation point `!`. The `!` allows a Jupyter notebook to run code from a shell.
![the command line](images/command-line.png)


# **Run a Jupyter cell**

In [119]:
print('Nice! You did it. You just ran a cell.')

Nice! You did it. You just ran a cell.


# **Installation**

### Python

I recommend installing Python with Anaconda:

https://docs.continuum.io/anaconda/install/

### Twarc 

To install twarc, you can run "pip install twarc" on the command line. The command below specifies the latest version.

In [1]:
!pip install 'twarc == 1.7.5'



Or you can download and open twarc as a zip file: https://github.com/DocNow/twarc/archive/master.zip

More detailed instruction about twarc and installation can be found at https://github.com/DocNow/twarc

# **Set up a Twitter developer account**

*Twarc won't work in this notebook unless you configure it with your own consumer key, consumer secret, access token, and access token secret.

1. Create a Twitter developer account and Twitter application. If you haven't done so yet, you can follow the instructions on our GitHub repo: https://github.com/melaniewalsh/Humanities-Data-Society/blob/master/TwitterFirstSteps.md

2. Record consumer key, consumer secret, access token, and access token secret

3. Open a terminal

![](images/terminal.png)

4. Configure twarc by entering `twarc configure` and following the prompts

![](images/twarc-configure.png)

Now you should be able to use twarc in this notebook!

# **Collecting Twitter Data**

>**“Ok boomer”** has become Generation Z’s endlessly repeated retort to the problem of older people who just don’t get it, a rallying cry for millions of fed up kids. Teenagers use it to reply to cringey YouTube videos, Donald Trump tweets, and basically any person over 30 who says something condescending about young people — and the issues that matter to them."

> -Taylor Lorenz, ["‘OK Boomer’ Marks the End of Friendly Generational Relations"](https://www.nytimes.com/2019/10/29/style/ok-boomer.html)

## Filter realtime (live)

In [None]:
!twarc filter "ok boomer" > ok_boomer_filter.jsonl

## Search (last 7 days)

In [2]:
!twarc search "morissey" > moz.jsonl

## Check how many tweets have been collected

*The command "wc" with the "-l" flag tells you how many lines are in a file*

In [3]:
!wc -l moz.jsonl

121 moz.jsonl


## Convert JSON file to CSV file

In [4]:
!python twarc/utils/json2csv.py moz.jsonl > moz.csv

## Import the Python library "pandas" and read in tweet CSV files

In [5]:
import pandas
pandas.set_option('max_colwidth', 2000)
pandas.set_option('max_columns', 2000)
pandas.set_option('max_rows', 100)
moz = pandas.read_csv('moz.csv')

## See entire CSV file for first 10 rows

In [6]:
moz.head(10)

Unnamed: 0,id,tweet_url,created_at,parsed_created_at,user_screen_name,text,tweet_type,coordinates,hashtags,media,urls,favorite_count,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,lang,place,possibly_sensitive,retweet_count,retweet_or_quote_id,retweet_or_quote_screen_name,retweet_or_quote_user_id,source,user_id,user_created_at,user_default_profile_image,user_description,user_favourites_count,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_statuses_count,user_time_zone,user_urls,user_verified
0,1197898902801977344,https://twitter.com/TheRichardM/status/1197898902801977344,Fri Nov 22 15:25:28 +0000 2019,2019-11-22 15:25:28+00:00,TheRichardM,Guess I'll see what this Morissey guy is all about.,original,,,,,1,,,,en,,,0,,,,"<a href=""https://about.twitter.com/products/tweetdeck"" rel=""nofollow"">TweetDeck</a>",15329910,Sun Jul 06 02:23:46 +0000 2008,False,"Working to boost Tulsa, OK's tech & game dev scene. Former Reviews Editor at Joystiq (RIP). Member of the Oregon Trail Generation.",5490,4352,348,263,"Tulsa, OK",Richard Mitchell,17559,,http://8bitninja.com,False
1,1197896641719783424,https://twitter.com/PlayingPolitix/status/1197896641719783424,Fri Nov 22 15:16:29 +0000 2019,2019-11-22 15:16:29+00:00,PlayingPolitix,@MarieResists52 There's are Morissey and Ramones versions too,reply,,,,,0,MarieResists52,1.197881e+18,2260900000.0,en,,,0,,,,"<a href=""https://mobile.twitter.com"" rel=""nofollow"">Twitter Web App</a>",744970093646381057,Mon Jun 20 19:08:06 +0000 2016,False,Resister of Tyrants and Purveyor of the World's Greatest Playing Cards. Check them out at the link below.,22315,29709,31694,16,"Des Moines, IA",PLAYING POLITICS,27914,,http://www.playingpoliticscards.com,False
2,1197892196445368320,https://twitter.com/tinycalamities/status/1197892196445368320,Fri Nov 22 14:58:50 +0000 2019,2019-11-22 14:58:50+00:00,tinycalamities,"Based on my playlist, Spotify thinks I’m interested in Morissey #imnot #fixyouralgorithms",original,,imnot fixyouralgorithms,,,0,,,,en,,,0,,,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",116732050,Tue Feb 23 12:12:03 +0000 2010,False,slippery slipstream,649,30,54,1,Copenhagen(-ish),based hogling,681,,,False
3,1197889982284914688,https://twitter.com/RealMathead/status/1197889982284914688,Fri Nov 22 14:50:02 +0000 2019,2019-11-22 14:50:02+00:00,RealMathead,"@jeffhorowitzPMP @victoryrhoad @pattonoswalt Yeah, sadly. Even more sadly, Morissey is a total bellend.",reply,,,,,0,jeffhorowitzPMP,1.197515e+18,31305460.0,en,,,0,,,,"<a href=""https://mobile.twitter.com"" rel=""nofollow"">Twitter Web App</a>",155591933,Mon Jun 14 15:20:08 +0000 2010,False,free and cheap forever,2861,39,389,0,Aachen,mathead,1181,,http://blog.headsign.de,False
4,1197853998851796993,https://twitter.com/NeuroBrick/status/1197853998851796993,Fri Nov 22 12:27:03 +0000 2019,2019-11-22 12:27:03+00:00,NeuroBrick,"@TheEbonyMaw I can't remember, but I think it was a woman. \n\nThere were several lefty celebs including Jack Morissey who tweeted that they should be fed to a woodchipper.\n\nLOL\n\nSue them all.",reply,,,,,3,TheEbonyMaw,1.197852e+18,1.034265e+18,en,,,0,,,,"<a href=""https://mobile.twitter.com"" rel=""nofollow"">Twitter Web App</a>",1146714432803753984,Thu Jul 04 09:36:40 +0000 2019,False,Traumatized into becoming a full-time shitposter. Neuroscience grad. Hi. DMs open. Follow me for egirl feet pics.,40151,457,166,5,,Brick,13790,,,False
5,1197841189577166848,https://twitter.com/inceltown/status/1197841189577166848,Fri Nov 22 11:36:09 +0000 2019,2019-11-22 11:36:09+00:00,inceltown,@GrogsGamut There are a lot of ways to mock this but the Morissey song rules them all,reply,,,,,0,GrogsGamut,1.197839e+18,50560530.0,en,,,0,,,,"<a href=""http://mvilla.it/fenix"" rel=""nofollow"">Fenix 2</a>",757413154280181760,Mon Jul 25 03:12:23 +0000 2016,False,"You ask some dumb questions, bro",2418,26,60,0,Your Yard,Parasocial Anomaly,2136,,,False
6,1197814959129759745,https://twitter.com/Eli_LetMeIn/status/1197814959129759745,Fri Nov 22 09:51:55 +0000 2019,2019-11-22 09:51:55+00:00,Eli_LetMeIn,"Let the right one in,\nLet the old dreams die.\n\n-Let the right one slip in,#Morissey",original,,Morissey,,,0,,,,en,,,0,,,,"<a href=""http://twittbot.net/"" rel=""nofollow"">twittbot.net</a>",2653020834,Thu Jul 17 06:24:24 +0000 2014,False,"Let the right one in (렛미인) by John Ajvide Lindqvist (2004 novel), Tomas Alfredson (2008 film). 영화와 원작을 기반으로 한 98% 자동봇입니다. with @Oskar_LetMeIn",595,27,1,2,wherever Oskar goes,엘리,33467,,,False
7,1197790909598855168,https://twitter.com/bclelandgt/status/1197790909598855168,Fri Nov 22 08:16:21 +0000 2019,2019-11-22 08:16:21+00:00,bclelandgt,@resultprospect @bwilky23 @SP_Duckworth @history_of_punk @PunkArt1977 @johnnyramone @MarkyRamone @punkrochelle @punkrockscience Morissey wrote this song,reply,,,,,0,resultprospect,1.19776e+18,2293899000.0,en,,,0,,,,"<a href=""http://klinkerapps.com"" rel=""nofollow"">Talon Android</a>",47385962,Mon Jun 15 17:14:58 +0000 2009,False,"Veteran, historian. Research focuses on Confederacy's ties to British colonies, filibustering (the fun kind), and the seamy side of Civil War diplomacy",970,223,412,5,"Calgary, Alberta",Beau Cleland,7754,,http://filibusteringhistory.wordpress.com,False
8,1197745687447797760,https://twitter.com/TheScarlixAct/status/1197745687447797760,Fri Nov 22 05:16:39 +0000 2019,2019-11-22 05:16:39+00:00,TheScarlixAct,RT @itssophiepia: artists who have never won a single grammy:\nbackstreet boys\nnicki minaj\nkaty perry\nbob marley\ntupac\nguns n roses\nsnoop do…,retweet,,,,,0,,,,en,,,6,1.197532e+18,itssophiepia,31686780.0,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",700867512745865216,Sat Feb 20 02:20:11 +0000 2016,False,"""Find Your Freedom In The Music"" -Lady Gaga, Dance In the Dark, 2009 [Fan account]",33405,553,1462,0,Canada,G A G A • LG6,58337,,,False
9,1197699913477279744,https://twitter.com/voidflavour/status/1197699913477279744,Fri Nov 22 02:14:46 +0000 2019,2019-11-22 02:14:46+00:00,voidflavour,morissey however..... ❌ https://t.co/87R2VD1L86,quote,,,,https://twitter.com/voidflavour/status/1197699693725155328,0,,,,en,,False,0,1.1977e+18,voidflavour,7.351593e+17,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",735159319818162176,Tue May 24 17:23:35 +0000 2016,False,"love, you shine like verdigris 💚",39532,112,210,5,,iero chaddi is bacc,38541,,,False


## See only select select columns

In [7]:
moz[['created_at', 'tweet_type', 'text', 'user_name', 'user_screen_name' , 'user_location', 'hashtags', 'urls', 'retweet_count']].head(100)

Unnamed: 0,created_at,tweet_type,text,user_name,user_screen_name,user_location,hashtags,urls,retweet_count
0,Fri Nov 22 15:25:28 +0000 2019,original,Guess I'll see what this Morissey guy is all about.,Richard Mitchell,TheRichardM,"Tulsa, OK",,,0
1,Fri Nov 22 15:16:29 +0000 2019,reply,@MarieResists52 There's are Morissey and Ramones versions too,PLAYING POLITICS,PlayingPolitix,"Des Moines, IA",,,0
2,Fri Nov 22 14:58:50 +0000 2019,original,"Based on my playlist, Spotify thinks I’m interested in Morissey #imnot #fixyouralgorithms",based hogling,tinycalamities,Copenhagen(-ish),imnot fixyouralgorithms,,0
3,Fri Nov 22 14:50:02 +0000 2019,reply,"@jeffhorowitzPMP @victoryrhoad @pattonoswalt Yeah, sadly. Even more sadly, Morissey is a total bellend.",mathead,RealMathead,Aachen,,,0
4,Fri Nov 22 12:27:03 +0000 2019,reply,"@TheEbonyMaw I can't remember, but I think it was a woman. \n\nThere were several lefty celebs including Jack Morissey who tweeted that they should be fed to a woodchipper.\n\nLOL\n\nSue them all.",Brick,NeuroBrick,,,,0
5,Fri Nov 22 11:36:09 +0000 2019,reply,@GrogsGamut There are a lot of ways to mock this but the Morissey song rules them all,Parasocial Anomaly,inceltown,Your Yard,,,0
6,Fri Nov 22 09:51:55 +0000 2019,original,"Let the right one in,\nLet the old dreams die.\n\n-Let the right one slip in,#Morissey",엘리,Eli_LetMeIn,wherever Oskar goes,Morissey,,0
7,Fri Nov 22 08:16:21 +0000 2019,reply,@resultprospect @bwilky23 @SP_Duckworth @history_of_punk @PunkArt1977 @johnnyramone @MarkyRamone @punkrochelle @punkrockscience Morissey wrote this song,Beau Cleland,bclelandgt,"Calgary, Alberta",,,0
8,Fri Nov 22 05:16:39 +0000 2019,retweet,RT @itssophiepia: artists who have never won a single grammy:\nbackstreet boys\nnicki minaj\nkaty perry\nbob marley\ntupac\nguns n roses\nsnoop do…,G A G A • LG6,TheScarlixAct,Canada,,,6
9,Fri Nov 22 02:14:46 +0000 2019,quote,morissey however..... ❌ https://t.co/87R2VD1L86,iero chaddi is bacc,voidflavour,,,https://twitter.com/voidflavour/status/1197699693725155328,0


## Finding Keywords/ Features

Applying a few basic heuristics can begin to help identify interesting ideas for relevent keywords.

### Sample Word Frequencies:

In [23]:
import random
import re
from collections import Counter

In [24]:
wp_Mimno = re.compile("\w[\w\-\']*\w|\w")
samples = random.sample([tweet for tweet in moz["text"]], 10)
bag_of_words = []
for item in samples:
    tokens = wp_Mimno.findall(item)
    for i in tokens:
        bag_of_words.append(i.lower())

In [25]:
bag_of_words

['donttrythis',
 'it',
 'turns',
 'out',
 'that',
 'morissey',
 'is',
 'massively',
 'racist',
 'so',
 'this',
 'might',
 'be',
 'more',
 'appropriate',
 'than',
 'you',
 'might',
 'think',
 'bludazza17',
 'i',
 'like',
 'the',
 'pogues',
 'and',
 'the',
 'dubliners',
 'etc',
 'great',
 'stuff',
 'imo',
 'so',
 'we',
 'have',
 'to',
 'agree',
 'if',
 'you',
 'look',
 'good',
 'and',
 "can't",
 'play',
 'an',
 'instrument',
 'at',
 'least',
 'make',
 'the',
 'lyrics',
 'good',
 'bob',
 'dylan',
 'lennon',
 'mccartney',
 'morissey',
 'esq',
 'even',
 'though',
 'your',
 'only',
 'fkin',
 '19',
 'rt',
 'mefbama',
 'democrats',
 'claim',
 'they',
 'care',
 'about',
 'children',
 'but',
 'then',
 'elect',
 'child',
 'predators',
 'joe',
 'morissey',
 'think',
 'they',
 'really',
 'care',
 'about',
 'your',
 'child',
 'um',
 'bocado',
 'de',
 'gente',
 'respondendo',
 'aquele',
 'tweet',
 'sobre',
 'n',
 'ter',
 'arte',
 'de',
 'direita',
 'e',
 'até',
 'agr',
 'n',
 'vi',
 'nada',
 'de',
 '

In [26]:
word_count= dict(Counter(bag_of_words))

In [29]:
word_count

{'donttrythis': 1,
 'it': 1,
 'turns': 1,
 'out': 1,
 'that': 2,
 'morissey': 9,
 'is': 1,
 'massively': 1,
 'racist': 1,
 'so': 2,
 'this': 2,
 'might': 2,
 'be': 1,
 'more': 1,
 'appropriate': 1,
 'than': 1,
 'you': 2,
 'think': 2,
 'bludazza17': 1,
 'i': 2,
 'like': 1,
 'the': 4,
 'pogues': 1,
 'and': 3,
 'dubliners': 1,
 'etc': 1,
 'great': 1,
 'stuff': 1,
 'imo': 1,
 'we': 1,
 'have': 4,
 'to': 1,
 'agree': 1,
 'if': 1,
 'look': 1,
 'good': 2,
 "can't": 1,
 'play': 1,
 'an': 1,
 'instrument': 1,
 'at': 1,
 'least': 1,
 'make': 1,
 'lyrics': 1,
 'bob': 3,
 'dylan': 1,
 'lennon': 1,
 'mccartney': 1,
 'esq': 1,
 'even': 1,
 'though': 1,
 'your': 2,
 'only': 1,
 'fkin': 1,
 '19': 1,
 'rt': 2,
 'mefbama': 1,
 'democrats': 1,
 'claim': 1,
 'they': 3,
 'care': 2,
 'about': 2,
 'children': 1,
 'but': 1,
 'then': 1,
 'elect': 1,
 'child': 2,
 'predators': 1,
 'joe': 1,
 'really': 1,
 'um': 2,
 'bocado': 1,
 'de': 3,
 'gente': 1,
 'respondendo': 1,
 'aquele': 1,
 'tweet': 1,
 'sobre': 1,
 '

### Guess demographic

In [39]:
!pip install gender-guesser

Collecting gender-guesser
[?25l  Downloading https://files.pythonhosted.org/packages/13/fb/3f2aac40cd2421e164cab1668e0ca10685fcf896bd6b3671088f8aab356e/gender_guesser-0.4.0-py2.py3-none-any.whl (379kB)
[K     |████████████████████████████████| 389kB 6.6MB/s eta 0:00:01
[?25hInstalling collected packages: gender-guesser
Successfully installed gender-guesser-0.4.0


In [40]:
import gender_guesser.detector as gender

In [45]:
samples = random.sample([tweet for tweet in moz["user_name"]], 10)

In [46]:
#you can probably tell this will not be useful#
samples

['Arya H.H. Achsan',
 'Michael Greer',
 'eduardadedidodu',
 'Brittney Barlett',
 'Mme Z (Goose in the streets, Aughra in the sheets)',
 'Aylin ✨',
 'Sam H',
 'avant-garbage',
 'Steve, Term 4 & surviving',
 'Ze Bestfriend']

In [53]:
import gender_guesser.detector as gender
detector = gender.Detector()
for item in samples:
    first_name = item.split()
    print(first_name[0], detector.get_gender(first_name[0]))

Arya unknown
Michael male
eduardadedidodu unknown
Brittney female
Mme unknown
Aylin female
Sam mostly_male
avant-garbage unknown
Steve, unknown
Ze andy


In [48]:
detector.get_gender("Malcolm")

'male'

In [49]:
detector.get_gender("Malcolmina")

'unknown'

### Time of tweet

In [56]:
samples = random.sample([tweet for tweet in moz["created_at"]], 10)

In [57]:
samples

['Wed Nov 20 18:04:42 +0000 2019',
 'Tue Nov 19 01:11:48 +0000 2019',
 'Thu Nov 21 16:20:08 +0000 2019',
 'Sun Nov 17 15:38:53 +0000 2019',
 'Fri Nov 15 13:37:14 +0000 2019',
 'Fri Nov 22 14:50:02 +0000 2019',
 'Wed Nov 20 15:26:36 +0000 2019',
 'Thu Nov 21 20:17:59 +0000 2019',
 'Mon Nov 18 21:02:07 +0000 2019',
 'Thu Nov 21 15:22:25 +0000 2019']

In [63]:
early_birds = 0
ok_nooners = 0
night_owls = 0

for item in samples:
    time_info = item.split()
    time = time_info[3]
    time_increments = time.split(":")
    if int(time_increments[0]) >= 18: 
        night_owls += 1
    elif int(time_increments[0]) <= 3:
        night_owls += 1
    elif int(time_increments[0]) >= 11:
        ok_nooners +=1
    elif int(time_increments[0]) <= 18:
        ok_nooners +=1
    else:
        early_birds += 1

In [64]:
early_birds

0

In [65]:
ok_nooners

6

In [66]:
night_owls

4

# **Basic Tweet Analysis**

### Twarc utilities `twarc/utils`

## Identify Top Hashtags `twarc/utils/tags.py`

In [None]:
!python twarc/utils/tags.py ok_boomer_search.jsonl

## Create a Word Cloud `twarc/utils/wordcloud.py`

In [118]:
!python twarc/utils/wordcloud.py ok_boomer_search.jsonl > ok_boomer_search.html

[ok_boomer_search.html](ok_boomer_search.html)

In [121]:
!python twarc/utils/wordcloud.py ok_boomer_filter.jsonl > ok_boomer_filer.html

View your word cloud:

[ok_boomer_filter.html](ok_boomer_filter.html)

## Identify Top Emojis `twarc/utils/emojis.py`

In [None]:
!python twarc/utils/emojis.py ok_boomer_search.jsonl | head -n 20