Mine tweets with Python and parse/analyze with R and a Shiny Dashboard
Clone or download
Latest commit c8f49ae Mar 13, 2018



Collect tweets with Python, parse them with R/TidyVerse, and generate a Shiny dashboard for high-level analysis and visualization.

For dashboard examples, see:

See my blog post, "Mining Twitter data with R, TidyText, and TAGS", for analysis examples (from an older version).

Instructions (Python & Tweepy)


twitter_stream.py is a Python script that uses Tweepy to fetch a live stream of tweets based on a search string. Register for a Twitter developer account and add your authentication details to twitter_authentication.py. To start, either edit the following lines with your search query and output file name

search_query = ['Twitter','@twitter', '#ilovehashtags']
filename = 'data/sources/stream-' + str(datetime.datetime.now()).replace(' ', '_').split('.')[0] + '.csv'

Then run the script locally, or on a virtual private server for extended collection periods (using nohup to allow the process to continue after you close the connection).

Press CTRL-C to stop the script (on your local computer). Use pgrep and kill to stop a process running with nohup on your virtual private server.

Searching Twitter history

twitter_search.py works similarly to twitter_stream.py, except it searches backwards in time, as far as the Twitter API will allow. It functions the same way as twitter_stream.py: add your Twitter developer credentials to twitter_authentication.py, and then run from the command line with the same command as above. Used in conjunction with twitter_stream.py, it should collect as many tweets as possible, going both backwards and forwards in time, given a particular search query. Following this example, it uses AppAuthHandler in order to increase the maximum tweets downloaded per 15 minutes. It is set to pause when the API times out, so you can continue indefinitely until Twitter's historical limits kick in, potentially returning millions of tweets, depending on the popularity of the search terms.

Analyzing Python/Tweepy results with R

mine_tweets.R is an R script containing code to parse the output from the Python/Tweepy scripts. It will generate a number of files in the data folder that summarize the most used words, bigrams, trigrams, hashtags; domains and URLs linked most frequently; most prolific accounts; most retweeted tweets; etc.

The dashboard folder contains code for a Shiny dashboard that will take the output of mine_tweets.R and create a dashboard like the ones linked above for high-level analysis of trends in the collected tweets. It can be run locally, or deployed to a Shiny Server.

Happy mining!