# DSI Summer Workshops Series

## July 26, 2018

Peggy Lindner<br>
Center for Advanced Computing & Data Science (CACDS)<br>
Data Science Institute (DSI)<br>
University of Houston  
plindner@uh.edu 

Please make sure you have Jupyterhub running with support for R and all the required packages installed. Data for this and other tutorials can be found in the github repsoitory for the Summer 2018 DSI Workshops https://github.com/peggylind/Materials_Summer2018

## Data Mining Twitter Data

Understand basics of twitter data mining using R

### Twitter

![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/twitter-start.png)



### Techniques

* Text Mining
* Topic Modeling
* Sentiment Analysis

### Tools

* Twitter API
* R and specifically the following packages

* [twitteR] Twitter data extraction
* [tm](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) Text cleaning mining
* [topicmodels](https://www.tidytextmining.com/topicmodeling.html) Topic Modeling

...

Visualization

* [ggplot2](http://ggplot2.tidyverse.org/) Modern R visulaizations
* [wordcloud](http://developer.marvel.com) Make some nice word clouds
* [RColorBrewer](https://dataset.readthedocs.org/en/latest/) Get color into your visualizations

...


### Process
1. Extract tweets and followers from the Twitter website with R and the twitteR package
2. With the tm package, clean text by removing punctuations, numbers, hyperlinks and stop words, followed by stemming and stem completion
3. Build a term-document matrix
4. Analyse topics with the topicmodels package
5. Analyse sentiment with the sentiment140 package
6. Analyse following/followed and retweeting relationships with the igraph package

### Using existing twitter data within this tutorial

![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/twitter-account.png)

`# you could download Twitter data manually from site
#url <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds"
#download.file(url, destfile = "RDataMining-Tweets-20160212.rds") `



### Retrieve Tweets

In [None]:
#load the twitteR library
library(twitteR)

#### A) using the Twitter API

The following code is merely an abstract example. You will have to learn more about the Twitter API and how to use it at: [Twitter Developer](https://developer.twitter.com/en.html)

And prepare you Twitter account: https://towardsdatascience.com/setting-up-twitter-for-text-mining-in-r-bcfc5ba910f4


In [None]:
# This code will not run!
# Change the next four lines based on your own consumer_key, consume_secret, access_token, and access_secret. 
consumer_key <- "dfgbfdbhe"
consumer_secret <- "fdbdbh"
access_token <- "dfbhdf"
access_secret <- "fbhfd"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
tw = twitteR::searchTwitter('#realDonaldTrump + #HillaryClinton', n = 1e4, since = '2016-11-08', retryOnRateLimit = 1e3)
d = twitteR::twListToDF(tw)

#### B) load from file

In [None]:


#read the data from a file
tweets <- readRDS("dataJuly26th/RDataMining-Tweets-20160212.rds")

#let's look what we got
tweets

#### Let's explore

In [None]:
# number of tweets in dataset
(n.tweet <- length(tweets))

In [None]:
# convert to data frame
tweets.df <- twListToDF(tweets)

In [None]:
# look at tweet #190
tweets.df[190, c("id", "created", "screenName", "replyToSN", "favoriteCount", "retweetCount", "longitude", "latitude", "text")]

#### Text Cleaning 

In [None]:
# we will use the tm library
library(tm) 

In [None]:
# build a corpus, and specify the source to be character vectors
myCorpus <- Corpus(VectorSource(tweets.df$text))

#what did we just create?
myCorpus

In [None]:
# print tweet # and make text fit for slide width
writeLines(strwrap(tweets.df$text[3], 60))

In [None]:
# print tweet #190 and make text fit for slide width
writeLines(strwrap(tweets.df$text[190], 60))

# convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))

writeLines(strwrap(myCorpus[[190]]$content, 60))

In [None]:
# remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))

writeLines(strwrap(myCorpus[[190]]$content, 60))

In [None]:
# remove anything other than English letters or space
removePunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removePunct))
myCorpus <- tm_map(myCorpus, removeNumbers)

writeLines(strwrap(myCorpus[[190]]$content, 60))

In [None]:
# remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
                 "use", "see", "used", "via", "amp")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

writeLines(strwrap(myCorpus[[190]]$content, 60))

In [None]:
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)

writeLines(strwrap(myCorpus[[190]]$content, 60))

In [None]:
# keep a copy for stem completion later
myCorpusCopy <- myCorpus

In [None]:
myCorpusCopy <- tm_map(myCorpusCopy, stemDocument) # stem words
writeLines(strwrap(myCorpusCopy[[190]]$content, 60))

In [None]:
stemCompletion2 <- function(x, dictionary) {
  x <- unlist(strsplit(as.character(x), " "))
  x <- x[x != ""]
  x <- stemCompletion(x, dictionary=dictionary)
  x <- paste(x, sep="", collapse=" ")
  PlainTextDocument(stripWhitespace(x))
}
myCorpusCopy <- lapply(myCorpusCopy, stemCompletion2, dictionary=myCorpusCopy)
myCorpusCopy <- Corpus(VectorSource(myCorpusCopy))
writeLines(strwrap(myCorpusCopy[[190]]$content, 60))

#### Issues in Stem completion

In [None]:
# let's count some words to see what is going on with this stemming
wordFreq <- function(corpus, word) {
  results <- lapply(corpus,
                    function(x) { grep(as.character(x), pattern=paste0("\\<",word)) }
  )
  sum(unlist(results))
}
n.miner <- wordFreq(myCorpusCopy, "miner")
n.mining <- wordFreq(myCorpusCopy, "mining")
cat(n.miner, n.mining)

In [None]:
# solution: replace oldword with newword (to fix stemming issue)
replaceWord <- function(corpus, oldword, newword) {
  tm_map(corpus, content_transformer(gsub),
         pattern=oldword, replacement=newword)
}
myCorpus <- replaceWord(myCorpus, "miner", "mining")
myCorpus <- replaceWord(myCorpus, "universidad", "university")
myCorpus <- replaceWord(myCorpus, "scienc", "science")

writeLines(strwrap(myCorpus[[190]]$content, 60))

#### Finally! Ready to Build a document term matrix

In [None]:
tdm <- TermDocumentMatrix(myCorpus,
                          control = list(wordLengths = c(1, Inf)))
tdm

# look at document term matrix
idx <- which(dimnames(tdm)$Terms %in% c("r", "data", "mining"))
as.matrix(tdm[idx, 21:30])

#### Let's look at the Top Frequent Terms

In [None]:
(freq.terms <- findFreqTerms(tdm, lowfreq = 20))

# sum up the document term matrix by rows
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 20)
# prepare sums for plotting
df <- data.frame(term = names(term.freq), freq = term.freq)

df

#### And visualize those results

In [None]:
#create histogram of word frequencies
library(ggplot2)
ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
  xlab("Terms") + ylab("Count") + coord_flip() +
  theme(axis.text=element_text(size=7))

#### Want something more colorful and playful?

In [None]:
#prep
m <- as.matrix(tdm)
# calculate the frequency of words and sort it by frequency
word.freq <- sort(rowSums(m), decreasing = T)

word.freq
# colors
library(RColorBrewer)
pal <- brewer.pal(9, "BuGn")[-(1:4)]

pal

In [None]:
# plot word cloud
library(wordcloud)
wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3,
          random.order = F, colors = pal)

#### Word Associations

Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together in documents, while a score approaching 0 means the terms seldom appear in the same document.

Keep in mind the calculation for findAssocs() is done at the document level. So for every document that contains the word in question, the other terms in those specific documents are associated. Documents without the search term are ignored.

In [None]:
# which words are associated with 'r'?
findAssocs(tdm, "r", 0.2)

findAssocs(tdm, "data", 0.2)

#### Network of terms

Once a few interesting term correlations have been identified, it can be useful to visually represent term correlations using the plot() function. By default the plot() function will default to a handful of randomly chosen terms, with the chosen correlation threshold, e.g.:

In [None]:
# network of terms
library(graph)
library(Rgraphviz)
plot(tdm, term = freq.terms, corThreshold = 0.1, weighting = T)

In [None]:
plot(tdm, terms = names(findAssocs(tdm,term="data",0.2)[["data"]]), corThreshold = 0.3)

#### Topic Modeling

In [None]:
dtm <- as.DocumentTermMatrix(tdm)
library(topicmodels)
lda <- LDA(dtm, k = 8) # find 8 topics
term <- terms(lda, 7) # first 7 terms of every topic
(term <- apply(term, MARGIN = 2, paste, collapse = ", "))

In [None]:
library(data.table)
topics <- topics(lda) # 1st topic identified for every document (tweet)
topics <- data.frame(date=as.IDate(tweets.df$created), topic=topics)
ggplot(topics, aes(date, fill = term[topic])) +
  geom_density(position = "stack")

#### Sentiment Analysis

In [None]:
# different way to install a package
#require(devtools)
#install_github("sentiment140", "okugami79")

library(sentiment)

sentiments <- sentiment(tweets.df$text)
table(sentiments$polarity)

In [None]:
# sentiment plot
sentiments$score <- 0
sentiments$score[sentiments$polarity == "positive"] <- 1
sentiments$score[sentiments$polarity == "negative"] <- -1
sentiments$date <- as.IDate(tweets.df$created)
result <- aggregate(score ~ date, data = sentiments, sum)
plot(result, type = "l")

#### Top retweetet tweets

In [None]:
# select top retweeted tweets
table(tweets.df$retweetCount)
selected <- which(tweets.df$retweetCount >= 9)
# plot them
dates <- strptime(tweets.df$created, format="%Y-%m-%d")
plot(x=dates, y=tweets.df$retweetCount, type="l", col="grey",
     xlab="Date", ylab="Times retweeted")
colors <- rainbow(10)[1:length(selected)]
points(dates[selected], tweets.df$retweetCount[selected],
       pch=19, col=colors)
text(dates[selected], tweets.df$retweetCount[selected],
     tweets.df$text[selected], col=colors, cex=.9)

#### Many more things that one wants to explore with Twitter data

e.g. Retrieve User Info and Followers

This Tutorial is based on:
Yanchang Zhao http://www.rdatamining.com/docs/twitter-analysis-with-r