**Warning!** The following packages take several minutes to load. 

In [None]:
install.packages(c("tm", "wordcloud", "proxy", "qdap", "rpart.plot", "SnowballC"))
library(tm)
library(wordcloud)
library(proxy)
library(stringr)
library(rpart)
library(rpart.plot)
library(SnowballC)
library(tidyverse)
tryCatch(library(qdap), error=function(x)"")
options(repr.matrix.max.cols=500)

# Twitter

<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Twitter-logo.svg/300px-Twitter-logo.svg.png" width="200"/>
</div>

In this class we will use data from Twitter to explore **natural language processing (NLP)**, a subfield of computer science and linguistics that is concerned with the statistical analysis of human language data. In this notebook we will retrieve a sample of tweets from Twitter and process the language of the tweets into features that can be used in machine learning algorithms. Then we will use these features to build a model that can predict how many retweets a given tweet will receive based on the language of the tweet.

---

The code cell below reads in tweets based on the following keywords: **Netflix** and **Squid Game**. 

In [None]:
system("gdown --id 1VfiKb3U9GCJ8JGali5ImCsnj6qVwImbD")
tweets <- read_csv("tweets.csv") %>% filter(language=="en")
head(tweets)

## Regular Expressions & Data Cleaning

To prepare our data for the predictive model, we want to clean the tweets by removing any non-language text that is not helpful in predicting the number of retweets. For example, tweets often contain hyperlinks to external webpages in the form of URLs that begin with $\texttt{http}$. Although all hyperlinks start with a common set of characters, the characters that follow $\texttt{http}$ are unique to each URL, so we cannot do a simple search and replace. Therefore, we need some other method to remove hyperlinks. 

---


### Introduction to Regular Expressions

A **regular expression (or regex)** is a sequence of characters that defines a general search pattern, as opposed to a specific search string. Imagine that we were to search all of our tweets for the sequence $\texttt{http}$. This would find the starting position of each URL we would like to remove, but how could we actually use this information to remove the URL completely? URLs vary in length, so we could not write a general rule like "remove $\texttt{http}$ and the next $n$ characters". Instead we need a more flexible rule of the form "remove $\texttt{http}$ and any characters that follow it *up until the next word in the text*". This is the purpose of regular expressions. 

#### Step 1: Basic regular expressions

To start simple, let's imagine we're searching for all instances of the pattern "at" within the tweet "@AlecBaldwin hates The Cat in the Hat". Here we show this string with each character labeled by its position:

|@|A|l|e|c|B|a|l|d|w|i|n|&nbsp;|h|a|t|e|s|&nbsp;|T|h|e|&nbsp;|C|a|t|&nbsp;|i|n|&nbsp;|t|h|e|&nbsp;|H|a|t|
|-|-|-|-|-|-|-|-|-|-|-|-|------|-|-|-|-|-|------|-|-|-|------|-|-|-|------|-|-|------|-|-|-|------|-|-|-|
|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|

The $\texttt{gregexpr}$($\texttt{pattern, string}$) function will find the starting index of all instances of the regular expression $\texttt{pattern}$ within $\texttt{string}$:

In [None]:
stringS1 = '@AlecBaldwin hates The Cat in the Hat'
gregexpr('at', stringS1)

In the above example our regular expression is not very interesting, as we are just searching for specific alphanumeric characters within the string. Imagine that instead of searching for the starting position of "at", we want the starting position of the entire word that contains "at". In our simple example, this means that our regular expression needs to match patterns of "at" preceded by a single letter. 

The regular expression $\texttt{[[:alpha:]]}$ will match any single alphanumeric character, lowercase or capitalized. This means that the regular expression $\texttt{[[:alpha:]]at}$ will match "hat", "Cat", and "Hat":

In [None]:
gregexpr('[[:alpha:]]at', stringS1)

Instead of just getting the starting indices of the matches, we can get the matches themselves by wrapping our call to $\texttt{gregexpr}$() with $\texttt{regmatches}$($\texttt{string, gregexpr}$($\texttt{pattern, string}$)):

In [None]:
regmatches(stringS1, gregexpr('[[:alpha:]]at', stringS1))

#### Step 2: Regular expressions with quantifiers and wildcards

Now imagine the full tweet we are processing is:
> @AlecBaldwin hates The Cat in the Hat, see https://www.rottentomatoes.com/m/cat_in_the_hat#contentReviews and http://www.rogerebert.com/reviews/dr-seuss-the-cat-in-the-hat-2003

Because we would like to remove the hyperlinks from the text, we need to compose a regular expression that will match both URLs at the end of the tweet.

We can start with the characters $\texttt{http}$, as these are common to our hyperlinks. However, some addresses follow the *secure* hypertext transfer protocol, so the $\texttt{http}$ is followed by an $\texttt{s}$. Other sites do not follow the secure version of the protocol, so $\texttt{http}$ is followed directly by  $\texttt{://www.}$. We can account for this in our regular expression by using the **quantifier** {n,m}, which means "match between n and m repetitions of the previous character". This means that the regular expression $\texttt{https$\texttt{\{}$0,1$\texttt{\}}$}$ will search for "http" with either zero or one "s" on the end. 

In [None]:
stringS2 = "@AlecBaldwin hates The Cat in the Hat, see https://www.rottentomatoes.com/m/cat_in_the_hat#contentReviews and http://www.rogerebert.com/reviews/dr-seuss-the-cat-in-the-hat-2003"
regmatches(stringS2, gregexpr('https{0,1}', stringS2))

After the $\texttt{s}$, our URLs are the same up until the website name (ignore the fact that both websites start with $\texttt{ro}$). This means that the beginning of our regular expression could be $\texttt{https{0,1}://www.}$ 

There are several different ways we could match the rest of the URL. The simplest way would be to use the **wildcard** character "." (a period). In the context of a regular expression, a period is used to match *any possible character* (except the newline delimiter $\texttt{\\n}$). Because we are not sure what will come after $\texttt{www.}$ - it could be letters, numbers, underscores, dashes, etc. - the period can be used to match any and all of these characters.

A single period in a regular expression will match a single character of any type. However, this is not what we want; because URLs vary in length, we want our regular expression to match an undefined number of characters of any type. We cannot use the $\texttt{{n,m}}$ quantifier we saw before because we would have to specify $\texttt{n}$ and $\texttt{m}$, which are unknown. Instead we can use the \* (an asterisk) quantifier, which matches *zero or more* of the character that comes before it. By combining . and \*, the regular expression .\* means "match an arbitrarily-long sequence of any characters". 

Note that URLs always feature an actual period after $\texttt{www}$. Because the period character has a special meaning within regular expressions, if we want to match an actual period we need to "escape" its wildcard behavior with two backslashes. The regular expression $\texttt{\\\\.}$ will match an actual period. 

Putting all of this together, our regular expression so far is $\texttt{https{0,1}://www\\\\..*}$ Let's try applying this regular expression to our tweet:

In [None]:
regmatches(stringS2, gregexpr("https{0,1}://www\\..*", stringS2))

#### Step 3: Greedy matching

What happened? Instead of matching the two URLs separately, the regular expression matched both URLs and the characters in between as one big string. This is because we did not define the boundary of our wildcard sequence $\texttt{.*}$. This sequence will match any set of characters of any length, so after $\texttt{https://www.}$ in the first URL the expression just matched the rest of the tweet. 

To solve this we need to define the boundary of our URLs. Let's ignore the second URL for a minute and focus on the first URL, which is followed by a space. In regular expressions spaces are represented by $\texttt{[[:space:]]}$, so the expression $\texttt{https{0,1}://www\\\\..*[[:space:]]}$ will match $\texttt{https://www.}$ followed by any sequence of characters *up until a space*. Let's apply this to our tweet:

In [None]:
regmatches(stringS2, gregexpr("https{0,1}://www\\..*[[:space:]]", stringS2))

We are getting closer! The issue with the current expression is that it matched our first URL *plus* " and ". Why did this happen? 

By default, the asterisk \* quantifier is **greedy**, meaning it will match the longest string possible. This means that $\texttt{.*[[:space:]]}$ will match any sequence of characters up until *the last space it can find*. The last space in our string is after the "and", so the expression matches up until that point. The first space is captured in the wildcard expression .\*, which matches any type of character *including spaces*. 

We can turn greedy matching off by adding a question mark (?) directly after the asterisk (\*). This means that $\texttt{.*?[[:space:]]}$ will match any sequence of characters up until *the first space it can find*, which is what we want:

In [None]:
regmatches(stringS2, gregexpr("https{0,1}://www\\..*?[[:space:]]", stringS2))

#### Step 4: Anchors and "or"

Now that we've captured the first URL, we need to modify our expression so that it also captures the second one. The issue with the current expression is that it only captures URLs that end with space, but our second URL is not followed by a space because it comes at the end of the tweet. We can solve this by changing $\texttt{[[:space:]]}$ to the **anchor** character $\texttt{\$}$, which matches the end of the string. This means the regular expression $\texttt{https{0,1}://www\\\\..*?\$}$ will match any URLs that appear at the end of the tweet. 

To capture both tweets, we want the end of the expression to match *either* a space *or* the end of the string. We can include **or statements** in our regular expression with the pipe character (|) surrounded by parentheses. The expression $\texttt{([[:space:]]}$ | $\texttt{\$)}$ will match either a space or an end of string:

In [None]:
regmatches(stringS2, gregexpr("https{0,1}://www\\..*?([[:space:]]|$)", stringS2))

#### Step 5: Removing matches

Ultimately our goal is to remove the URLs from our tweets as part of the data cleaning process. Now that we have come up with a regular expression that matches URLs, we can remove them using the $\texttt{gsub}$($\texttt{pattern, replacement, string}$) function. This function replaces any instances of $\texttt{pattern}$ within $\texttt{string}$ with $\texttt{replacement}$. If $\texttt{pattern}$ is our regular expression for URLs and $\texttt{replacement}$ is the empty string "", $\texttt{gsub}$() will effectively remove the URLs from our tweet:

In [None]:
gsub("https{0,1}://www\\..*?([[:space:]]|$)", "", stringS2)

### Clean Tweets with Regular Expressions

Below we use the $\texttt{gsub}$() function to remove the URLs from our tweets. Do not worry about following the logic of the regular expression - this pattern was designed to match many possible configurations of web addresses (we retrieved it from [here](https://stackoverflow.com/a/26498790)).

In [None]:
urlRegex = "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
tweets$tweet_cleaned = gsub(urlRegex, "", tweets$tweet)
tweets$tweet_cleaned = gsub('[^\x20-\x7E]', '', tweets$tweet_cleaned)
head(tweets)

### Drop Duplicates

Some retweets will appear several times, so we drop the duplicates by using the $\texttt{duplicated}$() function.

In [None]:
tweets = tweets[!duplicated(tweets[,c("tweet")]),]
head(tweets)
dim(tweets)

### Divide Data

In this notebook we will separately analyze tweets with no retweets and tweets that were retweeted. Then at the end we will use both types of tweets to build a predictive model that estimates the number of retweets a given tweet will receive. Below we separate our tweet data into $\texttt{noRetweets}$ and $\texttt{retweets}$.   

In [None]:
noRetweets = tweets[tweets$retweets_count == 0,]
dim(noRetweets)

In [None]:
retweets = tweets[tweets$retweets_count > 0,]
dim(retweets)

## Analyze Tweets with No Retweets

In this section we will analyze tweets that have no retweets. In the next section, you will apply the same analyses to the tweets that do have retweets. 

---

### Process Text Corpus <a class="anchor" id="process-text-corpus"></a>

Within NLP, a **corpus** refers to a collection of texts or documents that are being analyzed. In order to analyze our corpus of tweets, we will be using the $\texttt{tm}$ package ("tm" stands for text mining).  


#### Step 1: Create Corpus object

The first step to analyzing text with the $\texttt{tm}$ package is to create a $\texttt{Corpus}$ object using $\texttt{Corpus}$($\texttt{VectorSource}$()). By passing all of our tweets into this function, the $\texttt{tm}$ package simply organizes them into a format that makes them easier to process. This step does not manipulate or analyze our tweets in any way - it just stores the tweets in an object that the $\texttt{tm}$ package is designed to work with. 

In [None]:
noRetweetCorpus = Corpus(VectorSource(noRetweets$tweet_cleaned))
noRetweetCorpus

The output of $\texttt{noRetweetCorpus}$ indicates how many documents (or tweets) our corpus has; this should equal the number of rows of $\texttt{noRetweets}$:

In [None]:
nrow(noRetweets)

#### Step 2: Create term document matrix <a class="anchor" id="term-document-matrix"></a>

Up until now we have ignored the question of how exactly machine learning algorithms use language data to make predictions. After all, these algorithms are only able to manipulate numeric data, not strings. Therefore, we need some method of converting our texts into numbers. 

The simplest method for converting a corpus into a numeric format is to simply count the number of times each word in the corpus appears in each document. We start this process by collecting the set of all words used in the corpus. In our example, this means we collect every unique word used across all of the tweets we are analyzing. Then for each document (or tweet) we count the number of times each of those words occurs. Because a given tweet only contains a small fraction of the words used in the entire corpus, most of the entries for each document will be zero.

One way to represent this data is with a **term-document matrix**. The rows of this matrix represent each word in the corpus, and the columns represent each document. Each entry ($i$, $j$) represents the number of times word $i$ appears in document $j$. We can easily create this matrix with the $\texttt{tm}$ package by applying the $\texttt{TermDocumentMatrix}$() function to our corpus:

In [None]:
noRetweetTDM = TermDocumentMatrix(noRetweetCorpus)
as.matrix(noRetweetTDM)

Note that his method of representing our documents completely ignores the syntax and grammatical structures of our text. The only information it captures is the number of times each word appears in a given document. This type of representation is known as a **bag-of-words model**. 

#### Step 3: Cleaning our term document matrix

In order to improve the quality of our analyses, there is some further text cleaning we can do. 

##### Remove stop words 

Within NLP, **stop words** are any words that we want to filter out of our text data before conducting our analyses. For example, consider the word "and", which likely appears in the term document matrix from the previous step. Although "and" plays an important grammatical role in the English language, it is more functional than it is semantically substantive. Because we are using a bag-of-words model in which grammar is effectively ignored, "and" is not meaningful for the type of analyses we are conducting. Therefore, we will consider "and" a stop word and remove it from our tweets. 

Although there is not one agreed upon list of stop words, for our purposes we will use the default list provided by R. We can access this list with $\texttt{stopwords("english")}$:

In [None]:
stopwords("english")

##### Word stemming 

Another important pre-processing that is common in NLP is **stemming**, which attempts to reduce words to their root forms. 

For example, consider the words "finance", "financial", and "financing". The affixes on these words are used to mark their parts-of-speech and indicate their roles within sentences. "finance" is a noun referring to investment management; "financial" is an adjective that marks something as being related to finance; "financing" is a present participle or gerund that refers to the act of providing money. 

These differences are important syntactically, but our bag-of-words approach is more concerned with semantics than syntax. Therefore, we can stem these words into a common root (*e.g.* "financ") so that they are recognized as referring to the same concept despite their syntactic differences. As you will see below, the $\texttt{tm}$ package offers a pre-built stemmer that is designed to remove affixes from words, reducing them to their root forms.

##### Miscellaneous preprocessing

Finally, we also want to:
+ Remove punctuation and numbers from our tweets, as these are not relevant for our bag-of-words model.
+ Convert all of our words to lower case so that capitalization does not affect our analyses.
    * For example, "coronavirus" and "Coronavirus" should be treated as the same word.

##### Apply cleaning steps

Fortunately, the $\texttt{TermDocumentMatrix}$() function from the $\texttt{tm}$ package has a $\texttt{control}$ parameter that allows us to apply all of these cleaning steps automatically. First we create a list called $\texttt{processingSettings}$ with the following parameters, which are applied when the term document matrix is created:
+ $\texttt{stopwords = stopwords("english")}$
+ $\texttt{stemming = TRUE}$
+ $\texttt{removePunctuation = TRUE}$
+ $\texttt{removeNumbers = TRUE}$
+ $\texttt{tolower = TRUE}$

In [None]:
processingSettings = list(stopwords = stopwords("english"), stemming=TRUE, 
                          removePunctuation = TRUE, removeNumbers = TRUE, 
                          tolower = TRUE)

Now we apply the $\texttt{TermDocumentMatrix}$() function as before, but this time set the $\texttt{control}$ parameter equal to our $\texttt{processingSettings}$ list:

In [None]:
noRetweetTDMClean = TermDocumentMatrix(noRetweetCorpus, control=processingSettings)
as.matrix(noRetweetTDMClean)

Compare this term document matrix to the one from the previous step, which did not include any preprocessing. 

### Create Word Cloud <a class="anchor" id="create-word-cloud"></a>

Using the term document matrix we created in the last section, we can create a **word cloud** that visualizes the most common words in our corpus. First we apply the $\texttt{rowSums}$() function to our term document matrix to get the count of each word in the corpus across all of the documents:

In [None]:
noRetweetTokenCounts = rowSums(as.matrix(noRetweetTDMClean))
noRetweetTokenCounts

Then we just need to pass $\texttt{names(noRetweetTokenCounts)}$ and $\texttt{noRetweetTokenCounts}$ into the $\texttt{wordcloud}$() function from the $\texttt{wordcloud}$ package. The $\texttt{min.freq}$ parameter is used to determine how many words from our corpus we want included in the word cloud:

In [None]:
wordcloud(names(noRetweetTokenCounts), noRetweetTokenCounts, min.freq = 20)

## Analyze Tweets with Retweets

In this section, you will repeat the analyses from the previous section on the set of tweets that were retweeted. 

---

### Process Text Corpus

First, create a processed corpus of tweets that were retweeted. 

In [None]:
retweetCorpus = Corpus(VectorSource(retweets$tweet_cleaned))

Then use this corpus to create a term document matrix as we did before:

In [None]:
retweetTDM = TermDocumentMatrix(retweetCorpus)
as.matrix(retweetTDM)

### Create Word Cloud

Finally, use the term document matrix from the previous step to create a word cloud. Are there any differences in the frequently used words in the non-retweeted and retweeted tweets?

In [None]:
retweetTokenCounts = rowSums(as.matrix(retweetTDM))
wordcloud(names(retweetTokenCounts), retweetTokenCounts, min.freq = 20)

## Build Predictive Model

Now we will use what we have learned to build a machine learning model that predicts how many retweets a given tweet will receive. Here we will return to the full data set with all of the tweets we collected ($\texttt{tweets}$). 

---

### Import Sentiment Data

One common task within NLP is known as **sentiment analysis**. The goal of sentiment analysis is to use natural language processing to analyze the tone and attitude of a text. This is often used in business settings to analyze online customer reviews and open-ended responses from product surveys. 

We can use sentiment analysis to help build our predictive model by creating a "happiness" feature that measures the positivity of each tweet. The Computational Story Lab at the University of Vermont offers a corpus of English words and their associated happiness scores as part of its Hedonometer project (*see* [here](http://hedonometer.org/about.html)). As explained on their website:

> To quantify the happiness of the atoms of language, we merged the 5,000 most frequent words from a collection of four corpora: Google Books, New York Times articles, Music Lyrics, and Twitter messages, resulting in a composite set of roughly 10,000 unique words. Using Amazon’s Mechanical Turk service, we had each of these words scored on a nine point scale of happiness: (1) sad to (9) happy. You can explore the average scores of each word on our words page, or download the entire list from the publication supplement [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752).

We can import this data set from the $\texttt{qdap}$ package by running $\texttt{data("labMT")}$. Each row represents a unique word in their corpus, and $\texttt{happiness}$\_$\texttt{average}$ is the average happiness score of each word from the survey.

In [None]:
data("labMT")
head(labMT)

For a given tweet, we can create a happiness score by taking the average of $\texttt{happiness}$\_$\texttt{average}$ for each word that appears in the tweet. Recall that as part of our pre-processing we stem the words in our text data. To ensure that the words in our tweets match up to the words in Hedonometer's happiness corpus, we need to stem the words in $\texttt{labMT}$ using the $\texttt{stemDocument}$() function:

In [None]:
labMT$word = stemDocument(labMT$word)
head(labMT)

Note that after stemming the $\texttt{word}$ column there are now duplicate entries in $\texttt{labMT}$ (for example, $\texttt{happiness}$ and $\texttt{happy}$ have both been stemmed to $\texttt{happi}$). To correct this, we need to collapse duplicate rows together and take the average value of all the columns over those duplicate rows. We can do this using the $\texttt{aggregate}$() function:

In [None]:
labMT = aggregate(. ~ word, data=labMT, FUN=mean)
labMT = labMT[order(labMT$happiness_rank),]
head(labMT)

### Document Term Matrix

In the previous section we worked with term-document matrices in which the rows represented the words in our corpus and the columns represented the documents. As you might suspect, a **document-term matrix** is simply the inverse of a term-document matrix; the rows represent the documents and the columns represent the words in the corpus. 

As with the term-document matrix, we can easily create a document-term matrix using the $\texttt{DocumentTermMatrix}$() function from $\texttt{tm}$. Note that we pass in the same $\texttt{control}$ parameter as before, so we are applying all of the same pre-processing steps. We convert the document-term matrix to a data frame so it is easier to work with. 

In [None]:
# Create corpus from all tweets
tweetCorpus = Corpus(VectorSource(tweets$tweet_cleaned))

# Create document term matrix based on corpus
tweetDTM = DocumentTermMatrix(tweetCorpus, control=processingSettings)

# Convert to data frame
tweetDTM = as.data.frame(as.matrix(tweetDTM))
head(tweetDTM)

### Feature Engineering

Now we will create some features that will be used in our predictive model. 

#### Step 1: Calculate total happiness score for each tweet

In this step we will calculate the total happiness score for each document in our corpus using the happiness scores from the Hedonometer data. The function we define below accepts a document-term matrix and calculates the total happiness score for each row (or document) in that matrix. 



In [None]:
calculate_happiness = function(DTM){
    DTMCount <- DTM
    # For each word in the corpus (i.e. each column in the document-term matrix):
    for (col in names(DTM)){
        
        # 1. Calculate that word's happiness score according to Hedonometer (or 0 if word is missing from Hedonometer)
        wordHappinessScore = max(0, labMT[labMT$word==col,]$happiness_average)
        
        # 2. Multiply the word's happiness score by the count of that word in each document
        DTM[,col] = wordHappinessScore * DTM[col]
    }
    
    # For each tweet, average the happiness scores for all the words
    DTM$happiness_total = rowSums(DTM) / rowSums(DTMCount)
    
    # Return the total happiness scores for each tweet
    return(DTM$happiness_total)
}

We can then use this function to create a column with the happiness score for each tweet:

In [None]:
tweets$happiness_score <- calculate_happiness(tweetDTM)
head(tweets)

#### Step 2: Handle count for each tweet

Next we create a feature called $\texttt{handle}$\_$\texttt{count}$ that contains the number of user handles included in each tweet. We can count the number of times that "@" appears in each tweet with the $\texttt{str}$\_$\texttt{count}$() function from the $\texttt{stringr}$ package:

In [None]:
tweets$handle_count = str_count(tweets$tweet_cleaned, "@")
head(tweets)

### Build Model

Now that we have cleaned our tweets and done some feature engineering, we will fit a regression tree that predicts number of retweets. By applying $\texttt{summary}$() to our model we can see the importance of each feature (under "Variable importance").

In [None]:
model = rpart(retweets_count ~ happiness_score + handle_count,
                           data=tweets)
rpart.plot(model, branch=0.3, tweak=1.5)