03a-text-intro.Rmd

---
title: "Introduction to automated text analysis"
author: Pablo Barbera
date: June 28, 2017
output: html_document
---

### String manipulation with R

We will start with basic string manipulation with R.

Our running example will be a random sample of 10,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We'll save the text of these tweets as a vector called `text'

```{r}
tweets <- read.csv("data/EP-elections-tweets.csv", stringsAsFactors=F)
head(tweets)
text <- tweets$text
```

R stores the basic string in a character vector. `length` gets the number of items in the vector, while `nchar` is the number of characters in the vector.

```{r}
length(text)
text[1]
nchar(text[1])
```

Note that we can work with multiple strings at once.

```{r}
nchar(text[1:10])
sum(nchar(text[1:10]))
max(nchar(text[1:10]))
```

We can merge different strings into one using `paste`:
```{r}
paste(text[1], text[2], sep='--')
```

Charcter vectors can be compared using the `==` and `%in%` operators:

```{r}
tweets$screen_name[1]=="martinwedge"
"DavidCoburnUKip" %in% tweets$screen_name
```

As we will see later, it is often convenient to convert all words to lowercase or uppercase.

```{r}
tolower(text[1])
toupper(text[1])
```

We can grab substrings with `substr`. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.

```{r}
substr(text[1], 1, 2)
substr(text[1], 1, 10)
```

This is useful when working with date strings as well:

```{r}
dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years
substr(dates, 6, 7) # months
```

We can split up strings by a separator using `strsplit`. If we choose space as the separator, this is in most cases equivalent to splitting into words.

```{r}
strsplit(text[1], " ")
```

Let's dit into the data a little bit more. Given the construction of the dataset, we can expect that there will be many tweets mentioning the names of the candidates, such as @Nigel_Farage, We can use the `grep` command to identify these. `grep` returns the index where the word occurs.

```{r}
grep('@Nigel_Farage', text[1:10])
```

`grepl` returns `TRUE` or `FALSE`, indicating whether each element of the character vector contains that particular pattern.

```{r}
grepl('@Nigel_Farage', text[1:10])
```

Going back to the full dataset, we can use the results of `grep` to get particular rows. First, check how many tweets mention the handle "@Nigel_Farage".
```{r}
nrow(tweets)
grep('@Nigel_Farage', tweets$text[1:10])
length(grep('@Nigel_Farage', tweets$text))

```

It is important to note that matching is case-sensitive. You can use the `ignore.case` argument to match to a lowercase version.

```{r}
nrow(tweets)
length(grep('@Nigel_Farage', tweets$text))
length(grep('@Nigel_Farage', tweets$text, ignore.case = TRUE))
```

### Regular expressions

Another useful tool to work with text data is called "regular expression". You can learn more about regular expressions [here](http://www.zytrax.com/tech/web/regex.htm). Regular expressions let us develop complicated rules for both matching strings and extracting elements from them. 

For example, we could look at tweets that mention more than one handle using the operator "|" (equivalent to "OR")

```{r}
nrow(tweets)
length(grep('@Nigel_Farage|@UKIP', tweets$text, ignore.case=TRUE))
```

We can also use question marks to indicate optional characters.

```{r}
nrow(tweets)
length(grep('MEP?', tweets$text, ignore.case=TRUE))
```

This will match MEP, MEPs, etc.

Other common expression patterns are:

- `.` matches any character, `^` and `$` match the beginning and end of a string.  
- Any character followed by `{3}`, `*`, `+` is matched exactly 3 times, 0 or more times, 1 or more times.  
- `[0-9]`, `[a-zA-Z]`, `[:alnum:]` match any digit, any letter, or any digit and letter.
- Special characters such as `.`, `\`, `(` or `)` must be preceded by a backslash.  
- See `?regex` for more details.

For example, how many tweets are direct replies to @Nigel_Farage? How many tweets are retweets? How many tweets mention any username?
```{r}
length(grep('^@Nigel_Farage', tweets$text, ignore.case=TRUE))
length(grep('^RT @', tweets$text, ignore.case=TRUE))
length(grep('@[A-Za-z0-9]+ ', tweets$text, ignore.case=TRUE))
```


Another function that we will use is `gsub`, which replaces a pattern (or a regular expression) with another string:

```{r}
gsub('@[0-9_A-Za-z]+', 'USERNAME', text[1])
```

To extract a pattern, and not just replace, use parentheses and choose the option `repl="\\1"`:

```{r}
gsub('.*@([0-9_A-Za-z]+) .*', text[1], repl="\\1")
```

You can make this a bit more complex using `gregexpr`, which will extract the location of the matches, and then `regmatches` 

```{r}
handles <- gregexpr('@([0-9_A-Za-z]+)', text)
handles <- regmatches(text, handles)
handles <- unlist(handles)
head(sort(table(handles), decreasing=TRUE), n=25)
# now with hashtags...
hashtags <- regmatches(text, gregexpr("#(\\d|\\w)+",text))
hashtags <- unlist(hashtags)
head(sort(table(hashtags), decreasing=TRUE), n=25)
```

Now let's try to identify what tweets are related to UKIP and try to extract them. How would we do it? First, let's create a new column to the data frame that has value `TRUE` for tweets that mention this keyword and `FALSE` otherwise. Then, we can keep the rows with value `TRUE`.

```{r}
tweets$ukip <- grepl('ukip|farage', tweets$text, ignore.case=TRUE)
table(tweets$ukip)
ukip.tweets <- tweets[tweets$ukip==TRUE, ]
```

### Preprocessing text with quanteda

As we discussed earlier, before we can do any type of automated text analysis,  we will need to go through several "preprocessing" steps before it can be passed to a statistical model. We'll use the `quanteda` package  [quanteda](https://github.com/kbenoit/quanteda) here.

The basic unit of work for the `quanteda` package is called a `corpus`, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use `summary` to get some information about your corpus.

```{r}
library(quanteda)
twcorpus <- corpus(tweets$text)
summary(twcorpus)
```

A useful feature of corpus objects is _keywords in context_, which returns all the appearances of a word (or combination of words) in its immediate context.

```{r}
kwic(twcorpus, "brexit", window=10)
kwic(twcorpus, "miliband", window=10)
kwic(twcorpus, "eu referendum", window=10)
```

We can then convert a corpus into a document-feature matrix using the `dfm` function.
 
```{r}
twdfm <- dfm(twcorpus, verbose=TRUE)
twdfm
```

`dfm` has many useful options. Let's actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features...

```{r}
?dfm
twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, ngrams=1:3, verbose=TRUE)
twdfm
```

Note that here we use ngrams -- this will extract all combinations of one, two, and three words (e.g. it will consider both "human", "rights", and "human rights" as tokens in the matrix).

Stemming relies on the `SnowballC` package's implementation of the Porter stemmer:

```{r}
tokenize(tweets$text[1])
tokens_wordstem(tokenize(tweets$text[1]))
```

In a large corpus like this, many features often only appear in one or two documents. In some case it's a good idea to remove those features, to speed up the analysis or because they're not relevant. We can `trim` the dfm:

```{r}
twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
```

It's often a good idea to take a look at a wordcloud of the most frequent features to see if there's anything weird.

```{r}
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)
```

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. "a", "the", "is"). We can also see the list using `topFeatures`

```{r}
topfeatures(twdfm, 25)
```

We can remove the stopwords when we create the `dfm` object:

```{r}
twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), verbose=TRUE)
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)
```

One nice feature of quanteda is that we can easily add metadata to the corpus object.

```{r}
docvars(twcorpus) <- data.frame(screen_name=tweets$screen_name, polite=tweets$polite)
summary(twcorpus)
```

We can then use this metadata to subset the dataset:

```{r}
polite.tweets <- corpus_subset(twcorpus, polite=="impolite")
```

And then extract the text:
```{r}
mytexts <- texts(polite.tweets)
```

We'll come back later to this dataset.

### Importing text with quanteda

There are different ways to read text into `R` and create a `corpus` object with `quanteda`. We have already seen the most common way, importing the text from a csv file and then adding the metadata, but `quanteda` has a built-in function to help with this:

```{r}
library(readtext)
tweets <- readtext(file='data/EP-elections-tweets.csv')
twcorpus <- corpus(tweets)
```

This function will also work with text in multiple files. To do this, we use the textfile command, and use the 'glob' operator '*' to indicate that we want to load multiple files:

```{r}
myCorpus <- readtext(file='data/inaugural/*.txt')
inaugCorpus <- corpus(myCorpus)
```