Introduction and Workflow

Kenneth Benoit

Date: 20 April 2017

This file demonstrates a basic workflow to take some pre-loaded texts and perform elementary text analysis tasks quickly. The quanteda packages comes with a built-in set of inaugural addresses from US Presidents. We begin by loading quanteda and examining these texts. The summary command will output the name of each text along with the number of types, tokens and sentences contained in the text. Below we use R’s indexing syntax to selectivly use the summary command on the first five texts.

Corpus consisting of 58 documents:
##             Text Types Tokens Sentences Year  President       FirstName
##  1789-Washington   625   1538        23 1789 Washington          George
##  1793-Washington    96    147         4 1793 Washington          George
##       1797-Adams   826   2578        37 1797      Adams            John
##   1801-Jefferson   717   1927        41 1801  Jefferson          Thomas
##   1805-Jefferson   804   2381        45 1805  Jefferson          Thomas
##     1809-Madison   535   1263        21 1809    Madison           James
##     1813-Madison   541   1302        33 1813    Madison           James
##      1817-Monroe  1040   3680       121 1817     Monroe           James
##      1821-Monroe  1259   4886       129 1821     Monroe           James
##       1825-Adams  1003   3152        74 1825      Adams     John Quincy
##     1829-Jackson   517   1210        25 1829    Jackson          Andrew
##     1833-Jackson   499   1269        29 1833    Jackson          Andrew
##    1837-VanBuren  1315   4165        95 1837  Van Buren          Martin
##    1841-Harrison  1896   9144       210 1841   Harrison   William Henry
##        1845-Polk  1334   5193       153 1845       Polk      James Knox
##      1849-Taylor   496   1179        22 1849     Taylor         Zachary
##      1853-Pierce  1165   3641       104 1853     Pierce        Franklin
##    1857-Buchanan   945   3086        89 1857   Buchanan           James
##     1861-Lincoln  1075   4006       135 1861    Lincoln         Abraham
##     1865-Lincoln   360    776        26 1865    Lincoln         Abraham
##       1869-Grant   485   1235        40 1869      Grant      Ulysses S.
##       1873-Grant   552   1475        43 1873      Grant      Ulysses S.
##       1877-Hayes   831   2716        59 1877      Hayes   Rutherford B.
##    1881-Garfield  1021   3212       111 1881   Garfield        James A.
##   1885-Cleveland   676   1820        44 1885  Cleveland          Grover
##    1889-Harrison  1352   4722       157 1889   Harrison        Benjamin
##   1893-Cleveland   821   2125        58 1893  Cleveland          Grover
##    1897-McKinley  1232   4361       130 1897   McKinley         William
##    1901-McKinley   854   2437       100 1901   McKinley         William
##   1905-Roosevelt   404   1079        33 1905  Roosevelt        Theodore
##        1909-Taft  1437   5822       159 1909       Taft  William Howard
##      1913-Wilson   658   1882        68 1913     Wilson         Woodrow
##      1917-Wilson   549   1656        59 1917     Wilson         Woodrow
##     1921-Harding  1169   3721       148 1921    Harding       Warren G.
##    1925-Coolidge  1220   4440       196 1925   Coolidge          Calvin
##      1929-Hoover  1090   3865       158 1929     Hoover         Herbert
##   1933-Roosevelt   743   2062        85 1933  Roosevelt     Franklin D.
##   1937-Roosevelt   725   1997        96 1937  Roosevelt     Franklin D.
##   1941-Roosevelt   526   1544        68 1941  Roosevelt     Franklin D.
##   1945-Roosevelt   275    647        26 1945  Roosevelt     Franklin D.
##      1949-Truman   781   2513       116 1949     Truman        Harry S.
##  1953-Eisenhower   900   2757       119 1953 Eisenhower       Dwight D.
##  1957-Eisenhower   621   1931        92 1957 Eisenhower       Dwight D.
##     1961-Kennedy   566   1566        52 1961    Kennedy         John F.
##     1965-Johnson   568   1723        93 1965    Johnson   Lyndon Baines
##       1969-Nixon   743   2437       103 1969      Nixon Richard Milhous
##       1973-Nixon   544   2012        68 1973      Nixon Richard Milhous
##      1977-Carter   527   1376        52 1977     Carter           Jimmy
##      1981-Reagan   902   2790       128 1981     Reagan          Ronald
##      1985-Reagan   925   2921       123 1985     Reagan          Ronald
##        1989-Bush   795   2681       141 1989       Bush          George
##     1993-Clinton   642   1833        81 1993    Clinton            Bill
##     1997-Clinton   773   2449       111 1997    Clinton            Bill
##        2001-Bush   621   1808        97 2001       Bush       George W.
##        2005-Bush   773   2319       100 2005       Bush       George W.
##       2009-Obama   938   2711       110 2009      Obama          Barack
##       2013-Obama   814   2317        88 2013      Obama          Barack
##       2017-Trump   582   1660        88 2017      Trump       Donald J.
Source: Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes:
##    Length     Class      Mode 
##         5 character character

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1789-Washington 
## [1] 58
ntoken(data_corpus_inaugural[1:7], remove_punct = TRUE)
One of the most fundamental text analysis tasks is tokenization. To tokenize a text is to split it into units, most commonly words, which can be counted and to form the basis of a quantitative analysis. The quanteda package has a function for tokenization: tokens, which constructs a quanteda tokens object consisting of the texts segmented by their terms (and by default, other elements such as punctuation, numbers, symbols, etc.). Examine the manual page at ?tokens for this details about this function:


quanteda’s tokens function can be used on a simple character vector, a vector of character vectors, or a corpus. Here are some examples:

tokens("Today is Thursday in Canberra. It is yesterday in London.")
vec <- c(one = "This is text one", 
         two = "This, however, is the second text")
Consider the default arguments to the tokens() function. To remove punctuation, you should set the remove_punct argument to be TRUE. We can combine this with the char_tolower() function to get a cleaned and tokenized version of our text.

tokens(char_tolower(vec), remove_punct = TRUE)
The way that char_tolower() is named reflects the logic of quanteda’s function grammar. The first part (before the underscore _) names the both class of object that is input to the function and is returned by the function. To lowercase an R character class object, for instance, you use char_tolower(), and to lowercase a quanteda tokens class object, you use tokens_tolower(). Some object classes are defined in base R, and some have been defined by packages that extend R’s functionality (quanteda is one example – there are well over 10,000 contributed packages on the CRAN archive alone. CRAN stands for Comprehensive R Archive Network and is where the quanteda package is published.)

Using this function with the inaugural addresses:

inaugTokens <- tokens(data_corpus_inaugural, remove_punct = TRUE)
Here, we supplied one of the optional arguments to the tokens() function: remove_punct. This functon takes a “logical” type value (TRUE or FALSE) and specifies whether punctuation characters should be removed or not. The help page for tokens(), which you can access using the command ?tokens, details all of the function arguments and their valid values.

Every function in R and its contributed packages has a help page, and this is the first place to look when examining a function. Well-written help pages will also contain examples that you can run to see how a function operates. For quanteda, the main functions also have help pages with the results of executing their examples on

Returning to tokenization: Once each text has been split into words, we can use the dfm function to create a matrix of counts of the occurrences of each word in each document:

inaugDfm <- dfm(inaugTokens)
trimmedInaugDfm <- dfm_trim(inaugDfm, min_doc = 5, min_count = 10)
weightedTrimmedDfm <- dfm_tfidf(trimmedInaugDfm)

inaugDfm2 <- dfm(inaugTokens) %>% 
    dfm_trim(min_doc = 5, min_count = 10) %>% 
Note that dfm() works on a variety of object types, including character vectors, corpus objects, and tokenized text objects. This gives the user maximum flexibility and power, while also making it easy to achieve similar results by going directly from texts to a document-by-feature matrix.

To see what objects for which any particular method (function) is defined, you can use the methods() function:

Likewise, you can also figure out what methods are defined for any given class of object, using the same function:

methods(class = "tokens")
If we are interested in analysing the texts with respect to some other variables, we can create a corpus object to associate the texts with this metadata. For example, consider the last six inaugural addresses:

We can use the docvars option to the corpus command to record the party with which each text is associated:

dv <- data.frame(Party = c("dem", "dem", "rep", "rep", "dem", "dem"))
recentCorpus <- corpus(data_corpus_inaugural[52:57], docvars = dv)
We can use this metadata to combine features across documents when creating a document-feature matrix:

partyDfm <- dfm(recentCorpus, groups = "Party", remove = (stopwords("english")))
textplot_wordcloud(partyDfm, comparison = TRUE)