Skip to content

Latest commit

 

History

History
377 lines (342 loc) · 35.3 KB

workflow.md

File metadata and controls

377 lines (342 loc) · 35.3 KB

Introduction and Workflow

Kenneth Benoit

Date: 20 April 2017

This file demonstrates a basic workflow to take some pre-loaded texts and perform elementary text analysis tasks quickly. The quanteda packages comes with a built-in set of inaugural addresses from US Presidents. We begin by loading quanteda and examining these texts. The summary command will output the name of each text along with the number of types, tokens and sentences contained in the text. Below we use R’s indexing syntax to selectivly use the summary command on the first five texts.

require(quanteda)
## Loading required package: quanteda
## Package version: 1.1.4
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

summary(data_corpus_inaugural)
## Corpus consisting of 58 documents:
## 
##             Text Types Tokens Sentences Year  President       FirstName
##  1789-Washington   625   1538        23 1789 Washington          George
##  1793-Washington    96    147         4 1793 Washington          George
##       1797-Adams   826   2578        37 1797      Adams            John
##   1801-Jefferson   717   1927        41 1801  Jefferson          Thomas
##   1805-Jefferson   804   2381        45 1805  Jefferson          Thomas
##     1809-Madison   535   1263        21 1809    Madison           James
##     1813-Madison   541   1302        33 1813    Madison           James
##      1817-Monroe  1040   3680       121 1817     Monroe           James
##      1821-Monroe  1259   4886       129 1821     Monroe           James
##       1825-Adams  1003   3152        74 1825      Adams     John Quincy
##     1829-Jackson   517   1210        25 1829    Jackson          Andrew
##     1833-Jackson   499   1269        29 1833    Jackson          Andrew
##    1837-VanBuren  1315   4165        95 1837  Van Buren          Martin
##    1841-Harrison  1896   9144       210 1841   Harrison   William Henry
##        1845-Polk  1334   5193       153 1845       Polk      James Knox
##      1849-Taylor   496   1179        22 1849     Taylor         Zachary
##      1853-Pierce  1165   3641       104 1853     Pierce        Franklin
##    1857-Buchanan   945   3086        89 1857   Buchanan           James
##     1861-Lincoln  1075   4006       135 1861    Lincoln         Abraham
##     1865-Lincoln   360    776        26 1865    Lincoln         Abraham
##       1869-Grant   485   1235        40 1869      Grant      Ulysses S.
##       1873-Grant   552   1475        43 1873      Grant      Ulysses S.
##       1877-Hayes   831   2716        59 1877      Hayes   Rutherford B.
##    1881-Garfield  1021   3212       111 1881   Garfield        James A.
##   1885-Cleveland   676   1820        44 1885  Cleveland          Grover
##    1889-Harrison  1352   4722       157 1889   Harrison        Benjamin
##   1893-Cleveland   821   2125        58 1893  Cleveland          Grover
##    1897-McKinley  1232   4361       130 1897   McKinley         William
##    1901-McKinley   854   2437       100 1901   McKinley         William
##   1905-Roosevelt   404   1079        33 1905  Roosevelt        Theodore
##        1909-Taft  1437   5822       159 1909       Taft  William Howard
##      1913-Wilson   658   1882        68 1913     Wilson         Woodrow
##      1917-Wilson   549   1656        59 1917     Wilson         Woodrow
##     1921-Harding  1169   3721       148 1921    Harding       Warren G.
##    1925-Coolidge  1220   4440       196 1925   Coolidge          Calvin
##      1929-Hoover  1090   3865       158 1929     Hoover         Herbert
##   1933-Roosevelt   743   2062        85 1933  Roosevelt     Franklin D.
##   1937-Roosevelt   725   1997        96 1937  Roosevelt     Franklin D.
##   1941-Roosevelt   526   1544        68 1941  Roosevelt     Franklin D.
##   1945-Roosevelt   275    647        26 1945  Roosevelt     Franklin D.
##      1949-Truman   781   2513       116 1949     Truman        Harry S.
##  1953-Eisenhower   900   2757       119 1953 Eisenhower       Dwight D.
##  1957-Eisenhower   621   1931        92 1957 Eisenhower       Dwight D.
##     1961-Kennedy   566   1566        52 1961    Kennedy         John F.
##     1965-Johnson   568   1723        93 1965    Johnson   Lyndon Baines
##       1969-Nixon   743   2437       103 1969      Nixon Richard Milhous
##       1973-Nixon   544   2012        68 1973      Nixon Richard Milhous
##      1977-Carter   527   1376        52 1977     Carter           Jimmy
##      1981-Reagan   902   2790       128 1981     Reagan          Ronald
##      1985-Reagan   925   2921       123 1985     Reagan          Ronald
##        1989-Bush   795   2681       141 1989       Bush          George
##     1993-Clinton   642   1833        81 1993    Clinton            Bill
##     1997-Clinton   773   2449       111 1997    Clinton            Bill
##        2001-Bush   621   1808        97 2001       Bush       George W.
##        2005-Bush   773   2319       100 2005       Bush       George W.
##       2009-Obama   938   2711       110 2009      Obama          Barack
##       2013-Obama   814   2317        88 2013      Obama          Barack
##       2017-Trump   582   1660        88 2017      Trump       Donald J.
## 
## Source: Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes: http://www.presidency.ucsb.edu/inaugurals.php
summary(data_corpus_inaugural[1:5])
##    Length     Class      Mode 
##         5 character character

data_corpus_inaugural[1]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1789-Washington 
## "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated.\n\nSuch being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence.\n\nBy the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people.\n\nBesides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted.\n\nTo the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require.\n\nHaving thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "
cat(data_corpus_inaugural[2])
## Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.
## 
## Previous to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.
## 
## 

ndoc(data_corpus_inaugural)
## [1] 58
docnames(data_corpus_inaugural)
##  [1] "1789-Washington" "1793-Washington" "1797-Adams"     
##  [4] "1801-Jefferson"  "1805-Jefferson"  "1809-Madison"   
##  [7] "1813-Madison"    "1817-Monroe"     "1821-Monroe"    
## [10] "1825-Adams"      "1829-Jackson"    "1833-Jackson"   
## [13] "1837-VanBuren"   "1841-Harrison"   "1845-Polk"      
## [16] "1849-Taylor"     "1853-Pierce"     "1857-Buchanan"  
## [19] "1861-Lincoln"    "1865-Lincoln"    "1869-Grant"     
## [22] "1873-Grant"      "1877-Hayes"      "1881-Garfield"  
## [25] "1885-Cleveland"  "1889-Harrison"   "1893-Cleveland" 
## [28] "1897-McKinley"   "1901-McKinley"   "1905-Roosevelt" 
## [31] "1909-Taft"       "1913-Wilson"     "1917-Wilson"    
## [34] "1921-Harding"    "1925-Coolidge"   "1929-Hoover"    
## [37] "1933-Roosevelt"  "1937-Roosevelt"  "1941-Roosevelt" 
## [40] "1945-Roosevelt"  "1949-Truman"     "1953-Eisenhower"
## [43] "1957-Eisenhower" "1961-Kennedy"    "1965-Johnson"   
## [46] "1969-Nixon"      "1973-Nixon"      "1977-Carter"    
## [49] "1981-Reagan"     "1985-Reagan"     "1989-Bush"      
## [52] "1993-Clinton"    "1997-Clinton"    "2001-Bush"      
## [55] "2005-Bush"       "2009-Obama"      "2013-Obama"     
## [58] "2017-Trump"

nchar(data_corpus_inaugural[1:7])
## 1789-Washington 1793-Washington      1797-Adams  1801-Jefferson 
##            8618             790           13876           10136 
##  1805-Jefferson    1809-Madison    1813-Madison 
##           12907            7000            7156
ntoken(data_corpus_inaugural[1:7])
## 1789-Washington 1793-Washington      1797-Adams  1801-Jefferson 
##            1538             147            2578            1927 
##  1805-Jefferson    1809-Madison    1813-Madison 
##            2381            1263            1302
ntoken(data_corpus_inaugural[1:7], remove_punct = TRUE)
## 1789-Washington 1793-Washington      1797-Adams  1801-Jefferson 
##            1430             135            2318            1726 
##  1805-Jefferson    1809-Madison    1813-Madison 
##            2166            1175            1210
ntype(data_corpus_inaugural[1:7])
## 1789-Washington 1793-Washington      1797-Adams  1801-Jefferson 
##             625              96             826             717 
##  1805-Jefferson    1809-Madison    1813-Madison 
##             804             535             541

One of the most fundamental text analysis tasks is tokenization. To tokenize a text is to split it into units, most commonly words, which can be counted and to form the basis of a quantitative analysis. The quanteda package has a function for tokenization: tokens, which constructs a quanteda tokens object consisting of the texts segmented by their terms (and by default, other elements such as punctuation, numbers, symbols, etc.). Examine the manual page at ?tokens for this details about this function:

?tokens

quanteda’s tokens function can be used on a simple character vector, a vector of character vectors, or a corpus. Here are some examples:

tokens("Today is Thursday in Canberra. It is yesterday in London.")
## tokens from 1 document.
## text1 :
##  [1] "Today"     "is"        "Thursday"  "in"        "Canberra" 
##  [6] "."         "It"        "is"        "yesterday" "in"       
## [11] "London"    "."

vec <- c(one = "This is text one", 
         two = "This, however, is the second text")
tokens(vec)
## tokens from 2 documents.
## one :
## [1] "This" "is"   "text" "one" 
## 
## two :
## [1] "This"    ","       "however" ","       "is"      "the"     "second" 
## [8] "text"

Consider the default arguments to the tokens() function. To remove punctuation, you should set the remove_punct argument to be TRUE. We can combine this with the char_tolower() function to get a cleaned and tokenized version of our text.

tokens(char_tolower(vec), remove_punct = TRUE)
## tokens from 2 documents.
## one :
## [1] "this" "is"   "text" "one" 
## 
## two :
## [1] "this"    "however" "is"      "the"     "second"  "text"

The way that char_tolower() is named reflects the logic of quanteda’s function grammar. The first part (before the underscore _) names the both class of object that is input to the function and is returned by the function. To lowercase an R character class object, for instance, you use char_tolower(), and to lowercase a quanteda tokens class object, you use tokens_tolower(). Some object classes are defined in base R, and some have been defined by packages that extend R’s functionality (quanteda is one example – there are well over 10,000 contributed packages on the CRAN archive alone. CRAN stands for Comprehensive R Archive Network and is where the quanteda package is published.)

Using this function with the inaugural addresses:

inaugTokens <- tokens(data_corpus_inaugural, remove_punct = TRUE)
tokens_tolower(inaugTokens[2])
## tokens from 1 document.
## 1793-Washington :
##   [1] "fellow"         "citizens"       "i"              "am"            
##   [5] "again"          "called"         "upon"           "by"            
##   [9] "the"            "voice"          "of"             "my"            
##  [13] "country"        "to"             "execute"        "the"           
##  [17] "functions"      "of"             "its"            "chief"         
##  [21] "magistrate"     "when"           "the"            "occasion"      
##  [25] "proper"         "for"            "it"             "shall"         
##  [29] "arrive"         "i"              "shall"          "endeavor"      
##  [33] "to"             "express"        "the"            "high"          
##  [37] "sense"          "i"              "entertain"      "of"            
##  [41] "this"           "distinguished"  "honor"          "and"           
##  [45] "of"             "the"            "confidence"     "which"         
##  [49] "has"            "been"           "reposed"        "in"            
##  [53] "me"             "by"             "the"            "people"        
##  [57] "of"             "united"         "america"        "previous"      
##  [61] "to"             "the"            "execution"      "of"            
##  [65] "any"            "official"       "act"            "of"            
##  [69] "the"            "president"      "the"            "constitution"  
##  [73] "requires"       "an"             "oath"           "of"            
##  [77] "office"         "this"           "oath"           "i"             
##  [81] "am"             "now"            "about"          "to"            
##  [85] "take"           "and"            "in"             "your"          
##  [89] "presence"       "that"           "if"             "it"            
##  [93] "shall"          "be"             "found"          "during"        
##  [97] "my"             "administration" "of"             "the"           
## [101] "government"     "i"              "have"           "in"            
## [105] "any"            "instance"       "violated"       "willingly"     
## [109] "or"             "knowingly"      "the"            "injunctions"   
## [113] "thereof"        "i"              "may"            "besides"       
## [117] "incurring"      "constitutional" "punishment"     "be"            
## [121] "subject"        "to"             "the"            "upbraidings"   
## [125] "of"             "all"            "who"            "are"           
## [129] "now"            "witnesses"      "of"             "the"           
## [133] "present"        "solemn"         "ceremony"

Here, we supplied one of the optional arguments to the tokens() function: remove_punct. This functon takes a “logical” type value (TRUE or FALSE) and specifies whether punctuation characters should be removed or not. The help page for tokens(), which you can access using the command ?tokens, details all of the function arguments and their valid values.

Every function in R and its contributed packages has a help page, and this is the first place to look when examining a function. Well-written help pages will also contain examples that you can run to see how a function operates. For quanteda, the main functions also have help pages with the results of executing their examples on http://quanteda.io/reference/.

Returning to tokenization: Once each text has been split into words, we can use the dfm function to create a matrix of counts of the occurrences of each word in each document:

inaugDfm <- dfm(inaugTokens)
trimmedInaugDfm <- dfm_trim(inaugDfm, min_doc = 5, min_count = 10)
## Warning in dfm_trim.dfm(inaugDfm, min_doc = 5, min_count = 10): min_count
## is deprecated, use min_termfreq
weightedTrimmedDfm <- dfm_tfidf(trimmedInaugDfm)

require(magrittr)
## Loading required package: magrittr
inaugDfm2 <- dfm(inaugTokens) %>% 
    dfm_trim(min_doc = 5, min_count = 10) %>% 
        dfm_tfidf()
## Warning in dfm_trim.dfm(., min_doc = 5, min_count = 10): min_count is
## deprecated, use min_termfreq

Note that dfm() works on a variety of object types, including character vectors, corpus objects, and tokenized text objects. This gives the user maximum flexibility and power, while also making it easy to achieve similar results by going directly from texts to a document-by-feature matrix.

To see what objects for which any particular method (function) is defined, you can use the methods() function:

methods(dfm)
## [1] dfm.character* dfm.corpus*    dfm.default*   dfm.dfm*      
## [5] dfm.tokens*   
## see '?methods' for accessing help and source code

Likewise, you can also figure out what methods are defined for any given class of object, using the same function:

methods(class = "tokens")
##  [1] [                     [[                    [[<-                 
##  [4] [<-                   +                     $                    
##  [7] as.character          as.list               c                    
## [10] dfm                   docnames              docnames<-           
## [13] docvars               docvars<-             fcm                  
## [16] kwic                  lengths               metadoc              
## [19] ndoc                  nsentence             nsyllable            
## [22] ntoken                ntype                 phrase               
## [25] print                 textstat_collocations tokens_compound      
## [28] tokens_lookup         tokens_ngrams         tokens_replace       
## [31] tokens_segment        tokens_select         tokens_skipgrams     
## [34] tokens_subset         tokens_tolower        tokens_toupper       
## [37] tokens_wordstem       tokens                types                
## [40] unlist               
## see '?methods' for accessing help and source code

If we are interested in analysing the texts with respect to some other variables, we can create a corpus object to associate the texts with this metadata. For example, consider the last six inaugural addresses:

summary(data_corpus_inaugural[52:57])
##    Length     Class      Mode 
##         6 character character

We can use the docvars option to the corpus command to record the party with which each text is associated:

dv <- data.frame(Party = c("dem", "dem", "rep", "rep", "dem", "dem"))
recentCorpus <- corpus(data_corpus_inaugural[52:57], docvars = dv)
summary(recentCorpus)
## Corpus consisting of 6 documents:
## 
##          Text Types Tokens Sentences Party
##  1993-Clinton   642   1833        81   dem
##  1997-Clinton   773   2449       111   dem
##     2001-Bush   621   1808        97   rep
##     2005-Bush   773   2319       100   rep
##    2009-Obama   938   2711       110   dem
##    2013-Obama   814   2317        88   dem
## 
## Source: /Users/kbenoit/GitHub/ITAUR/1_demo/* on x86_64 by kbenoit
## Created: Wed Mar 28 14:16:05 2018
## Notes:

We can use this metadata to combine features across documents when creating a document-feature matrix:

partyDfm <- dfm(recentCorpus, groups = "Party", remove = (stopwords("english")))
textplot_wordcloud(partyDfm, comparison = TRUE)