The textsampler R-Package samples texts from a predefined text source. This implementation uses tidy data principles and works seamlessly with existing text mining packages such as tm, tidytext, and rvest. In addition, it supplies multiple built-in text datasets for a hassle-free sampling of words, sentences, and texts.
You can easily install the latest development version of textsampler via GitHub.
# Install the development version from GitHub: # install.packages("devtools") devtools::install_github("nproellochs/textsampler")
This section shows the basic functionality of how to sample text from a predefined text source. First, load the corresponding package textsampler.
The following example shows how to sample sentences from a built-in database of texts. The result is a data frame containing five random sentences.
# Sample five sentences sample_text(n = 5, type = "sentences") #> # A tibble: 5 x 3 #> Id Text Length #> <int> <chr> <int> #> 1 897 the pizza selections are good. 5 #> 2 264 good service, very clean, and inexpensive, to boot! 8 #> 3 368 would come back again if i had a sushi craving while in veg~ 13 #> 4 569 an hour... seriously? 3 #> 5 904 and the drinks are weak, people! 6
Example: Sampling text from built-in text source
The following example shows how to sample words from a built-in text source (“english_words”). The result is a data frame containing five random words.
# Sample five words from english_words sample_text(n = 5, type = "words", source = "english_words") #> # A tibble: 5 x 3 #> Id Text Length #> <int> <chr> <int> #> 1 9440 cuisin 1 #> 2 42046 trojan 1 #> 3 44211 upper 1 #> 4 30925 prediagnost 1 #> 5 29442 peter 1
Example: Sampling text from website
The textsampler R-package works with tidy tools and can easily be combined with existing packages such as the rvest R-package. The following example shows how to sample texts from a website. Specifically, the example samples 15 famous quotes by Julius Ceasar.
library(rvest) read_html("https://www.brainyquote.com/authors/julius-caesar-quotes/") %>% html_nodes(xpath = ".//a[contains(@class, 'b-qt qt_')]") %>% html_text() %>% enframe() %>% sample_text(n = 15, source = ., input = "value", min_length = 1, max_length = 40, shuffle = F, clean = T) #> # A tibble: 15 x 3 #> Id Text Length #> <int> <chr> <int> #> 1 1 experience is the teacher of all things. 7 #> 2 2 it is easier to find men who will volunteer to die, than t~ 23 #> 3 3 it was the wont of the immortal gods sometimes to grant pr~ 38 #> 4 4 cowards die many times before their actual deaths. 8 #> 5 5 if you must break the law, do it to seize power: in all ot~ 17 #> 6 7 i came, i saw, i conquered. 6 #> 7 8 it is not these well-fed long-haired men that i fear, but ~ 19 #> 8 9 i have lived long enough both in years and in accomplishme~ 11 #> 9 10 i had rather be first in a village than second at rome. 12 #> 10 11 i love the name of honor, more than i fear death. 11 #> 11 12 no one is so brave that he is not disturbed by something u~ 13 #> 12 13 men willingly believe what they wish. 6 #> 13 14 i have lived long enough to satisfy both nature and glory. 11 #> 14 15 i have always reckoned the dignity of the republic of firs~ 16 #> 15 16 as a rule, men worry more about what they can't see than a~ 16
Example: Sampling text from vector source
The textsamplr R-package can be used to sample text from a vector source. The following example samples five random sentences from a book downloaded by the gutenbergr R-Package.
library(gutenbergr) full_text <- gutenberg_download(5314) textsampler::sample_text(n = 5, source = full_text$text[1:1000], type = "sentences", shuffle = T) #> # A tibble: 5 x 3 #> Id Text Length #> <int> <chr> <int> #> 1 90 59 frederick and catherine (der frieder und das catherliesc~ 9 #> 2 281 "thou wilt have, dear frog,\" said she--\"my clothes, my pe~ 13 #> 3 245 legend 4 poverty and humility lead to heaven (armut und dem~ 14 #> 4 736 "\"one of this kind has never come my way before.\"" 10 #> 5 453 they set 2
Example: Sampling text data with specific text characteristics
The textsamplr R-package allows one to sample texts with specific text characteristics. The following example samples three sentences from Amazon reviews, all of which have a maximum length of 5 words and contain the word ‘great’.
sample_text(n = 5, source = "amazon_sentences", type = "sentences", max_length = 5, word_list = c("great")) #> # A tibble: 5 x 3 #> Id Text Length #> <int> <chr> <int> #> 1 557 great product for the price!. 5 #> 2 291 great phone. 2 #> 3 474 great software for motorolas. 4 #> 4 793 great phone. 2 #> 5 234 great sound and service. 4
If you experience any difficulties with the package, or have suggestions, or want to contribute directly, you have the following options:
- Contact the maintainer by email.
- Issues and bug reports: File a GitHub issue.
- Fork the source code, modify, and issue a pull request through the project GitHub page.
textsampler is released under the MIT License
Copyright (c) 2019 Nicolas Pröllochs