Author: Nicolas Pröllochs
License: MIT
The textsampler R-Package samples texts from a predefined text source. This implementation uses tidy data principles and works seamlessly with existing text mining packages such as tm, tidytext, and rvest. In addition, it supplies multiple built-in text datasets for a hassle-free sampling of words, sentences, and texts.
You can easily install the latest development version of textsampler via GitHub.
# Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("nproellochs/textsampler")
This section shows the basic functionality of how to sample text from a predefined text source. First, load the corresponding package textsampler.
library(textsampler)
The following example shows how to sample sentences from a built-in database of texts. The result is a data frame containing five random sentences.
# Sample five sentences
sample_text(n = 5, type = "sentences")
#> # A tibble: 5 x 3
#> Id Text Length
#> <int> <chr> <int>
#> 1 897 the pizza selections are good. 5
#> 2 264 good service, very clean, and inexpensive, to boot! 8
#> 3 368 would come back again if i had a sushi craving while in veg~ 13
#> 4 569 an hour... seriously? 3
#> 5 904 and the drinks are weak, people! 6
The following example shows how to sample words from a built-in text source (“english_words”). The result is a data frame containing five random words.
# Sample five words from english_words
sample_text(n = 5, type = "words", source = "english_words")
#> # A tibble: 5 x 3
#> Id Text Length
#> <int> <chr> <int>
#> 1 9440 cuisin 1
#> 2 42046 trojan 1
#> 3 44211 upper 1
#> 4 30925 prediagnost 1
#> 5 29442 peter 1
The textsampler R-package works with tidy tools and can easily be combined with existing packages such as the rvest R-package. The following example shows how to sample texts from a website. Specifically, the example samples 15 famous quotes by Julius Ceasar.
library(rvest)
read_html("https://www.brainyquote.com/authors/julius-caesar-quotes/") %>%
html_nodes(xpath = ".//a[contains(@class, 'b-qt qt_')]") %>%
html_text() %>%
enframe() %>%
sample_text(n = 15, source = ., input = "value", min_length = 1, max_length = 40,
shuffle = F, clean = T)
#> # A tibble: 15 x 3
#> Id Text Length
#> <int> <chr> <int>
#> 1 1 experience is the teacher of all things. 7
#> 2 2 it is easier to find men who will volunteer to die, than t~ 23
#> 3 3 it was the wont of the immortal gods sometimes to grant pr~ 38
#> 4 4 cowards die many times before their actual deaths. 8
#> 5 5 if you must break the law, do it to seize power: in all ot~ 17
#> 6 7 i came, i saw, i conquered. 6
#> 7 8 it is not these well-fed long-haired men that i fear, but ~ 19
#> 8 9 i have lived long enough both in years and in accomplishme~ 11
#> 9 10 i had rather be first in a village than second at rome. 12
#> 10 11 i love the name of honor, more than i fear death. 11
#> 11 12 no one is so brave that he is not disturbed by something u~ 13
#> 12 13 men willingly believe what they wish. 6
#> 13 14 i have lived long enough to satisfy both nature and glory. 11
#> 14 15 i have always reckoned the dignity of the republic of firs~ 16
#> 15 16 as a rule, men worry more about what they can't see than a~ 16
The textsamplr R-package can be used to sample text from a vector source. The following example samples five random sentences from a book downloaded by the gutenbergr R-Package.
library(gutenbergr)
full_text <- gutenberg_download(5314)
textsampler::sample_text(n = 5, source = full_text$text[1:1000], type = "sentences", shuffle = T)
#> # A tibble: 5 x 3
#> Id Text Length
#> <int> <chr> <int>
#> 1 90 59 frederick and catherine (der frieder und das catherliesc~ 9
#> 2 281 "thou wilt have, dear frog,\" said she--\"my clothes, my pe~ 13
#> 3 245 legend 4 poverty and humility lead to heaven (armut und dem~ 14
#> 4 736 "\"one of this kind has never come my way before.\"" 10
#> 5 453 they set 2
The textsamplr R-package allows one to sample texts with specific text characteristics. The following example samples three sentences from Amazon reviews, all of which have a maximum length of 5 words and contain the word ‘great’.
sample_text(n = 5, source = "amazon_sentences", type = "sentences",
max_length = 5, word_list = c("great"))
#> # A tibble: 5 x 3
#> Id Text Length
#> <int> <chr> <int>
#> 1 557 great product for the price!. 5
#> 2 291 great phone. 2
#> 3 474 great software for motorolas. 4
#> 4 793 great phone. 2
#> 5 234 great sound and service. 4
If you experience any difficulties with the package, or have suggestions, or want to contribute directly, you have the following options:
- Contact the maintainer by email.
- Issues and bug reports: File a GitHub issue.
- Fork the source code, modify, and issue a pull request through the project GitHub page.
textsampler is released under the MIT License
Copyright (c) 2019 Nicolas Pröllochs