Create standard texts and a dictionary for examples #592

koheiw · 2017-03-10T22:16:29Z

This is from dfm_select's example. Texts and dictionaries should be more meaningful in examples (these looks almost like unit tests).

myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", 
                "Does the United_States or Sweden have more progressive taxation?"),
              tolower = FALSE, verbose = FALSE)
mydict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
                           wordsEndingInY = c("by", "my"),
                           notintext = "blahblah"))

I like examples in dfm, but all about taxes. Better to have more place names.

# with the thesaurus feature
mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
              "New York City has raised a taxes: an income tax and a sales tax.")
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))

Also, we should also have a naming rule for examples. dfm, mydfm or mat? texts, mytexts, txts? Standard texts and consistent naming will make the manual easier to understand.

The text was updated successfully, but these errors were encountered:

kbenoit · 2017-09-12T11:46:34Z

Would be a nice addition to include a data_dictionary_sentiment object that could be used for examples but also meaningful, dictionary-based sentiment analysis.

stefan-mueller · 2017-09-12T12:13:45Z

I agree. The tidytext package contains the English NRC sentiment dictionary. I could ask the dictionary maintainers whether we would be allowed to add it to quanteda as well (or maybe just the positive (4672 words) and negative (10461) sentiment categories)? This subset would already enable users to conduct a meaningful sentiment analysis. The problem with most other sentiment dictionaries is that they are not for free or require an official download request from each user.

library(tidytext)

tidytext::sentiments
# > tidytext::sentiments
# # A tibble: 27,314 x 4
# word sentiment lexicon score
# <chr>     <chr>   <chr> <int>
#   1      abacus     trust     nrc    NA
# 2     abandon      fear     nrc    NA
# 3     abandon  negative     nrc    NA
# 4     abandon   sadness     nrc    NA
# 5   abandoned     anger     nrc    NA
# 6   abandoned      fear     nrc    NA
# 7   abandoned  negative     nrc    NA
# 8   abandoned   sadness     nrc    NA
# 9 abandonment     anger     nrc    NA
# 10 abandonment      fear     nrc    NA
# # ... with 27,304 more rows

table(sentiments$sentiment)
# > table(sentiments$sentiment)
# anger anticipation constraining      disgust 
# 1247          839          184         1058 
# fear          joy    litigious     negative 
# 1476          689          903        10461 
# positive      sadness  superfluous     surprise 
# 4672         1191           56          534 
# trust  uncertainty 
# 1231          297

kbenoit · 2017-09-12T12:56:26Z

I have asked them! but not received a reply yet. I will try again.

kbenoit · 2017-09-18T14:46:03Z

Have Lexicoder now, just need some texts. Why not use the inaugural texts?

stefan-mueller · 2017-09-18T15:16:01Z

Yes, that's probably the easiest way. We could also consider adding the positive and negative movie reviews from the readtext package. With this corpus we could show to what degree the estimated sentiment is in line with the movie evaluation. But adding more text would increase the size of the package which might be problematic for CRAN.

stefan-mueller · 2017-09-18T18:53:19Z

I thought a bit more about it, and some form of a sentiment corpus might be useful for the manuals of several functions:

textmodel_NB only uses the Manning et al. (2008) example (five sentences), but it would be good to classify a "real" dfm. We could use the Naive Bayes classifier to predict the category (positive or negative) or movie (in case the corpus contains several movies) of an unclassified review. Moreover, we could compare the performances of Naive Bayes and a sentiment dictionary (based on data_dictionary_LSD2015).
In the textstat_keyness examples we use Obama vs. Trump, but "positive" vs. "negative" reviews might be more comprehensible.
For textstat_frequency, it would also be nice to look at the most common words in "negative" and "positive" reviews using groups = "category".
The negative and positive reviews could also be used to exemplify the merits of tfidf in dfm_weight.

We could add data_corpus_reviews consisting of five positive and five negative reviews about one or more movies. The following docvars might be reasonable:

movie: name of movie
stars: number of stars/evaluation attached to review
category: positive/negative

kbenoit · 2017-09-18T20:13:30Z

That's a good idea, but we'd want to include the whole set of movie reviews. I thought of this before, but moved the n=2000 Pang and Lee set to quantedaData because it was too large to distribute with the package. Once we pare down the package for v1.0 we could revisit this, however.

data_corpus_movies would provide good examples for dictionaries, classifiers, and keyness. Topic models actually work great on this set too.

stefan-mueller · 2017-09-18T20:22:58Z

Sounds good, @kbenoit – and yes, a large set of movie reviews would be even better if possible. Once data_corpus_movies is added to quanteda, I am happy to update and/or extend the examples in the documentation for the functions mentioned above.

kbenoit · 2017-10-16T07:51:41Z

We have the data_dictionary_LSD2015 now, and we can improve the documentation (examples) in completing #992.

koheiw added the documentation label Mar 10, 2017

koheiw assigned kbenoit Mar 10, 2017

kbenoit modified the milestone: v1.0 Mar 16, 2017

stefan-mueller self-assigned this Sep 18, 2017

kbenoit closed this as completed Oct 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create standard texts and a dictionary for examples #592

Create standard texts and a dictionary for examples #592

koheiw commented Mar 10, 2017 •

edited

kbenoit commented Sep 12, 2017

stefan-mueller commented Sep 12, 2017

kbenoit commented Sep 12, 2017

kbenoit commented Sep 18, 2017

stefan-mueller commented Sep 18, 2017

stefan-mueller commented Sep 18, 2017

kbenoit commented Sep 18, 2017

stefan-mueller commented Sep 18, 2017

kbenoit commented Oct 16, 2017

Create standard texts and a dictionary for examples #592

Create standard texts and a dictionary for examples #592

Comments

koheiw commented Mar 10, 2017 • edited

kbenoit commented Sep 12, 2017

stefan-mueller commented Sep 12, 2017

kbenoit commented Sep 12, 2017

kbenoit commented Sep 18, 2017

stefan-mueller commented Sep 18, 2017

stefan-mueller commented Sep 18, 2017

kbenoit commented Sep 18, 2017

stefan-mueller commented Sep 18, 2017

kbenoit commented Oct 16, 2017

koheiw commented Mar 10, 2017 •

edited