Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create standard texts and a dictionary for examples #592

Closed
koheiw opened this issue Mar 10, 2017 · 9 comments
Closed

Create standard texts and a dictionary for examples #592

koheiw opened this issue Mar 10, 2017 · 9 comments
Assignees
Milestone

Comments

@koheiw
Copy link
Collaborator

koheiw commented Mar 10, 2017

This is from dfm_select's example. Texts and dictionaries should be more meaningful in examples (these looks almost like unit tests).

myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", 
                "Does the United_States or Sweden have more progressive taxation?"),
              tolower = FALSE, verbose = FALSE)
mydict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
                           wordsEndingInY = c("by", "my"),
                           notintext = "blahblah"))

I like examples in dfm, but all about taxes. Better to have more place names.

# with the thesaurus feature
mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
              "New York City has raised a taxes: an income tax and a sales tax.")
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))

Also, we should also have a naming rule for examples. dfm, mydfm or mat? texts, mytexts, txts? Standard texts and consistent naming will make the manual easier to understand.

@kbenoit kbenoit modified the milestone: v1.0 Mar 16, 2017
@kbenoit
Copy link
Collaborator

kbenoit commented Sep 12, 2017

Would be a nice addition to include a data_dictionary_sentiment object that could be used for examples but also meaningful, dictionary-based sentiment analysis.

@stefan-mueller
Copy link
Collaborator

I agree. The tidytext package contains the English NRC sentiment dictionary. I could ask the dictionary maintainers whether we would be allowed to add it to quanteda as well (or maybe just the positive (4672 words) and negative (10461) sentiment categories)? This subset would already enable users to conduct a meaningful sentiment analysis. The problem with most other sentiment dictionaries is that they are not for free or require an official download request from each user.

library(tidytext)

tidytext::sentiments
# > tidytext::sentiments
# # A tibble: 27,314 x 4
# word sentiment lexicon score
# <chr>     <chr>   <chr> <int>
#   1      abacus     trust     nrc    NA
# 2     abandon      fear     nrc    NA
# 3     abandon  negative     nrc    NA
# 4     abandon   sadness     nrc    NA
# 5   abandoned     anger     nrc    NA
# 6   abandoned      fear     nrc    NA
# 7   abandoned  negative     nrc    NA
# 8   abandoned   sadness     nrc    NA
# 9 abandonment     anger     nrc    NA
# 10 abandonment      fear     nrc    NA
# # ... with 27,304 more rows

table(sentiments$sentiment)
# > table(sentiments$sentiment)
# anger anticipation constraining      disgust 
# 1247          839          184         1058 
# fear          joy    litigious     negative 
# 1476          689          903        10461 
# positive      sadness  superfluous     surprise 
# 4672         1191           56          534 
# trust  uncertainty 
# 1231          297 

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 12, 2017

I have asked them! but not received a reply yet. I will try again.

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 18, 2017

Have Lexicoder now, just need some texts. Why not use the inaugural texts?

@stefan-mueller
Copy link
Collaborator

Yes, that's probably the easiest way. We could also consider adding the positive and negative movie reviews from the readtext package. With this corpus we could show to what degree the estimated sentiment is in line with the movie evaluation. But adding more text would increase the size of the package which might be problematic for CRAN.

@stefan-mueller
Copy link
Collaborator

I thought a bit more about it, and some form of a sentiment corpus might be useful for the manuals of several functions:

  • textmodel_NB only uses the Manning et al. (2008) example (five sentences), but it would be good to classify a "real" dfm. We could use the Naive Bayes classifier to predict the category (positive or negative) or movie (in case the corpus contains several movies) of an unclassified review. Moreover, we could compare the performances of Naive Bayes and a sentiment dictionary (based on data_dictionary_LSD2015).
  • In the textstat_keyness examples we use Obama vs. Trump, but "positive" vs. "negative" reviews might be more comprehensible.
  • For textstat_frequency, it would also be nice to look at the most common words in "negative" and "positive" reviews using groups = "category".
  • The negative and positive reviews could also be used to exemplify the merits of tfidf in dfm_weight.

We could add data_corpus_reviews consisting of five positive and five negative reviews about one or more movies. The following docvars might be reasonable:

  • movie: name of movie
  • stars: number of stars/evaluation attached to review
  • category: positive/negative

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 18, 2017

That's a good idea, but we'd want to include the whole set of movie reviews. I thought of this before, but moved the n=2000 Pang and Lee set to quantedaData because it was too large to distribute with the package. Once we pare down the package for v1.0 we could revisit this, however.

data_corpus_movies would provide good examples for dictionaries, classifiers, and keyness. Topic models actually work great on this set too.

@stefan-mueller
Copy link
Collaborator

Sounds good, @kbenoit – and yes, a large set of movie reviews would be even better if possible. Once data_corpus_movies is added to quanteda, I am happy to update and/or extend the examples in the documentation for the functions mentioned above.

@stefan-mueller stefan-mueller self-assigned this Sep 18, 2017
@kbenoit
Copy link
Collaborator

kbenoit commented Oct 16, 2017

We have the data_dictionary_LSD2015 now, and we can improve the documentation (examples) in completing #992.

@kbenoit kbenoit closed this as completed Oct 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants