Add support for other languages than English in get_sentiment for NRC… #18

Open
wants to merge 1 commit into
from

Projects

None yet

2 participants

@denrou
denrou commented Jan 5, 2017

The NRC lexicon supports many languages and should partialy solve the issue #13 .
sysdata.rda has been moved to data-raw/sysdata-old.rda and a new sysdata.rda has been created with the new NRC lexicon. Details about how to get this new lexicon is stored in data-raw/data.R

Some changes have been introduced to get sentiment analysis work with multiple languages. In particular, I used dplyr and tidyr depencies to make the dataframe more tidy.
Because the NRC lexicon is quite large, every word with a sentiment set to 0 has been removed.
The data frame is now around 500,000 observations (10 times smaller than the original one).

@mjockers
Owner
mjockers commented Jan 6, 2017

This looks like a great addition, thanks. Before I can merge it, will need to think about how to update get_sentences() function to include a "language" argument. get_sentences() implements openNLP which has different language models. From the openNLP R package docs: "For languages other than English, these can conveniently be made available to R by installing the respective openNLPmodels.language package from the repository at http://datacube.wu.ac.at" Looks like openNLP has models for da, de, en, es, it, nl, pt, and sv, which is != to all of the languages in the NRC lexicon, so need to deal with that as well.

@denrou
denrou commented Jan 9, 2017

Thanks for willing to take the PR into account. I agree that if other languages are integrated in sentiment analysis, they should also be integrated with the other functions of the package.
How would you like to handle the fact that openLNPmodels.language and NRC lexicon can't work with common languages?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment