Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for other languages than English in get_sentiment for NRC… #18

wants to merge 2 commits into from


Copy link

@denrou denrou commented Jan 5, 2017

The NRC lexicon supports many languages and should partialy solve the issue #13 .
sysdata.rda has been moved to data-raw/sysdata-old.rda and a new sysdata.rda has been created with the new NRC lexicon. Details about how to get this new lexicon is stored in data-raw/data.R

Some changes have been introduced to get sentiment analysis work with multiple languages. In particular, I used dplyr and tidyr depencies to make the dataframe more tidy.
Because the NRC lexicon is quite large, every word with a sentiment set to 0 has been removed.
The data frame is now around 500,000 observations (10 times smaller than the original one).

Copy link

mjockers commented Jan 6, 2017

This looks like a great addition, thanks. Before I can merge it, will need to think about how to update get_sentences() function to include a "language" argument. get_sentences() implements openNLP which has different language models. From the openNLP R package docs: "For languages other than English, these can conveniently be made available to R by installing the respective openNLPmodels.language package from the repository at" Looks like openNLP has models for da, de, en, es, it, nl, pt, and sv, which is != to all of the languages in the NRC lexicon, so need to deal with that as well.

Copy link

denrou commented Jan 9, 2017

Thanks for willing to take the PR into account. I agree that if other languages are integrated in sentiment analysis, they should also be integrated with the other functions of the package.
How would you like to handle the fact that openLNPmodels.language and NRC lexicon can't work with common languages?

Copy link

Hi Denrou,
I check the code "", there are total 41 language in "" May I know the reason why you only inport some of the languages list in NRC-Emotion-Lexicon-v0.92-InManyLanguages-web.xlsx instead of import all the languages in NRC-Emotion-Lexicon-v0.92-InManyLanguages-web.xlsx ?
code ""
lang_ascii <- c("basque", "catalan", "danish", "dutch", "english", "esperanto", "finnish", "french", "german", "irish", "italian", "latin", "portuguese", "romanian", "somali", "spanish", "sudanese", "swahili", "swedish", "turkish", "welsh", "zulu")

Copy link

denrou commented May 11, 2017

Hi strategist922,

There are indeed other languages that could be imported. The problem I faced was that this other languages had a different encoding and get_nrc_value function crashed when those languages were included. I didn't spend much time on it, so maybe there is a quick solution to fix it.

Copy link

Hi denrou!

I tried to use get_nrc_sentiment with lang = "portuguese" and it aways returns zero for all sentiments. I was wondering if I shoud do something more to use it, like inport a lexicon in portuguese or something else.

Copy link

mjockers commented Aug 2, 2017

merged and revised arc related functions for efficiency in v 1.0.3 and for expandability with user defined lexicons.

@mjockers mjockers closed this Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

Successfully merging this pull request may close these issues.

None yet

4 participants