Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for other languages than English in get_sentiment for NRC… #18

Closed
wants to merge 2 commits into from

Conversation

denrou
Copy link

@denrou denrou commented Jan 5, 2017

The NRC lexicon supports many languages and should partialy solve the issue #13 .
sysdata.rda has been moved to data-raw/sysdata-old.rda and a new sysdata.rda has been created with the new NRC lexicon. Details about how to get this new lexicon is stored in data-raw/data.R

Some changes have been introduced to get sentiment analysis work with multiple languages. In particular, I used dplyr and tidyr depencies to make the dataframe more tidy.
Because the NRC lexicon is quite large, every word with a sentiment set to 0 has been removed.
The data frame is now around 500,000 observations (10 times smaller than the original one).

@mjockers
Copy link
Owner

mjockers commented Jan 6, 2017

This looks like a great addition, thanks. Before I can merge it, will need to think about how to update get_sentences() function to include a "language" argument. get_sentences() implements openNLP which has different language models. From the openNLP R package docs: "For languages other than English, these can conveniently be made available to R by installing the respective openNLPmodels.language package from the repository at http://datacube.wu.ac.at" Looks like openNLP has models for da, de, en, es, it, nl, pt, and sv, which is != to all of the languages in the NRC lexicon, so need to deal with that as well.

@denrou
Copy link
Author

denrou commented Jan 9, 2017

Thanks for willing to take the PR into account. I agree that if other languages are integrated in sentiment analysis, they should also be integrated with the other functions of the package.
How would you like to handle the fact that openLNPmodels.language and NRC lexicon can't work with common languages?

@strategist922
Copy link

Hi Denrou,
I check the code "https://github.com/denrou/syuzhet/blob/master/data-raw/data.R", there are total 41 language in "http://www.saifmohammad.com/WebDocs/NRC-Emotion-Lexicon-v0.92-InManyLanguages-web.xlsx" May I know the reason why you only inport some of the languages list in NRC-Emotion-Lexicon-v0.92-InManyLanguages-web.xlsx instead of import all the languages in NRC-Emotion-Lexicon-v0.92-InManyLanguages-web.xlsx ?
Ref.
code "https://github.com/denrou/syuzhet/blob/master/data-raw/data.R"
lang_ascii <- c("basque", "catalan", "danish", "dutch", "english", "esperanto", "finnish", "french", "german", "irish", "italian", "latin", "portuguese", "romanian", "somali", "spanish", "sudanese", "swahili", "swedish", "turkish", "welsh", "zulu")

@denrou
Copy link
Author

denrou commented May 11, 2017

Hi strategist922,

There are indeed other languages that could be imported. The problem I faced was that this other languages had a different encoding and get_nrc_value function crashed when those languages were included. I didn't spend much time on it, so maybe there is a quick solution to fix it.

@hichemmkhalyd
Copy link

Hi denrou!

I tried to use get_nrc_sentiment with lang = "portuguese" and it aways returns zero for all sentiments. I was wondering if I shoud do something more to use it, like inport a lexicon in portuguese or something else.

@mjockers
Copy link
Owner

mjockers commented Aug 2, 2017

merged and revised arc related functions for efficiency in v 1.0.3 and for expandability with user defined lexicons.

@mjockers mjockers closed this Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants