Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding StopWordsRemover #59

Closed
clayms opened this issue Dec 11, 2017 · 4 comments
Closed

Adding StopWordsRemover #59

clayms opened this issue Dec 11, 2017 · 4 comments

Comments

@clayms
Copy link

clayms commented Dec 11, 2017

I want to add the pyspark.ml.feature StopWordsRemover as a class in the annotator.py file so I can use that function in the same pipeline as the other sparknlp functions.

I have tried the code below, but I get the error: TypeError: 'JavaPackage' object is not callable

What am I doing wrong?

from pyspark.ml.feature import StopWordsRemover as sparkml_StopWordsRemover
stopwordList = sparkml_StopWordsRemover.loadDefaultStopWords("english")

class StopWordsRemover(AnnotatorTransformer):

    caseSensitive = Param(Params._dummy(),
                             "caseSensitive",
                             'whether to do a case sensitive comparison over the stop words',
                             typeConverter=TypeConverters.toBoolean)

    stopWords = Param(Params._dummy(),
                         "stopWords",
                         "The words to be filtered out",
                         typeConverter=TypeConverters.toListString)    
    @keyword_only
    def __init__(self):
        super(StopWordsRemover, self).__init__()
        self._java_obj = self._new_java_obj("com.johnsnowlabs.nlp.annotators.StopWordsRemover", self.uid)
        self._setDefault(caseSensitive=False, stopWords=stopwordList)
        self.setParams(**kwargs)
    
    def setParams(self, caseSensitive=False, 
                  stopWords=stopwordList):
        kwargs = self._input_kwargs
        return self._set(**kwargs)
      
    def setCaseSensitive(self, value):
        return self._set(caseSensitive=value)

    def setStopWords(self, value):
        return self._set(stopWords=value)
@clayms
Copy link
Author

clayms commented Dec 12, 2017

How do I add the pyspark.ml.feature StopWordsRemover to the spark-nlp_2.11-1.2.3.jar file?

@aleksei-ai
Copy link
Contributor

@clayms As I see you have a line self._java_obj = self._new_java_obj("com.johnsnowlabs.nlp.annotators.StopWordsRemover", self.uid)
But there is not StopWordsRemover in package com.johnsnowlabs.nlp.annotators. I think that is the reason why you got this Exception.

I suggest to add pyspark StopWordsRemover as a first stage of your pipeline.

@clayms
Copy link
Author

clayms commented Dec 17, 2017

Thank you @aleksei-ai . I see that now. I got the Spark ML StopWordsRemover to work previously together with the Spark ML RegexTokenizer, but was having trouble getting them to work in the same pipeline as the John Snow Labs annotators. I ended up putting those Spark ML stages at the end of the John Snow Labs annotator stages. The end results are what I am after. Thank you.

@maziyarpanahi
Copy link
Member

In the upcoming release of Spark NLP 2.3.0, we will have a native StopWordsCleaner annotator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants