Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text cleaner node #75

Closed
lalitpagaria opened this issue Apr 4, 2021 · 4 comments · Fixed by #110
Closed

Add text cleaner node #75

lalitpagaria opened this issue Apr 4, 2021 · 4 comments · Fixed by #110

Comments

@lalitpagaria
Copy link
Collaborator

lalitpagaria commented Apr 4, 2021

Idea to have configurable text cleaning node.
This node also have predefined template to clean tweets, facebook feed, app reviews etc.

For detail refer #75 (comment)

@lalitpagaria lalitpagaria added enhancement New feature or request medium priority labels Apr 4, 2021
@shahrukhx01
Copy link
Collaborator

@lalitpagaria Could you list down the text cleaning features we are looking for here. Eg stemming, lemmatisation, stop word removal etc?

@lalitpagaria
Copy link
Collaborator Author

lalitpagaria commented May 27, 2021

Thank you @shahrukhx01. Please find requested details as follows -

Few of cleaning feature could be (not extensive, if you have more idea please add them as well) -

  1. lower casing (if possible extra care of Named Entity for example Bush is person and bush is word hence no point is lowering Bush if it appear in sentance)
  2. stop word removal, stemming, punctuation removal (but it should keep sentence as it is ie not return tokens)
  3. Excessive white space remove
  4. link removal or adding filler
  5. Hashtags: remove hashtag or remove only # or replace them with some filer
  6. UserTagging (@someuser): remove user tag or remove only `@ or replace them with some filer
  7. Spelling/grammer correction (if it is heavy model like transformers then it should go to Analyzer)
  8. User provided list of regex (and corresponding substitute) in case he want to customise cleaning
  9. Decoding Unicode characters into a normalized form, such as UTF8
  10. Handling of domain specific words, phrases, and acronyms
  11. Handling or removing numbers, such as dates and amounts
  12. Locating and correcting common typos and misspellings
  13. Transliteration of characters from other languages into one fixed Language (if it is heavy model like transformers then it should go to Analyzer)
  14. Cleaning up of alpha numeric words

Following is design consideration -

  1. Till we finalise on Add DAG support and fix inconsistent naming #8, let's create preprosessor module and add text_cleaner.py
  2. We can have BaseTextCleaner class which can have clean_input as@abstractmethod
  3. Similarly we can have BaseTextCleanerConfig to store configuration of cleaner
  4. clean_input can have following signature -
    def clean_input(
        self,
        input_list: List[AnalyzerRequest],
        config: BaseTextCleanerConfig,
        **kwargs
    ) -> List[AnalyzerRequest]:
        pass
# AnalyzerRequest will change in future for for now let's stick with it
  1. All cleaning feature/config could be optional. User can select any one via config.
  2. Now we can have Specific cleaning class like NLTKTextCleaner, SpacyTextCleaner, TextBobTextCleaner (I am still not sure we can have discussion, for end user it does not matter which library is doing what until their things is getting done). Hence I prefer less cognitive load on end user and for him text cleaner should be text cleaner step how it perform not much interesting to him. So I am open for the suggestions

I know it is very extensive list, it helped me to express my mind. It is not required to have implementation of all. We can start to add basic cleaning first and then enhance it.

Please let me know would you like to work on it and create a PR

@shahrukhx01
Copy link
Collaborator

sounds good. I'll start working on it and create a PR on the first draft I come up with.

@lalitpagaria
Copy link
Collaborator Author

closed with #110

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants