Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to provide multilingual support #84

Open
jbesomi opened this issue Jul 14, 2020 · 19 comments
Open

How to provide multilingual support #84

jbesomi opened this issue Jul 14, 2020 · 19 comments
Labels
discussion To discuss new improvements

Comments

@jbesomi
Copy link
Owner

jbesomi commented Jul 14, 2020

Text preprocessing might be very language-dependent.

It would be great for Texthero to offer text preprocessing in all different languages.

There are probably two ways to support multilanguage:

from texthero.lang import hero_lang
from texthero import hero
hero.config.set_lang("lang")

We might have also cases where the dataset is composed of many languages. What would be the best solution in that case?

The first question we probably have to solve is: does different languages requires very different preprocessing pipeline and therefore different functions?

@jbesomi jbesomi added the discussion To discuss new improvements label Jul 14, 2020
@jbesomi jbesomi changed the title How to support multilingual How to provide multilingual support Jul 14, 2020
@ryangawei
Copy link

For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the text with NLP tools. Its difference to the current texthero.preprocessing.clean is that we need to keep punctuation and digits to make sure the correctness of word segmentation, then we proceed to the following steps. Despite the difference, these steps can be done by language specific functions and custom pipelines, e.g.

from texthero import preprocessing
from texthero import nlp
import texthero as hero
hero.config.set_lang("cn")

custom_pipeline = [preprocessing.fillna,
                   preprocessing.remove_non_natural_text,
                   nlp.word_segment,
                   preprocessing.remove_stopwords,
                   ...]
df['clean_text'] = df['clean_text'].pipe(custom_pipeline)

Seems hero.config.set_lang("lang") can work properly and avoid redundant codes.

@guilhermelowa
Copy link

guilhermelowa commented Jul 14, 2020

Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 15, 2020

Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function

Hey @Jkasnese , if you read there in the getting-started, you will see that you can define your own custom pipeline:

from texthero import preprocessing

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)

Is that what you were looking for?

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 15, 2020

For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the text with NLP tools. Its difference to the current texthero.preprocessing.clean is that we need to keep punctuation and digits to make sure the correctness of word segmentation, then we proceed to the following steps. Despite the difference, these steps can be done by language specific functions and custom pipelines, e.g.

from texthero import preprocessing
from texthero import nlp
import texthero as hero
hero.config.set_lang("cn")

custom_pipeline = [preprocessing.fillna,
                   preprocessing.remove_non_natural_text,
                   nlp.word_segment,
                   preprocessing.remove_stopwords,
                   ...]
df['clean_text'] = df['clean_text'].pipe(custom_pipeline)

Seems hero.config.set_lang("lang") can work properly and avoid redundant codes.

Great!

@guilhermelowa
Copy link

Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function

Hey @Jkasnese , if you read there in the getting-started, you will see that you can define your own custom pipeline:

from texthero import preprocessing

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)

Is that what you were looking for?

Yes, I think so! Thanks! I'll check later how to pass arguments to the remove_stopwords function. It would be nice if it was possible to do something like:

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_stopwords(my_stopwords),
                   preprocessing.remove_whitespace]

But I'm guessing its with kargs** right? I don't know, I'm a newbie hahaha gonna check this later. Thanks again!

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 16, 2020

You can solve it like this:

import texthero as hero
import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)

All items in the pipeline must be callable (i.e functions).

I agree that this is not trivial. We will make sure that this will be well explained both in the docstring of the clean function as well as in the (soon to arrive) Getting started: preprocessing part. By the way, @Jkasnese, would you be interested in explaining that in the clean docstring (your comment then will appear there)?

@guilhermelowa
Copy link

I'd like that! Can't make guarantees though, since I'm already involved in many projects ): I'll try to do it until Saturday.

I still have some questions, which might be good since I can address them on the docstring. Should I create a new issue (since its a bit off this issue) or message you privately?

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 16, 2020

Both solutions work; either open an issue or send me an email: jonathanbesomi__AT__gmail.com

@cedricconol
Copy link
Contributor

You can solve it like this:

import texthero as hero
import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)

All items in the pipeline must be callable (i.e functions).

I agree that this is not trivial. We will make sure that this will be well explained both in the docstring of the clean function as well as in the (soon to arrive) Getting started: preprocessing part. By the way, @Jkasnese, would you be interested in explaining that in the clean docstring (your comment then will appear there)?

@jbesomi i think it is also good to add a language parameter in remove_stopwords. I've used a solution similar to your suggestion above for Spanish and Indonesian languages. We can use the stop-words library to load stop words from different languages. What do you think?

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 17, 2020

I perfectly agree with what you are proposing; i.e to permit to remove stopwords from a specific language. The only big question (and the main purpose of this discussion tab) is to understand how.

There are different alternatives:

  1. Add a language parameter (as you were proposing)
  2. Automatically detect the language (see Add hero.infer_lang(s) #3) and remove the stopwords for such a language.
  3. By default create a very big stopwords set that contains all stopwords of all languages. The main drawback is that one stopword in a language might not be for another one ... (👎 )

In general, I'm against adding too many arguments to functions as this makes it generally more complex to understand and use it ...

Also, something we always need to keep in mind is that, from a multilingual perspective, Texthero is composed of two kinds of functions:

  1. Functions that should make a distinction on the Series (or cell) language type
  2. Functions that work independently on the underline language

Only some of the preprocessing functions fall under (1) (tell me if you think that this is wrong). Such functions are for instance remove_stopwords and tokenize. It might be redundant to specify for each of these functions the language parameter, and that's why @AlfredWGA solution of having a global setting hero.config.set_lang("zh") is probably a great idea.

Your opinions?

Technically, do you think is feasible and not too complex to have a global setting for the lang at the module level?

@cedricconol
Copy link
Contributor

I think @AlfredWGA's solution of having a global setting is a better idea than adding language parameter for every function as i suggested and is also more feasible.

#3 is also very interesting and might be an even better idea as it automates the process. It aligns perfectly well to Texthero's purpose of being easy to use and understand.

@ryangawei
Copy link

I found a problem of using global setting language. Some of functions cannot be applied to Asian languages, e.g. remove_diacritics, stem. Also, remove_punctuation is integrated into remove_stopwords after tokenization. When the user select a certain language to process, I think we shouldn't expose APIs that they cannot use which might lead to confusion. Any idea how to solve this?

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 21, 2020

Hey @AlfredWGA !

Apologize, what do you mean by "integrated"? (Also, remove_punctuation is integrated into remove_stopwords after tokenization)

I agree.

To probably understand better the problem, we should create a document (by opening a new issue or using Google doc or similar) and for different languages make a list of necessary functions. Presumably, except for some functions in preprocessing (i.e remove_diacritics for Asian languages, ...), all other functions will be useful for all languages.

Once we have this, it will be easier to see how to solve the problem. Hopefully, we will notice some patterns and we will be able to classify languages together that share a common preprocessing pipeline. Do you agree?

Then, a simple idea might be to have a select items menu in the API page that will let the user show only the relevant functions for the given language. This coupled with the getting-started tutorials in all different languages (or kinds of languages) should reduce the confusion.

What are your thoughts?

@ryangawei
Copy link

ryangawei commented Jul 21, 2020

Hey @AlfredWGA !

Apologize, what do you mean by "integrated"? (Also, remove_punctuation is integrated into remove_stopwords after tokenization)

I agree.

To probably understand better the problem, we should create a document (by opening a new issue or using Google doc or similar) and for different languages make a list of necessary functions. Presumably, except for some functions in preprocessing (i.e remove_diacritics for Asian languages, ...), all other functions will be useful for all languages.

Once we have this, it will be easier to see how to solve the problem. Hopefully, we will notice some patterns and we will be able to classify languages together that share a common preprocessing pipeline. Do you agree?

Then, a simple idea might be to have a select items menu in the API page that will let the user show only the relevant functions for the given language. This coupled with the getting-started tutorials in all different languages (or kinds of languages) should reduce the confusion.

What are your thoughts?

Sorry for the confusion. The default pipeline for preprocessing Chinese text should look like this,

def get_default_pipeline() -> List[Callable[[pd.Series], pd.Series]]:
    return [
        fillna,
        remove_whitespace,
        tokenize,
        remove_digits,
        remove_stopwords
    ]

Punctuations and stopwords should be removed after tokenization (as they might affect the word segmentation results). We can put punctuations into the list of stopwords and remove them together using remove_stopwords, therefore a series of remove_** might be unnecessary. Plus, all functions in preprocessing.py deal with Series with str. With tokenize as a prior step, a lot of functions have to require Series of list as input.

In this case, if we use hero.config.set_lang("lang"), how can we make unnecessary functions invisible when user call hero.**? On the other hand, from texthero.lang import hero_lang can import only necessary functions for a certain language.

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 24, 2020

Hey @AlfredWGA, sorry for the late reply.

I agree that a series of remove_* might be unnecessary. On the other hand, someone might just need to apply remove_punctuation for some reasons and in that case such function might be handy.

Regarding the tokenize, what you say is super interesting. For you to know, in the next version, all representation functions will require as input a TokenSeries (a tokenized series).

In the case of preprocessing, if we consider Western language, the actual approach is that the tokenization is done once at the end of the preprocessing phase. remove_punctuation and remove_stopwords do not necessarily require to receive as input an already tokenized string as we might just apply string.replace:

The main reason until now we are not requiring preprocessing functions to receive a tokenized series as input is for performance. For example, remove_punctuation is using str.replace+ regex. I assumed that this was faster than iterating over every token and remove the ones that are punctuation.

For Asian Language, tokenization should be strictly done before applying remove_*? If yes, I'm fine re-consider this and having that the first task consists of tokenizing the Series. As we aim for simplicity and unification, it would not make sense to have two different approaches for different languages (when it exist a universal solution)

One more thing regarding what you were proposing (remove_*):

As an alternative to multiple remove_* we might have a universal function remove(s, tokens) that remove all tokens from the (tokenized) Series. Then, we might provide through a module or something similar, such collections of tokens:

from tokens import stopwords
from tokens import punctuation
from tokens import ...

s = pd.Series([...])
s = hero.tokenize(s)
hero.remove(s, stopwords.union(punctuation).union(...) )

Looking forward to hearing from you! 👍

@ryangawei
Copy link

ryangawei commented Jul 24, 2020

For Asian Language, tokenization should be strictly done before applying remove_*?

From my perspective, yes, but except for some strings that won't interfere word segmentation (urls, html tags, \n and \t, etc.).

TokenSeries and a tokens module is a great idea. If we implement that , the clean for Asian languages can be like this,

from texthero.lang import hero_cn as hero

[hero.preprocessing.fillna,
hero.preprocessing.remove_whitespace,
...,
hero.preprocessing.tokenize,
hero.remove(s, hero.tokens.stopwords.union(punctuation).union(...) )]

Then the cleaning process of Western and Asian will be unified. What do you think?

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 24, 2020

Sounds good!

We might call the tokens module collections or something similar:

from texthero.collections import stopwords
from texthero.collections import punctuation
...

Yes, TokenSeries seems a promising idea. @henrifroese is working on it. If you are interested have a look at #60.

@AlfredWGA How do you suggest we proceed, in relation on how you plan to contribute.

@ryangawei
Copy link

I'll start implementing texthero.lang.hero_cn to make all functions support Chinese. If #60 is completed I'll refactor the code to accommodate this feature. Is that OK?

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 27, 2020

OK!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion To discuss new improvements
Projects
None yet
Development

No branches or pull requests

4 participants