How to provide multilingual support #84

jbesomi · 2020-07-14T10:49:29Z

Text preprocessing might be very language-dependent.

It would be great for Texthero to offer text preprocessing in all different languages.

There are probably two ways to support multilanguage:

from texthero.lang import hero_lang

from texthero import hero
hero.config.set_lang("lang")

We might have also cases where the dataset is composed of many languages. What would be the best solution in that case?

The first question we probably have to solve is: does different languages requires very different preprocessing pipeline and therefore different functions?

The text was updated successfully, but these errors were encountered:

ryangawei · 2020-07-14T16:50:02Z

For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the text with NLP tools. Its difference to the current texthero.preprocessing.clean is that we need to keep punctuation and digits to make sure the correctness of word segmentation, then we proceed to the following steps. Despite the difference, these steps can be done by language specific functions and custom pipelines, e.g.

from texthero import preprocessing
from texthero import nlp
import texthero as hero
hero.config.set_lang("cn")

custom_pipeline = [preprocessing.fillna,
                   preprocessing.remove_non_natural_text,
                   nlp.word_segment,
                   preprocessing.remove_stopwords,
                   ...]
df['clean_text'] = df['clean_text'].pipe(custom_pipeline)

Seems hero.config.set_lang("lang") can work properly and avoid redundant codes.

guilhermelowa · 2020-07-14T17:51:17Z

Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function

jbesomi · 2020-07-15T08:10:57Z

Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function

Hey @Jkasnese , if you read there in the getting-started, you will see that you can define your own custom pipeline:

from texthero import preprocessing

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)

Is that what you were looking for?

jbesomi · 2020-07-15T08:12:04Z

For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the text with NLP tools. Its difference to the current texthero.preprocessing.clean is that we need to keep punctuation and digits to make sure the correctness of word segmentation, then we proceed to the following steps. Despite the difference, these steps can be done by language specific functions and custom pipelines, e.g.
from texthero import preprocessing
from texthero import nlp
import texthero as hero
hero.config.set_lang("cn")

custom_pipeline = [preprocessing.fillna,
                   preprocessing.remove_non_natural_text,
                   nlp.word_segment,
                   preprocessing.remove_stopwords,
                   ...]
df['clean_text'] = df['clean_text'].pipe(custom_pipeline)
Seems hero.config.set_lang("lang") can work properly and avoid redundant codes.

Great!

guilhermelowa · 2020-07-16T10:35:08Z

Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function

Hey @Jkasnese , if you read there in the getting-started, you will see that you can define your own custom pipeline:
from texthero import preprocessing

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)
Is that what you were looking for?

Yes, I think so! Thanks! I'll check later how to pass arguments to the remove_stopwords function. It would be nice if it was possible to do something like:

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_stopwords(my_stopwords),
                   preprocessing.remove_whitespace]

But I'm guessing its with kargs** right? I don't know, I'm a newbie hahaha gonna check this later. Thanks again!

jbesomi · 2020-07-16T11:06:52Z

You can solve it like this:

import texthero as hero
import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)

All items in the pipeline must be callable (i.e functions).

I agree that this is not trivial. We will make sure that this will be well explained both in the docstring of the clean function as well as in the (soon to arrive) Getting started: preprocessing part. By the way, @Jkasnese, would you be interested in explaining that in the clean docstring (your comment then will appear there)?

guilhermelowa · 2020-07-16T12:36:41Z

I'd like that! Can't make guarantees though, since I'm already involved in many projects ): I'll try to do it until Saturday.

I still have some questions, which might be good since I can address them on the docstring. Should I create a new issue (since its a bit off this issue) or message you privately?

jbesomi · 2020-07-16T18:01:26Z

Both solutions work; either open an issue or send me an email: jonathanbesomi__AT__gmail.com

cedricconol · 2020-07-17T08:43:46Z

You can solve it like this:
import texthero as hero
import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)
All items in the pipeline must be callable (i.e functions).

I agree that this is not trivial. We will make sure that this will be well explained both in the docstring of the clean function as well as in the (soon to arrive) Getting started: preprocessing part. By the way, @Jkasnese, would you be interested in explaining that in the clean docstring (your comment then will appear there)?

@jbesomi i think it is also good to add a language parameter in remove_stopwords. I've used a solution similar to your suggestion above for Spanish and Indonesian languages. We can use the stop-words library to load stop words from different languages. What do you think?

jbesomi · 2020-07-17T10:12:39Z

I perfectly agree with what you are proposing; i.e to permit to remove stopwords from a specific language. The only big question (and the main purpose of this discussion tab) is to understand how.

There are different alternatives:

Add a language parameter (as you were proposing)
Automatically detect the language (see Add hero.infer_lang(s) #3) and remove the stopwords for such a language.
By default create a very big stopwords set that contains all stopwords of all languages. The main drawback is that one stopword in a language might not be for another one ... (👎 )

In general, I'm against adding too many arguments to functions as this makes it generally more complex to understand and use it ...

Also, something we always need to keep in mind is that, from a multilingual perspective, Texthero is composed of two kinds of functions:

Functions that should make a distinction on the Series (or cell) language type
Functions that work independently on the underline language

Only some of the preprocessing functions fall under (1) (tell me if you think that this is wrong). Such functions are for instance remove_stopwords and tokenize. It might be redundant to specify for each of these functions the language parameter, and that's why @AlfredWGA solution of having a global setting hero.config.set_lang("zh") is probably a great idea.

Your opinions?

Technically, do you think is feasible and not too complex to have a global setting for the lang at the module level?

cedricconol · 2020-07-17T23:43:46Z

I think @AlfredWGA's solution of having a global setting is a better idea than adding language parameter for every function as i suggested and is also more feasible.

#3 is also very interesting and might be an even better idea as it automates the process. It aligns perfectly well to Texthero's purpose of being easy to use and understand.

ryangawei · 2020-07-21T08:02:38Z

I found a problem of using global setting language. Some of functions cannot be applied to Asian languages, e.g. remove_diacritics, stem. Also, remove_punctuation is integrated into remove_stopwords after tokenization. When the user select a certain language to process, I think we shouldn't expose APIs that they cannot use which might lead to confusion. Any idea how to solve this?

jbesomi · 2020-07-21T11:06:28Z

Hey @AlfredWGA !

Apologize, what do you mean by "integrated"? (Also, remove_punctuation is integrated into remove_stopwords after tokenization)

I agree.

To probably understand better the problem, we should create a document (by opening a new issue or using Google doc or similar) and for different languages make a list of necessary functions. Presumably, except for some functions in preprocessing (i.e remove_diacritics for Asian languages, ...), all other functions will be useful for all languages.

Once we have this, it will be easier to see how to solve the problem. Hopefully, we will notice some patterns and we will be able to classify languages together that share a common preprocessing pipeline. Do you agree?

Then, a simple idea might be to have a select items menu in the API page that will let the user show only the relevant functions for the given language. This coupled with the getting-started tutorials in all different languages (or kinds of languages) should reduce the confusion.

What are your thoughts?

ryangawei · 2020-07-21T12:39:08Z

Hey @AlfredWGA !

Apologize, what do you mean by "integrated"? (Also, remove_punctuation is integrated into remove_stopwords after tokenization)

I agree.

To probably understand better the problem, we should create a document (by opening a new issue or using Google doc or similar) and for different languages make a list of necessary functions. Presumably, except for some functions in preprocessing (i.e remove_diacritics for Asian languages, ...), all other functions will be useful for all languages.

Once we have this, it will be easier to see how to solve the problem. Hopefully, we will notice some patterns and we will be able to classify languages together that share a common preprocessing pipeline. Do you agree?

Then, a simple idea might be to have a select items menu in the API page that will let the user show only the relevant functions for the given language. This coupled with the getting-started tutorials in all different languages (or kinds of languages) should reduce the confusion.

What are your thoughts?

Sorry for the confusion. The default pipeline for preprocessing Chinese text should look like this,

def get_default_pipeline() -> List[Callable[[pd.Series], pd.Series]]:
    return [
        fillna,
        remove_whitespace,
        tokenize,
        remove_digits,
        remove_stopwords
    ]

Punctuations and stopwords should be removed after tokenization (as they might affect the word segmentation results). We can put punctuations into the list of stopwords and remove them together using remove_stopwords, therefore a series of remove_** might be unnecessary. Plus, all functions in preprocessing.py deal with Series with str. With tokenize as a prior step, a lot of functions have to require Series of list as input.

In this case, if we use hero.config.set_lang("lang"), how can we make unnecessary functions invisible when user call hero.**? On the other hand, from texthero.lang import hero_lang can import only necessary functions for a certain language.

jbesomi · 2020-07-24T09:36:20Z

Hey @AlfredWGA, sorry for the late reply.

I agree that a series of remove_* might be unnecessary. On the other hand, someone might just need to apply remove_punctuation for some reasons and in that case such function might be handy.

Regarding the tokenize, what you say is super interesting. For you to know, in the next version, all representation functions will require as input a TokenSeries (a tokenized series).

In the case of preprocessing, if we consider Western language, the actual approach is that the tokenization is done once at the end of the preprocessing phase. remove_punctuation and remove_stopwords do not necessarily require to receive as input an already tokenized string as we might just apply string.replace:

The main reason until now we are not requiring preprocessing functions to receive a tokenized series as input is for performance. For example, remove_punctuation is using str.replace+ regex. I assumed that this was faster than iterating over every token and remove the ones that are punctuation.

For Asian Language, tokenization should be strictly done before applying remove_*? If yes, I'm fine re-consider this and having that the first task consists of tokenizing the Series. As we aim for simplicity and unification, it would not make sense to have two different approaches for different languages (when it exist a universal solution)

One more thing regarding what you were proposing (remove_*):

As an alternative to multiple remove_* we might have a universal function remove(s, tokens) that remove all tokens from the (tokenized) Series. Then, we might provide through a module or something similar, such collections of tokens:

from tokens import stopwords
from tokens import punctuation
from tokens import ...

s = pd.Series([...])
s = hero.tokenize(s)
hero.remove(s, stopwords.union(punctuation).union(...) )

Looking forward to hearing from you! 👍

ryangawei · 2020-07-24T13:09:01Z

For Asian Language, tokenization should be strictly done before applying remove_*?

From my perspective, yes, but except for some strings that won't interfere word segmentation (urls, html tags, \n and \t, etc.).

TokenSeries and a tokens module is a great idea. If we implement that , the clean for Asian languages can be like this,

from texthero.lang import hero_cn as hero

[hero.preprocessing.fillna,
hero.preprocessing.remove_whitespace,
...,
hero.preprocessing.tokenize,
hero.remove(s, hero.tokens.stopwords.union(punctuation).union(...) )]

Then the cleaning process of Western and Asian will be unified. What do you think?

jbesomi · 2020-07-24T14:56:29Z

Sounds good!

We might call the tokens module collections or something similar:

from texthero.collections import stopwords
from texthero.collections import punctuation
...

Yes, TokenSeries seems a promising idea. @henrifroese is working on it. If you are interested have a look at #60.

@AlfredWGA How do you suggest we proceed, in relation on how you plan to contribute.

ryangawei · 2020-07-24T15:06:54Z

I'll start implementing texthero.lang.hero_cn to make all functions support Chinese. If #60 is completed I'll refactor the code to accommodate this feature. Is that OK?

jbesomi · 2020-07-27T15:30:20Z

OK!

jbesomi mentioned this issue Jul 14, 2020

Chinese language support #68

Open

jbesomi added the discussion To discuss new improvements label Jul 14, 2020

jbesomi changed the title ~~How to support multilingual~~ How to provide multilingual support Jul 14, 2020

jbesomi mentioned this issue Jul 15, 2020

tfidf(s): remove normalization, improve docstring #76

Closed

ryangawei mentioned this issue Jul 30, 2020

Add initial Chinese support for hero.lang.zh.preprocessing #128

Draft

This was referenced Jul 31, 2020

Avoid downloading spaCy models when only using the preprocessing module #120

Open

Getting started: preprocessing #144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to provide multilingual support #84

How to provide multilingual support #84

jbesomi commented Jul 14, 2020

ryangawei commented Jul 14, 2020

guilhermelowa commented Jul 14, 2020 •

edited

Loading

jbesomi commented Jul 15, 2020

jbesomi commented Jul 15, 2020

guilhermelowa commented Jul 16, 2020

jbesomi commented Jul 16, 2020

guilhermelowa commented Jul 16, 2020

jbesomi commented Jul 16, 2020

cedricconol commented Jul 17, 2020

jbesomi commented Jul 17, 2020

cedricconol commented Jul 17, 2020

ryangawei commented Jul 21, 2020

jbesomi commented Jul 21, 2020

ryangawei commented Jul 21, 2020 •

edited

Loading

jbesomi commented Jul 24, 2020

ryangawei commented Jul 24, 2020 •

edited

Loading

jbesomi commented Jul 24, 2020

ryangawei commented Jul 24, 2020

jbesomi commented Jul 27, 2020

How to provide multilingual support #84

How to provide multilingual support #84

Comments

jbesomi commented Jul 14, 2020

ryangawei commented Jul 14, 2020

guilhermelowa commented Jul 14, 2020 • edited Loading

jbesomi commented Jul 15, 2020

jbesomi commented Jul 15, 2020

guilhermelowa commented Jul 16, 2020

jbesomi commented Jul 16, 2020

guilhermelowa commented Jul 16, 2020

jbesomi commented Jul 16, 2020

cedricconol commented Jul 17, 2020

jbesomi commented Jul 17, 2020

cedricconol commented Jul 17, 2020

ryangawei commented Jul 21, 2020

jbesomi commented Jul 21, 2020

ryangawei commented Jul 21, 2020 • edited Loading

jbesomi commented Jul 24, 2020

ryangawei commented Jul 24, 2020 • edited Loading

jbesomi commented Jul 24, 2020

ryangawei commented Jul 24, 2020

jbesomi commented Jul 27, 2020

guilhermelowa commented Jul 14, 2020 •

edited

Loading

ryangawei commented Jul 21, 2020 •

edited

Loading

ryangawei commented Jul 24, 2020 •

edited

Loading