TokenSeries as input to every representation function #44

jbesomi · 2020-07-08T16:56:59Z

One of the principles of Texthero is to give to the NLP developer more control.

Motivation

A simple example is the TfidfVectorizer object from the scikit learn. It's absolutely fast and great but it has too many parameters and before applying the TF-IDF it actually preprocesses the text data. I just discovered that TfidfVectorizer even L2 normalizes the output and that there is no option to avoid a normalization.

With Texthero's tf-idf we just want the code to apply TF-IDF. That's it. No stopwords removal, no tokenization, no normalization. All this essential step can be done by the NLP developer on the pipeline (the drawback is that it might be less efficient, but at the advantage of having clear and expected behavior).

Solution

All representation functions will require the Pandas Series to be already tokenized. In the beginning, we can still accept Text Pandas Series; in this case the default hero.tokenize the function will be applied but a warning message will be outputted (see example below).

Interested in working on this task?
For the tfidf + term_frequency function, the code has already (almost) been made. The body of the function would look like this:

if type(s.iloc[0]) != list:
    raise ValueError(
        "🤔 It seems like the given Pandas Series is not tokenized. Have you tried passing the Series through `hero.tokenize(s)`?"
    )

tfidf = TfidfVectorizer(
    use_idf=True,
    max_features=max_features,
    min_df=min_df,
    max_df=max_df,
    tokenizer=lambda x: x,
    preprocessor=lambda x: x,
)

If you are interested in helping out, just leave a comment!

The text was updated successfully, but these errors were encountered:

henrifroese · 2020-07-14T16:35:03Z

I'm working on this with @mk2510

input. Closes jbesomi#44 Now checks for all text-based functions in representation.py whether the input is already tokenized. If not, it prints a warning and tokenizes. Docstrings and tests are changed accordingly.

jbesomi added the enhancement New feature or request label Jul 8, 2020

This was referenced Jul 10, 2020

Kind of Pandas Series #60

Closed

👩‍💻 API next steps: checklist #85

Open

jbesomi pinned this issue Jul 14, 2020

jbesomi assigned henrifroese Jul 14, 2020

jbesomi changed the title ~~All representation function to accept as input an already tokenized Series~~ TokenSeries as input to every representation function Jul 14, 2020

jbesomi closed this as completed in 817f0d8 Jul 15, 2020

jbesomi unpinned this issue Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenSeries as input to every representation function #44

TokenSeries as input to every representation function #44

jbesomi commented Jul 8, 2020

henrifroese commented Jul 14, 2020

TokenSeries as input to every representation function #44

TokenSeries as input to every representation function #44

Comments

jbesomi commented Jul 8, 2020

henrifroese commented Jul 14, 2020