Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenSeries as input to every representation function #44

Closed
jbesomi opened this issue Jul 8, 2020 · 1 comment
Closed

TokenSeries as input to every representation function #44

jbesomi opened this issue Jul 8, 2020 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@jbesomi
Copy link
Owner

jbesomi commented Jul 8, 2020

One of the principles of Texthero is to give to the NLP developer more control.

Motivation

A simple example is the TfidfVectorizer object from the scikit learn. It's absolutely fast and great but it has too many parameters and before applying the TF-IDF it actually preprocesses the text data. I just discovered that TfidfVectorizer even L2 normalizes the output and that there is no option to avoid a normalization.

With Texthero's tf-idf we just want the code to apply TF-IDF. That's it. No stopwords removal, no tokenization, no normalization. All this essential step can be done by the NLP developer on the pipeline (the drawback is that it might be less efficient, but at the advantage of having clear and expected behavior).

Solution

All representation functions will require the Pandas Series to be already tokenized. In the beginning, we can still accept Text Pandas Series; in this case the default hero.tokenize the function will be applied but a warning message will be outputted (see example below).

Interested in working on this task?
For the tfidf + term_frequency function, the code has already (almost) been made. The body of the function would look like this:

if type(s.iloc[0]) != list:
    raise ValueError(
        "🤔 It seems like the given Pandas Series is not tokenized. Have you tried passing the Series through `hero.tokenize(s)`?"
    )

tfidf = TfidfVectorizer(
    use_idf=True,
    max_features=max_features,
    min_df=min_df,
    max_df=max_df,
    tokenizer=lambda x: x,
    preprocessor=lambda x: x,
)

If you are interested in helping out, just leave a comment!

@jbesomi jbesomi added the enhancement New feature or request label Jul 8, 2020
@jbesomi jbesomi pinned this issue Jul 14, 2020
@henrifroese
Copy link
Collaborator

I'm working on this with @mk2510

@jbesomi jbesomi changed the title All representation function to accept as input an already tokenized Series TokenSeries as input to every representation function Jul 14, 2020
henrifroese added a commit to SummerOfCode-NoHate/texthero that referenced this issue Jul 15, 2020
input. Closes jbesomi#44

Now checks for all text-based functions in representation.py whether the
input is already tokenized. If not,
it prints a warning and tokenizes.

Docstrings and tests are changed accordingly.
@jbesomi jbesomi unpinned this issue Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants