You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the principles of Texthero is to give to the NLP developer more control.
Motivation
A simple example is the TfidfVectorizer object from the scikit learn. It's absolutely fast and great but it has too many parameters and before applying the TF-IDF it actually preprocesses the text data. I just discovered that TfidfVectorizer even L2 normalizes the output and that there is no option to avoid a normalization.
With Texthero's tf-idf we just want the code to apply TF-IDF. That's it. No stopwords removal, no tokenization, no normalization. All this essential step can be done by the NLP developer on the pipeline (the drawback is that it might be less efficient, but at the advantage of having clear and expected behavior).
Solution
All representation functions will require the Pandas Series to be already tokenized. In the beginning, we can still accept Text Pandas Series; in this case the default hero.tokenize the function will be applied but a warning message will be outputted (see example below).
Interested in working on this task?
For the tfidf + term_frequency function, the code has already (almost) been made. The body of the function would look like this:
if type(s.iloc[0]) != list:
raise ValueError(
"🤔 It seems like the given Pandas Series is not tokenized. Have you tried passing the Series through `hero.tokenize(s)`?"
)
tfidf = TfidfVectorizer(
use_idf=True,
max_features=max_features,
min_df=min_df,
max_df=max_df,
tokenizer=lambda x: x,
preprocessor=lambda x: x,
)
If you are interested in helping out, just leave a comment!
The text was updated successfully, but these errors were encountered:
jbesomi
changed the title
All representation function to accept as input an already tokenized Series
TokenSeries as input to every representation function
Jul 14, 2020
henrifroese
added a commit
to SummerOfCode-NoHate/texthero
that referenced
this issue
Jul 15, 2020
input. Closesjbesomi#44
Now checks for all text-based functions in representation.py whether the
input is already tokenized. If not,
it prints a warning and tokenizes.
Docstrings and tests are changed accordingly.
One of the principles of Texthero is to give to the NLP developer more control.
Motivation
A simple example is the TfidfVectorizer object from the scikit learn. It's absolutely fast and great but it has too many parameters and before applying the TF-IDF it actually preprocesses the text data. I just discovered that TfidfVectorizer even L2 normalizes the output and that there is no option to avoid a normalization.
With Texthero's
tf-idf
we just want the code to apply TF-IDF. That's it. No stopwords removal, no tokenization, no normalization. All this essential step can be done by the NLP developer on the pipeline (the drawback is that it might be less efficient, but at the advantage of having clear and expected behavior).Solution
All
representation
functions will require the Pandas Series to be already tokenized. In the beginning, we can still accept Text Pandas Series; in this case the defaulthero.tokenize
the function will be applied but a warning message will be outputted (see example below).Interested in working on this task?
For the
tfidf
+term_frequency
function, the code has already (almost) been made. The body of the function would look like this:If you are interested in helping out, just leave a comment!
The text was updated successfully, but these errors were encountered: