Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tfidf retriever parameters #3291

Conversation

go5paopao
Copy link
Contributor

I would like to add some new features to the TfidfRetriever in the retriever library.

The TfidfRetriever currently uses TfidfVectorizer from scikit-learn, which has many optional parameters that can affect the result of the tfidf and retrieval processes.

For instance, if we want to use the TfidfRetriever with a different language, we may need to add an original tokenization process. In the case of Japanese, we need to pass tokenizer parameters to TfidfVectorizer as shown below:

import MeCab
from sklearn.feature_extraction.text import TfidfVectorizer

def mecab_tokenizer(text):
    mecab = MeCab.Tagger("-Owakati")
    return mecab.parse(text).split()

vectorizer = TfidfVectorizer(tokenizer=mecab_tokenizer)

I have submitted this pull request so that we can support this feature.

In addition, I have added a simple unit test code. Since there was no TfidfRetriever test code previously, I have created a new file.

This is my first time submitting a pull request, so if there is anything insufficient or incorrect, please let me know.

@go5paopao go5paopao changed the title Add tfidf retriever params Add tfidf retriever parameters Apr 21, 2023
Comment on lines 32 to 35
if tfidf_params is None:
vectorizer = TfidfVectorizer()
else:
vectorizer = TfidfVectorizer(**tfidf_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tfidf_params = tfidf_params or {}
vectorizer = TfidfVectorizer(**tfidf_params)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dev2049
Thank you for review comment.
It is smart way, I just updated!

@dev2049
Copy link
Contributor

dev2049 commented Apr 23, 2023

looks great, thanks @go5paopao!

@go5paopao
Copy link
Contributor Author

I just fix import error of scikit-learn by adding scikit-learn library to poetry.
I think it is currently resolved.

@hwchase17 hwchase17 changed the base branch from master to harrison/tfidf-parameters April 25, 2023 02:48
@hwchase17 hwchase17 merged commit 1ddbf28 into langchain-ai:harrison/tfidf-parameters Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants