New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tfidf retriever parameters #3291
Add tfidf retriever parameters #3291
Conversation
langchain/retrievers/tfidf.py
Outdated
if tfidf_params is None: | ||
vectorizer = TfidfVectorizer() | ||
else: | ||
vectorizer = TfidfVectorizer(**tfidf_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tfidf_params = tfidf_params or {}
vectorizer = TfidfVectorizer(**tfidf_params)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dev2049
Thank you for review comment.
It is smart way, I just updated!
looks great, thanks @go5paopao! |
I just fix import error of scikit-learn by adding scikit-learn library to poetry. |
I would like to add some new features to the TfidfRetriever in the retriever library.
The TfidfRetriever currently uses TfidfVectorizer from scikit-learn, which has many optional parameters that can affect the result of the tfidf and retrieval processes.
For instance, if we want to use the TfidfRetriever with a different language, we may need to add an original tokenization process. In the case of Japanese, we need to pass tokenizer parameters to TfidfVectorizer as shown below:
I have submitted this pull request so that we can support this feature.
In addition, I have added a simple unit test code. Since there was no TfidfRetriever test code previously, I have created a new file.
This is my first time submitting a pull request, so if there is anything insufficient or incorrect, please let me know.