Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Websearch] update #427

Merged
merged 48 commits into from Sep 13, 2023
Merged

[Websearch] update #427

merged 48 commits into from Sep 13, 2023

Conversation

mishig25
Copy link
Collaborator

@mishig25 mishig25 commented Sep 4, 2023

TLDR: update websearch demo

Specifically, there are 2 parts:

  1. Improve/fix google query generation, better information retrieval from websites
  2. Use embeddings to find relevant texts from sources

Testing locally

  1. Make sure you are using meta-llama/Llama-2-70b-chat-hf as your model
  2. Run npm ci since there is a new dependency to gradio client

Detailed description

  1. Request to generate google search query from user query was broken (fixed 0a38bec)
  2. Update webSearchQueryPromptTemplate (9145a3a and fd8f654 and 19ef30b). So the new google query generator prompt template is My question is: {{USER QUESTION}}. Based on the conversation history, give me an appropriate query to answer my question for google search. You should not say more than query. You should not say any words except the query. For the context, today is {{CURRENT DATE}}. Adding {{CURRENT DATE}} has been very useful in asking questions such as: what happened yesterday?, which becomes news September 3, 2023 (overall, the query is same as used https://huggingface.co/spaces/chansung/llama2-with-gradio-chat except current date is added)
  3. On google results, get top storeis and organic results only (discard knowledge graph and answer box)
  4. From the web pages: only get texts from <p> paragraph elements (this seems to remove a lot of "noise" and actually get "useful" text information from websites)
  5. Find relevant information on those webpages using ths gradio space https://huggingface.co/spaces/mishig/embeddings-similarity (the space is inspired by https://huggingface.co/spaces/smangrul/PEFT-Docs-QA-Chatbot)
    a. use intfloat/e5-large-v2 for embeddings here
    b. use hnswlib for finding releavant text chunks here
    c. and use gradio js client here (docs here)
  6. Inject relevant info into the messages here

Video cast

Screen.Recording.2023-09-04.at.14.57.42.mov

Notes

I'd say it is working quite well. However, right now, I'm looking for a potential bug that results in generating a completely unrelated text. When I ask: what happened yesterday, it answered once: Why AWS is a good cloud provider. Looking for a bug whether it is in the prompt, frontend, or server. (Update: 5bfab64 fixed the error)

Examples

Open the examples:
image image image image image image

Next steps

@mishig25 mishig25 changed the title use hosted gradio app [Websearch] update Sep 4, 2023
@nsarrazin nsarrazin self-requested a review September 13, 2023 08:40
Copy link
Collaborator

@nsarrazin nsarrazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is clearly much better! I tried it with Serper and it worked great. I also think we should switch to serper for prod tbh.

I like the sources UI, plus it's much faster and more accurate. I left a few minor comments but really nothing blocking, just notes for the future.

Minor nit is that the update collapse menu is getting pretty big and I feel like some updates could be combined/removed imo.

I think it makes sense to keep the ones that contain extra info (like which query is used, which webpages are visited) but the other ones could be dropped/merged into fewer updates imo.

image

src/lib/server/websearch/generateQuery.ts Outdated Show resolved Hide resolved
src/lib/server/models.ts Outdated Show resolved Hide resolved
src/lib/buildPrompt.ts Show resolved Hide resolved
src/routes/conversation/[id]/web-search/+server.ts Outdated Show resolved Hide resolved
@xenova
Copy link
Contributor

xenova commented Sep 13, 2023

For the transformers.js integration, I would also recommend the following:

  1. Making copies of the models that will be used, and freezing them. For example, this is what Supabase/gte-small does.
  2. Host the WASM files on Hugging Face instead of jsdelivr. This isn't a massive issue, but it's always good to know if can never break in the future :) You can then set the path to the WASM files with (see docs):
    import { env } from '@xenova/transformers';
    
    // Set location of .wasm files. Defaults to use a CDN.
    env.backends.onnx.wasm.wasmPaths = '/path/to/files/'; // or HTTP(S) link

@mishig25
Copy link
Collaborator Author

@xenova based on this codebase, do you have suggestion for the path? And I guess I should put that path in gitignore?

@xenova
Copy link
Contributor

xenova commented Sep 13, 2023

@xenova based on this codebase, do you have suggestion for the path? And I guess I should put that path in gitignore?

These can also be HTTP(s) links, so we could host them on the hub if needed. Alternatively, if you wish to serve these locally, they can be copied from node_modules/@xenova/transformers/dist into whatever the public/static folder is. You can ignore them if they are copied during the build/dev process.

@mishig25
Copy link
Collaborator Author

@xenova in this PR, tfjs is running on server-side, not frontend. Therefore, is import { env } from '@xenova/transformers'; needed?

@gary149 gary149 self-requested a review September 13, 2023 13:46
@xenova
Copy link
Contributor

xenova commented Sep 13, 2023

@xenova in this PR, tfjs is running on server-side, not frontend. Therefore, is import { env } from '@xenova/transformers'; needed?

I guess it depends if you want to modify the default configuration at all (e.g., updating the path to the WASM files). It also depends how much extra configuration you want users to face. However, as you say, it's running server-side, so it doesn't actually use the WASM files - it uses their onnxruntime-node backend.

TLDR: Probably not necessary! Check these options just in case you want some custom behaviour.

@mishig25 mishig25 merged commit ebac87f into main Sep 13, 2023
2 checks passed
@mishig25 mishig25 deleted the websearch_update branch September 13, 2023 13:57
nsarrazin pushed a commit that referenced this pull request Sep 20, 2023
* Fix reuqest body

* update webSearchQueryPromptTemplate

* update generate google query parser

* Add today's date to google search query creator

* crawl top stories if exts; remove answer_box & knowledgeGraph

* Create paragraph chunks from top articles

* flattened paragprah chunks

* update status texts

* add gradio client

* call gradio app for RAG

* Web scrape only "p, li, span" els

* add MAX_N_CHUNKS

* gradio result typing

* parse only <p> elements

* rm dev change

* update typing WebSearch

* buld RAG prompt

* Rm dev change

* change websearch context msg from user to assisntat type

* use hosted gradio app

* fix lint

* prompt engineering

* more prompt engineering

* MAX_N_PAGES_SCRAPE = 10

* better error msg

* more prompt engineering

* revert websearch prompt to previous

* rm `top_stories` from websearch as the results are not good

* Stop using gradio client, use regular fetch

* chore

* Rm websearchsummary references as it is no longer used

* update readme

* Apply suggestions from code review

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Use tfjs to do embeddings in server node

* fix websearch component disapperar after finishing generation

* Show sources of closest embeddings used in RAG

* fix prompting and also add current date

* add comment

* comment for search query

* sources

* hide www

* using hostname direclty

* Show successful web pages instead of failed ones

* rm noisy messages

* google query generation using previous messaages as context

* handle falcon generation

* bring back Browsing webpage msg

---------

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Victor Mustar <victor.mustar@gmail.com>
nsarrazin added a commit that referenced this pull request Sep 20, 2023
* Bump mongodb from 5.3.0 to 5.8.0

Bumps [mongodb](https://github.com/mongodb/node-mongodb-native) from 5.3.0 to 5.8.0.
- [Release notes](https://github.com/mongodb/node-mongodb-native/releases)
- [Changelog](https://github.com/mongodb/node-mongodb-native/blob/v5.8.0/HISTORY.md)
- [Commits](mongodb/node-mongodb-native@v5.3.0...v5.8.0)

---
updated-dependencies:
- dependency-name: mongodb
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Store IP in messageEvents

* IP based rate limit

* Revert "IP based rate limit"

This reverts commit 87c6937.

* ip rate limit

* move rate limit event to top

* Add rate limiting to websearch and title summary (#433)

* [Websearch] update (#427)

* Fix reuqest body

* update webSearchQueryPromptTemplate

* update generate google query parser

* Add today's date to google search query creator

* crawl top stories if exts; remove answer_box & knowledgeGraph

* Create paragraph chunks from top articles

* flattened paragprah chunks

* update status texts

* add gradio client

* call gradio app for RAG

* Web scrape only "p, li, span" els

* add MAX_N_CHUNKS

* gradio result typing

* parse only <p> elements

* rm dev change

* update typing WebSearch

* buld RAG prompt

* Rm dev change

* change websearch context msg from user to assisntat type

* use hosted gradio app

* fix lint

* prompt engineering

* more prompt engineering

* MAX_N_PAGES_SCRAPE = 10

* better error msg

* more prompt engineering

* revert websearch prompt to previous

* rm `top_stories` from websearch as the results are not good

* Stop using gradio client, use regular fetch

* chore

* Rm websearchsummary references as it is no longer used

* update readme

* Apply suggestions from code review

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Use tfjs to do embeddings in server node

* fix websearch component disapperar after finishing generation

* Show sources of closest embeddings used in RAG

* fix prompting and also add current date

* add comment

* comment for search query

* sources

* hide www

* using hostname direclty

* Show successful web pages instead of failed ones

* rm noisy messages

* google query generation using previous messaages as context

* handle falcon generation

* bring back Browsing webpage msg

---------

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Victor Mustar <victor.mustar@gmail.com>

* bump to 0.6.0 (#434)

* Update README.md (#435)

* Update README.md

* add description of websearch on readme

* Apply suggestions from code review

Co-authored-by: Victor Muštar <victor.mustar@gmail.com>

* Update README.md

---------

Co-authored-by: Mishig Davaadorj <dmishig@gmail.com>
Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>

* Mobile: fix model selection (#448)

* adjustments and mobile modal

* use dvh unit

* margin

* fix lint on main

* Add latex support with marked-katex-extension (#450)

* Add latex support with marked-katex-extension

* Add renderer

* Fix marked default option problem

* Fix linting error

* Fix lock error

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nathan Sarrazin <sarrazin.nathan@gmail.com>
Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Victor Mustar <victor.mustar@gmail.com>
Co-authored-by: Mishig Davaadorj <dmishig@gmail.com>
Co-authored-by: Blanchon <julien@blanchon.cc>
@Pixel-Panda
Copy link

“ through something hacky like selenium, Google won't like it (and they'll miss ad revenue).“
@nsarrazin of whom was this quoting on HF?

I'm trying not to implode with cringe and laughter fighting to make sense of such a ridiculous naivety and ill-logic.

Google owns searching the internet? We must not do anything Google doesn't like. Their feelings could get huwwrrrrt. We are not responsible for the eggs Google's business has restricted its success to. We are not responsible for prioritizing the Google-gets-a-check-for any web searchy things any creature partakes in. Cuz, duh, pay Google. I mean just like, they did a thing so pay up humanity! 🤦🏻‍♂️

@xenova
Copy link
Contributor

xenova commented Oct 2, 2023

@pixelpandacreative I believe this is the thread you are referring to? Your questions will be answered there :)

@ShadowDawg
Copy link

Just curious, how does web search for this really work? I went through langchain's web search tutorial and according to it, the web search api usually returns html and we have to manually filter it by relevant tags (span, p, etc) to get the exact text information. And of course, this doesn't really seem to be applicable to every website. So how does huggingchat or perplexity manage to extract the necesarry information from any web page?

@mishig25
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants