[Websearch] update #427

mishig25 · 2023-09-04T11:12:12Z

TLDR: update websearch demo

Specifically, there are 2 parts:

Improve/fix google query generation, better information retrieval from websites
Use embeddings to find relevant texts from sources

Testing locally

Make sure you are using meta-llama/Llama-2-70b-chat-hf as your model
~~Run npm ci since there is a new dependency to gradio client~~

Detailed description

Request to generate google search query from user query was broken (fixed 0a38bec)
Update webSearchQueryPromptTemplate (9145a3a and fd8f654 and 19ef30b). So the new google query generator prompt template is My question is: {{USER QUESTION}}. Based on the conversation history, give me an appropriate query to answer my question for google search. You should not say more than query. You should not say any words except the query. For the context, today is {{CURRENT DATE}}. Adding {{CURRENT DATE}} has been very useful in asking questions such as: what happened yesterday?, which becomes news September 3, 2023 (overall, the query is same as used https://huggingface.co/spaces/chansung/llama2-with-gradio-chat except current date is added)
On google results, get top storeis and organic results only (discard knowledge graph and answer box)
From the web pages: only get texts from <p> paragraph elements (this seems to remove a lot of "noise" and actually get "useful" text information from websites)
Find relevant information on those webpages using ths gradio space https://huggingface.co/spaces/mishig/embeddings-similarity (the space is inspired by https://huggingface.co/spaces/smangrul/PEFT-Docs-QA-Chatbot)
a. use intfloat/e5-large-v2 for embeddings here
b. use hnswlib for finding releavant text chunks here
c. and use gradio js client here (docs here)
Inject relevant info into the messages here

Video cast

Screen.Recording.2023-09-04.at.14.57.42.mov

Notes

I'd say it is working quite well. However, right now, I'm looking for a potential bug that results in generating a completely unrelated text. When I ask: what happened yesterday, it answered once: Why AWS is a good cloud provider. Looking for a bug whether it is in the prompt, frontend, or server. (Update: 5bfab64 fixed the error)

Examples

Open the examples:

Next steps

get @gary149 feedback and decide

nsarrazin

Overall this is clearly much better! I tried it with Serper and it worked great. I also think we should switch to serper for prod tbh.

I like the sources UI, plus it's much faster and more accurate. I left a few minor comments but really nothing blocking, just notes for the future.

Minor nit is that the update collapse menu is getting pretty big and I feel like some updates could be combined/removed imo.

I think it makes sense to keep the ones that contain extra info (like which query is used, which webpages are visited) but the other ones could be dropped/merged into fewer updates imo.

src/lib/server/websearch/generateQuery.ts

src/lib/server/models.ts

src/lib/buildPrompt.ts

src/routes/conversation/[id]/web-search/+server.ts

src/lib/server/websearch/sentenceSimilarity.ts

xenova · 2023-09-13T11:50:45Z

For the transformers.js integration, I would also recommend the following:

Making copies of the models that will be used, and freezing them. For example, this is what Supabase/gte-small does.
Host the WASM files on Hugging Face instead of jsdelivr. This isn't a massive issue, but it's always good to know if can never break in the future :) You can then set the path to the WASM files with (see docs):
```
import { env } from '@xenova/transformers';

// Set location of .wasm files. Defaults to use a CDN.
env.backends.onnx.wasm.wasmPaths = '/path/to/files/'; // or HTTP(S) link
```

mishig25 · 2023-09-13T12:01:21Z

@xenova based on this codebase, do you have suggestion for the path? And I guess I should put that path in gitignore?

xenova · 2023-09-13T12:03:27Z

@xenova based on this codebase, do you have suggestion for the path? And I guess I should put that path in gitignore?

These can also be HTTP(s) links, so we could host them on the hub if needed. Alternatively, if you wish to serve these locally, they can be copied from node_modules/@xenova/transformers/dist into whatever the public/static folder is. You can ignore them if they are copied during the build/dev process.

src/routes/conversation/[id]/web-search/+server.ts

mishig25 · 2023-09-13T13:40:49Z

@xenova in this PR, tfjs is running on server-side, not frontend. Therefore, is import { env } from '@xenova/transformers'; needed?

xenova · 2023-09-13T13:50:00Z

@xenova in this PR, tfjs is running on server-side, not frontend. Therefore, is import { env } from '@xenova/transformers'; needed?

I guess it depends if you want to modify the default configuration at all (e.g., updating the path to the WASM files). It also depends how much extra configuration you want users to face. However, as you say, it's running server-side, so it doesn't actually use the WASM files - it uses their onnxruntime-node backend.

TLDR: Probably not necessary! Check these options just in case you want some custom behaviour.

* Fix reuqest body * update webSearchQueryPromptTemplate * update generate google query parser * Add today's date to google search query creator * crawl top stories if exts; remove answer_box & knowledgeGraph * Create paragraph chunks from top articles * flattened paragprah chunks * update status texts * add gradio client * call gradio app for RAG * Web scrape only "p, li, span" els * add MAX_N_CHUNKS * gradio result typing * parse only <p> elements * rm dev change * update typing WebSearch * buld RAG prompt * Rm dev change * change websearch context msg from user to assisntat type * use hosted gradio app * fix lint * prompt engineering * more prompt engineering * MAX_N_PAGES_SCRAPE = 10 * better error msg * more prompt engineering * revert websearch prompt to previous * rm `top_stories` from websearch as the results are not good * Stop using gradio client, use regular fetch * chore * Rm websearchsummary references as it is no longer used * update readme * Apply suggestions from code review Co-authored-by: Julien Chaumond <julien@huggingface.co> * Use tfjs to do embeddings in server node * fix websearch component disapperar after finishing generation * Show sources of closest embeddings used in RAG * fix prompting and also add current date * add comment * comment for search query * sources * hide www * using hostname direclty * Show successful web pages instead of failed ones * rm noisy messages * google query generation using previous messaages as context * handle falcon generation * bring back Browsing webpage msg --------- Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Victor Mustar <victor.mustar@gmail.com>

* Bump mongodb from 5.3.0 to 5.8.0 Bumps [mongodb](https://github.com/mongodb/node-mongodb-native) from 5.3.0 to 5.8.0. - [Release notes](https://github.com/mongodb/node-mongodb-native/releases) - [Changelog](https://github.com/mongodb/node-mongodb-native/blob/v5.8.0/HISTORY.md) - [Commits](mongodb/node-mongodb-native@v5.3.0...v5.8.0) --- updated-dependencies: - dependency-name: mongodb dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Store IP in messageEvents * IP based rate limit * Revert "IP based rate limit" This reverts commit 87c6937. * ip rate limit * move rate limit event to top * Add rate limiting to websearch and title summary (#433) * [Websearch] update (#427) * Fix reuqest body * update webSearchQueryPromptTemplate * update generate google query parser * Add today's date to google search query creator * crawl top stories if exts; remove answer_box & knowledgeGraph * Create paragraph chunks from top articles * flattened paragprah chunks * update status texts * add gradio client * call gradio app for RAG * Web scrape only "p, li, span" els * add MAX_N_CHUNKS * gradio result typing * parse only <p> elements * rm dev change * update typing WebSearch * buld RAG prompt * Rm dev change * change websearch context msg from user to assisntat type * use hosted gradio app * fix lint * prompt engineering * more prompt engineering * MAX_N_PAGES_SCRAPE = 10 * better error msg * more prompt engineering * revert websearch prompt to previous * rm `top_stories` from websearch as the results are not good * Stop using gradio client, use regular fetch * chore * Rm websearchsummary references as it is no longer used * update readme * Apply suggestions from code review Co-authored-by: Julien Chaumond <julien@huggingface.co> * Use tfjs to do embeddings in server node * fix websearch component disapperar after finishing generation * Show sources of closest embeddings used in RAG * fix prompting and also add current date * add comment * comment for search query * sources * hide www * using hostname direclty * Show successful web pages instead of failed ones * rm noisy messages * google query generation using previous messaages as context * handle falcon generation * bring back Browsing webpage msg --------- Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Victor Mustar <victor.mustar@gmail.com> * bump to 0.6.0 (#434) * Update README.md (#435) * Update README.md * add description of websearch on readme * Apply suggestions from code review Co-authored-by: Victor Muštar <victor.mustar@gmail.com> * Update README.md --------- Co-authored-by: Mishig Davaadorj <dmishig@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> * Mobile: fix model selection (#448) * adjustments and mobile modal * use dvh unit * margin * fix lint on main * Add latex support with marked-katex-extension (#450) * Add latex support with marked-katex-extension * Add renderer * Fix marked default option problem * Fix linting error * Fix lock error --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Nathan Sarrazin <sarrazin.nathan@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Victor Mustar <victor.mustar@gmail.com> Co-authored-by: Mishig Davaadorj <dmishig@gmail.com> Co-authored-by: Blanchon <julien@blanchon.cc>

Pixel-Panda · 2023-10-01T22:44:28Z

“ through something hacky like selenium, Google won't like it (and they'll miss ad revenue).“
@nsarrazin of whom was this quoting on HF?

I'm trying not to implode with cringe and laughter fighting to make sense of such a ridiculous naivety and ill-logic.

Google owns searching the internet? We must not do anything Google doesn't like. Their feelings could get huwwrrrrt. We are not responsible for the eggs Google's business has restricted its success to. We are not responsible for prioritizing the Google-gets-a-check-for any web searchy things any creature partakes in. Cuz, duh, pay Google. I mean just like, they did a thing so pay up humanity! 🤦🏻‍♂️

xenova · 2023-10-02T09:52:22Z

@pixelpandacreative I believe this is the thread you are referring to? Your questions will be answered there :)

ShadowDawg · 2024-02-22T13:24:04Z

Just curious, how does web search for this really work? I went through langchain's web search tutorial and according to it, the web search api usually returns html and we have to manually filter it by relevant tags (span, p, etc) to get the exact text information. And of course, this doesn't really seem to be applicable to every website. So how does huggingchat or perplexity manage to extract the necesarry information from any web page?

mishig25 · 2024-02-22T13:50:26Z

the overall process is described in https://github.com/huggingface/chat-ui?tab=readme-ov-file#web-search

you can see the code for the details: https://github.com/huggingface/chat-ui/tree/main/src/lib/server/websearch

mishig25 added 20 commits September 1, 2023 10:33

Fix reuqest body

0a38bec

update webSearchQueryPromptTemplate

9145a3a

update generate google query parser

fd8f654

Add today's date to google search query creator

19ef30b

crawl top stories if exts; remove answer_box & knowledgeGraph

aee8168

Create paragraph chunks from top articles

3025d0b

flattened paragprah chunks

d6c89b8

update status texts

21c94b0

add gradio client

8b79e70

call gradio app for RAG

940016e

Web scrape only "p, li, span" els

1157507

add MAX_N_CHUNKS

74af0b3

gradio result typing

80c74dd

parse only <p> elements

2dcc92f

rm dev change

84593f9

update typing WebSearch

c8223bd

buld RAG prompt

5f0daf4

Rm dev change

251c28c

change websearch context msg from user to assisntat type

1bfa0c1

use hosted gradio app

7fcf22d

mishig25 changed the title ~~use hosted gradio app~~ [Websearch] update Sep 4, 2023

fix lint

2f0b2fa

mishig25 force-pushed the websearch_update branch from c050b95 to 2f0b2fa Compare September 4, 2023 12:49

prompt engineering

57e52bf

mishig25 requested a review from gary149 September 4, 2023 14:03

mishig25 added 5 commits September 4, 2023 15:26

more prompt engineering

5bfab64

MAX_N_PAGES_SCRAPE = 10

43b3dc1

better error msg

881b9f4

more prompt engineering

00f3094

revert websearch prompt to previous

c72a77b

nsarrazin self-requested a review September 13, 2023 08:40

gary149 added 4 commits September 13, 2023 12:11

sources

8305b25

hide www

4b70c82

using hostname direclty

9c1a26e

Show successful web pages instead of failed ones

e4d313c

nsarrazin approved these changes Sep 13, 2023

View reviewed changes

julien-c reviewed Sep 13, 2023

View reviewed changes

src/lib/server/websearch/sentenceSimilarity.ts Show resolved Hide resolved

mishig25 added 2 commits September 13, 2023 14:11

rm noisy messages

dd15579

google query generation using previous messaages as context

87a62af

gary149 reviewed Sep 13, 2023

View reviewed changes

src/routes/conversation/[id]/web-search/+server.ts Outdated Show resolved Hide resolved

mishig25 added 2 commits September 13, 2023 14:49

handle falcon generation

ce30878

bring back Browsing webpage msg

199826e

gary149 self-requested a review September 13, 2023 13:46

gary149 approved these changes Sep 13, 2023

View reviewed changes

Merge branch 'main' into websearch_update

ebb0ed9

mishig25 merged commit ebac87f into main Sep 13, 2023
2 checks passed

mishig25 deleted the websearch_update branch September 13, 2023 13:57

mishig25 mentioned this pull request Sep 13, 2023

Update embedding model for WebSearch #437

Merged

irthomasthomas mentioned this pull request Oct 5, 2023

Chat-UI with RAG websearch from huggingface. irthomasthomas/undecidability#81

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Websearch] update #427

[Websearch] update #427

mishig25 commented Sep 4, 2023 •

edited by julien-c

nsarrazin left a comment •

edited

xenova commented Sep 13, 2023 •

edited

mishig25 commented Sep 13, 2023

xenova commented Sep 13, 2023 •

edited

mishig25 commented Sep 13, 2023

xenova commented Sep 13, 2023

Pixel-Panda commented Oct 1, 2023

xenova commented Oct 2, 2023

ShadowDawg commented Feb 22, 2024

mishig25 commented Feb 22, 2024

[Websearch] update #427

[Websearch] update #427

Conversation

mishig25 commented Sep 4, 2023 • edited by julien-c

TLDR: update websearch demo

Testing locally

Detailed description

Video cast

Notes

Examples

Next steps

nsarrazin left a comment • edited

Choose a reason for hiding this comment

xenova commented Sep 13, 2023 • edited

mishig25 commented Sep 13, 2023

xenova commented Sep 13, 2023 • edited

mishig25 commented Sep 13, 2023

xenova commented Sep 13, 2023

Pixel-Panda commented Oct 1, 2023

xenova commented Oct 2, 2023

ShadowDawg commented Feb 22, 2024

mishig25 commented Feb 22, 2024

mishig25 commented Sep 4, 2023 •

edited by julien-c

nsarrazin left a comment •

edited

xenova commented Sep 13, 2023 •

edited

xenova commented Sep 13, 2023 •

edited