diff --git a/docs/extras/use_cases/web_scraping.ipynb b/docs/extras/use_cases/web_scraping.ipynb index 41bb28703edfd6..07a571aaf20dc5 100644 --- a/docs/extras/use_cases/web_scraping.ipynb +++ b/docs/extras/use_cases/web_scraping.ipynb @@ -453,11 +453,11 @@ "\n", "Related to scraping, we may want to answer specific questions using searched content.\n", "\n", - "We can automate the process of [web research](https://blog.langchain.dev/automating-web-research/) using a retriver, such as the `WebResearchRetriever` ([docs](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research)).\n", + "We can automate the process of [web research](https://blog.langchain.dev/automating-web-research/) using a retriever, such as the `WebResearchRetriever` ([docs](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research)).\n", "\n", "![Image description](/img/web_research.png)\n", "\n", - "Copy requirments [from here](https://github.com/langchain-ai/web-explorer/blob/main/requirements.txt):\n", + "Copy requirements [from here](https://github.com/langchain-ai/web-explorer/blob/main/requirements.txt):\n", "\n", "`pip install -r requirements.txt`\n", " \n", @@ -573,13 +573,70 @@ }, { "cell_type": "markdown", - "id": "ff62e5f5", "metadata": {}, "source": [ "### Going deeper \n", "\n", "* Here's a [app](https://github.com/langchain-ai/web-explorer/tree/main) that wraps this retriver with a lighweight UI." ] + }, + { + "cell_type": "markdown", + "id": "312c399e", + "metadata": {}, + "source": [ + "## Question answering over a website\n", + "\n", + "To answer questions over a specific website, you can use Apify's [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor, which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,\n", + "and extract text content from the web pages.\n", + "\n", + "In the example below, we will deeply crawl the Python documentation of LangChain's Chat LLM models and answer a question over it.\n", + "\n", + "First, install the requirements\n", + "`pip install apify-client openai langchain chromadb tiktoken`\n", + " \n", + "Next, set `OPENAI_API_KEY` and `APIFY_API_TOKEN` in your environment variables.\n", + "\n", + "The full code follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "9b08da5e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Yes, LangChain offers integration with OpenAI chat models. You can use the ChatOpenAI class to interact with OpenAI models.\n" + ] + } + ], + "source": [ + "from langchain.docstore.document import Document\n", + "from langchain.indexes import VectorstoreIndexCreator\n", + "from langchain.utilities import ApifyWrapper\n", + "\n", + "apify = ApifyWrapper()\n", + "# Call the Actor to obtain text from the crawled webpages\n", + "loader = apify.call_actor(\n", + " actor_id=\"apify/website-content-crawler\",\n", + " run_input={\"startUrls\": [{\"url\": \"https://python.langchain.com/docs/integrations/chat/\"}]},\n", + " dataset_mapping_function=lambda item: Document(\n", + " page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n", + " ),\n", + ")\n", + "\n", + "# Create a vector store based on the crawled data\n", + "index = VectorstoreIndexCreator().from_loaders([loader])\n", + "\n", + "# Query the vector store\n", + "query = \"Are any OpenAI chat models integrated in LangChain?\"\n", + "result = index.query(query)\n", + "print(result)" + ] } ], "metadata": { @@ -598,7 +655,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.1" + "version": "3.9.16" } }, "nbformat": 4,