<a href="https://colab.research.google.com/github/nicoduchR/chain-bluider-api/blob/master/Copie_de_langchain_API_doc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/v0.1/docs/docs/use_cases/web_scraping.ipynb)

## Use case

[Web research](https://blog.langchain.dev/automating-web-research/) is one of the killer LLM applications:

* Users have [highlighted it](https://twitter.com/GregKamradt/status/1679913813297225729?s=20) as one of his top desired AI tools.
* OSS repos like [gpt-researcher](https://github.com/assafelovic/gpt-researcher) are growing in popularity.

![Image description](https://github.com/langchain-ai/langchain/blob/v0.1/docs/static/img/web_scraping.png?raw=1)

## Overview

Gathering content from the web has a few components:

* `Search`: Query to url (e.g., using `GoogleSearchAPIWrapper`).
* `Loading`: Url to HTML  (e.g., using `AsyncHtmlLoader`, `AsyncChromiumLoader`, etc).
* `Transforming`: HTML to formatted text (e.g., using `HTML2Text` or `Beautiful Soup`).

## Quickstart

In [None]:
!pip install -q langchain-openai langchain playwright beautifulsoup4
!playwright install

# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

╔══════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers. ║
║ Missing libraries:                                   ║
║     libwoff2dec.so.1.0.2                             ║
║     libgstgl-1.0.so.0                                ║
║     libgstcodecparsers-1.0.so.0                      ║
║     libharfbuzz-icu.so.0                             ║
║     libenchant-2.so.2                                ║
║     libsecret-1.so.0                                 ║
║     libhyphen.so.0                                   ║
║     libmanette-0.2.so.0                              ║
╚══════════════════════════════════════════════════════╝
    at validateDependenciesLinux (/usr/local/lib/python3.10/dist-packages/playwright/driver/package/lib/server/registry/dependencies.js:216:9)
[90m    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)[39m
    at async Registry._validateHostRequirements (/usr/local/lib/python3.10/dist-p

Scraping HTML content using a headless instance of Chromium.

* The async nature of the scraping process is handled using Python's asyncio library.
* The actual interaction with the web pages is handled by Playwright.

In [None]:
!pip install nest-asyncio
import nest_asyncio

nest_asyncio.apply()



In [None]:
from langchain_community.document_loaders import AsyncChromiumLoader
from langchain_community.document_transformers import BeautifulSoupTransformer

# Load HTML
loader = AsyncChromiumLoader(["https://openweathermap.org/current"])
html = loader.load()

Scrape text content tags such as `<p>, <li>, <div>, and <a>` tags from the HTML content:

* `<p>`: The paragraph tag. It defines a paragraph in HTML and is used to group together related sentences and/or phrases.

* `<li>`: The list item tag. It is used within ordered (`<ol>`) and unordered (`<ul>`) lists to define individual items within the list.

* `<div>`: The division tag. It is a block-level element used to group other inline or block-level elements.

* `<a>`: The anchor tag. It is used to define hyperlinks.

* `<span>`:  an inline container used to mark up a part of a text, or a part of a document.

For many news websites (e.g., WSJ, CNN), headlines and summaries are all in `<span>` tags.

In [None]:
# Transform
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(html, tags_to_extract=["span", "code", "p"])

In [None]:
# Result
docs_transformed[0].page_content[0:500]

'OpenWeather OpenWeather Ltd. GET - on Google Play × Access current weather data for any location on Earth! We collect and process weather data from different sources such as global and local weather models, satellites, radars and a vast network of weather stations. Data is available in JSON, XML, or HTML format. API call https://api.openweathermap.org/data/2.5/weather?lat= {lat} &lon= {lon} &appid= {API key} lat required lon required appid required mode optional xml html mode units optional stan'

## Découpage du doc en chunks

In [None]:
!pip install --quiet langchain_experimental langchain_openai langchain_chroma

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from api_key import OPENAI_KEY
from langchain_text_splitters import CharacterTextSplitter


In [None]:
#text_splitter = SemanticChunker(
#    OpenAIEmbeddings(openai_api_key=OPENAI_KEY),
#    breakpoint_threshold_type="percentile"
#)
text_splitter = CharacterTextSplitter(
    separator="",
    chunk_size=1000,
    chunk_overlap=200,
    is_separator_regex=False,
)

In [None]:
len(docs_transformed[0].page_content)

22070

In [None]:
docs_chunk = text_splitter.create_documents([docs_transformed[0].page_content])
print(docs_chunk[0].page_content)

OpenWeather OpenWeather Ltd. GET - on Google Play × Access current weather data for any location on Earth! We collect and process weather data from different sources such as global and local weather models, satellites, radars and a vast network of weather stations. Data is available in JSON, XML, or HTML format. API call https://api.openweathermap.org/data/2.5/weather?lat= {lat} &lon= {lon} &appid= {API key} lat required lon required appid required mode optional xml html mode units optional standard metric imperial units standard lang optional Please use Geocoder API if you need automatic convert city names and zip-codes to geo coordinates and the other way around. Please note that built-in geocoder has been deprecated. Although it is still available for use, bug fixing and updates are no longer available for this functionality. Examples of API calls https://api.openweathermap.org/data/2.5/weather?lat=44.34&lon=10.99&appid= {API key} (https://home.openweathermap.org/api_keys)  { "coord

In [None]:
docs_chunk

[Document(page_content='OpenWeather OpenWeather Ltd. GET - on Google Play × Access current weather data for any location on Earth! We collect and process weather data from different sources such as global and local weather models, satellites, radars and a vast network of weather stations. Data is available in JSON, XML, or HTML format. API call https://api.openweathermap.org/data/2.5/weather?lat= {lat} &lon= {lon} &appid= {API key} lat required lon required appid required mode optional xml html mode units optional standard metric imperial units standard lang optional Please use Geocoder API if you need automatic convert city names and zip-codes to geo coordinates and the other way around. Please note that built-in geocoder has been deprecated. Although it is still available for use, bug fixing and updates are no longer available for this functionality. Examples of API calls https://api.openweathermap.org/data/2.5/weather?lat=44.34&lon=10.99&appid= {API key} (https://home.openweathermap

In [None]:
len(docs_chunk)

28

## Vector Store

In [None]:
from langchain_chroma import Chroma

In [None]:
def reset_chroma_vector_store(vector_store):
# Get all document ids in the vector store
  doc_ids = vector_store._collection.get()
  # Delete each document by id
  for doc_id in doc_ids:
    vector_store._collection.delete(doc_id)

In [None]:
db = Chroma.from_documents(
    docs_chunk,
    OpenAIEmbeddings(openai_api_key=OPENAI_KEY)
)

## Retriever

In [None]:
#retriever = db.as_retriever(
#    search_type="similarity_score_threshold",
#    search_kwargs={"k": 2, "score_threshold": 0.5}
#)

retriever = db.as_retriever(search_type="mmr", search_kwargs={"k": 3})

In [None]:
docs_answer = retriever.invoke("How to convert city name to geo coordinates? And what are the possible response format for the API?")

In [None]:
len(docs_answer)

3

In [None]:
docs_answer[0].page_content

'There is no need to call an API to do this. More information is on the Bulk page (/bulk) . Examples of bulk files http://bulk.openweathermap.org/sample/ (http://bulk.openweathermap.org/sample/) Requesting API calls by geographical coordinates is the most accurate way to specify any location. If you need to convert city names and zip-codes to geo coordinates and the other way around automatically, please use our Geocoding API . Please use Geocoder API if you need automatic convert city names and zip-codes to geo coordinates and the other way around. Please note that API requests by city name (#name) , zip-codes (#zip) and city id (#cityid) have been deprecated. Although they are still available for use, bug fixing and updates are no longer available for this functionality. You can call by city name or city name, state code and country code. Please note that searching by states available only for the USA locations. API call https://api.openweathermap.org/data/2.5/weather?q= {city name}'

## Connexion à ChatGPT

In [None]:
from langchain_openai import OpenAI

llm = OpenAI(model_name="gpt-4-turbo", openai_api_key=OPENAI_KEY)

In [None]:
!pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.30.1
    Uninstalling openai-1.30.1:
      Successfully uninstalled openai-1.30.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-openai 0.1.7 requires openai<2.0.0,>=1.24.0, but you have openai 0.28.0 which is incompatible.[0m[31m
[0mSuccessfully installed openai-0.28.0


In [None]:
from langchain_openai import OpenAI
from langchain.prompts.prompt import PromptTemplate
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate


prompt_template = PromptTemplate.from_template(
    "Answer to this query: {query}. Based on this context: {context}. Ignore irrelavant informations"
)

llm = OpenAI(api_key=OPENAI_KEY)# model_name="gpt-4o", api_key=OPENAI_KEY)

def end2end_call(query):
    docs_query = retriever.invoke(query)
    context = ""
    for doc in docs_query:
        context += doc.page_content + "\n\n"
    prompt = prompt_template.format(query=query, context=context)
    #chain = LLMChain(llm=llm, prompt=prompt)
    res = llm.invoke(prompt)
    return res

In [None]:
res = end2end_call(query="How to convert city name to geo coordinates?")

Answer to this query: How to convert city name to geo coordinates?. Based on this context: There is no need to call an API to do this. More information is on the Bulk page (/bulk) . Examples of bulk files http://bulk.openweathermap.org/sample/ (http://bulk.openweathermap.org/sample/) Requesting API calls by geographical coordinates is the most accurate way to specify any location. If you need to convert city names and zip-codes to geo coordinates and the other way around automatically, please use our Geocoding API . Please use Geocoder API if you need automatic convert city names and zip-codes to geo coordinates and the other way around. Please note that API requests by city name (#name) , zip-codes (#zip) and city id (#cityid) have been deprecated. Although they are still available for use, bug fixing and updates are no longer available for this functionality. You can call by city name or city name, state code and country code. Please note that searching by states available only for t

In [None]:
res

'.\n\nTo convert a city name to geo coordinates, you can use the Bulk page on the OpenWeather website. However, please note that the API requests by city name, zip-codes, and city id have been deprecated, so it is recommended to use the Geocoding API for automatic conversion of city names and zip-codes to geo coordinates and vice versa. You can also call the API directly with the city name in the API call. For example, the API call for London would be: https://api.openweathermap.org/data/2.5/weather?q=London. You can also use JSONP callback by adding "&callback=test" to the end of the API call and defining a callback function in your JavaScript code. The API response will include relevant information such as coordinates, weather conditions, temperature, wind speed, and more. OpenWeather is a team of IT experts and data scientists who specialize in weather data science, so you can trust the accuracy and reliability of the data provided.'