<a href="https://colab.research.google.com/github/hongqin/AI4Health/blob/main/RAG_wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Tutorial: Chat with a  Document using an Open-Source LLM on Google Colab


In [1]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.15-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.0-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB

In [2]:

# Step 1: Install Required Libraries
# To get started, we'll use the `langchain` library for building language model applications and the `transformers` library for loading open-source LLMs.

# Install libraries
!pip install langchain transformers sentence_transformers chromadb beautifulsoup4 requests protobuf==3.20.3

# Step 2: Load the Open-Source Language Model
# We'll use a pre-trained model like `tiiuae/falcon-7b` or similar. Make sure to select a model suitable for your needs.

from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# Load model and tokenizer
def load_model():
    model_name = "tiiuae/falcon-7b"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")
    text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
    llm = HuggingFacePipeline(pipeline=text_generator)
    return llm

llm = load_model()


Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting protobuf==3.20.3
  Downloading protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.7-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.10.0-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-ot

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Device set to use cpu
  llm = HuggingFacePipeline(pipeline=text_generator)


In [3]:
# Step 3: Fetch and Index Text from Wikipedia
# We'll scrape text from a Wikipedia page, process it, and index it for querying.

import requests
from bs4 import BeautifulSoup
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Fetch text from a Wikipedia page
def fetch_wikipedia_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    paragraphs = soup.find_all('p')
    text = "".join([para.get_text() for para in paragraphs])
    return text

# Index the document
def index_document_from_text(text):
    # Split text into manageable chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    docs = text_splitter.split_text(text)

    # Use embeddings for indexing
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    db = Chroma.from_texts(docs, embeddings)
    return db

# Example: Fetch and index a Wikipedia page
wiki_url = "https://en.wikipedia.org/wiki/Computer#History"
wikipedia_text = fetch_wikipedia_text(wiki_url)
db = index_document_from_text(wikipedia_text)


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [4]:
# Step 4: Build a Chat Interface
# Create a function to interact with your document using the LLM and the indexed data.

from langchain.chains import RetrievalQA

# Define a query-answering chain
def create_qa_chain(llm, db):
    retriever = db.as_retriever()
    qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
    return qa_chain

qa_chain = create_qa_chain(llm, db)

In [11]:
## add time

import time

# Step 5: Query the Document
# Finally, input a question and get an answer based on your document.

# Ask a question
def ask_question(question, max_response_length=50):
    print(f"(ask_question) Processing the question: {question}")
    start_time = time.time()

    response = qa_chain.run( {"query": question, "max_length":max_response_length})
    end_time = time.time()
    print(f"(ask_question) Question processed in {end_time - start_time:.2f} seconds.")
    print("(ask_question) Answer received successfully.")
    return response

question = "What is the history of computers?"
print("(main) Asking the question to the model...")
answer = ask_question(question)
print("(main) Answer:", answer)


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


(main) Asking the question to the model...
(ask_question) Processing the question: What is the history of computers?


KeyboardInterrupt: 

In [8]:
import time

# Step 5: Query the Document
# Finally, input a question and get an answer based on your document.

# Ask a question
def ask_question(question, max_response_length=50):
    print(f"(ask_question) Processing the question: {question}")
    start_time = time.time()
    # The change is in the following line:
    # Instead of passing max_length as a keyword argument to qa_chain.run(),
    # we are setting the chain_type_kwargs to control the response length.
    response = qa_chain.run(question, chain_type_kwargs={"max_length": max_response_length})
    end_time = time.time()
    print(f"(ask_question) Question processed in {end_time - start_time:.2f} seconds.")
    print("(ask_question) Answer received successfully.")
    return response

question = "What is the history of computers?"
print("(main) Asking the question to the model...")
answer = ask_question(question)
print("(main) Answer:", answer)

(main) Asking the question to the model...
(ask_question) Processing the question: What is the history of computers?


ValueError: `run` supported with either positional arguments or keyword arguments but not both. Got args: ('What is the history of computers?',) and kwargs: {'chain_type_kwargs': {'max_length': 50}}.

In [6]:
# Step 5: Query the Document
# Finally, input a question and get an answer based on your document.

# Ask a question
def ask_question(question):
    print("(ask_questions):", question)
    response = qa_chain.run(question)
    return response

question = "What is the history of computers?"
answer = ask_question(question)
print("Answer:", answer)

# Step 6: Optional Enhancements
# - **Streamlit UI:** Build a web interface for interaction.
# - **Model Fine-Tuning:** Customize the model for your specific data.
# - **Cloud Integration:** Store and index large documents using cloud storage.

# This completes the tutorial on using an open-source LLM to chat with a Wikipedia page on Google Colab.


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


(ask_questions): What is the history of computers?


KeyboardInterrupt: 