# Lab | LangChain Med

## Objectives

- continue on with lesson 2' example, use different datasets to test what we did in class. Some datasets are suggested in the notebook but feel free to scout other datasets on HuggingFace or Kaggle.
- Find another model on Hugging Face and compare it.
- Modify the prompt to fit your selected dataset.

In [1]:
import numpy as np
import pandas as pd

## Load the Dataset
As you can see the notebook is ready to work with three different Datasets. Just uncomment the lines of the Dataset you want to use.

I selected Datasets with News. Two of them have just a brief decription of the news, but the other contains the full text.

As we are working in a free and limited space, I limited the number of news to use with the variable MAX_NEWS. Feel free to pull more if you have memory available.

The name of the field containing the text of the new is stored in the variable *DOCUMENT* and the metadata in *TOPIC*

In [5]:
# news = pd.read_csv('/kaggle/input/topic-labeled-news-dataset/labelled_newscatcher_dataset.csv', sep=';')
# MAX_NEWS = 1000
# DOCUMENT="title"
# TOPIC="topic"

#news = pd.read_csv('/kaggle/input/bbc-news/bbc_news.csv')
#MAX_NEWS = 1000
#DOCUMENT="description"
#TOPIC="title"

#news = pd.read_csv('/kaggle/input/mit-ai-news-published-till-2023/articles.csv')
#MAX_NEWS = 100
#DOCUMENT="Article Body"
#TOPIC="Article Header"

news = "articles.csv" #Ideally pick one from the commented ones above

ChromaDB requires that the data has a unique identifier. We can make it with this statement, which will create a new column called **Id**.


In [11]:
import pandas as pd

# Make sure the file is read correctly
news = pd.read_csv("articles.csv")

# Add a new column 'id' based on the index
news["id"] = news.index

# Display the first 5 rows of the DataFrame
print(news.head())


   Unnamed: 0 Published Date                Author  \
0           0   July 7, 2023             Adam Zewe   
1           1   July 6, 2023           Alex Ouyang   
2           2  June 30, 2023  Jennifer Michalowski   
3           3  June 30, 2023   Mary Beth Gallagher   
4           4  June 30, 2023             Adam Zewe   

                                              Source  \
0                                    MIT News Office   
1  Abdul Latif Jameel Clinic for Machine Learning...   
2              McGovern Institute for Brain Research   
3                              School of Engineering   
4                                    MIT News Office   

                                      Article Header  \
0  Learning the language of molecules to predict ...   
1  MIT scientists build a system that can generat...   
2  When computer vision works more like a brain, ...   
3  Educating national security leaders on artific...   
4  Researchers teach an AI to write better chart ...   

 

In [13]:
# Because it is just a course we select a small portion of News.
MAX_NEWS = 500  # You can change this number to fit your needs
subset_news = news.head(MAX_NEWS)


## Import and configure the Vector Database
I'm going to use ChromaDB, the most popular OpenSource embedding Database.

First we need to import ChromaDB, and after that import the **Settings** class from **chromadb.config** module. This class allows us to change the setting for the ChromaDB system, and customize its behavior.

In [7]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-1.0.6-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.2-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.25.0-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentele

In [14]:
import chromadb
from chromadb.config import Settings

Now we need to create the seetings object calling the **Settings** function imported previously. We store the object in the variable **settings_chroma**.

Is necessary to inform two parameters
* chroma_db_impl. Here we specify the database implementation and the format how store the data. I choose ***duckdb***, because his high-performace. It operate primarly in memory. And is fully compatible with SQL. The store format ***parquet*** is good for tabular data. With good compression rates and performance.

* persist_directory: It just contains the directory where the data will be stored. Is possible work without a directory and the data will be stored in memory without persistece, but Kaggle dosn't support that.

In [None]:
chroma_client = chromadb.PersistentClient(path="/path/to/persist/directory")

## Filling and Querying the ChromaDB Database
The Data in ChromaDB is stored in collections. If the collection exist we need to delete it.

In the next lines, we are creating the collection by calling the ***create_collection*** function in the ***chroma_client*** created above.

In [16]:
from chromadb import Client
chroma_client = Client()

collection_name = "news_collection"
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
        chroma_client.delete_collection(name=collection_name)

collection = chroma_client.create_collection(name=collection_name)


It's time to add the data to the collection. Using the function ***add*** we need to inform, at least ***documents***, ***metadatas*** and ***ids***.
* In the **document** we store the big text, it's a different column in each Dataset.
* In **metadatas**, we can informa a list of topics.
* In **id** we need to inform an unique identificator for each row. It MUST be unique! I'm creating the ID using the range of MAX_NEWS.


In [23]:
print(subset_news.columns)


Index(['Unnamed: 0', 'Published Date', 'Author', 'Source', 'Article Header',
       'Sub_Headings', 'Article Body', 'Url', 'id'],
      dtype='object')


In [25]:
collection.add(
    documents=subset_news["Article Body"].tolist(),
    metadatas=[{"source": "course"} for _ in range(len(subset_news))],
    ids=[f"id{x}" for x in range(MAX_NEWS)],
)


/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 76.2MiB/s]


In [26]:
results = collection.query(query_texts=["laptop"], n_results=10 )

print(results)

{'ids': [['id237', 'id109', 'id9', 'id100', 'id334', 'id218', 'id231', 'id144', 'id400', 'id103']], 'embeddings': None, 'documents': [["['When humans look at a scene, they see objects and the relationships between them. On top of your desk, there might be a laptop that is sitting to the left of a phone, which is in front of a computer monitor.', '', 'Many deep learning models struggle to see the world this way because they don’t understand the entangled relationships between individual objects. Without knowledge of these relationships, a robot designed to help someone in a kitchen would have difficulty following a command like “pick up the spatula that is to the left of the stove and place it on top of the cutting board.”', '', 'In an effort to solve this problem, MIT researchers have developed a model that understands the underlying relationships between objects in a scene. Their model represents individual relationships one at a time, then combines these representations to describe t

## Vector MAP

In [27]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [28]:

getado = collection.get(ids="id141",
                       include=["documents", "embeddings"])


In [29]:
word_vectors = getado["embeddings"]
word_list = getado["documents"]
word_vectors

array([[-3.73729430e-02, -1.82920452e-02,  1.48837995e-02,
        -6.51341975e-02,  2.60852501e-02, -3.37915793e-02,
        -8.85920227e-02, -2.32951622e-02,  3.66715305e-02,
        -5.46598248e-02, -4.64128628e-02, -8.24273191e-03,
         1.19425273e-02,  4.93168868e-02,  3.10725272e-02,
        -3.82354259e-02,  2.31101066e-02,  1.22701988e-01,
        -6.19672947e-02,  5.81677221e-02,  1.47276628e-03,
        -7.65489414e-02,  3.21513750e-02, -8.23823139e-02,
         6.63918778e-02,  4.48568985e-02,  4.21497077e-02,
        -8.53938982e-02, -6.01443984e-02, -6.61355108e-02,
        -2.66331788e-02,  6.44640326e-02, -4.93826345e-02,
         4.94234189e-02, -1.08341053e-02,  5.14731668e-02,
        -3.51315998e-02,  6.42413571e-02, -9.27531570e-02,
        -3.59810181e-02, -8.51940811e-02,  6.69028834e-02,
        -2.12268773e-02, -9.54463799e-03,  7.42490366e-02,
         7.68522471e-02,  8.58911593e-03, -2.83467062e-02,
         3.36568616e-02,  1.72231649e-03, -8.33549798e-0

Once we have our information inside the Database we can query It, and ask for data that matches our needs. The search is done inside the content of the document, and it dosn't look for the exact word, or phrase. The results will be based on the similarity between the search terms and the content of documents.

The metadata is not used in the search, but they can be utilized for filtering or refining the results after the initial search.


## Loading the model and creating the prompt
TRANSFORMERS!!
Time to use the library **transformers**, the most famous library from [hugging face](https://huggingface.co/) for working with language models.

We are importing:
* **Autotokenizer**: It is a utility class for tokenizing text inputs that are compatible with various pre-trained language models.
* **AutoModelForCasualLLM**: it provides an interface to pre-trained language models specifically designed for language generation tasks using causal language modeling (e.g., GPT models), or the model used in this notebook ***databricks/dolly-v2-3b***.
* **pipeline**: provides a simple interface for performing various natural language processing (NLP) tasks, such as text generation (our case) or text classification.

The model selected is [dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b), the smallest Dolly model. It have 3billion paramaters, more than enough for our sample, and works much better than GPT2.

Please, feel free to test [different Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), you need to search for NLP models trained for text-generation. My recomendation is choose "small" models, or we will run out of memory in kaggle.  


In [30]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

The next step is to initialize the pipeline using the objects created above.

The model's response is limited to 256 tokens, for this project I'm not interested in a longer response, but it can easily be extended to whatever length you want.

Setting ***device_map*** to ***auto*** we are instructing the model to automaticaly select the most appropiate device: CPU or GPU for processing the text generation.  

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name = "gpt2"

lm_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto",
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


## Creating the extended prompt
To create the prompt we use the result from query the Vector Database  and the sentence introduced by the user.

The prompt have two parts, the **relevant context** that is the information recovered from the database and the **user's question**.

We only need to join the two parts together to create the prompt that we are going to send to the model.

You can limit the lenght of the context passed to the model, because we can get some Memory problems with one of the datasets that contains a realy large text in the document part.

In [7]:
# Import Chroma and Transformers libraries
from chromadb import Client
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# Initialize Chroma client and create/get the collection
chroma_client = Client()
collection_name = "news_collection"
collection = chroma_client.get_or_create_collection(name=collection_name)

# Example question from user
question = "Can I buy a Toshiba laptop?"

# Run a similarity search using the question
results = collection.query(
    query_texts=[question],
    n_results=5
)

# Build the context from the retrieved documents
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])

# Create the prompt for generation
prompt_template = f"Relevant context: {context}\n\nThe user's question: {question}"

# Load language model and tokenizer
model_name = "gpt2"  # You can change this to any supported model
lm_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create text-generation pipeline
pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto"
)

# Generate the response using the prompt
response = pipe(prompt_template)[0]["generated_text"]

# Print the result
print(response)


Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Relevant context: 

The user's question: Can I buy a Toshiba laptop? I know it doesn't need a case anymore but does it also fit into my current laptop? There's an opportunity to see what type of laptop could be sold.

From: Muhlenberg [mailto:"Muhlenberg Sent: Saturday, March 24, 2013 4:11 AM] Subject: Toshiba laptop: Can I buy a Toshiba laptop? [Mailto:Muhlenberg@goop.net (Muhlenberg) ] Date: Fri, 23 Feb 2013 11:35:41 +0800 (UTC)

You can't buy a Chromebook. You can buy them. You can buy something else. It makes no sense to give a full-blown computer such as a laptop to a company that doesn't do any of the things that the business people do to their PCs. I would say that it's not appropriate to give a laptop to Lenovo for free and they are being a little out of touch about how it works.

The question: Can I buy a Dell laptop?

From: Jeff Kowalski [mailto:"JDKowalski@goop.net (Kowalski) : I have a Dell C64 and I just purchased one for $20.


Now all that remains is to send the prompt to the model and wait for its response!


In [8]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Relevant context: 

The user's question: Can I buy a Toshiba laptop?

Note, "can I buy a Toshiba laptop?" It's not the same question for many people, so I won't bother to answer as everyone knows what the term actually means. Nonetheless, it's a really important question and one that goes to the heart of the matter for most users.

So let's look at the context and see what happens when buying a Toshiba laptop. First let's begin with how they're priced.

The value of every Toshiba laptop

A good Toshiba laptop is rated at Rs 100,000. That's the highest price you can pay for your PC. The difference from one to two pennies would put a smartphone up to Rs 25,000. So it's not worth it for us to buy 2, 3 or 4 laptops for a one time price (even though we might go over Rs 500 for 2 laptops but our prices are so high for so very small that even they didn't consider a laptop, they would actually choose to have 2 laptops for their annual home costs).

As for the rest that goes on the Toshiba lapt