# Lab | LangChain Med

## Objectives

- continue on with lesson 2' example, use different datasets to test what we did in class. Some datasets are suggested in the notebook but feel free to scout other datasets on HuggingFace or Kaggle.
- Find another model on Hugging Face and compare it.
- Modify the prompt to fit your selected dataset.

In [1]:
import numpy as np 
import pandas as pd

## Load the Dataset
As you can see the notebook is ready to work with three different Datasets. Just uncomment the lines of the Dataset you want to use. 

I selected Datasets with News. Two of them have just a brief decription of the news, but the other contains the full text. 

As we are working in a free and limited space, I limited the number of news to use with the variable MAX_NEWS. Feel free to pull more if you have memory available. 

The name of the field containing the text of the new is stored in the variable *DOCUMENT* and the metadata in *TOPIC*

In [2]:
news = pd.read_csv('labelled_newscatcher_dataset/labelled_newscatcher_dataset.csv', sep=';')
MAX_NEWS = 5000
DOCUMENT="title"
TOPIC="topic"

#news = pd.read_csv('/kaggle/input/bbc-news/bbc_news.csv')
#MAX_NEWS = 1000
#DOCUMENT="description"
#TOPIC="title"

#news = pd.read_csv('/kaggle/input/mit-ai-news-published-till-2023/articles.csv')
#MAX_NEWS = 100
#DOCUMENT="Article Body"
#TOPIC="Article Header"

#news = "PICK A DATASET" #Ideally pick one from the commented ones above

ChromaDB requires that the data has a unique identifier. We can make it with this statement, which will create a new column called **Id**.


In [3]:
news["id"] = news.index
news.head()

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


In [4]:
#Because it is just a course we select a small portion of News.
subset_news = news.head(MAX_NEWS)

In [5]:
subset_news.head()

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


## Import and configure the Vector Database
I'm going to use ChromaDB, the most popular OpenSource embedding Database. 

First we need to import ChromaDB, and after that import the **Settings** class from **chromadb.config** module. This class allows us to change the setting for the ChromaDB system, and customize its behavior. 

In [6]:
import chromadb
from chromadb.config import Settings

Now we need to create the seetings object calling the **Settings** function imported previously. We store the object in the variable **settings_chroma**.

Is necessary to inform two parameters 
* chroma_db_impl. Here we specify the database implementation and the format how store the data. I choose ***duckdb***, because his high-performace. It operate primarly in memory. And is fully compatible with SQL. The store format ***parquet*** is good for tabular data. With good compression rates and performance. 

* persist_directory: It just contains the directory where the data will be stored. Is possible work without a directory and the data will be stored in memory without persistece, but Kaggle dosn't support that. 

In [7]:
import chromadb
print(chromadb.__version__)

0.6.3


In [8]:
chroma_client = chromadb.PersistentClient(path="Model")

⚠️ It looks like you upgraded from a version below 0.5.6 and could benefit from vacuuming your database. Run chromadb utils vacuum --help for more information.


## Filling and Querying the ChromaDB Database
The Data in ChromaDB is stored in collections. If the collection exist we need to delete it. 

In the next lines, we are creating the collection by calling the ***create_collection*** function in the ***chroma_client*** created above.

In [9]:
# Handle collection creation/deletion
collection_name = "news_collection"

# Get list of collection names
existing_collections = chroma_client.list_collections()

# Delete if exists
if collection_name in existing_collections:
    chroma_client.delete_collection(name=collection_name)

# Create new collection
collection = chroma_client.create_collection(name=collection_name)

It's time to add the data to the collection. Using the function ***add*** we need to inform, at least ***documents***, ***metadatas*** and ***ids***. 
* In the **document** we store the big text, it's a different column in each Dataset. 
* In **metadatas**, we can informa a list of topics. 
* In **id** we need to inform an unique identificator for each row. It MUST be unique! I'm creating the ID using the range of MAX_NEWS. 


In [20]:
try:
    collection.add(
        documents=subset_news[DOCUMENT].tolist(),
        metadatas=[{TOPIC: topic} for topic in subset_news[TOPIC].tolist()],
        ids=[f"id{x}" for x in range(MAX_NEWS)],
    )
except ValueError as ve:
    print(f"Error with data validation: {ve}")
except Exception as e:
    print(f"An error occurred while adding data to collection: {e}")

Insert of existing embedding ID: id0
Insert of existing embedding ID: id1
Insert of existing embedding ID: id2
Insert of existing embedding ID: id3
Insert of existing embedding ID: id4
Insert of existing embedding ID: id5
Insert of existing embedding ID: id6
Insert of existing embedding ID: id7
Insert of existing embedding ID: id8
Insert of existing embedding ID: id9
Insert of existing embedding ID: id10
Insert of existing embedding ID: id11
Insert of existing embedding ID: id12
Insert of existing embedding ID: id13
Insert of existing embedding ID: id14
Insert of existing embedding ID: id15
Insert of existing embedding ID: id16
Insert of existing embedding ID: id17
Insert of existing embedding ID: id18
Insert of existing embedding ID: id19
Insert of existing embedding ID: id20
Insert of existing embedding ID: id21
Insert of existing embedding ID: id22
Insert of existing embedding ID: id23
Insert of existing embedding ID: id24
Insert of existing embedding ID: id25
Insert of existing emb

In [21]:
results = collection.query(query_texts=["AI"], n_results=10 )

print(results)



## Vector MAP

In [22]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [23]:

getado = collection.get(ids="id198", 
                       include=["documents", "embeddings"])


In [24]:
word_vectors = getado["embeddings"]
word_list = getado["documents"]
word_vectors

array([[-5.59539311e-02, -5.86840287e-02,  2.46963911e-02,
         1.86365470e-03,  3.25582474e-02, -7.10237632e-03,
         3.60154882e-02,  1.71795557e-03,  7.60566269e-04,
         1.59953237e-02, -1.75690409e-02,  1.57342628e-02,
        -2.47866493e-02, -3.29301110e-04, -1.07590919e-02,
         3.11088543e-02, -6.93508089e-02,  3.84741696e-03,
        -1.98889561e-02,  3.72963870e-04, -1.50958560e-02,
         2.60766819e-02, -2.12972034e-02, -6.71948195e-02,
         1.59701910e-02,  1.61412895e-01,  7.42058409e-03,
        -1.59859911e-01,  2.07737368e-03, -2.12099478e-02,
         1.22177474e-01,  1.17519721e-02,  1.57717038e-02,
        -4.55725975e-02, -4.45696786e-02,  5.15297055e-02,
        -2.44198740e-02, -1.64127983e-02,  7.80728832e-02,
        -1.06360301e-01,  1.06747961e-02, -8.30580071e-02,
        -1.57140438e-02,  9.78870038e-03,  4.67230566e-02,
         9.87039432e-02, -1.08455401e-02, -7.33674839e-02,
         2.47516632e-02, -7.39795947e-03, -1.14180140e-0

Once we have our information inside the Database we can query It, and ask for data that matches our needs. The search is done inside the content of the document, and it dosn't look for the exact word, or phrase. The results will be based on the similarity between the search terms and the content of documents. 

The metadata is not used in the search, but they can be utilized for filtering or refining the results after the initial search. 


## Loading the model and creating the prompt
TRANSFORMERS!!
Time to use the library **transformers**, the most famous library from [hugging face](https://huggingface.co/) for working with language models. 

We are importing: 
* **Autotokenizer**: It is a utility class for tokenizing text inputs that are compatible with various pre-trained language models.
* **AutoModelForCasualLLM**: it provides an interface to pre-trained language models specifically designed for language generation tasks using causal language modeling (e.g., GPT models), or the model used in this notebook ***databricks/dolly-v2-3b***.
* **pipeline**: provides a simple interface for performing various natural language processing (NLP) tasks, such as text generation (our case) or text classification. 

The model selected is [dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b), the smallest Dolly model. It have 3billion paramaters, more than enough for our sample, and works much better than GPT2. 

Please, feel free to test [different Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), you need to search for NLP models trained for text-generation. My recomendation is choose "small" models, or we will run out of memory in kaggle.  


In [25]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "gpt2"#"databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id)



The next step is to initialize the pipeline using the objects created above. 

The model's response is limited to 256 tokens, for this project I'm not interested in a longer response, but it can easily be extended to whatever length you want.

Setting ***device_map*** to ***auto*** we are instructing the model to automaticaly select the most appropiate device: CPU or GPU for processing the text generation.  

In [26]:
pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto", # Changed to "auto" to let it detect available devices
)

Device set to use cpu


## Creating the extended prompt
To create the prompt we use the result from query the Vector Database  and the sentence introduced by the user. 

The prompt have two parts, the **relevant context** that is the information recovered from the database and the **user's question**. 

We only need to join the two parts together to create the prompt that we are going to send to the model. 

You can limit the lenght of the context passed to the model, because we can get some Memory problems with one of the datasets that contains a realy large text in the document part. 

In [27]:
question = "Can i trust AI?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
#context = context[0:5120]
prompt_template = f"Relevant context: {context}\n\n The user's question: {question}"
prompt_template



Now all that remains is to send the prompt to the model and wait for its response!


In [28]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



 The user's question: Can i trust AI? You can, with artificial intelligence to 'accidentally' 'accidentally' lose your money

The Australian study, published on Friday in Current Biology, found that research on the potential impacts of human-driven automation, including machine learning and artificial intelligence, are "almost completely out of the question".

The paper, entitled 'Human-generated Autonomous Robots on Earth Could Be More Incompatible Than the Human-Driven Automated Human-Made Robots' cites several studies from earlier this month showing that humans also are more likely to perform work than machines.

However, the researchers argue that as automation advances and people have access to technology, the skills required to be human will become redundant.

In the case of machine vision and deeplearning, researchers believe humans can actually overcome these obstacles.

The study finds that artificial intelligence will not always be capable of predicting or controlling the ou

In [29]:
# Set maximum context length to avoid memory issues
MAX_CONTEXT_LENGTH = 5120  # Adjust this based on model's context window

question = "Tell me about sustainable energy?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])

# Truncate context if it exceeds maximum length while preserving complete sentences
if len(context) > MAX_CONTEXT_LENGTH:
    # Find last period before max length to avoid cutting mid-sentence
    last_period = context[:MAX_CONTEXT_LENGTH].rfind('.')
    if last_period != -1:
        context = context[:last_period + 1]
    else:
        # If no period found, do hard truncation
        context = context[:MAX_CONTEXT_LENGTH]
    print(f"Context was truncated to {len(context)} characters")

prompt_template = f"Relevant context: {context}\n\n The user's question: {question}"

try:
    lm_response = pipe(prompt_template)
    print(lm_response[0]["generated_text"])
except Exception as e:
    print(f"Error generating response: {str(e)}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



 The user's question: Tell me about sustainable energy? The answers: "This is just the beginning. It's about as close to zero emissions as we possibly get." Here is "Why the heck do we have emissions?" I had the answer: The only way our current fossil fuel technology might reduce greenhouse gases is to reduce our reliance on carbon monoxide. The latest research shows that the same can be done in the next 10 years. It has nothing to do with carbon dioxide's emissions that cause global temperature to hit about 2C. Our current energy system is an amalgam of all sorts of things — gas, electricity and coal. So today fossil fuels must pay for themselves in some form, but not the way they do in other sectors. They provide some of the most advanced, efficient ways to reduce carbon dioxide, but are required in part due to a number of factors, which don't seem to be at issue here. For instance, if fossil fuels become less polluting, less energy is lost to our economy. That's because carbon emis