<a href="https://colab.research.google.com/github/mrhamedani/How-to-apply-popular-LLMs/blob/main/%E2%80%8C%E2%80%8C4.%20BBCNews_Chromadb_Liama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers --upgrade
!pip install sentence-transformers --upgrade
!pip install chromadb --upgrade

In [None]:
import numpy as np
import pandas as pd
from google.colab import files
import transformers
import sentence_transformers
import chromadb
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

## Download and prepare the Dataset
 Log into Kaggle and go to the link https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset to download

Using the Kaggle API JSON file, I saved the data set directly in the Google Colab temporary memory

In [None]:
print("Transformers version:", transformers.__version__)
print("Sentence-Transformers version:", sentence_transformers.__version__)
print("ChromaDB version:", chromadb.__version__)

Transformers version: 4.48.1
Sentence-Transformers version: 3.4.0
ChromaDB version: 0.6.3


In [None]:
files.upload()
!kaggle datasets download -d gpreda/bbc-news
!unzip bbc-news.zip

ChromaDB requires that the data has a unique identifier. You can achieve it with the statement below, which will create a new column called **Id**.

In [None]:
news = pd.read_csv('./bbc_news.csv')
MAX_NEWS = 1000
DOCUMENT="description"
TOPIC="title"
news["id"] = news.index
news.head(3)

Unnamed: 0,title,pubDate,guid,link,description,id
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...,0
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as...",1
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...,2


In [None]:
#Because it is just a example we select a small portion of News.
subset_news = news.head(MAX_NEWS)
# We specify the client name and storage path for ChromaDB (also, the database must be permanent, not temporary).
chroma_client = chromadb.PersistentClient(path="./chromadb")


## Filling and Querying the ChromaDB Database
The Data in ChromaDB is stored in collections. If the collection previously exist is necessary to delete it.

In the next lines, the collection is created by calling the ***create_collection*** function in the ***chroma_client*** created above.

In [None]:
# create a new unique name for the collection
collection_name = "news_collection" + datetime.now().strftime("%s")
# get list existing collections
collection_names = chroma_client.list_collections()
# if the collection already exists, delete it
if collection_name in collection_names:
    chroma_client.delete_collection(name=collection_name)
# create a new collection
collection = chroma_client.create_collection(name=collection_name)

It's time to add the data to the collection. Using the function ***add*** you should inform, at least ***documents***, ***metadatas*** and ***ids***.
* In the **document** the full news text is stored, remember that it is contained in a different column for each Dataset.
* In **metadatas**, we can inform a list of topics.
* In **id** an unique identificator for each row must be informed. It MUST be unique! I'm creating the ID using the range of MAX_NEWS.

In [None]:
collection.add(documents=subset_news[DOCUMENT].tolist(),
    metadatas=[{TOPIC: topic} for topic in subset_news[TOPIC].tolist()],
    ids=[f"id{x}" for x in range(MAX_NEWS)],)
# The code is adding news items to a collection. For each news item, its subject is assigned as metadata and a unique identifier.

In [None]:
results = collection.query(query_texts=["laptop"], n_results=10 )
# for each query text in the list ChromaDB searches, we get the top 10 results
print(results)

{'ids': [['id775', 'id707', 'id310', 'id587', 'id444', 'id751', 'id701', 'id862', 'id191', 'id740']], 'embeddings': None, 'documents': [['Photography student Thorsten Mjölnir captures the way students decorate their laptops.', 'Why sales of very basic mobile phones, without apps and internet connection, are increasing.', "What do you do when your collection of millions of books keeps growing but your bookshelves don't?", 'The developers of a powerful mini aircraft hope it will be used by the armed forces.', 'How tech is helping young families and couples regain their busy social lives after Covid.', 'Watch as Lee Zii Jia of Malaysia records a speed of 372km/h on his backhand point against Lakshya Sen of India in the All England Badminton Championships.', 'The Royal Mint has found a way to turn old circuit boards from phones, computers and TVs into gold.', 'A van was reportedly hijacked and driven to the venue, and a controlled explosion has since been carried out.', 'The Ukrainian pres

In [None]:
getado = collection.get(ids="id141",include=["documents", "embeddings"])
print(getado)

## Loading the model and creating the prompt
TRANSFORMERS!!
Time to use the library **transformers**, the most famous library from [hugging face](https://huggingface.co/) for working with language models.

We are importing:
* **Autotokenizer**: It is a utility class for tokenizing text inputs that are compatible with various pre-trained language models.
* **AutoModelForCasualLLM**: it provides an interface to pre-trained language models specifically designed for language generation tasks using causal language modeling (e.g., GPT models), or the model used in this notebook ***TinyLlama-1.1B-Chat-v1.0***.
* **pipeline**: provides a simple interface for performing various natural language processing (NLP) tasks, such as text generation (our case) or text classification.

The model I have selected is [TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0), which is one of the smartest Small Language Models. Even so, it still has 1.1 billion parameters.

Please, feel free to test [different Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), you need to search for NLP models trained for text-generation. My recomendation is choose "small" models, or we will run out of memory in kaggle.  

In [None]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#model_id = "databricks/dolly-v2-3b"      Search for this powerful model in chatGPT !!
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
# This code snippet is for loading a language model from the Hugging Face Transformers library and preparing it to generate text using the model.


The next step is to initialize the pipeline using the objects created above.

The model's response is limited to 256 tokens, for this project I'm not interested in a longer response, but it can easily be extended to whatever length you want.

Setting ***device_map*** to ***auto*** we are instructing the model to automaticaly select the most appropiate device: CPU or GPU for processing the text generation.

In [None]:
pipe = pipeline("text-generation", #"text-generation" is a predefined task in the Transformers library and is
                #specifically designed for text generation models. It allows the model to generate new text based on the given input.
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto",)

Device set to use cpu


## Creating the extended prompt
To create the prompt you can use the result from query the Vector Database  and the sentence introduced by the user.

The prompt have two parts, the **relevant context** that is the information recovered from the database and the **user's question**.

You only need to join the two parts together to create the prompt sended to the model.

You can limit the lenght of the context passed to the model, because you can get some Memory problems with one of the datasets that contains a realy large text in the document part.

In [None]:
question = "Can I buy a new Toshiba laptop?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
#context = context[0:5120] limits the length of the context to 5120 characters to avoid these limitations.
prompt_template = f"""
Relevant context: {context}
Considering the relevant context, answer the question.
Question: {question}
Answer: """
prompt_template

"\nRelevant context: #Photography student Thorsten Mjölnir captures the way students decorate their laptops. #Why sales of very basic mobile phones, without apps and internet connection, are increasing. #What do you do when your collection of millions of books keeps growing but your bookshelves don't? #The developers of a powerful mini aircraft hope it will be used by the armed forces. #How tech is helping young families and couples regain their busy social lives after Covid. #Watch as Lee Zii Jia of Malaysia records a speed of 372km/h on his backhand point against Lakshya Sen of India in the All England Badminton Championships. #The Royal Mint has found a way to turn old circuit boards from phones, computers and TVs into gold. #A van was reportedly hijacked and driven to the venue, and a controlled explosion has since been carried out. #The Ukrainian president reveals his location in Kyiv in a new video shared on social media. #The Royal Mint has found a way to turn old circuit boards

In [None]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])


Relevant context: #Photography student Thorsten Mjölnir captures the way students decorate their laptops. #Why sales of very basic mobile phones, without apps and internet connection, are increasing. #What do you do when your collection of millions of books keeps growing but your bookshelves don't? #The developers of a powerful mini aircraft hope it will be used by the armed forces. #How tech is helping young families and couples regain their busy social lives after Covid. #Watch as Lee Zii Jia of Malaysia records a speed of 372km/h on his backhand point against Lakshya Sen of India in the All England Badminton Championships. #The Royal Mint has found a way to turn old circuit boards from phones, computers and TVs into gold. #A van was reportedly hijacked and driven to the venue, and a controlled explosion has since been carried out. #The Ukrainian president reveals his location in Kyiv in a new video shared on social media. #The Royal Mint has found a way to turn old circuit boards f

# Note:
Here we used the embedding available in ChromaDB, which is easier to do
(while the model itself has its own embedding, which we did not use)