<a href="https://colab.research.google.com/github/hosein9574/My-agents/blob/main/%E2%80%8C%E2%80%8C4.%20BBCNews_Chromadb_Liama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers --upgrade
!pip install sentence-transformers --upgrade
!pip install chromadb --upgrade



In [2]:
import numpy as np
import pandas as pd
from google.colab import files
import transformers
import sentence_transformers
from sentence_transformers import SentenceTransformer
import chromadb
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

## Download and prepare the Dataset
 Log into Kaggle and go to the link https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset to download

Using the Kaggle API JSON file, I saved the data set directly in the Google Colab temporary memory

In [3]:
print("Transformers version:", transformers.__version__)
print("Sentence-Transformers version:", sentence_transformers.__version__)
print("ChromaDB version:", chromadb.__version__)

Transformers version: 4.51.3
Sentence-Transformers version: 4.1.0
ChromaDB version: 1.0.7


In [5]:
from google.colab import files
uploaded = files.upload()


Saving kaggle.json to kaggle.json


In [6]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [7]:
!kaggle datasets download -d gpreda/bbc-news
!unzip bbc-news.zip


Dataset URL: https://www.kaggle.com/datasets/gpreda/bbc-news
License(s): CC0-1.0
Archive:  bbc-news.zip
  inflating: bbc_news.csv            


ChromaDB requires that the data has a unique identifier. You can achieve it with the statement below, which will create a new column called **Id**.

In [8]:
news = pd.read_csv('./bbc_news.csv')
MAX_NEWS = 1000
DOCUMENT="description"
TOPIC="title"
news["id"] = news.index
news.head(3)

Unnamed: 0,title,pubDate,guid,link,description,id
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...,0
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as...",1
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...,2


In [9]:
subset_news = news.head(MAX_NEWS)     #Because it is just a example we select a small portion of News.
chroma_client = chromadb.PersistentClient(path="./chromadb")   # We specify the client name and storage path for ChromaDB (also, the database must be permanent, not temporary).

## Filling and Querying the ChromaDB Database
The Data in ChromaDB is stored in collections. If the collection previously exist is necessary to delete it.

In the next lines, the collection is created by calling the ***create_collection*** function in the ***chroma_client*** created above.

In [10]:
collection_name = "news_collection" + datetime.now().strftime("%s")      # create a new unique name for the collection
collection_names = chroma_client.list_collections()           # get list existing collections
if collection_name in collection_names:           # if the collection already exists, delete it
    chroma_client.delete_collection(name=collection_name)          # create a new collection

collection = chroma_client.create_collection(name=collection_name)

The data must be added to the collection with the add function.
At least three parts must be specified:

**Documents** → full text of each news item (stored in a specific column of the dataset)

**metadatas** → Meta information, such as the title or category of the news

**ids** → a unique identifier for each data row

**embedding**:To send and use information in Chroma DB, they must be captured as images

In [11]:
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")     # choose a sentence transformer model
embeddings = embedding_model.encode(subset_news[DOCUMENT].tolist(), convert_to_numpy=True)   # create embeddings

# add embeddings to collection chromaDB
collection.add(
    documents=subset_news[DOCUMENT].tolist(),
    metadatas=[{TOPIC: topic} for topic in subset_news[TOPIC].tolist()],
    ids=[f"id{x}" for x in range(MAX_NEWS)],
    embeddings=embeddings.tolist(),)

documents = collection.get(ids=[f"id{x}" for x in range(MAX_NEWS)])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
results = collection.query(query_texts=["laptop"], n_results=10 )   # for each query text in the list ChromaDB searches, we get the top 10 results
print(results)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 76.3MiB/s]


{'ids': [['id775', 'id707', 'id310', 'id587', 'id444', 'id751', 'id701', 'id862', 'id191', 'id740']], 'embeddings': None, 'documents': [['Photography student Thorsten Mjölnir captures the way students decorate their laptops.', 'Why sales of very basic mobile phones, without apps and internet connection, are increasing.', "What do you do when your collection of millions of books keeps growing but your bookshelves don't?", 'The developers of a powerful mini aircraft hope it will be used by the armed forces.', 'How tech is helping young families and couples regain their busy social lives after Covid.', 'Watch as Lee Zii Jia of Malaysia records a speed of 372km/h on his backhand point against Lakshya Sen of India in the All England Badminton Championships.', 'The Royal Mint has found a way to turn old circuit boards from phones, computers and TVs into gold.', 'A van was reportedly hijacked and driven to the venue, and a controlled explosion has since been carried out.', 'The Ukrainian pres

In [14]:
print(collection.get(ids="id775",include=["documents", "embeddings"]))  # for test


{'ids': ['id775'], 'embeddings': array([[-2.67221183e-02,  9.37914997e-02,  3.02346353e-03,
        -5.78389168e-02,  2.97618173e-02, -2.34934054e-02,
         2.72504054e-02,  4.50752899e-02,  4.75690402e-02,
         5.35774417e-02,  1.00971334e-01, -1.01442235e-02,
         8.24281350e-02,  6.61812946e-02,  6.21390063e-03,
         4.01778030e-04,  9.82936751e-03,  1.11560654e-02,
         3.24836560e-02,  2.27334816e-02,  2.07442846e-02,
        -5.90932034e-02,  2.73000635e-02, -6.01511896e-02,
         4.16708216e-02,  2.75906343e-02,  6.32560924e-02,
        -1.07277416e-01, -5.16070612e-02, -6.56247810e-02,
        -2.74685696e-02, -8.95415060e-03, -3.35778370e-02,
         7.96491429e-02, -2.32424978e-02, -4.83128149e-03,
         1.03186453e-02,  6.27650246e-02, -2.56560799e-02,
        -1.88748445e-02, -1.29533350e-01, -7.17133358e-02,
         3.10858637e-02, -5.89844510e-02,  3.43758389e-02,
        -8.83892998e-02,  2.81154886e-02, -4.83900346e-02,
         2.88590323e-02


 # We use TRANSFORMERS for working with language models (LLMs).
The three main tools used here are:

1️⃣ AutoTokenizer → An automatic tokenizer that converts text into tokens suitable for the model.

2️⃣ AutoModelForCausalLM → Language models based on Causal Language Modeling (like GPT) for text generation.

3️⃣ pipeline → A simple interface for performing NLP tasks such as text generation or text classification.  

In [13]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"        #model_id = "databricks/dolly-v2-3b"      Search for this powerful model in chatGPT !!
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The model's response is limited to 256 tokens in answer

**"text-generation"** is a predefined task in the Transformers library and is

specifically designed for text generation models. It allows the model to generate new text based on the given input.

In [None]:
pipe = pipeline("text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto",)  # CPU or GPU selection

Device set to use cpu


## Creating prompt


In [None]:
question = "Can I buy a new Toshiba laptop?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
#context = context[0:5120] limits the length of the context to 5120 characters to avoid these limitations.
prompt_template = f"""
Relevant context: {context}
Considering the relevant context, answer the question.
Question: {question}
Answer: """

prompt_template

"\nRelevant context: #Photography student Thorsten Mjölnir captures the way students decorate their laptops. #Why sales of very basic mobile phones, without apps and internet connection, are increasing. #What do you do when your collection of millions of books keeps growing but your bookshelves don't? #The developers of a powerful mini aircraft hope it will be used by the armed forces. #How tech is helping young families and couples regain their busy social lives after Covid. #Watch as Lee Zii Jia of Malaysia records a speed of 372km/h on his backhand point against Lakshya Sen of India in the All England Badminton Championships. #The Royal Mint has found a way to turn old circuit boards from phones, computers and TVs into gold. #A van was reportedly hijacked and driven to the venue, and a controlled explosion has since been carried out. #The Ukrainian president reveals his location in Kyiv in a new video shared on social media. #The Royal Mint has found a way to turn old circuit boards

In [None]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])


Relevant context: #Photography student Thorsten Mjölnir captures the way students decorate their laptops. #Why sales of very basic mobile phones, without apps and internet connection, are increasing. #What do you do when your collection of millions of books keeps growing but your bookshelves don't? #The developers of a powerful mini aircraft hope it will be used by the armed forces. #How tech is helping young families and couples regain their busy social lives after Covid. #Watch as Lee Zii Jia of Malaysia records a speed of 372km/h on his backhand point against Lakshya Sen of India in the All England Badminton Championships. #The Royal Mint has found a way to turn old circuit boards from phones, computers and TVs into gold. #A van was reportedly hijacked and driven to the venue, and a controlled explosion has since been carried out. #The Ukrainian president reveals his location in Kyiv in a new video shared on social media. #The Royal Mint has found a way to turn old circuit boards f