<a href="https://colab.research.google.com/github/mahdihayan/LLM-Agents/blob/main/4_BBCNews_Chromadb_Liama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers --upgrade
!pip install sentence-transformers --upgrade
!pip install chromadb --upgrade

In [2]:
import numpy as np
import pandas as pd
from google.colab import files
import transformers
import sentence_transformers
from sentence_transformers import SentenceTransformer
import chromadb
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

## Download and prepare the Dataset
 Log into Kaggle and go to the link https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset to download

Using the Kaggle API JSON file, I saved the data set directly in the Google Colab temporary memory

In [3]:
print("Transformers version:", transformers.__version__)
print("Sentence-Transformers version:", sentence_transformers.__version__)
print("ChromaDB version:", chromadb.__version__)

Transformers version: 4.51.3
Sentence-Transformers version: 4.1.0
ChromaDB version: 1.0.8


In [5]:
files.upload()


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"mahdihayan","key":"2ad0ac6616a65c8dde7be538aaed542e"}'}

In [7]:
# مرحله 2: انتقال به پوشه درست و تنظیم دسترسی
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [18]:
# دانلود دیتاست Fertility
!kaggle datasets download -d everydaycodings/global-news-dataset



Dataset URL: https://www.kaggle.com/datasets/everydaycodings/global-news-dataset
License(s): CC0-1.0


In [20]:
# باز کردن فایل ZIP
import zipfile  # اضافه کردن این خط برای وارد کردن کتابخانه

# باز کردن فایل ZIP
with zipfile.ZipFile('global-news-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall()

print("دیتاست با موفقیت دانلود و استخراج شد.")



دیتاست با موفقیت دانلود و استخراج شد.


ChromaDB requires that the data has a unique identifier. You can achieve it with the statement below, which will create a new column called **Id**.

In [None]:
news = pd.read_csv('./raw-data.csv')
MAX_NEWS = 1200
DESCRIPTION="description"
TITLE="title"
news["id"] = news.index
news

In [27]:
subset_news = news.head(MAX_NEWS)     #Because it is just a example we select a small portion of News.
chroma_client = chromadb.PersistentClient(path="./chromadb")   # We specify the client name and storage path for ChromaDB (also, the database must be permanent, not temporary).

## Filling and Querying the ChromaDB Database
The Data in ChromaDB is stored in collections. If the collection previously exist is necessary to delete it.

In the next lines, the collection is created by calling the ***create_collection*** function in the ***chroma_client*** created above.

In [28]:
collection_name = "news_collection" + datetime.now().strftime("%s")      # create a new unique name for the collection
collection_names = chroma_client.list_collections()           # get list existing collections
if collection_name in collection_names:           # if the collection already exists, delete it
    chroma_client.delete_collection(name=collection_name)          # create a new collection

collection = chroma_client.create_collection(name=collection_name)

The data must be added to the collection with the add function.
At least three parts must be specified:

**Documents** → full text of each news item (stored in a specific column of the dataset)

**metadatas** → Meta information, such as the title or category of the news

**ids** → a unique identifier for each data row

**embedding**:To send and use information in Chroma DB, they must be captured as images

In [29]:
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")     # choose a sentence transformer model
embeddings = embedding_model.encode(subset_news[DESCRIPTION].tolist(), convert_to_numpy=True)   # create embeddings

# add embeddings to collection chromaDB
collection.add(
    documents=subset_news[DESCRIPTION].tolist(),
    metadatas=[{TITLE: topic} for topic in subset_news[TITLE].tolist()],
    ids=[f"id{x}" for x in range(MAX_NEWS)],
    embeddings=embeddings.tolist(),)

ValueError: Expected document to be a str, got nan in add.

In [None]:
results = collection.query(query_texts=["India"], n_results=10 )   # for each query text in the list ChromaDB searches, we get the top 10 results
print(results)

In [None]:
print(collection.get(ids="id775",include=["description", "embeddings"]))  # for test



 # We use TRANSFORMERS for working with language models (LLMs).
The three main tools used here are:

1️⃣ AutoTokenizer → An automatic tokenizer that converts text into tokens suitable for the model.

2️⃣ AutoModelForCausalLM → Language models based on Causal Language Modeling (like GPT) for text generation.

3️⃣ pipeline → A simple interface for performing NLP tasks such as text generation or text classification.  

In [None]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"        #model_id = "databricks/dolly-v2-3b"      Search for this powerful model in chatGPT !!
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

The model's response is limited to 256 tokens in answer

**"text-generation"** is a predefined task in the Transformers library and is

specifically designed for text generation models. It allows the model to generate new text based on the given input.

In [None]:
pipe = pipeline("text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto",)  # CPU or GPU selection

Device set to use cpu


## Creating prompt


In [None]:
question = "Can I buy a new Toshiba laptop?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
#context = context[0:5120] limits the length of the context to 5120 characters to avoid these limitations.
prompt_template = f"""
Relevant context: {context}
Considering the relevant context, answer the question.
Question: {question}
Answer: """

prompt_template

"\nRelevant context: #Photography student Thorsten Mjölnir captures the way students decorate their laptops. #Why sales of very basic mobile phones, without apps and internet connection, are increasing. #What do you do when your collection of millions of books keeps growing but your bookshelves don't? #The developers of a powerful mini aircraft hope it will be used by the armed forces. #How tech is helping young families and couples regain their busy social lives after Covid. #Watch as Lee Zii Jia of Malaysia records a speed of 372km/h on his backhand point against Lakshya Sen of India in the All England Badminton Championships. #The Royal Mint has found a way to turn old circuit boards from phones, computers and TVs into gold. #A van was reportedly hijacked and driven to the venue, and a controlled explosion has since been carried out. #The Ukrainian president reveals his location in Kyiv in a new video shared on social media. #The Royal Mint has found a way to turn old circuit boards

In [None]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])


Relevant context: #Photography student Thorsten Mjölnir captures the way students decorate their laptops. #Why sales of very basic mobile phones, without apps and internet connection, are increasing. #What do you do when your collection of millions of books keeps growing but your bookshelves don't? #The developers of a powerful mini aircraft hope it will be used by the armed forces. #How tech is helping young families and couples regain their busy social lives after Covid. #Watch as Lee Zii Jia of Malaysia records a speed of 372km/h on his backhand point against Lakshya Sen of India in the All England Badminton Championships. #The Royal Mint has found a way to turn old circuit boards from phones, computers and TVs into gold. #A van was reportedly hijacked and driven to the venue, and a controlled explosion has since been carried out. #The Ukrainian president reveals his location in Kyiv in a new video shared on social media. #The Royal Mint has found a way to turn old circuit boards f