In [21]:
import chromadb

In [22]:

client = chromadb.Client()

## Stvaranje kolekcije
U kolekciji se spremaju dokumenti

In [29]:
collection3 = client.create_collection(name="kolekcija3")

## Dodavanje dokumenata u kolekciju
Ovdje mozemo dodat dokumente koje zelimo vektorizirati, inace bi bio cjeli tekst iz dokumenta

In [31]:
collection3.add(
   ids=["1", "2", "3"],
   documents=[
      "Dokument o bananama","Dokument o narancama",
      "Dokument o jabukama"
   ]
)

In [32]:
print(collection3)

Collection(name=kolekcija3)


## Pretrazivanje baze sa user query
Kad user napise neko pitanje, to pitanje se onda salje kolekciji i ona, semantickim pretrazivanjem, vrati dokument koji najvise odgovara tom pitanju.

Ovo je temelj sve sto ce mo radit jer na temelju tog izbora dokumenta ce AI model citati podatke.

In [48]:
user_query = "Koliko dugo banana moze stajati vani"

context = collection3.query(
    query_texts=[user_query]
)

print(context)

{'ids': [['1', '2', '3']], 'embeddings': None, 'documents': [['Dokument o bananama', 'Dokument o narancama', 'Dokument o jabukama']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None, None, None]], 'distances': [[0.6421152353286743, 1.0225074291229248, 1.0967178344726562]]}


## Dodavanje Dataseta sa Kaggle
Instalirao sam dataset 'News Articles', nalazi se u folderu /archive

### Prikaz prvih nekoliko linija dataseta

In [44]:
import polars as pl

# Sa with_row_index samo dodamo index stupac na pocetku tablice
articles = pl.read_csv('./archive/Articles.csv', encoding='ISO-8859-1').with_row_index(offset=1)
articles.head()

index,Article,Date,Heading,NewsType
u32,str,str,str,str
1,"""KARACHI: The Sindh government …","""1/1/2015""","""sindh govt decides to cut publ…","""business"""
2,"""HONG KONG: Asian markets start…","""1/2/2015""","""asia stocks up in new year tra…","""business"""
3,"""HONG KONG: Hong Kong shares o…","""1/5/2015""","""hong kong stocks open 0.66 per…","""business"""
4,"""HONG KONG: Asian markets tumbl…","""1/6/2015""","""asian stocks sink euro near ni…","""business"""
5,"""NEW YORK: US oil prices Monday…","""1/6/2015""","""us oil prices slip below 50 a …","""business"""


## Embeddings
Embedding model uzima neki tekst i pretvara ga u niz brojeva npr. vektor.
Ti brojevi onda predstavljaju znacenje te recenice, sto znaci da mozemo raditi bolje pretrage velikih komada teksta jer se fokusiramo na znacenje teksta, a ne na specificne rijeci.

Kao rezultat usporedbe dva teksta npr. usporedba korisnikovog pitanja: "Reci mi nesto o bananama" i svih dokumenata u bazi, dobijemo udaljenost vektora pitanja i dokumenta. Na temelju te udaljenosti onda izaberemo dokument koji je semanticki najblizi pitanju. Taj dokument onda koristimo kao izvor znanja.

Za process embeddinga postoji vise modela i svaki ima svoj use case. Ovo cemo jos kasnije istraziti, ali za sada radi svrhe testa cu izabrati samo jedan.

Za embedding cu koristit OpenAI embedding model za sad pa kasnije potencijalno usporedit s drugima.

Za ovo treba napravit OpenAI api kljuc i stavit ga u .env file pod API_KEY=<key>

In [47]:
from dotenv import load_dotenv
import os
load_dotenv()

KEY=os.getenv('API_KEY')


### OpenAI embedding model
ChromaDB ima integraciju sa OpenAI-em pa mozemo direktno preko nje komunicirati sa OpenAI serverima

In [None]:
import chromadb.utils.embedding_functions as embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=KEY,
    model_name="text-embedding-3-small"
)

In [51]:
len(articles)

2692

In [52]:
N = 50
articles = articles[:N] # Selektirat cemo samo prvih 50 redova podataka, a ne svih 2692

In [53]:
articles

index,Article,Date,Heading,NewsType
u32,str,str,str,str
1,"""KARACHI: The Sindh government …","""1/1/2015""","""sindh govt decides to cut publ…","""business"""
2,"""HONG KONG: Asian markets start…","""1/2/2015""","""asia stocks up in new year tra…","""business"""
3,"""HONG KONG: Hong Kong shares o…","""1/5/2015""","""hong kong stocks open 0.66 per…","""business"""
4,"""HONG KONG: Asian markets tumbl…","""1/6/2015""","""asian stocks sink euro near ni…","""business"""
5,"""NEW YORK: US oil prices Monday…","""1/6/2015""","""us oil prices slip below 50 a …","""business"""
…,…,…,…,…
46,"""Karachi: Microsoft Devices Pak…","""2/12/2015""","""nokia 215 dual sim launched in…","""business"""
47,"""ISLAMABAD: Federal Finance Min…","""2/12/2015""","""cnic number now tax number onl…","""business"""
48,"""ISLAMABAD: Government has put …","""2/12/2015""","""govt imposes new taxes of rs4 …","""business"""
49,"""Singapore: Oil prices edged hi…","""2/12/2015""","""oil prices rise in asian trad""","""business"""


In [55]:
# Isprobamo embeddanje na samo prvi red na stupac 'Article'
vectors = openai_ef([articles['Article'][0]])
print(vectors)

[array([-0.00787773, -0.02999097,  0.03325776, ...,  0.03130676,
       -0.03847556,  0.00330366], shape=(1536,), dtype=float32)]


In [57]:
collectionArticles = client.get_or_create_collection(name="articles")

In [None]:
# Za test dodamo samo jedan clanak
collectionArticles.add(
    documents=[articles['Article'][0]],
    ids=["id1"],
    embeddings=vectors # Kad dodamo embeddings atribut, onda uzima poslane embeddings i ne stvara svoje automatski kao sto je na pocetku dokumenta
    # Ovdje je dobro jos dodat i metadata, npr. naslove clanka
)

In [None]:
articles['Article'][0] # Ovo je clanak za koji smo napravili embedding

'KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling.Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.                        \n\n\n\n\n\n\n\n\n\n\n'

In [None]:
query = 'public transport fares by 7 per cent'
query_embeddings = openai_ef([query]) # Pretvorimo query u embedding
collectionArticles.query(
    query_embeddings=query_embeddings, # Mi sami damo embedding da se ne pokrene default proces
    n_results=1
)
# Trenutno imamo samo jedan pa je logicno da je vratio taj jedini

{'ids': [['id1']],
 'embeddings': None,
 'documents': [['KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling.Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.                        \n\n\n\n\n\n\n\n\n\n\n']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None]],
 'distances': [[0.886054515838623]]}

In [67]:
# Sad idemo za ostalih 49 clanaka napraviti embedding i dodati ih u kolekciju
remaining = articles['Article'][1:].to_list()
vectors = openai_ef(remaining)
remaining

['HONG KONG: Asian markets started 2015 on an upswing in limited trading on Friday, with mainland Chinese stocks surging in Hong Kong on speculation Beijing may ease monetary policy to boost slowing growth.Hong Kong rose 1.07 percent, closing 252.78 points higher at 23857.82.Seoul closed up 0.57 percent, rising 10.85 points to 1,926.44, while Sydney gained 0.46 percent, or 24.89 points, to close at 5,435.9.Singapore edged up 0.19 percent, gaining 6.39 points to 3,371.54.Markets in mainland China, Japan, Taiwan, New Zealand, the Philippines, and Thailand remained closed for holidays.With mainland bourses shut until January 5, shares in Chinese developers and financial companies surged in Hong Kong, stoked by hopes that Beijing could ease monetary policy to support lagging growth in the world´s second-largest economy.China Vanke, the country´s biggest developer by sales, leapt 10.8 percent and the People´s Insurance Company (Group) of China Ltd. was up 5.51 percent in afternoon trading.T

In [63]:
# Sad idemo jos samo generirat ids za te clanke
ids = [f"id{x}" for x in articles['index'][1:].to_list()] # Za svaki id u stupcu 'index' dodat cemo id<id> u listu ids
ids


['id2',
 'id3',
 'id4',
 'id5',
 'id6',
 'id7',
 'id8',
 'id9',
 'id10',
 'id11',
 'id12',
 'id13',
 'id14',
 'id15',
 'id16',
 'id17',
 'id18',
 'id19',
 'id20',
 'id21',
 'id22',
 'id23',
 'id24',
 'id25',
 'id26',
 'id27',
 'id28',
 'id29',
 'id30',
 'id31',
 'id32',
 'id33',
 'id34',
 'id35',
 'id36',
 'id37',
 'id38',
 'id39',
 'id40',
 'id41',
 'id42',
 'id43',
 'id44',
 'id45',
 'id46',
 'id47',
 'id48',
 'id49',
 'id50']

In [65]:
collectionArticles.add(
    documents=remaining,
    ids=ids,
    embeddings=vectors
)

In [None]:
collectionArticles.count() # Provjerimo da smo dodali sve ostale clanke

50

### Testiramo na svih 50 clanka
Probat cu ciljat na ovaj clanak: 'New York: Oil prices rebounded Tuesday from six-year lows as the dollar weakened...


In [None]:
query = 'Oil prices tuesday'
query_embeddings = openai_ef([query])

collectionArticles.query(
    query_embeddings=query_embeddings,
    n_results=3 # Prikazemo 3 najbliza dokumenta
)

{'ids': [['id26', 'id36', 'id50']],
 'embeddings': None,
 'documents': [['New York: Oil prices rebounded Tuesday from six-year lows as the dollar weakened after disappointing US economic data.The US benchmark, West Texas Intermediate (WTI) for March delivery, rose $1.08 (2.4 percent) to close at $45.16 a barrel.Brent North Sea crude for March settled at $49.60 a barrel in London, up $1.44 from Monday´s closing level."The market has found a bottom in the mid-40 range," said Kyle Cooper of IAF Advisors.Crude futures sank Monday to their lowest closing levels since early 2009. Crude has shed nearly 60 percent of its value in an almost uninterrupted slide since June due to a supply glut, largely boosted by robust US shale-oil production and weaker global economic growth.The greenback has been strengthening for months, making dollar-priced oil relatively more expensive, adding to the pressure on the oil market.A slight easing in the dollar Tuesday against major rival currencies such as the 

In [73]:
# Neko kompliciranije pitanje
query = 'Give me stats about the Asian market'
query_embeddings = openai_ef([query])

collectionArticles.query(
    query_embeddings=query_embeddings,
    n_results=3 # Prikazemo 3 najbliza dokumenta
)

{'ids': [['id2', 'id18', 'id25']],
 'embeddings': None,
 'documents': [['HONG KONG: Asian markets started 2015 on an upswing in limited trading on Friday, with mainland Chinese stocks surging in Hong Kong on speculation Beijing may ease monetary policy to boost slowing growth.Hong Kong rose 1.07 percent, closing 252.78 points higher at 23857.82.Seoul closed up 0.57 percent, rising 10.85 points to 1,926.44, while Sydney gained 0.46 percent, or 24.89 points, to close at 5,435.9.Singapore edged up 0.19 percent, gaining 6.39 points to 3,371.54.Markets in mainland China, Japan, Taiwan, New Zealand, the Philippines, and Thailand remained closed for holidays.With mainland bourses shut until January 5, shares in Chinese developers and financial companies surged in Hong Kong, stoked by hopes that Beijing could ease monetary policy to support lagging growth in the world´s second-largest economy.China Vanke, the country´s biggest developer by sales, leapt 10.8 percent and the People´s Insurance C

# Spremanje svih embeddinga i podataka koje smo napravili
Trenutno smo u ovom file-u koristili chromadb client, sto znaci da svi dokumenti i svi embeddings koje smo napravili su bili sejvani u memoriji. Ako bi se aplikacija srusila, ti podaci bi nestali.

ChromaDb ima rjesenje za ovo: koristenje PersistentClient() modela.

Ovime ce se svi podaci spremiti u sqlite bazu na nasem disku. I kasnije mozemo prosiriti da se sprema na neku drugu bazu, ne na nasem racunalu


In [None]:
client2 = chromadb.PersistentClient(path='./vectordb')

collectionPersistent = client2.get_or_create_collection(name="articles")

collectionPersistent.add( # Dodamo onih 49 ostalih clanaka, prvi ce mo zanemarit
    documents=remaining,
    ids=ids,
    embeddings=vectors
)

collectionPersistent.count()

# S ovime se stvorila sqlite baza u /vectordb folderu

49

## Chroma server

Ovo bi koristili u produkciji.

Chroma server se zapocinje sa komandom `chroma run --path /db_path`

In [None]:

client3=chromadb.HttpClient(host='localhost', port=8000)

# Zakljucak

Ovime sam istrazio kako bi mogao koristit ChromaDB za projekt.

ChromaDB cemo koristit kako bi spremili podatke faksa kao vektore u vektorsku bazu.
Kod ovoga je vazno da su dokumenti dobro formatirani jer ako se npr. spominje Matematicka Analiza 1 u nekom dokumentu, u drugom dokumentu se mozda spominje pod imenom MATAN1. Na takve situacije ce mo morat pazit i pokusat sanirat.

Slijedeci koraci: Istraziti LangChain i kako se moze nadovezat na ovo sto sam istrazio.

