# Purpose of this notebook

- experiment with langchain
- use local LLaMA3.1 model for inference 
- use local embedding nomic model from ollama
- use ChromaDB to store vector db
- ETL pipeline for Yahoo
- basic backend infrastructure for UI app


# Import

In [36]:
from datasets import load_dataset
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings
from pprint import pp
from uuid import uuid4
from langchain_core.documents import Document


# Load Dataset

In [14]:
ds = load_dataset("virattt/financial-qa-10K", split="train")

In [15]:
for i in range(3):
    print(ds[i])

{'question': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?', 'answer': 'NVIDIA initially focused on PC graphics.', 'context': 'Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.', 'ticker': 'NVDA', 'filing': '2023_10K'}
{'question': 'What are some of the recent applications of GPU-powered deep learning as mentioned by NVIDIA?', 'answer': 'Recent applications of GPU-powered deep learning include recommendation systems, large language models, and generative AI.', 'context': 'Some of the most recent applications of GPU-powered deep learning include recommendation systems, which are AI algorithms trained to understand the preferences, previous decisions, and characteristics of people and products using data gathered about their interactions, large language models, which can recognize, summarize, translate, predict and generate text and other content based on 

In [39]:
# Load into a list of documents
doc_list = []
for d in ds:
    doc = Document(
        page_content = d['context'],
        id=uuid4()
    )

    doc_list.append(doc)

print('doc list len = ', len(doc_list))

doc list len =  7000


#  Load Nomic embedding from ollama and chromedb

In [33]:
EMBEDDING_MODEL = 'mxbai-embed-large'

embedding_model = OllamaEmbeddings(model = EMBEDDING_MODEL)

res = embedding_model.embed_query(ds[0]['context'])

print(len(res))
print(type(res))
print(res)

1024
<class 'list'>
[-0.012722743, 0.003792157, 0.011633649, -0.019778889, 0.033497896, 0.021232214, -0.0050703213, -0.04749044, 0.039212067, -0.011423886, 0.00092675333, -0.0075196754, 0.020072406, -0.060655143, 0.048937686, 0.012976466, -0.019728072, -0.03815751, -0.044454478, -0.02004559, -0.010513954, 0.00944843, -0.070821114, -0.024431212, -0.026921213, 0.035399646, 0.0821816, -0.0055558123, 0.07539738, -0.0014763722, 0.02383156, 0.0076157562, -0.0033858055, -0.032196112, -0.04912068, -0.020661965, 0.012037749, -0.0068636076, -0.006680019, -0.021836737, 0.02364378, -0.07988179, 0.015870523, -0.060347095, -0.059334986, 0.017904038, 0.053027608, -0.057077076, 0.075122096, -0.027112547, 0.017910995, 0.029090865, -0.05499518, 0.0031623722, -0.03654831, 0.00054243923, 0.0046587987, 0.0028844, 0.018506281, 0.0074807373, 0.054119304, 0.050255097, 0.0038435147, -0.018957293, 0.010586346, 0.046244167, -0.013806931, -0.035196595, 0.020551465, 0.030549075, -0.02404075, -0.01719271, 0.0086887

#  Create a chroma vector db 

In [None]:
vector_store = Chroma(
    collection_name = 'news_context',
    embedding_function=embedding_model,
    persist_directory='./news_db'
)

uuids = [str(uuid4()) for _ in range(len(doc_list))]

vector_store.add_documents(documents = doc_list, uuids = uuids)

# Query Chroma

In [45]:
results = vector_store.similarity_search(
    'Nviddia is growing very fast because development of AI',
    k = 5,
    # filter={} # filter criterias
)

# similarity_search_with_score
# similarity_search_by_vector

for res in results:
    print(res.page_content)

Fueled by the sustained demand for exceptional 3D graphics and the scale of the gaming market, NVIDIA has leveraged its GPU architecture to create platforms for scientific computing, AI, data science, AV, robotics, metaverse and 3D internet applications.
NVIDIA has a platform strategy, bringing together hardware, systems, software, algorithms, libraries, and services to create unique value for the markets we serve.
A rapidly growing number of enterprises and startups across a broad range of industries use our GPUs and software to bring automation to the products and services they build. The transportation industry is turning to our platforms for autonomous driving; the healthcare industry is leveraging them for enhanced medical imaging and acceleration of drug discovery; and the financial services industry is using them for fraud detection. Professional designers use our GPUs and software to create visual effects in movies and to design buildings and products ranging from cell phones t

In [50]:
# Convert vector store to retriever to use in LCEL

retriever = vector_store.as_retriever(search_type="mmr", search_kwargs = {'k': 5, 'fetch_k': 5})
results = retriever.invoke('google is more like a data mining company')

for res in results:
    print(res.page_content)

Alphabet is a collection of businesses, the largest of which is Google. Alphabet reports Google in two segments, Google Services and Google Cloud; all non-Google businesses are collectively reported as Other Bets.
For reporting purposes Google comprises two segments: Google Services and Google Cloud.
We report Google in two segments, Google Services and Google Cloud, and all non-Google businesses collectively as Other Bets.
While Google Search started as a way to find web pages, organized into ten blue links, we have driven technical advancements and product innovations that have transformed Google Search into a dynamic, multimodal experience.
With the abundance of data, there are opportunities to develop AI tools with powerful computational abilities to extract insights and value from the captured data.
