## About


---

This notebook contains all practices I did, tests and things i tried out during the project.

I will try to add more comments as I go on, they include things you can try for your projects



```
    pip install -r requirements
```


### The OPENAI module


To use the OPENAI api, you need to create an account and get an API key, with the key you can use any OpenAI service or model


Below are two ways out of many to use OPENAI LLM


```
from langchain_openai import ChatOpenAI
from langchain.llms import OpenAI
```


In [None]:
from dotenv import dotenv_values
from langchain.schema import (AIMessage, HumanMessage, SystemMessage)
from langchain_openai import ChatOpenAI
from langchain.llms import OpenAI

env = dotenv_values('.env')
# Load environment variables from the .env file
# Import necessary modules for working with OpenAI and LangChain


In [None]:
# Note: The model_name parameter has been deprecated. Consider using the newer ChatOpenAI or other updated classes.
llm = OpenAI(model_name='text-davinci-003', api_key=env['OPENAI_API_KEY'])
llm.invoke('Who is JFK in Ameria')

  llm = OpenAI(model_name='text-davinci-003', api_key=env['OPENAI_API_KEY'])
  llm('Who is JFK in Ameria')


NotFoundError: Error code: 404 - {'error': {'message': 'The model `text-davinci-003` has been deprecated, learn more here: https://platform.openai.com/docs/deprecations', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

In [None]:
# 
chat = ChatOpenAI(api_key=env['OPENAI_API_KEY'],
                model='gpt-3.5-turbo',
                temperature=0.3)
message = [
    SystemMessage(content='You are an expert data scientist'),
    HumanMessage(content='Write a python script that trains a neural network on simulated data')
]

response = chat(message)

  response = chat(message)


In [5]:
print(response.content,end='\n')

Sure! Here is an example of a Python script that trains a simple neural network on simulated data using the TensorFlow library:

```python
import numpy as np
import tensorflow as tf

# Generate simulated data
np.random.seed(0)
X = np.random.rand(1000, 2)
y = np.array([1 if x1 + x2 > 1 else 0 for x1, x2 in X])

# Define the neural network architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(4, activation='relu', input_shape=(2,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X, y, epochs=10, batch_size=32)

# Evaluate the model
loss, accuracy = model.evaluate(X, y)
print(f'Loss: {loss}, Accuracy: {accuracy}')
```

In this script:
1. We generate simulated data where the target variable is 1 if the sum of the input features is greater than 1, otherwise 0.
2. We define a simple neural network with one hidden layer and one output

## Prompting 

Using the PromptTemplate module, you can create custom prompts for specifici purposes 


In [8]:
from langchain import PromptTemplate 

template = """
You are an expert data scientist with an expertise in building deep learning models. 
Explain the concept of {concept} in a couple of lines
"""

prompt = PromptTemplate(input_variables=['concept'],
                        template = template)


In [9]:
prompt

PromptTemplate(input_variables=['concept'], input_types={}, partial_variables={}, template='\nYou are an expert data scientist with an expertise in building deep learning models. \nExplain the concept of {concept} in a couple of lines\n')

In [10]:
chat.invoke(prompt.format(concept='batch normalization'))

AIMessage(content='Batch normalization is a technique used in deep learning to normalize the input of each layer by adjusting and scaling the activations. This helps in reducing internal covariate shift, improving training speed, and allowing for higher learning rates.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 44, 'prompt_tokens': 37, 'total_tokens': 81, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-BE8D91DX7DQE1sDU9db1C440m0Fyp', 'finish_reason': 'stop', 'logprobs': None}, id='run-e12c5b83-4e19-4651-bab2-953fa2d07290-0', usage_metadata={'input_tokens': 37, 'output_tokens': 44, 'total_tokens': 81, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning'

## Using LLMChains

Similar to using a pipeline in sklearn

In [40]:
chain('bias and variance')

  chain('bias and variance')


{'concept': 'bias and variance',
 'text': "Bias refers to the error introduced by approximating a real-world problem, while variance refers to the error introduced by the model's sensitivity to fluctuations in the training data. Balancing bias and variance is crucial for creating a model that generalizes well to unseen data."}

In [13]:
from langchain.chains import LLMChain 

chain = LLMChain(llm=chat, prompt= prompt) 

print(chain.invoke('bias and variance'))

{'concept': 'bias and variance', 'text': 'Bias refers to the error introduced by approximating a real-world problem, while variance refers to the error introduced by sensitivity to fluctuations in the training data. Balancing bias and variance is crucial in building accurate and generalizable deep learning models.'}


### SimpleSequential Chains

helps to line up a series of prompts and activities you want to do, 
when you want the result of a prompt to be used as an input to another prompt

In [16]:
from langchain.chains import SimpleSequentialChain 

template2 = """ you are an expert data scientist. Turn the description of {ml_concept} and explain as if to a toddler in 300 words"""
prompt2 = PromptTemplate(
        input_variables=['ml_concept'],
        template = template2
)
chain2 = LLMChain(llm=chat, prompt=prompt2)

all_chain = SimpleSequentialChain(chains=[chain,chain2], verbose=True)


In [18]:
response = all_chain.invoke('bias and variance')

print(response)



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3mBias refers to the error introduced by approximating a real-world problem, leading to underfitting. Variance refers to the error introduced by modeling the noise in the training data, leading to overfitting. Balancing bias and variance is crucial for building accurate and generalizable deep learning models.[0m
[33;1m[1;3mImagine you are trying to build a puzzle. Bias is like when you try to put a piece in the wrong spot because you don't really understand what the picture is supposed to look like. This can make your puzzle look really messy and not quite right. 

Variance is like when you try to make the puzzle perfect by putting every piece exactly where it seems to fit, even if it doesn't really belong there. This can make your puzzle look too perfect and not like the original picture at all.

When you are building a puzzle, you want to find a balance between making sure each piece fits correctly and not forcing 

## Agents 

Basically AI bots for performing specific tasks 


In [19]:
from langchain_experimental.agents.agent_toolkits.python.base import create_python_agent
from langchain_experimental.utilities import PythonREPL
from langchain_experimental.tools.python.tool import PythonREPLTool

In [20]:
agentexecutor = create_python_agent(
    llm= OpenAI(api_key=env['OPENAI_API_KEY'],
                temperature= 0, max_tokens= 1000),
    tool= PythonREPLTool(),
    verbose= True 
)

In [21]:
agentexecutor.invoke('Find the roots of a quadratic equation funcion 3x^6 + 4x + 2 = 1')




[1m> Entering new AgentExecutor chain...[0m


Python REPL can execute arbitrary code. Use with caution.


[32;1m[1;3m I can use the quadratic formula to find the roots of a quadratic equation.
Action: Python_REPL
Action Input: (-4 + (4**2 - 4*3*2)**0.5) / (2*3)[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m I can use the quadratic formula to find the roots of a quadratic equation.
Action: Python_REPL
Action Input: (-4 - (4**2 - 4*3*2)**0.5) / (2*3)[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: -0.3333333333333333, -0.6666666666666666[0m

[1m> Finished chain.[0m


{'input': 'Find the roots of a quadratic equation funcion 3x^6 + 4x + 2 = 1',
 'output': '-0.3333333333333333, -0.6666666666666666'}

### Using exec allows you to be able to use ai agents to run python codes, since the responses come in strings
> There are two ways to do this 
1. exec can run strings as seen below and still capture the output of the codes 
2. using subprocess.run (this can even give you access to the shell but not advisable)

In [29]:
code = """ 
def greet(f):
    return f'Hello {f}' 
new = greet('Dapo')
"""
namespace = {}
exec(code, namespace) 

In [31]:
print(namespace['new'])

Hello Dapo


In [32]:
import subprocess 

result = subprocess.run(['python', '-c', 'print("Hello world")'],
                        capture_output=True,text=True) 

print(result.stdout.strip())

Hello world


In [None]:
# Testing out python code generation

# first step is to have something to take in an input and prompt
# then code generation output
# the code is then passed as an input into the agent that calls the exec function to run the code using the input variable 


In [33]:
from langchain.llms import OpenAI 
from langchain import PromptTemplate 
from langchain.chains import LLMChain 


chat = OpenAI(api_key=env['OPENAI_API_KEY'], temperature=0.2,
              max_tokens= 1000)

template3 = """ 
        You are an expert data scientist.
        Create a python script that takes in an integer argument {number} and returns the number mupltiplied by 4, 
        then create a variable called answer to store the value of the function
"""

prompt3 = PromptTemplate(input_variables= ['number'], 
                         template= template3,
                         )

chain3 = LLMChain(llm=chat, prompt=prompt3)

response = chain3.invoke(5)

In [38]:
namespace = {}
exec(response['text'].strip(),namespace )

20


In [39]:
namespace['answer']

20

In [5]:
from langchain.memory import  ConversationBufferMemory,ConversationBufferWindowMemory
from langchain_openai import ChatOpenAI 
from langchain.chains import ConversationChain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv 

load_dotenv('.env')

True

In [24]:
chain = ConversationChain(llm=ChatOpenAI(temperature=0.3))
                          

In [21]:
chain.prompt.template 

'The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.\n\nCurrent conversation:\n{history}\nHuman: {input}\nAI:'

In [25]:
chain.invoke({'input':'bias and variance'})

{'input': 'bias and variance',
 'history': '',
 'response': "Ah, bias and variance are key concepts in machine learning! Bias refers to the error introduced by approximating a real-world problem, while variance is the error introduced by modeling the noise in the training data. Balancing bias and variance is crucial for building a model that generalizes well to unseen data. It's like finding the sweet spot between underfitting and overfitting. Do you want to know more about how to manage bias and variance in machine learning models?"}

In [26]:
chain.invoke("argentina's last match was against who and when?")

{'input': "argentina's last match was against who and when?",
 'history': "Human: bias and variance\nAI: Ah, bias and variance are key concepts in machine learning! Bias refers to the error introduced by approximating a real-world problem, while variance is the error introduced by modeling the noise in the training data. Balancing bias and variance is crucial for building a model that generalizes well to unseen data. It's like finding the sweet spot between underfitting and overfitting. Do you want to know more about how to manage bias and variance in machine learning models?",
 'response': "Argentina's last match was against Brazil on July 10, 2021. They played in the final of the Copa America tournament. Argentina won the match 1-0, with Angel Di Maria scoring the only goal of the game. It was a historic victory for Argentina, as it was their first Copa America title since 1993."}

In [28]:
print(chain.memory.buffer)

Human: bias and variance
AI: Ah, bias and variance are key concepts in machine learning! Bias refers to the error introduced by approximating a real-world problem, while variance is the error introduced by modeling the noise in the training data. Balancing bias and variance is crucial for building a model that generalizes well to unseen data. It's like finding the sweet spot between underfitting and overfitting. Do you want to know more about how to manage bias and variance in machine learning models?
Human: argentina's last match was against who and when?
AI: Argentina's last match was against Brazil on July 10, 2021. They played in the final of the Copa America tournament. Argentina won the match 1-0, with Angel Di Maria scoring the only goal of the game. It was a historic victory for Argentina, as it was their first Copa America title since 1993.


## Embeddings 

---

Using embeddings, OllamaEmbeddings to save money, llama3.2 to also run locally 


In [21]:
from langchain_community.vectorstores import FAISS 
from langchain_ollama import OllamaEmbeddings

In [23]:
db = FAISS.from_texts(['hello world', 'hello world 2', "hi there"], 
                 embedding=OllamaEmbeddings(model='llama3.2')
                
                 )

In [12]:
# Perform a similarity search
query = "hello"
results = db.similarity_search(query)  # k is the number of top results to retrieve

# Print the results
for result in results:
    print(result.page_content)

hi there
hello world
hello world 2


In [13]:
results

[Document(id='dba82005-d6f9-4dbe-997c-413eec0fd272', metadata={}, page_content='hi there'),
 Document(id='df3f65a9-7bbf-4b58-ba59-ca818be1f42d', metadata={}, page_content='hello world'),
 Document(id='63b197ab-e275-47e2-9ae4-b76d821c6308', metadata={}, page_content='hello world 2')]

### saving embeddings

It saves locally, and can be pointed in a directory

In [14]:
db.save_local('faiss_index')

In [23]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

db = FAISS.load_local('faiss_index', embeddings=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2'),
                 allow_dangerous_deserialization=True)

In [28]:
list(db.index_to_docstore_id.values())
# db.index.ntotal

['0a217fb3-335f-467e-b1e3-96fd00c4bf36',
 '32f5e753-f0e3-423f-bc92-29eccb361252',
 '1bfbf5ff-0239-4694-9bbc-1ab544de46ef',
 '1d56ac8e-8ba4-4b4d-992a-d7fc0ef3ca5f',
 'c3abc858-7c19-4ac6-a556-549c2eade1ca',
 '7f277dd7-b2a1-4de3-ae9d-ecda565a7e67',
 'e4e3e7cf-6b9c-4bfc-8ded-5b56d510d103',
 '04eb8f7e-0940-40c4-ac95-26757e8d7c1c',
 '8cb1efba-0923-4724-aff9-d1bbe21bbf3e',
 'cf6e33c4-305c-4fef-910b-60a91118a374',
 '6fb3f712-6d75-4de5-8cc2-a1dc842bbdd0',
 'd11729d5-328c-4367-b667-acd0eab80528',
 '64efcc17-f76f-4d18-93c0-59600f319c67',
 'b609376c-49d9-48b3-9b75-5c310f113057']

In [31]:
db.docstore.search('b609376c-49d9-48b3-9b75-5c310f113057').metadata['source']

'https://punchng.com/eid-el-fitr-nigerians-can-overcome-corruption-with-stronger-resolve-efcc-chairman/'

You can also load the old vectorstore and update it with new data

you have to merge and save when you are done

In [None]:
newdb = FAISS.load_local('faiss_index', OllamaEmbeddings(model='gemma3'), allow_dangerous_deserialization=True)

db2 = FAISS.from_texts(['food is good','my guy how are you', 'hello there'],
                 embedding=OllamaEmbeddings(model='gemma3')
                 )

In [17]:
newdb.merge_from(db2)
newdb.save_local('faiss_index')

In [18]:
newdb.similarity_search('hello')

[Document(id='ba9a0929-3f3a-4a52-816f-9b6a270f21ea', metadata={}, page_content='hello there'),
 Document(id='dba82005-d6f9-4dbe-997c-413eec0fd272', metadata={}, page_content='hi there'),
 Document(id='df3f65a9-7bbf-4b58-ba59-ca818be1f42d', metadata={}, page_content='hello world'),
 Document(id='c2e1ea99-281c-4dfa-a345-097cecbc0fc8', metadata={}, page_content='my guy how are you')]

In [20]:
newdb.similarity_search('food')

[Document(id='bdf4c390-865b-4bbf-ad03-b6091f67bd8c', metadata={}, page_content='food is good'),
 Document(id='ba9a0929-3f3a-4a52-816f-9b6a270f21ea', metadata={}, page_content='hello there'),
 Document(id='c2e1ea99-281c-4dfa-a345-097cecbc0fc8', metadata={}, page_content='my guy how are you'),
 Document(id='63b197ab-e275-47e2-9ae4-b76d821c6308', metadata={}, page_content='hello world 2')]

### Another way to do embedding using Llama index

This use case, I have a pdf file that i want to load and summarize, it is stored in the data/ directory

The logic is to read the file using the SimpleDirectoryReader module, embed using HuggingFaceEmbedding and create indexes of it.

Using the index, i can then use as a knowledge base for an LLM. 

In [1]:
from llama_index.core import VectorStoreIndex, Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_index.core import SimpleDirectoryReader
# from langchain_community.document_loaders import PyPDFLoader
# load data()
loader = SimpleDirectoryReader(
    input_dir='data',
    required_exts=['.pdf'],
    recursive=True
)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
docs= loader.load_data()
embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-large-en-v1.5", trust_remote_code=True)


In [4]:
from llama_index.core import Settings
# ====== Create vector store and upload indexed data ======
Settings.embed_model = embed_model # we specify the embedding model to be used
index = VectorStoreIndex.from_documents(docs)

In [5]:
Settings

_Settings(_llm=None, _embed_model=HuggingFaceEmbedding(model_name='BAAI/bge-large-en-v1.5', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001F360646010>, num_workers=None, max_length=512, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None), _callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001F360646010>, _tokenizer=None, _node_parser=SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001F360646010>, id_func=<function default_id_func at 0x000001F350649BC0>, chunk_size=1024, chunk_overlap=200, separator=' ', paragraph_separator='\n\n\n', secondary_chunking_regex='[^,.;。？！]+[,.;。？！]?|[,.;。？！]'), _prompt_helper=None, _transformations=[SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 

In [6]:
from llama_index.llms.ollama import Ollama

# setting up the llm

llm = Ollama(model="gemma3", request_timeout=500) 

# ====== Setup a query engine on the index previously created ======
Settings.llm = llm # specifying the llm to be used
query_engine = index.as_query_engine(streaming=True, similarity_top_k=4)

In [7]:
from llama_index.core.prompts import PromptTemplate 

In [8]:
qa_prompt_tmpl_str = (
            """Context information is below.\n"
            ---------------------
            {context_str}
            ---------------------
            Given the context information above I want you to think step by step to answer the query in a crisp manner, 
            incase case you don't know the answer say 'I don't know!'.\n
            Query: {query_str}\n"
            "Answer: 
            """)

qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)
query_engine.update_prompts({"response_synthesizer:text_qa_template": qa_prompt_tmpl})



In [None]:
docs

In [9]:
response = query_engine.query('What is the document about?')
print(response)

The document is about the admission requirements and procedures for the Winter 2025-2026 Master in Data Science program at the University of Luxembourg, specifically for third-country national applicants. It details required documents, application deadlines, tuition fees, and contact information.


## Document loaders in details 



Still on Embeddings 


In [2]:
from langchain.document_loaders import TextLoader,CSVLoader,JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

In [None]:
text_loader = TextLoader('data/onedrive.txt')
text_docs = text_loader.load()
text_docs

[Document(metadata={'source': 'data/onedrive.txt'}, page_content='Microsoft OneDrive is a cloud storage service that allows users to store files and access them from anywhere with an internet connection. It is integrated with the Microsoft 365 ecosystem, making it a convenient choice for individuals and businesses.\n\n**Benefits:**\n1. **Accessibility:** Files stored in OneDrive can be accessed from any device, including PCs, smartphones, and tablets.\n2. **Collaboration:** OneDrive enables real-time collaboration on documents, spreadsheets, and presentations through Microsoft Office apps.\n3. **Backup and Sync:** It provides automatic backup and synchronization of files across devices, ensuring data is always up-to-date.\n4. **Security:** OneDrive offers robust security features, including encryption, ransomware detection, and recovery options.\n5. **Integration:** Seamless integration with Microsoft 365 apps like Word, Excel, and Teams enhances productivity.\n\n**Use Cases:**\n- **Pe

In [None]:
# the text loader is used to load the text file and create a list of Document objects. Each Document object contains the text content and metadata (if any) associated with the document.
# The text_splitter is used to split the loaded text documents into smaller chunks based on the specified separators and chunk size. This is useful for processing large documents in smaller, manageable pieces.
text_splitter = RecursiveCharacterTextSplitter(separators=['\n\n', '\n','.'],chunk_size=200, chunk_overlap=20)
text_docs = text_splitter.split_documents(text_docs)

In [None]:
# chunks of the data
text_docs

[Document(metadata={'source': 'data/onedrive.txt'}, page_content='Microsoft OneDrive is a cloud storage service that allows users to store files and access them from anywhere with an internet connection'),
 Document(metadata={'source': 'data/onedrive.txt'}, page_content='. It is integrated with the Microsoft 365 ecosystem, making it a convenient choice for individuals and businesses.'),
 Document(metadata={'source': 'data/onedrive.txt'}, page_content='**Benefits:**\n1. **Accessibility:** Files stored in OneDrive can be accessed from any device, including PCs, smartphones, and tablets.'),
 Document(metadata={'source': 'data/onedrive.txt'}, page_content='2. **Collaboration:** OneDrive enables real-time collaboration on documents, spreadsheets, and presentations through Microsoft Office apps.'),
 Document(metadata={'source': 'data/onedrive.txt'}, page_content='3. **Backup and Sync:** It provides automatic backup and synchronization of files across devices, ensuring data is always up-to-da

In [None]:
# we need to create a vector store from the text documents. 
# The FAISS vector store is used to efficiently search and retrieve similar documents based on their embeddings.
# The OllamaEmbeddings class is used to generate embeddings for the text documents using a specified embedding model.

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_ollama.embeddings import OllamaEmbeddings

db = FAISS.from_documents(text_docs, embedding=OllamaEmbeddings(model='llama3.2')) 
resp = db.similarity_search('onedrive', k=3) # k is the number of top results to retrieve

for x in resp:
    print(x.page_content)

    # Microsoft OneDrive is a cloud storage service that allows users to store files and access them from anywhere with an internet connection
    # 4. **Security:** OneDrive offers robust security features, including encryption, ransomware detection, and recovery options.
    # Overall, Microsoft OneDrive is a versatile and reliable cloud storage solution that caters to both personal and professional needs.

Microsoft OneDrive is a cloud storage service that allows users to store files and access them from anywhere with an internet connection
4. **Security:** OneDrive offers robust security features, including encryption, ransomware detection, and recovery options.
Overall, Microsoft OneDrive is a versatile and reliable cloud storage solution that caters to both personal and professional needs.


Using a CSVLoader when you have your data in CSV format 

The first type of loading the document

In [51]:
csv_loader= CSVLoader('data/ngxdata.csv', source_column='Date')
# the source column is the column that contains the source of the document being retrieved
csv_docs = csv_loader.load()
len(csv_docs)

75177

In [52]:
csv_docs[0].metadata

{'source': '12/31/2021', 'row': 0}

In [53]:
csv_docs[:2]

[Document(metadata={'source': '12/31/2021', 'row': 0}, page_content='Date: 12/31/2021\nSymbol: ABCTRANS\nPclose: 0.31\nOpen: 0.31\nHigh: 0.31\nLow: 0.31\nClose: 0.31\n% Change: 0\nVolume: 3580.00\nValue: 1199.80'),
 Document(metadata={'source': '12/31/2021', 'row': 1}, page_content='Date: 12/31/2021\nSymbol: ACCESS\nPclose: 9.1\nOpen: 9.1\nHigh: 9.5\nLow: 9.1\nClose: 9.3\n% Change: 0.2\nVolume: 32102347.00\nValue: 301141911.90')]

In [None]:
[vdoc.docstore.search(doc_id).page_content for doc_id in list(vdoc.index_to_docstore_id.values())]

[Document(id='a681b85c-8659-4a3b-a3a8-298f61fbf6b2', metadata={'source': 'Stevens-Brown', 'row': 0}, page_content='Date: 2024-01-01\nSymbol: BEAR\nCompany: Stevens-Brown\nSector: Technology\nOpen: 335.53\nHigh: 338.72\nLow: 332.25\nClose: 334.6\nVolume: 654598\nValue: 219028490.8'),
 Document(id='a5bfbbc8-bd2e-450d-a715-08214489c164', metadata={'source': 'Hill LLC', 'row': 1}, page_content='Date: 2024-01-01\nSymbol: BLAN\nCompany: Hill LLC\nSector: Consumer\nOpen: 745.75\nHigh: 747.47\nLow: 730.11\nClose: 731.89\nVolume: 872708\nValue: 638726258.12'),
 Document(id='1e921962-a4d0-430e-bc51-ac7b59c42e08', metadata={'source': 'Smith, Johnson and Mendoza', 'row': 2}, page_content='Date: 2024-01-01\nSymbol: COLE\nCompany: Smith, Johnson and Mendoza\nSector: Consumer\nOpen: 874.44\nHigh: 908.21\nLow: 873.96\nClose: 902.0\nVolume: 720105\nValue: 649534710.0'),
 Document(id='7d20702a-a998-475f-a255-8c846703e1a4', metadata={'source': 'Nichols, Ward and Miller', 'row': 3}, page_content='Date: 20

In [54]:
vdoc = FAISS.from_documents(csv_docs, embedding=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2'))

KeyboardInterrupt: 

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    You are a helpful data scientist. Use the context below to answer the question.
    Context: {context}

    Question: {question}
    Answer:
    """,
)
# The RetrievalQA chain is a type of LangChain that combines a retriever and a question-answering model.

retriever = vdoc.as_retriever(search_kwargs={"k": 5}) # k is the number of top results to retrieve



chain = RetrievalQA.from_chain_type(
        llm=llm,chain_type="stuff",
        retriever=retriever,
        chain_type_kwargs={"prompt": prompt, "verbose": True},
        return_source_documents=True
    )


response = chain.invoke('What is the name of the company that has the highest number of trades in terms of volume?')
print(response['result'])




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
    You are a helpful data scientist. Use the context below to answer the question.
    Context: Date: 2024-03-12
Symbol: MILL
Company: Kim and Sons
Sector: Finance
Open: 272.57
High: 274.36
Low: 270.66
Close: 272.12
Volume: 619238
Value: 168507044.56

Date: 2024-01-03
Symbol: MILL
Company: Kim and Sons
Sector: Finance
Open: 331.38
High: 338.53
Low: 330.81
Close: 337.1
Volume: 280231
Value: 94465870.1

Date: 2024-02-01
Symbol: MILL
Company: Kim and Sons
Sector: Finance
Open: 297.67
High: 328.6
Low: 296.29
Close: 325.6
Volume: 1434276
Value: 467000265.6

Date: 2024-01-31
Symbol: MILL
Company: Kim and Sons
Sector: Finance
Open: 289.9
High: 299.21
Low: 288.32
Close: 297.48
Volume: 383546
Value: 114097264.08

Date: 2024-03-06
Symbol: MILL
Company: Kim and Sons
Sector: Finance
Open: 314.72
High: 317.77
Low: 308.47
Close: 309.32
Volume: 973472
Value: 3011143

The initial embedding of the large data was cancel because it was taking too long. 

FastEmbedEmbeddings are rumored to be faster for this use case 
``` pip install fastembed
```
You need to install that module above to be able to use the langchain embedding. 

Also you can read more [here](https://qdrant.github.io/fastembed/examples/Supported_Models/)

Another way to go is to batch embed

In [56]:
from langchain_community.embeddings import FastEmbedEmbeddings

embedding = FastEmbedEmbeddings(model_name='BAAI/bge-small-en-v1.5', trust_remote_code=True)


Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00,  8.18it/s]


In [58]:
from tqdm import tqdm
from typing import List

embedding=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
def create_faiss_index_batched(documents: List, batch_size: int = 2000):
    # Initialize embeddings model
    
    
    # Process in batches
    vdoc = None
    for i in tqdm(range(0, len(documents), batch_size)):
        batch = documents[i:i + batch_size]
        
        # Create or update FAISS index
        if vdoc is None:
            vdoc = FAISS.from_documents(batch, embedding=embedding)
        else:
            batch_vdoc = FAISS.from_documents(batch, embedding=embedding)
            vdoc.merge_from(batch_vdoc)
            
        # Optional: Save checkpoint after each batch
        if i % (batch_size * 5) == 0 and i > 0:
            vdoc.save_local(f"faiss_checkpoint_{i}")
            
    return vdoc

# Usage with progress bar
vdoc = create_faiss_index_batched(csv_docs, batch_size=2000)

100%|██████████| 38/38 [20:50<00:00, 32.92s/it]


In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    You are an expert data scientist. 
    You are given a dataset of trades in a CSV file.
    The dataset contains the following columns: Date,Symbol,Pclose(Previous Closing price),Open,High,Low,Close,% Change,Volume(traded),Value(Traded)
    The dataset contains the following data: {context}.

    Your task is to analyze the dataset, answer the question and provide insights on the trades, based on the data provided.

    Note that the date in the data is in the america format MM/DD/YYYY. 
    
    
    Context: {context}
    Question: {question}
    Answer:
    """,
)
# The RetrievalQA chain is a type of LangChain that combines a retriever and a question-answering model.

retriever = vdoc.as_retriever(search_kwargs={"k": 4}) # k is the number of top results to retrieve

llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0.2)


chain = RetrievalQA.from_chain_type(
        llm=llm,chain_type="stuff",
        retriever=retriever,
        chain_type_kwargs={"prompt": prompt, "verbose": True},
        return_source_documents=True
    )


response = chain.invoke('List the top 5 symbols based on the total volume traded in the entire month May 2023, also with their close price on the last day of the month? ')
print(response['result'])




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
    You are an expert data scientist. 
    You are given a dataset of trades in a CSV file.
    The dataset contains the following columns: Date,Symbol,Pclose(Previous Closing price),Open,High,Low,Close,% Change,Volume(traded),Value(Traded)
    The dataset contains the following data: Date: 5/30/2023
Symbol: TOTAL
Pclose: 249
Open: 249
High: 272
Low: 272
Close: 272
% Change: 23
Volume: 486135.00
Value: 131286936.50

Date: 5/31/2023
Symbol: TOTAL
Pclose: 272
Open: 272
High: 272
Low: 272
Close: 272
% Change: 0
Volume: 349703.00
Value: 95030589.00

Date: 5/11/2023
Symbol: SEPLAT
Pclose: 1175
Open: 1175
High: 1175
Low: 1175
Close: 1175
% Change: 0
Volume: 1797.00
Value: 2146806.00

Date: 5/25/2023
Symbol: TOTAL
Pclose: 233
Open: 233
High: 249
Low: 249
Close: 249
% Change: 16
Volume: 719279.00
Value: 179714147.60.

    Your task is to analyze the dataset, a

In [64]:
d = chain.invoke('what was the sum total of the volumne traded of SEPLAT in the month May 2023?')
print(d['result'], d.get('source_documents')[0].page_content)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
    You are a helpful data scientist. Use the context below to answer the question. 
    Note that the date in the data is in the america format MM/DD/YYYY.

    Context: Date: 11/28/2023
Symbol: SEPLAT
Pclose: 2100.1
Open: 2100.1
High: 2100.1
Low: 2100.1
Close: 2100.1
% Change: 0
Volume: 27230.00
Value: 62614526.10

Date: 11/29/2023
Symbol: SEPLAT
Pclose: 2100.1
Open: 2100.1
High: 2310.1
Low: 2310.1
Close: 2310.1
% Change: 210
Volume: 190206.00
Value: 438587142.80

Date: 5/11/2023
Symbol: SEPLAT
Pclose: 1175
Open: 1175
High: 1175
Low: 1175
Close: 1175
% Change: 0
Volume: 1797.00
Value: 2146806.00

Date: 2/15/2024
Symbol: SEPLAT
Pclose: 3370
Open: 3370
High: 3370
Low: 3370
Close: 3370
% Change: 0
Volume: 44386.00
Value: 149145178.50

Date: 12/6/2023
Symbol: SEPLAT
Pclose: 2310.1
Open: 2310.1
High: 2310.1
Low: 2310.1
Close: 2310.1
% Change: 0
Volume: 21

In [37]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv 


dd  = ' '.join([doc.page_content for doc in csv_docs]) 

template = """
        You are an expert data scientist.
        You are given a dataset of trades in a CSV file.
        The dataset contains the following columns: Date,Symbol,Company,Sector,Open,High,Low,Close,Volume,Value.
        The dataset contains the following data: {data}.

        Your task is to analyze the dataset, answer the question and provide insights on the trades, based on the data provided.

        The question is: {question}

"""
prompt = PromptTemplate(input_variables=['data','question'],
                        template=template,
                        )

llm = ChatOpenAI()
chain = prompt | llm | StrOutputParser()

response = chain.invoke({'data':dd.strip()[:15000], 'question':'Who are the top 3 most traded companies in the dataset in terms of volume and value?'})

In [35]:
len(dd.strip())

157380

In [41]:
print(response)

Based on the data provided, the top 3 most traded companies in terms of volume and value are:

1. Company: Cole (Smith, Johnson and Mendoza)
   - Volume: 5,851,106
   - Value: $3,709,174,113.37

2. Company: Moore (Petersen-Ali)
   - Volume: 4,739,437
   - Value: $2,272,063,745.9

3. Company: Herr (Key, Santiago and Costa)
   - Volume: 4,145,195
   - Value: $2,270,385,825.5

These companies have shown to have the highest trading volume and value in the dataset, indicating that they are actively traded and valued by investors.


In [71]:
from langchain_community.document_loaders import CSVLoader 

csv_loader = CSVLoader('data/ngxdata.csv', source_column='Date', 
                       csv_args={'delimiter': ',', 
                                 'fieldnames': ['Date', 'Symbol', 'Pclose', 'Open', 'High', 'Low', 'Close', '% Change ', 'Volume', 'Value']})
csv_doc2 = csv_loader.load()

In [73]:
csv_doc2[1]

Document(metadata={'source': '12/31/2021', 'row': 1}, page_content='Date: 12/31/2021\nSymbol: ABCTRANS\nPclose: 0.31\nOpen: 0.31\nHigh: 0.31\nLow: 0.31\nClose: 0.31\n% Change: 0\nVolume: 3580.00\nValue: 1199.80')

### Using UnstructuredURLLoader is quite useful for webdate

A use case is below 

### Prompt to fetch exchange rates from Aboki FX 

In [74]:
from langchain_community.document_loaders import UnstructuredURLLoader

url_loader = UnstructuredURLLoader(urls= ['https://abokiforex.app/'])
url_docs = url_loader.load()

In [75]:
url_docs[0]

Document(metadata={'source': 'https://abokiforex.app/'}, page_content="Aboki Forex - Naira to Dollar Black Market Today\n\nDollar to Naira Today Black Market\n\nThe Black Market Dollar to Naira Rates are tabulated below:\n\nBlack Market Rates\n\nDollar to Naira rate\n\nnaira to dollar\n\nBUY\n\ndollar to naira\n\n1545\n\nDOLLAR (USD)\n\nSELL\n\npound to dollar\n\n1555\n\nPound to Naira rate\n\ndollar to pound\n\nBUY\n\ndollar to yen\n\n1980\n\nPOUND (GBP)\n\nSELL\n\npound to euro\n\n2005\n\nEuro to Naira rate\n\neuro to pound\n\nBUY\n\ndollar to euro\n\n1630\n\nEURO (EUR)\n\nSELL\n\ndollar to euro\n\n1660\n\nCanadian Dollar to Naira rate\n\ncanadian dollar to euro\n\nBUY\n\ndollar to canadian dollar\n\n1000\n\nDOLLAR (CAD)\n\nSELL\n\ndollar to rand\n\n1150\n\nSouth African Rand to Naira rate\n\nRand to dollar\n\nBUY\n\nzar to dollar\n\n80\n\nRAND (ZAR)\n\nSELL\n\nyuan to dollar\n\n100\n\nUAE Dirham to Naira rate\n\ndirham aed to dollar\n\nBUY\n\ndollar to yuan\n\n380\n\nDIRHAM (AED)\n\

### Using openai LLM to extract the key data using a very solid prompt

In [39]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
# chatmodel = ChatOpenAI()
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate 

llm = ChatOpenAI(temperature=0.4)

prompt = PromptTemplate(
        input_variables=['file'],
        template= '''
            You are a very good webscraper, also good with data mining.

            Fetch the exchange rate of the currencies in the document and return the result in a dictionary format
            for both the official rate and the black market.
            The document is as follows: {file}
'''
    )
chain = prompt | llm | StrOutputParser()

response = chain.invoke({'file':url_docs[0].page_content})

In [41]:
print(response)

{
    "exchange_rates": {
        "black_market": {
            "USD": {
                "buy": 1540,
                "sell": 1550
            },
            "GBP": {
                "buy": 2020,
                "sell": 2020
            },
            "EUR": {
                "buy": 1690,
                "sell": 1690
            },
            "CAD": {
                "buy": 1000,
                "sell": 1000
            },
            "ZAR": {
                "buy": 80,
                "sell": 1150
            },
            "AED": {
                "buy": 380,
                "sell": 420
            },
            "CNY": {
                "buy": 190,
                "sell": 215
            },
            "GHS": {
                "buy": 90,
                "sell": 105
            },
            "XOF": {
                "buy": 2370,
                "sell": 2550
            },
            "XAF": {
                "buy": 2250,
                "sell": 2400
            },
            "AUD"

In [77]:
from langchain_ollama import OllamaLLM 
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
# chatmodel = ChatOpenAI()
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate 
ollama_llm = OllamaLLM(model='gemma3', temperature=0.2, max_tokens=1000, verbose=True)
prompt = PromptTemplate(
        input_variables=['file'],
        template= '''
            You are a very good webscraper, also good with data mining.

            Fetch the exchange rate of the currencies against naira (NGN) in the webpage and return the result in a dictionary format
            for both the official rate and the black market.
            The document is as follows: {file}
'''
    )
ollama_chain = prompt | ollama_llm | StrOutputParser()
response = ollama_chain.invoke({'file':url_docs[0].page_content})


In [78]:
print(response)

```python
data = {
    "black_market_rates": {
        "dollar_to_naira": 1545,
        "pound_to_dollar": 1555,
        "dollar_to_yen": 1980,
        "pound_to_euro": 2005,
        "dollar_to_euro": 1630,
        "canadian_dollar_to_euro": 1000,
        "rand_to_dollar": 80,
        "yuan_to_dollar": 380,
        "dirham_to_dollar": 420,
        "euro_to_dirham": 190,
        "yuan_to_euro": 215,
        "cedi_to_dollar": 90,
        "xof_to_dollar": 2370,
        "wa_to_dollar": 2040.43,
        "chinese_yuan_to_dollar": 211.51,
        "saudi_riyal_to_dollar": 409.70
    }
}

print(data)
```

**Explanation:**

1.  **Data Structure:** The code creates a dictionary called `data`.
2.  **Keys:** The keys of the dictionary represent the currency pairs (e.g., "dollar\_to\_naira").
3.  **Values:** The values are the exchange rates extracted from the provided text.
4.  **Output:** The `print(data)` statement displays the entire dictionary, showing the extracted rates.

**Output:**

```
{
 

In [16]:
url_loader2 = UnstructuredURLLoader(urls=['https://ngxgroup.com/exchange/data/equities-price-list/', 'https://saharareporters.com/2025/03/30/edo-governor-suspends-security-chief-vigilante-groups-over-uromi-lynching', 'https://punchng.com/eid-el-fitr-nigerians-can-overcome-corruption-with-stronger-resolve-efcc-chairman/'])
data = url_loader2.load()

In [21]:
data[2]

Document(metadata={'source': 'https://punchng.com/eid-el-fitr-nigerians-can-overcome-corruption-with-stronger-resolve-efcc-chairman/'}, page_content="Advertise with us\n\nSunday, March 30, 2025\n\nMost Widely Read Newspaper\n\nVideos\n\nPunchNG Menu:\n\nVideo\n\nSpice\n\nSpecial Features\n\nEducation\n\nSex & Relationship\n\nInterview\n\nColumns\n\nOpinion\n\nAdvertise with us\n\nEid-el-Fitr: Nigerians can overcome corruption with stronger resolve – EFCC Chairman\n\n30th March 2025\n\nBy Solomon Odeniyi\n\nKindly share this story:\n\nGetting your Trinity Audio player ready...\n\nThe Executive Chairman of the Economic and Financial Crimes Commission, Ola Olukoyede, has expressed confidence that Nigerians have the capacity to defeat corruption through collective determination and commitment.\n\nOlukoyede made this statement in Abuja on Sunday in his goodwill message to Muslims celebrating Eid-el-Fitr across the country.\n\n“Tackling economic and financial crimes and other acts of corrupt