## About


---

This notebook contains all practices I did, tests and things i tried out during the project.

I will try to add more comments as I go on, they include things you can try for your projects



```
    pip install -r requirements
```


### The OPENAI module


To use the OPENAI api, you need to create an account and get an API key, with the key you can use any OpenAI service or model


Below are two ways out of many to use OPENAI LLM


```
from langchain_openai import ChatOpenAI
from langchain.llms import OpenAI
```


In [None]:
from dotenv import dotenv_values
from langchain.schema import (AIMessage, HumanMessage, SystemMessage)
from langchain_openai import ChatOpenAI
from langchain.llms import OpenAI

env = dotenv_values('.env')
# Load environment variables from the .env file
# Import necessary modules for working with OpenAI and LangChain


In [None]:
# Note: The model_name parameter has been deprecated. Consider using the newer ChatOpenAI or other updated classes.
llm = OpenAI(model_name='text-davinci-003', api_key=env['OPENAI_API_KEY'])
llm.invoke('Who is JFK in Ameria')

  llm = OpenAI(model_name='text-davinci-003', api_key=env['OPENAI_API_KEY'])
  llm('Who is JFK in Ameria')


NotFoundError: Error code: 404 - {'error': {'message': 'The model `text-davinci-003` has been deprecated, learn more here: https://platform.openai.com/docs/deprecations', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

In [None]:
# 
chat = ChatOpenAI(api_key=env['OPENAI_API_KEY'],
                model='gpt-3.5-turbo',
                temperature=0.3)
message = [
    SystemMessage(content='You are an expert data scientist'),
    HumanMessage(content='Write a python script that trains a neural network on simulated data')
]

response = chat(message)

  response = chat(message)


In [5]:
print(response.content,end='\n')

Sure! Here is an example of a Python script that trains a simple neural network on simulated data using the TensorFlow library:

```python
import numpy as np
import tensorflow as tf

# Generate simulated data
np.random.seed(0)
X = np.random.rand(1000, 2)
y = np.array([1 if x1 + x2 > 1 else 0 for x1, x2 in X])

# Define the neural network architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(4, activation='relu', input_shape=(2,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X, y, epochs=10, batch_size=32)

# Evaluate the model
loss, accuracy = model.evaluate(X, y)
print(f'Loss: {loss}, Accuracy: {accuracy}')
```

In this script:
1. We generate simulated data where the target variable is 1 if the sum of the input features is greater than 1, otherwise 0.
2. We define a simple neural network with one hidden layer and one output

## Prompting 

Using the PromptTemplate module, you can create custom prompts for specifici purposes 


In [8]:
from langchain import PromptTemplate 

template = """
You are an expert data scientist with an expertise in building deep learning models. 
Explain the concept of {concept} in a couple of lines
"""

prompt = PromptTemplate(input_variables=['concept'],
                        template = template)


In [9]:
prompt

PromptTemplate(input_variables=['concept'], input_types={}, partial_variables={}, template='\nYou are an expert data scientist with an expertise in building deep learning models. \nExplain the concept of {concept} in a couple of lines\n')

In [10]:
chat.invoke(prompt.format(concept='batch normalization'))

AIMessage(content='Batch normalization is a technique used in deep learning to normalize the input of each layer by adjusting and scaling the activations. This helps in reducing internal covariate shift, improving training speed, and allowing for higher learning rates.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 44, 'prompt_tokens': 37, 'total_tokens': 81, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-BE8D91DX7DQE1sDU9db1C440m0Fyp', 'finish_reason': 'stop', 'logprobs': None}, id='run-e12c5b83-4e19-4651-bab2-953fa2d07290-0', usage_metadata={'input_tokens': 37, 'output_tokens': 44, 'total_tokens': 81, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning'

## Using LLMChains

Similar to using a pipeline in sklearn

In [40]:
chain('bias and variance')

  chain('bias and variance')


{'concept': 'bias and variance',
 'text': "Bias refers to the error introduced by approximating a real-world problem, while variance refers to the error introduced by the model's sensitivity to fluctuations in the training data. Balancing bias and variance is crucial for creating a model that generalizes well to unseen data."}

In [13]:
from langchain.chains import LLMChain 

chain = LLMChain(llm=chat, prompt= prompt) 

print(chain.invoke('bias and variance'))

{'concept': 'bias and variance', 'text': 'Bias refers to the error introduced by approximating a real-world problem, while variance refers to the error introduced by sensitivity to fluctuations in the training data. Balancing bias and variance is crucial in building accurate and generalizable deep learning models.'}


### SimpleSequential Chains

helps to line up a series of prompts and activities you want to do, 
when you want the result of a prompt to be used as an input to another prompt

In [16]:
from langchain.chains import SimpleSequentialChain 

template2 = """ you are an expert data scientist. Turn the description of {ml_concept} and explain as if to a toddler in 300 words"""
prompt2 = PromptTemplate(
        input_variables=['ml_concept'],
        template = template2
)
chain2 = LLMChain(llm=chat, prompt=prompt2)

all_chain = SimpleSequentialChain(chains=[chain,chain2], verbose=True)


In [18]:
response = all_chain.invoke('bias and variance')

print(response)



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3mBias refers to the error introduced by approximating a real-world problem, leading to underfitting. Variance refers to the error introduced by modeling the noise in the training data, leading to overfitting. Balancing bias and variance is crucial for building accurate and generalizable deep learning models.[0m
[33;1m[1;3mImagine you are trying to build a puzzle. Bias is like when you try to put a piece in the wrong spot because you don't really understand what the picture is supposed to look like. This can make your puzzle look really messy and not quite right. 

Variance is like when you try to make the puzzle perfect by putting every piece exactly where it seems to fit, even if it doesn't really belong there. This can make your puzzle look too perfect and not like the original picture at all.

When you are building a puzzle, you want to find a balance between making sure each piece fits correctly and not forcing 

## Agents 

Basically AI bots for performing specific tasks 


In [19]:
from langchain_experimental.agents.agent_toolkits.python.base import create_python_agent
from langchain_experimental.utilities import PythonREPL
from langchain_experimental.tools.python.tool import PythonREPLTool

In [20]:
agentexecutor = create_python_agent(
    llm= OpenAI(api_key=env['OPENAI_API_KEY'],
                temperature= 0, max_tokens= 1000),
    tool= PythonREPLTool(),
    verbose= True 
)

In [21]:
agentexecutor.invoke('Find the roots of a quadratic equation funcion 3x^6 + 4x + 2 = 1')




[1m> Entering new AgentExecutor chain...[0m


Python REPL can execute arbitrary code. Use with caution.


[32;1m[1;3m I can use the quadratic formula to find the roots of a quadratic equation.
Action: Python_REPL
Action Input: (-4 + (4**2 - 4*3*2)**0.5) / (2*3)[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m I can use the quadratic formula to find the roots of a quadratic equation.
Action: Python_REPL
Action Input: (-4 - (4**2 - 4*3*2)**0.5) / (2*3)[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: -0.3333333333333333, -0.6666666666666666[0m

[1m> Finished chain.[0m


{'input': 'Find the roots of a quadratic equation funcion 3x^6 + 4x + 2 = 1',
 'output': '-0.3333333333333333, -0.6666666666666666'}

### Using exec allows you to be able to use ai agents to run python codes, since the responses come in strings
> There are two ways to do this 
1. exec can run strings as seen below and still capture the output of the codes 
2. using subprocess.run (this can even give you access to the shell but not advisable)

In [29]:
code = """ 
def greet(f):
    return f'Hello {f}' 
new = greet('Dapo')
"""
namespace = {}
exec(code, namespace) 

In [31]:
print(namespace['new'])

Hello Dapo


In [32]:
import subprocess 

result = subprocess.run(['python', '-c', 'print("Hello world")'],
                        capture_output=True,text=True) 

print(result.stdout.strip())

Hello world


In [None]:
# Testing out python code generation

# first step is to have something to take in an input and prompt
# then code generation output
# the code is then passed as an input into the agent that calls the exec function to run the code using the input variable 


In [33]:
from langchain.llms import OpenAI 
from langchain import PromptTemplate 
from langchain.chains import LLMChain 


chat = OpenAI(api_key=env['OPENAI_API_KEY'], temperature=0.2,
              max_tokens= 1000)

template3 = """ 
        You are an expert data scientist.
        Create a python script that takes in an integer argument {number} and returns the number mupltiplied by 4, 
        then create a variable called answer to store the value of the function
"""

prompt3 = PromptTemplate(input_variables= ['number'], 
                         template= template3,
                         )

chain3 = LLMChain(llm=chat, prompt=prompt3)

response = chain3.invoke(5)

In [38]:
namespace = {}
exec(response['text'].strip(),namespace )

20


In [39]:
namespace['answer']

20

In [5]:
from langchain.memory import  ConversationBufferMemory,ConversationBufferWindowMemory
from langchain_openai import ChatOpenAI 
from langchain.chains import ConversationChain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv 

load_dotenv('.env')

True

In [24]:
chain = ConversationChain(llm=ChatOpenAI(temperature=0.3))
                          

In [21]:
chain.prompt.template 

'The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.\n\nCurrent conversation:\n{history}\nHuman: {input}\nAI:'

In [25]:
chain.invoke({'input':'bias and variance'})

{'input': 'bias and variance',
 'history': '',
 'response': "Ah, bias and variance are key concepts in machine learning! Bias refers to the error introduced by approximating a real-world problem, while variance is the error introduced by modeling the noise in the training data. Balancing bias and variance is crucial for building a model that generalizes well to unseen data. It's like finding the sweet spot between underfitting and overfitting. Do you want to know more about how to manage bias and variance in machine learning models?"}

In [26]:
chain.invoke("argentina's last match was against who and when?")

{'input': "argentina's last match was against who and when?",
 'history': "Human: bias and variance\nAI: Ah, bias and variance are key concepts in machine learning! Bias refers to the error introduced by approximating a real-world problem, while variance is the error introduced by modeling the noise in the training data. Balancing bias and variance is crucial for building a model that generalizes well to unseen data. It's like finding the sweet spot between underfitting and overfitting. Do you want to know more about how to manage bias and variance in machine learning models?",
 'response': "Argentina's last match was against Brazil on July 10, 2021. They played in the final of the Copa America tournament. Argentina won the match 1-0, with Angel Di Maria scoring the only goal of the game. It was a historic victory for Argentina, as it was their first Copa America title since 1993."}

In [28]:
print(chain.memory.buffer)

Human: bias and variance
AI: Ah, bias and variance are key concepts in machine learning! Bias refers to the error introduced by approximating a real-world problem, while variance is the error introduced by modeling the noise in the training data. Balancing bias and variance is crucial for building a model that generalizes well to unseen data. It's like finding the sweet spot between underfitting and overfitting. Do you want to know more about how to manage bias and variance in machine learning models?
Human: argentina's last match was against who and when?
AI: Argentina's last match was against Brazil on July 10, 2021. They played in the final of the Copa America tournament. Argentina won the match 1-0, with Angel Di Maria scoring the only goal of the game. It was a historic victory for Argentina, as it was their first Copa America title since 1993.


## Embeddings 

---

Using embeddings, OllamaEmbeddings to save money, llama3.2 to also run locally 


In [21]:
from langchain_community.vectorstores import FAISS 
from langchain_ollama import OllamaEmbeddings

In [23]:
db = FAISS.from_texts(['hello world', 'hello world 2', "hi there"], 
                 embedding=OllamaEmbeddings(model='llama3.2')
                
                 )

In [12]:
# Perform a similarity search
query = "hello"
results = db.similarity_search(query)  # k is the number of top results to retrieve

# Print the results
for result in results:
    print(result.page_content)

hi there
hello world
hello world 2


In [13]:
results

[Document(id='dba82005-d6f9-4dbe-997c-413eec0fd272', metadata={}, page_content='hi there'),
 Document(id='df3f65a9-7bbf-4b58-ba59-ca818be1f42d', metadata={}, page_content='hello world'),
 Document(id='63b197ab-e275-47e2-9ae4-b76d821c6308', metadata={}, page_content='hello world 2')]

### saving embeddings

It saves locally, and can be pointed in a directory

In [14]:
db.save_local('faiss_index')

You can also load the old vectorstore and update it with new data

you have to merge and save when you are done

In [None]:
newdb = FAISS.load_local('faiss_index', OllamaEmbeddings(model='gemma3'), allow_dangerous_deserialization=True)

db2 = FAISS.from_texts(['food is good','my guy how are you', 'hello there'],
                 embedding=OllamaEmbeddings(model='gemma3')
                 )

In [17]:
newdb.merge_from(db2)
newdb.save_local('faiss_index')

In [18]:
newdb.similarity_search('hello')

[Document(id='ba9a0929-3f3a-4a52-816f-9b6a270f21ea', metadata={}, page_content='hello there'),
 Document(id='dba82005-d6f9-4dbe-997c-413eec0fd272', metadata={}, page_content='hi there'),
 Document(id='df3f65a9-7bbf-4b58-ba59-ca818be1f42d', metadata={}, page_content='hello world'),
 Document(id='c2e1ea99-281c-4dfa-a345-097cecbc0fc8', metadata={}, page_content='my guy how are you')]

In [20]:
newdb.similarity_search('food')

[Document(id='bdf4c390-865b-4bbf-ad03-b6091f67bd8c', metadata={}, page_content='food is good'),
 Document(id='ba9a0929-3f3a-4a52-816f-9b6a270f21ea', metadata={}, page_content='hello there'),
 Document(id='c2e1ea99-281c-4dfa-a345-097cecbc0fc8', metadata={}, page_content='my guy how are you'),
 Document(id='63b197ab-e275-47e2-9ae4-b76d821c6308', metadata={}, page_content='hello world 2')]

### Another way to do embedding using Llama index

In [1]:
from llama_index.core import VectorStoreIndex, Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_index.core import SimpleDirectoryReader
# from langchain_community.document_loaders import PyPDFLoader
# load data()
loader = SimpleDirectoryReader(
    input_dir='data',
    required_exts=['.pdf'],
    recursive=True
)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
docs= loader.load_data()
embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-large-en-v1.5", trust_remote_code=True)


In [4]:
from llama_index.core import Settings
# ====== Create vector store and upload indexed data ======
Settings.embed_model = embed_model # we specify the embedding model to be used
index = VectorStoreIndex.from_documents(docs)

In [5]:
Settings

_Settings(_llm=None, _embed_model=HuggingFaceEmbedding(model_name='BAAI/bge-large-en-v1.5', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001F360646010>, num_workers=None, max_length=512, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None), _callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001F360646010>, _tokenizer=None, _node_parser=SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001F360646010>, id_func=<function default_id_func at 0x000001F350649BC0>, chunk_size=1024, chunk_overlap=200, separator=' ', paragraph_separator='\n\n\n', secondary_chunking_regex='[^,.;。？！]+[,.;。？！]?|[,.;。？！]'), _prompt_helper=None, _transformations=[SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 

In [6]:
from llama_index.llms.ollama import Ollama

# setting up the llm

llm = Ollama(model="gemma3", request_timeout=500) 

# ====== Setup a query engine on the index previously created ======
Settings.llm = llm # specifying the llm to be used
query_engine = index.as_query_engine(streaming=True, similarity_top_k=4)

In [7]:
from llama_index.core.prompts import PromptTemplate 

In [8]:
qa_prompt_tmpl_str = (
            """Context information is below.\n"
            ---------------------
            {context_str}
            ---------------------
            Given the context information above I want you to think step by step to answer the query in a crisp manner, 
            incase case you don't know the answer say 'I don't know!'.\n
            Query: {query_str}\n"
            "Answer: 
            """)

qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)
query_engine.update_prompts({"response_synthesizer:text_qa_template": qa_prompt_tmpl})



In [None]:
docs

In [9]:
response = query_engine.query('What is the document about?')
print(response)

The document is about the admission requirements and procedures for the Winter 2025-2026 Master in Data Science program at the University of Luxembourg, specifically for third-country national applicants. It details required documents, application deadlines, tuition fees, and contact information.


## Document loaders in details 



Still on Embeddings 


In [21]:
from langchain.document_loaders import TextLoader,CSVLoader,JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

In [16]:
loader = TextLoader('data/onedrive.txt')

In [20]:
docs = loader.load()
docs

[Document(metadata={'source': 'data/onedrive.txt'}, page_content='Microsoft OneDrive is a cloud storage service that allows users to store files and access them from anywhere with an internet connection. It is integrated with the Microsoft 365 ecosystem, making it a convenient choice for individuals and businesses.\n\n**Benefits:**\n1. **Accessibility:** Files stored in OneDrive can be accessed from any device, including PCs, smartphones, and tablets.\n2. **Collaboration:** OneDrive enables real-time collaboration on documents, spreadsheets, and presentations through Microsoft Office apps.\n3. **Backup and Sync:** It provides automatic backup and synchronization of files across devices, ensuring data is always up-to-date.\n4. **Security:** OneDrive offers robust security features, including encryption, ransomware detection, and recovery options.\n5. **Integration:** Seamless integration with Microsoft 365 apps like Word, Excel, and Teams enhances productivity.\n\n**Use Cases:**\n- **Pe

In [28]:
csv_loader= CSVLoader('data/trades.csv', source_column='Company')
# the source column is the column that contains the source of the document being retrieved

In [29]:
csv_docs = csv_loader.load()

In [30]:
len(csv_docs)

975

In [31]:
csv_docs[0].metadata

{'source': 'Stevens-Brown', 'row': 0}

### Prompt to fetch exchange rates from Aboki FX 

In [32]:
from langchain_community.document_loaders import UnstructuredURLLoader

url_loader = UnstructuredURLLoader(urls= ['https://abokiforex.app/'])
url_docs = url_loader.load()

In [34]:
url_docs[0]

Document(metadata={'source': 'https://abokiforex.app/'}, page_content="Aboki Forex - Naira to Dollar Black Market Today\n\nDollar to Naira Today Black Market\n\nThe Black Market Dollar to Naira Rates are tabulated below:\n\nBlack Market Rates\n\nDollar to Naira rate\n\nnaira to dollar\n\nBUY\n\ndollar to naira\n\n1540\n\nDOLLAR (USD)\n\nSELL\n\npound to dollar\n\n1550\n\nPound to Naira rate\n\ndollar to pound\n\nBUY\n\ndollar to yen\n\n1990\n\nPOUND (GBP)\n\nSELL\n\npound to euro\n\n2020\n\nEuro to Naira rate\n\neuro to pound\n\nBUY\n\ndollar to euro\n\n1660\n\nEURO (EUR)\n\nSELL\n\ndollar to euro\n\n1690\n\nCanadian Dollar to Naira rate\n\ncanadian dollar to euro\n\nBUY\n\ndollar to canadian dollar\n\n1000\n\nDOLLAR (CAD)\n\nSELL\n\ndollar to rand\n\n1150\n\nSouth African Rand to Naira rate\n\nRand to dollar\n\nBUY\n\nzar to dollar\n\n80\n\nRAND (ZAR)\n\nSELL\n\nyuan to dollar\n\n100\n\nUAE Dirham to Naira rate\n\ndirham aed to dollar\n\nBUY\n\ndollar to yuan\n\n380\n\nDIRHAM (AED)\n\

### Using openai LLM to extract the key data using a very solid prompt

In [39]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
# chatmodel = ChatOpenAI()
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate 

llm = ChatOpenAI(temperature=0.4)

prompt = PromptTemplate(
        input_variables=['file'],
        template= '''
            You are a very good webscraper, also good with data mining.

            Fetch the exchange rate of the currencies in the document and return the result in a dictionary format
            for both the official rate and the black market.
            The document is as follows: {file}
'''
    )
chain = prompt | llm | StrOutputParser()

response = chain.invoke({'file':url_docs[0].page_content})

In [41]:
print(response)

{
    "exchange_rates": {
        "black_market": {
            "USD": {
                "buy": 1540,
                "sell": 1550
            },
            "GBP": {
                "buy": 2020,
                "sell": 2020
            },
            "EUR": {
                "buy": 1690,
                "sell": 1690
            },
            "CAD": {
                "buy": 1000,
                "sell": 1000
            },
            "ZAR": {
                "buy": 80,
                "sell": 1150
            },
            "AED": {
                "buy": 380,
                "sell": 420
            },
            "CNY": {
                "buy": 190,
                "sell": 215
            },
            "GHS": {
                "buy": 90,
                "sell": 105
            },
            "XOF": {
                "buy": 2370,
                "sell": 2550
            },
            "XAF": {
                "buy": 2250,
                "sell": 2400
            },
            "AUD"

In [43]:
from langchain_ollama import OllamaLLM 

ollama_llm = OllamaLLM(model='llama3.2', temperature=0.2, max_tokens=1000)

ollama_chain = prompt | ollama_llm | StrOutputParser()
response = ollama_chain.invoke({'file':url_docs[0].page_content})


In [44]:
print(response)

import re

# Define the document content
document = """
Dollar to Naira Today Black Market

The Black Market Dollar to Naira Rates are tabulated below:

Black Market Rates

Dollar to Naira rate

naira to dollar

BUY

dollar to naira

1540

DOLLAR (USD)

SELL

pound to dollar

1550

Pound to Naira rate

dollar to pound

BUY

dollar to yen

1990

POUND (GBP)

SELL

pound to euro

2020

Euro to Naira rate

euro to pound

BUY

dollar to euro

1660

EURO (EUR)

SELL

dollar to euro

1690

Canadian Dollar to Naira rate

canadian dollar to euro

BUY

dollar to canadian dollar

1000

DOLLAR (CAD)

SELL

dollar to rand

1150

South African Rand to Naira rate

Rand to dollar

BUY

zar to dollar

80

RAND (ZAR)

SELL

yuan to dollar

100

UAE Dirham to Naira rate

dirham aed to dollar

BUY

dollar to yuan

380

DIRHAM (AED)

SELL

dirham to dollar

420

Chinese Yuan to Naira rate

yuan to euro

BUY

euro to dirham

190

YUAN (CNY)

SELL

pound to yuan

215

Ghanaian Cedi to Naira rate

Ghanaian c