In [1]:
!pip install pypdf sentence-transformers faiss-cpu numpy openai



# TEXT ConVERT

     pypdf → Reads and extracts text from PDFs.
	•	sentence-transformers → Creates embeddings (vector representations of text).
	•	faiss-cpu → Stores vectors and finds the most similar ones quickly.
	•	numpy → Handles numerical arrays (FAISS uses it internally).
	•	openai → Lets you call GPT models with your API key.


In [1]:
from pypdf import PdfReader

import faiss
import numpy as np
from openai import OpenAI
import os

In [3]:
from sentence_transformers import SentenceTransformer # Hugging face

In [4]:
# Read PDF 
def read_pdf(path):
    reader = PdfReader(path)
    text = ""
    for page in reader.pages:
        page_text = page.extract_text() 
        text += page_text 
    return text


In [5]:
PDF_FILE = "data/Introduction to Neural Networks.pdf"

In [6]:
read_pdf(PDF_FILE)

"Introduction to Neural \nNetworks\nAgenda - Schedule\n1. Historical Intro to Neural Nets\n2. Perceptrons\n3. Break\n4. Gradient Descent\n5. Simple Feed Forward Networks Hyperplanes, bayes theorem, neighbors, why \nnot just take inspiration from the brain?\nAgenda - Goals\n● \nHistorical/Biological TangentWhile all of these are interesting ideas and could entail years of study, let’s \nfocus in on the “perceptron” as another attempt at “solving” intelligence. \n“Can Machines Think?”(1950) - \nAlan Turing\nLogic Theorist (1955) - Newell & \nSimonAutomatons - Ancient \nEgypt/Greece/China\nPerceptron Machines (1957) - \nRosenblatt\n… \n1000s of years \nlater…\nThe Perceptron - An Imitation of our Neurons\nIn 1957, Frank Rosenblatt (cognitive researcher/psychologist) \npublishes a paper titled “The Perceptron: A Probabilistic Model for \nInformation Storage & Organization in the Brain”\n“If we are eventually to understand the capability of higher organisms \nfor perceptual recognition, gen

In [7]:
text1 = read_pdf(PDF_FILE)

In [8]:
CHUNK_SIZE = 300 # maximum number of characters in each chunk of text.
CHUNK_OVERLAP = 50  #each chunk shares 50 characters with the next one.

In [9]:
# chunking function
def chunk_text(text, size=300, overlap=50):
    return [text[i:i+size] for i in range(0, len(text), size - overlap)]

In [10]:
chunk_text(text1)

['Introduction to Neural \nNetworks\nAgenda - Schedule\n1. Historical Intro to Neural Nets\n2. Perceptrons\n3. Break\n4. Gradient Descent\n5. Simple Feed Forward Networks Hyperplanes, bayes theorem, neighbors, why \nnot just take inspiration from the brain?\nAgenda - Goals\n● \nHistorical/Biological TangentWhile',
 'enda - Goals\n● \nHistorical/Biological TangentWhile all of these are interesting ideas and could entail years of study, let’s \nfocus in on the “perceptron” as another attempt at “solving” intelligence. \n“Can Machines Think?”(1950) - \nAlan Turing\nLogic Theorist (1955) - Newell & \nSimonAutomatons - Anc',
 ' Theorist (1955) - Newell & \nSimonAutomatons - Ancient \nEgypt/Greece/China\nPerceptron Machines (1957) - \nRosenblatt\n… \n1000s of years \nlater…\nThe Perceptron - An Imitation of our Neurons\nIn 1957, Frank Rosenblatt (cognitive researcher/psychologist) \npublishes a paper titled “The Perceptron: A Probab',
 'publishes a paper titled “The Perceptron: A Probabilist

In [11]:
chunks = chunk_text(text1)

In [12]:

EMBED_MODEL = "all-MiniLM-L6-v2" #embedding model used to turn text into vectors.
	# "all-MiniLM-L6-v2" is a fast, small, accurate model from Sentence Transformers.


In [13]:
# Create embeddings
embedder = SentenceTransformer(EMBED_MODEL) # hugging face model
embeddings = embedder.encode(chunks, normalize_embeddings=True)
embeddings = np.array(embeddings, dtype="float32")


In [14]:
embeddings.shape # 108 chunks and 384 dim- “meaning coordinates” for that text in a semantic space.

(91, 384)

In [15]:
# 3) Store in FAISS index
dim = embeddings.shape[1] # 384 dim of each vector, this model only gives you 384 dim
index = faiss.IndexFlatIP(dim) # Creates a FAISS index object that will store and search your vectors.
index.add(embeddings) # add my vectors (document chunks) into faiss(store vectors)

This is my document chunk vectors (each 384-dim long). Store them so FAISS when I give you a query vector later, you can quickly find the most similar ones.


In [16]:

print(index.ntotal) # vectors stored

91


How it works in RAG
- 1.	Retriever (Hugging Face model + FAISS) → Finds relevant chunks.
- 2.	LLM (OpenAI GPT, or any other LLM) → Reads those chunks + your question → writes a final, well-formed answer.


In [None]:
# CONFIG 
OPENAI_API_KEY = ""

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
client = OpenAI(api_key= OPENAI_API_KEY
)


In [18]:
TOP_K = 3 #FAISS searches all chunks and returns top 3 most similar ones.

In [19]:
def rag(query):
    # embed the query (same model as documents)
    ques_emb = embedder.encode([query], normalize_embeddings=True).astype("float32")

    # retrieve top-k most similar chunks from FAISS
    D, I = index.search(ques_emb, TOP_K)     # D: scores, I: indices into chunks list
    scores = D[0].tolist()
    idxs = I[0].tolist()


    print("\n--- Chunks-------")
    for score, idx in zip(D[0], I[0]):
        if idx == -1: 
            continue
        print(f"[score={score:.3f} idx={idx}] {chunks[idx][:200]}...")

    # join chunks --{'\n\n'.join(retrieved_chunks)
    
    context = "\n\n".join(                 #  join everything together, separated by 2 newlines
    chunks[idx]                        # take the text chunk at this index
    for idx in I[0]                     # loop through each index in the first result row from FAISS
    if idx != -1  )

    print(f'--------Context-----------{context}')


In [20]:
rag('what is a neuron?')


--- Chunks-------
[score=0.643 idx=5] f our Neurons
Before exploring the features of a perceptron, 
let’s recognize the most basic components of 
our brain: the neuron.
A neuron is a cell that takes input and provides 
output via electric...
[score=0.548 idx=12] output as 
“1. ” Just like a neuron!
The Perceptron - An Imitation of our Neurons
However this only worked for linearly 
separable classiﬁcations.
Perhaps to achieve complex classiﬁcations, we 
don’t ...
[score=0.542 idx=13] mitation of our Neurons
However this only worked for linearly 
separable classiﬁcations.
Perhaps to achieve complex classiﬁcations, we 
don’t use one neuron, but rather… 
A network of neurons (a neura...
--------Context-----------f our Neurons
Before exploring the features of a perceptron, 
let’s recognize the most basic components of 
our brain: the neuron.
A neuron is a cell that takes input and provides 
output via electrical signals (action potential) 
once we cross a certain threshold potential. 
This

In [21]:
#  query embedding
def rag_answer(query):
    q_emb = embedder.encode([query], normalize_embeddings=True).astype("float32") # query vector
    D, I = index.search(q_emb, TOP_K) # compare query vectortop  and 3 most similar chunks.

    # Get top chunks
    retrieved_chunks = []
    for score, idx in zip(D[0], I[0]):
        retrieved_chunks.append(f"[Source chunk {idx}] {chunks[idx]}")

        # GPT prompt
    prompt = f"""
       You are a helpful assistant. Use the provided context to answer the question.

      Context:
      {'\n'.join(retrieved_chunks)}

      Question: {query}
      Answer with facts from the context. If unsure, say 'I don't know'.
     
       """

    # GPT call
    response = client.chat.completions.create(
        model='gpt-4o-mini',  
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return response.choices[0].message.content, retrieved_chunks

    


In [22]:
rag_answer("what is a neuron")

('A neuron is a cell that takes input and provides output via electrical signals (action potential) once a certain threshold potential is crossed.',
 ['[Source chunk 5] f our Neurons\nBefore exploring the features of a perceptron, \nlet’s recognize the most basic components of \nour brain: the neuron.\nA neuron is a cell that takes input and provides \noutput via electrical signals (action potential) \nonce we cross a certain threshold potential. \nThis by no means is a b',
  '[Source chunk 13] mitation of our Neurons\nHowever this only worked for linearly \nseparable classiﬁcations.\nPerhaps to achieve complex classiﬁcations, we \ndon’t use one neuron, but rather… \nA network of neurons (a neural network?)\nApplications of Neural Networks\nThere are are many different applications of this \npower',
  '[Source chunk 12] output as \n“1. ” Just like a neuron!\nThe Perceptron - An Imitation of our Neurons\nHowever this only worked for linearly \nseparable classiﬁcations.\nPerhaps to achie

________________END_____________________________

In [23]:
rag_answer('What is z?')

("I don't know.",
 ['[Source chunk 12] output as \n“1. ” Just like a neuron!\nThe Perceptron - An Imitation of our Neurons\nHowever this only worked for linearly \nseparable classiﬁcations.\nPerhaps to achieve complex classiﬁcations, we \ndon’t use one neuron, but rather… \nThe Perceptron - An Imitation of our Neurons\nHowever this only worked f',
  '[Source chunk 8] r that’s about.\n“Did I turn the stove off?”\n“What will I get my mom for \nher birthday?”\n“2 + 2  = ???”\nThe Perceptron - An Imitation of our Neurons\nRosenblatt observed this biological \nphenomenon (and with the help of past \nresearch) and posited:\n“...if one understood the code or ‘wiring diagram’ \no',
  '[Source chunk 4] king, we must \nﬁrst have answers to three fundamental questions:\n1. How is information about the …world sensed…\n2. In what form is information stored…\n3. How does information contained in storage…inﬂuence \nrecognition”\nThe Perceptron - An Imitation of our Neurons\nBefore exploring the featu

In [24]:
rag_answer('Can you show me a deep neural network diagram?')

("I don't know.",
 ['[Source chunk 89]  is a brand new \nproject, you will have to build this project from scratch.\nSubmit a link to your GitHub repository by 4/22.\nTomorrow\nNeural Networks\n○ What is a “recurrent” neural \nnetwork\n○ How do we build a neural \nnetwork?\n○ How do we use neural nets to \nunderstand sentiment?\nHow can we re-organ',
  '[Source chunk 0] Introduction to Neural \nNetworks\nAgenda - Schedule\n1. Historical Intro to Neural Nets\n2. Perceptrons\n3. Break\n4. Gradient Descent\n5. Simple Feed Forward Networks Hyperplanes, bayes theorem, neighbors, why \nnot just take inspiration from the brain?\nAgenda - Goals\n● \nHistorical/Biological TangentWhile',
  '[Source chunk 9] .if one understood the code or ‘wiring diagram’ \nof the nervous system, one should, in principle, be \nable to discover exactly what an organism \nremembers by reconstructing the original \nsensory pattern…”\nThe Perceptron - An Imitation of our Neurons\nWell, lo & behold , a machine that \n