## 0.00 Setup

In [None]:
%pip install PyMuPDF
%pip install faiss-cpu

Collecting PyMuPDF
  Downloading pymupdf-1.25.4-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.4-cp39-abi3-win_amd64.whl (16.6 MB)
   ---------------------------------------- 0.0/16.6 MB ? eta -:--:--
   ----- ---------------------------------- 2.4/16.6 MB 13.4 MB/s eta 0:00:02
   ------------ --------------------------- 5.2/16.6 MB 13.3 MB/s eta 0:00:01
   ------------------ --------------------- 7.9/16.6 MB 12.8 MB/s eta 0:00:01
   ------------------------- -------------- 10.5/16.6 MB 12.8 MB/s eta 0:00:01
   ------------------------------- -------- 13.1/16.6 MB 12.8 MB/s eta 0:00:01
   ------------------------------------- -- 15.7/16.6 MB 12.8 MB/s eta 0:00:01
   ---------------------------------------- 16.6/16.6 MB 12.6 MB/s eta 0:00:00
Installing collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.4
Note: you may need to restart the kernel to use updated packages.


In [19]:
%pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp312-cp312-win_amd64.whl.metadata (4.5 kB)
Downloading faiss_cpu-1.10.0-cp312-cp312-win_amd64.whl (13.7 MB)
   ---------------------------------------- 0.0/13.7 MB ? eta -:--:--
   ------- -------------------------------- 2.6/13.7 MB 13.7 MB/s eta 0:00:01
   ---------------- ----------------------- 5.5/13.7 MB 14.0 MB/s eta 0:00:01
   ---------------------- ----------------- 7.9/13.7 MB 13.2 MB/s eta 0:00:01
   ------------------------------ --------- 10.5/13.7 MB 12.8 MB/s eta 0:00:01
   -------------------------------------- - 13.1/13.7 MB 12.8 MB/s eta 0:00:01
   ---------------------------------------- 13.7/13.7 MB 12.8 MB/s eta 0:00:00
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0
Note: you may need to restart the kernel to use updated packages.


### Step 0.01 Load libraries and API keys

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from dotenv import load_dotenv

# Load API keys from .env file 
load_dotenv(override=True)

True

In [None]:
# Step 0.02: Check API KEY
import os
print(f"[API KEY]\n{os.environ['LANGSMITH_API_KEY']}")

## 1.00 Setup RAG pipeline

### 1.01 Load the documents

In [4]:
### Load PDF Document
loader = PyMuPDFLoader("documents/2025_NARST_conference_program_book.pdf")
docs = loader.load()
print(f"Number of pages in the document: {len(docs)}")

Number of pages in the document: 212


In [12]:
### Check document content
# Print the first page content
print(docs[0].page_content)

98th NARST International Conference | Digital Program
Chicago, Illinois, Hilton Downtown Chicago
Washington, DC
March 23 - 26, 2025


In [None]:
### Check metadata of loaded document
docs[3].__dict__

{'id': None,
 'metadata': {'producer': 'Adobe PDF Library 17.0',
  'creator': 'Adobe InDesign 20.2 (Macintosh)',
  'creationdate': '2025-03-14T13:22:13-04:00',
  'source': 'documents/2025_NARST_conference_program_book.pdf',
  'file_path': 'documents/2025_NARST_conference_program_book.pdf',
  'total_pages': 212,
  'format': 'PDF 1.6',
  'title': '',
  'author': '',
  'subject': '',
  'keywords': '',
  'moddate': '2025-03-17T10:04:33-06:00',
  'trapped': '',
  'modDate': "D:20250317100433-06'00'",
  'creationDate': "D:20250314132213-04'00'",
  'page': 3},
 'page_content': '98th NARST International Conference     March 23–26, 2025     4\n98th NARST International Conference\nGeneral Information\nGeneral Information\nInformation about NARST\nNARST is a global organization for improving science \nteaching and learning through research. Since its \ninception in 1928, NARST has promoted research in \nscience education and the communication of \nknowledge generated by the research. The ultimate

### 1.02 Document Chunking

In [14]:
## Define Chunking Size and Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)
print(f"Number of split chunks: {len(split_documents)}")

Number of split chunks: 1099


### 1.03 Generate Chunk Embeddings 

In [15]:
# Generate Embeddings
embeddings = OpenAIEmbeddings()

### 1.04 Store Embedded Chunks in Vector Database

In [27]:
# Create and Save the Database
# Create a vector store.
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

In [None]:
searchTerm = "What is the conference theme?"
number_of_results = 3

In [31]:
for doc in vectorstore.similarity_search(searchTerm, k=number_of_results):
    # Print the document content
    print(doc.page_content)

from a place of trust and relationship-building? Given 
that NARST's ultimate goal is to help all learners 
achieve science literacy, how might we reimagine 
science literacy with social, environmental, and 
epistemological justice at its core? 
This conference theme invites us to share the ways 
that we can transgress canonical boundaries in 
science education and expand dialogues on strategies 
for disrupting structures that sustain inequities, and
2025 NARST Annual International Conference, Washington DC 
 
 
44 
 
Strand 14: Environmental Education and 
Sustainability 
Stand-Alone Paper 
Bridging Roles: Educators and High 
School Graduates’ Sense of Climate 
Change. 
Shaima Alokbe*, Ben-Gurion University of 
the Negev, Israel 
Areej Nbari*, Ben-Gurion University of the 
Negev, Israel 
Wisam Sedawi*, University of Michigan, 
USA 
Orit Ben Zvi Assaraf, Ben-Gurion 
University of the Negev, Israel 
 
 
 
Closing remarks
in the spirit of Bell Hooks, “[envision] new, alternative, 
opposi

### 1.05 Define Retriever to perform similarity search

In [32]:
# Create Retriever
# Search and retrieve information contained in the documents.
retriever = vectorstore.as_retriever()

In [33]:
initial_query = "What is the conference theme?"

In [34]:
retriever.invoke(initial_query)

[Document(id='67d5f200-1da7-4d93-b9e7-d9e0486e5d42', metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 20.2 (Macintosh)', 'creationdate': '2025-03-14T13:22:13-04:00', 'source': 'documents/2025_NARST_conference_program_book.pdf', 'file_path': 'documents/2025_NARST_conference_program_book.pdf', 'total_pages': 212, 'format': 'PDF 1.6', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-03-17T10:04:33-06:00', 'trapped': '', 'modDate': "D:20250317100433-06'00'", 'creationDate': "D:20250314132213-04'00'", 'page': 33}, page_content="from a place of trust and relationship-building? Given \nthat NARST's ultimate goal is to help all learners \nachieve science literacy, how might we reimagine \nscience literacy with social, environmental, and \nepistemological justice at its core? \nThis conference theme invites us to share the ways \nthat we can transgress canonical boundaries in \nscience education and expand dialogues on strategies \nfor disrupt

### 1.06 Create prompt for performing RAG

In [35]:
# Create Prompt
prompt = PromptTemplate.from_template(
    """You are a thinking partner for a teacher working to adapt their curricular material. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

#Context: 
{context}

#Question:
{question}

#Answer:"""
)

### 1.07 Define the LLM

In [36]:
# Setup LLM
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

### 1.08 Produce Output from Defined Chain

In [37]:
# Create Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

### 1.09 Run Chain

In [42]:
# Run Chain
# Input a query about the document and print the response.
question = "What are important design principles to think about when adapting my science unit?"
response = chain.invoke(question)
print(response)

I don't know. The provided context does not include specific information about design principles for adapting a science unit.


## 2.00 Combined RAG Code

In [None]:
## Step 0: Load Environment Variables
import os
from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

document_path = "documents"

# Define the root folder that contains your PDFs
pdf_folder = document_path  # adjust to your folder path

# List to store all loaded documents
all_docs = []

# Walk through the folder and its subfolders
for root, dirs, files in os.walk(pdf_folder):
    for file in files:
        if file.lower().endswith(".pdf"):
            file_path = os.path.join(root, file)
            print(f"Loading file: {file_path}")
            loader = PyMuPDFLoader(file_path)
            docs = loader.load()  # load returns a list of Document objects (e.g., one per page)
            all_docs.extend(docs)

print(f"Loaded {len(all_docs)} documents from {pdf_folder}.")


chunk_size = 500
chunk_overlap = 50
number_of_chunks = 3

question = "What are important design principles to think about when adapting my science unit?"
model_name ="gpt-4o" 
temperature = 0

Loading file: documents\2025_NARST_conference_program_book.pdf
Loaded 212 documents from documents.


In [None]:

# Step 1: Load Documents
# loader = PyMuPDFLoader(document_path)
# docs = loader.load()

# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
split_documents = text_splitter.split_documents(docs)

# Step 3: Generate Embeddings
embeddings = OpenAIEmbeddings()

# Step 4: Create and Save the Database
# Create a vector store.
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

# Step 5: Create Retriever
# Search and retrieve information contained in the documents.
retriever = vectorstore.as_retriever()

# Step 6: Create Prompt
prompt = PromptTemplate.from_template(
    """You are a thinking partner for a teacher working to adapt their curricular material. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know.  

#Context: 
{context}

#Question:
{question}

#Answer:"""
)

# Step 7: Load LLM
llm = ChatOpenAI(model_name=model_name, temperature=temperature)

# Step 8: Create Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [49]:
# Run Chain
# Input a query about the document and print the response.
question = question
response = chain.invoke(question)
print(response)

Based on the retrieved context, here are some important design principles to consider when adapting your science unit:

1. **Promoting Critical Thinking**: Consider incorporating a newly designed instructional framework that emphasizes critical thinking. This can help students engage more deeply with the material and develop essential analytical skills.

2. **Addressing Instructional Shifts**: Be aware of the instructional shifts in science education and support students' sensemaking. This involves helping students understand and make sense of scientific concepts through inquiry-based learning and other student-centered approaches.

3. **Incorporating Play and Joyful Methodologies**: Think about integrating play and joyful methodologies into your science learning activities. This can make learning more engaging and enjoyable for students, fostering a positive learning environment.

4. **Enacting Care and Inclusivity**: Consider how you can enact care alongside students, teachers, and c