## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

In [1]:
# imports

import os
import glob
from dotenv import load_dotenv
import gradio as gr

In [2]:
# imports for langchain

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [3]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "gpt-4o-mini"
db_name = "vector_db"

In [4]:
# Load environment variables in a file called .env

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

In [16]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase

folders = glob.glob("knowledge-base/*")

# With thanks to CG and Jon R, students on the course, for this fix needed for some users 
text_loader_kwargs = {'encoding': 'utf-8'}
# If that doesn't work, some Windows users might need to uncomment the next line instead
# text_loader_kwargs={'autodetect_encoding': True}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

In [17]:
len(documents)

17

In [18]:
documents[4]

Document(metadata={'source': 'knowledge-base\\grade\\grade5.md', 'doc_type': 'grade'}, page_content='## Grade 5\n\n| Enrollment No. | First Name | Last Name | Age | Height (cm) | English | Math | Science | Art | French | Average | Grade | Extracurricular Activities |\n|---|---|---|---|---|---|---|---|---|---|---|---|\n| 1 | Aaron | Davis | 10 | 135 | 88 | 92 | 85 | 90 | 88 | 88.6 | A- | Soccer, Art Club |\n| 2 | Brianna | Miller | 10 | 140 | 90 | 95 | 88 | 90 | 92 | 91 | A | Music, Chess Club |\n| 3 | Caleb | Garcia | 10 | 133 | 82 | 88 | 80 | 92 | 80 | 84.4 | B+ | Dance, Swimming |\n| 4 | Dylan | Wilson | 10 | 137 | 90 | 95 | 88 | 85 | 90 | 89.6 | A | Basketball, Drama |\n| 5 | Emily | Anderson | 10 | 130 | 85 | 88 | 92 | 80 | 78 | 84.6 | B+ | Karate, Book Club |\n| 6 | Finn | Thomas | 10 | 135 | 78 | 82 | 75 | 88 | 80 | 80.6 | B | Football, Music |\n| 7 | Grace | Martinez | 10 | 133 | 92 | 95 | 90 | 85 | 90 | 90.4 | A | Soccer, Art Club |\n| 8 | Hayden | Rodriguez | 10 | 130 | 88 | 9

In [28]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

In [29]:
len(chunks)

25

In [30]:
chunks[6]

Document(metadata={'source': 'knowledge-base\\grade\\grade4.md', 'doc_type': 'grade'}, page_content='## Grade 4')

In [31]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: grade, staff


In [35]:
for chunk in chunks:
    if 'Sarah' in chunk.page_content:
        print(chunk)
        print("_________")

page_content='# Bayside Elementary School Staff Directory


### Grade 1: Ms. Sarah Thompson
**Resume:**
- Bachelor of Elementary Education, University of Michigan
- 8 years teaching experience
- Specialization: Early Childhood Literacy
- Certifications: Elementary Education K-6

**Previous Year Performance Report:**
- Student Reading Growth: 87% above district average
- Classroom Engagement: Excellent
- Parent Satisfaction Rating: 94%
- Implemented innovative phonics program' metadata={'source': 'knowledge-base\\staff\\Sarah Thompson.md', 'doc_type': 'staff'}
_________
