# Midterm Challenge Notebook - Mike Dean

In [1]:
!pip install -qU langchain langchain_openai langchain_core==0.2.40 langchain_community
!pip install -qU qdrant_client pymupdf tiktoken

## Task 1.  Dealing with the Data
(Role: AI Solutions Engineer)

In [2]:
import os, tiktoken
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import Qdrant
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Path to my directory containing PDF files
directory = "References/"

# List to store all the documents
all_docs = []

# Iterate through all the files in the directory
for filename in os.listdir(directory):
    if filename.endswith(".pdf"):  # Check if the file is a PDF
        file_path = os.path.join(directory, filename)
        loader = PyMuPDFLoader(file_path)
        docs = loader.load()
        all_docs.extend(docs)  # Append the loaded docs to my list

# Default behavior is to break PDF files into their pages
# Using tiktoken, I checked the token lengths of several representative pages and
# the lengths were always less than 1000 tokens, so INITIAL STRATEGY is to use
# each document as a single chunk and not further split.

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

page_split_vectorstore = Qdrant.from_documents(
    all_docs,
    embedding_model,
    location=":memory:",
    collection_name="page_split_collection",
)
page_split_retriever = page_split_vectorstore.as_retriever()


In [3]:

# ALTERNATIVE STRATEGY is to recombine all the pages into one string document and then
# split it.  The advantage of this approach is to have chunk overlap, which is not
# possible with my initial strategy.

one_document = ""
for doc in all_docs:
    one_document += doc.page_content

def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-4o").encode(
        text,
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap = 400,
    length_function = tiktoken_len,
)

split_chunks = text_splitter.split_text(one_document)

chunk_split_vectorstore = Qdrant.from_documents(
    text_splitter.create_documents(split_chunks),
    embedding_model,
    location=":memory:",
    collection_name="chunk_split_collection",
)

chunk_split_retriever = chunk_split_vectorstore.as_retriever()


In [4]:
## Check that I have two vectorstores in memory
client = page_split_vectorstore.client
print(client.get_collections())
client = chunk_split_vectorstore.client
print(client.get_collections())

collections=[CollectionDescription(name='page_split_collection')]
collections=[CollectionDescription(name='chunk_split_collection')]


## Task 1 Deliverables:
1.  Describe the default chunking strategy that I will use:<br>
The default strategy will be to load the two PDF files using `PyMuPDFLoader` just as we have previously done.  This results in each PDF page being its own document.  I have checked sample pages with `tiktoken` and the token count per page is <1000, so these are small enough to just embed without further splitting. I saved these embeddings in `page_split_vectorstore`.

2.  Articulate a chunking strategy that I will also test:<br>
The disadvantage of the default strategy is that there is no chunk overlapping between the pages, and this might worsen the ability connect two pages that are both relevant to a query.  So I  recombine the page_content of all pages into a single string, convert it into a document, and split it with a chunk size of 800 and an overlap of 400 (the default settings used by OpenAI).  This strategy allows chunks to overlap, pehaps adding semantic continuity between adjacent pages.  These were embedded with the same embedding model and saved in `chunk_split_vectorstore`.

3.  Describe how and why I made these decisions:<br>
The default behavior of `PyMuPDFLoader` is not bad and I have been using it for several months.  However, I was splitting each of the documents created, not thinking through that if I had chunk sizes greater than the page itself, this was meaningless.  I also had chunk overlap, but had not thought through the implications of each page being a separate document.  So I made these decision for this Midterm Challenge so I can later compare the performance using RAGAS in Task 5.

## Task 2.  Building a Quick End-to-End Prototype
(Role: AI Systems Engineer)

In [5]:
from langchain.prompts import ChatPromptTemplate
from langchain_openai.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

rag_prompt_template = """\
You are a helpful and polite and cheerful assistant who answers questions based solely on the provided context. 
Use the context to answer the question and provide a  clear answer. Do not mention the document in your
response.
If there is no specific information
relevant to the question, then tell the user that you can't answer based on the context.

Context:
{context}

Question:
{question}
"""

rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)

In [6]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser

## CREATE MY TWO RAG CHAINS

page_split_rag_chain = (
    {"context": itemgetter("question") | page_split_retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

chunk_split_rag_chain = (
    {"context": itemgetter("question") | chunk_split_retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)


In [7]:
from IPython.display import Markdown, display

## TEST THE TWO CHAINS
page_response = (page_split_rag_chain.invoke({"question": "List the ten major risks of AI?"}))
display(Markdown(page_response))

chunk_response = (chunk_split_rag_chain.invoke({"question": "What are some risks of AI?"}))
display(Markdown(chunk_response))

Based on the provided context, the ten major risks of AI, particularly Generative AI (GAI), can be categorized as follows:

1. **Confabulation**
2. **Dangerous or Violent Recommendations**
3. **Data Privacy**
4. **Value Chain and Component Integration**
5. **Harmful Bias**
6. **Homogenization**
7. **CBRN Information or Capabilities**
8. **Human-AI Configuration**
9. **Obscene, Degrading, and/or Abusive Content**
10. **Information Integrity**

These risks can be further categorized into technical/model risks, misuse by humans, and ecosystem/societal risks.

Some risks of AI include:

1. **Confabulation**: The production of confidently stated but erroneous or false content, which can mislead or deceive users.
2. **Dangerous, Violent, or Hateful Content**: Eased production of and access to harmful content, including violent, inciting, radicalizing, or threatening material.
3. **Data Privacy**: Leakage and unauthorized use or disclosure of sensitive data like biometric, health, or location information.
4. **Environmental Impacts**: High compute resource utilization in training or operating AI models, which can adversely impact ecosystems.
5. **Harmful Bias or Homogenization**: Amplification and exacerbation of historical, societal, and systemic biases, and performance disparities between sub-groups or languages.
6. **Algorithmic Monocultures**: Increased susceptibility to correlated failures due to the repeated use of the same model in consequential decision-making settings.

These risks can arise during various stages of the AI lifecycle, from design and development to deployment and operation, and can have impacts at individual, application, and ecosystem levels.

## Task 2 Deliverables:
1.  Build a live public prototype on Hugging Face, and include the public URL link to my space.<br>

2.  How did I choose my stack, and why did I select each tool the way it Did? <br>



## Task 3.  Creating a Golden Test Data Set
(Role: AI Evaluation and Performance Engineer)

## Task 3 Deliverables:
1.  Assess my pipeline using the RAGAS framework including key metrics faithfulness, answer relevancy, context precision, and context recall.  Provide a table of my output results.<br>

2.  What conclusions can I draw about performance and effectiveness of my pipeline with this information? <br>


## Task 4.  Fine-Tuning Open-Source Embeddings
(Role: Machine Learning Engineer)

## Task 4 Deliverables:
1.  Swap out my existing embedding model for the new fine tuned version.  Provide a link to m fine-tuned embedding model on the Hugging Face Hub.<br>

2.  How did I choose the embedding model for this application?<br>


## Task 5.  Assessing Performance
(Role: AI Evaluation and Performance Engineer)

## Task 5 Deliverables:
1.  Test the fine-tuned embedding model using the RAGAS frameworks to quantify any improvements.  Provide results in a table.<br>

2.  Test the two chunking strategies using the RAGAS frameworks to quantify any improvements.  Provide results in a table.<br>

3.  The AI Solutions Engineer asks me "Which one is the best to test with internal stakeholders next week, and why?<br>


## Task 6.  Managing Your Boss and User Expectations
(Role: SVP of Technology)

## Task 6 Deliverables:
1.  What is the story that I will give to the CEO to tell the whole company at the launch next month?<br>

2.  There appears to be important information not included in our build.  How might we incorporate relevant white-house briefing information in future versions? <br>
