# Midterm Challenge Notebook - Mike Dean

In [12]:
!pip install -qU langchain langchain_openai langchain_core==0.2.40 langchain_community
!pip install -qU qdrant_client pymupdf tiktoken ragas pandas

## Task 1.  Dealing with the Data
(Role: AI Solutions Engineer)

In [3]:
import os, tiktoken
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import Qdrant
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Path to my directory containing PDF files
directory = "References/"

# List to store all the documents
all_docs = []

# Iterate through all the files in the directory
for filename in os.listdir(directory):
    if filename.endswith(".pdf"):  # Check if the file is a PDF
        file_path = os.path.join(directory, filename)
        loader = PyMuPDFLoader(file_path)
        docs = loader.load()
        all_docs.extend(docs)  # Append the loaded docs to my list

# Default behavior is to break PDF files into their pages
# Using tiktoken, I checked the token lengths of several representative pages and
# the lengths were always less than 1000 tokens, so INITIAL STRATEGY is to use
# each document as a single chunk and not further split.

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

page_split_vectorstore = Qdrant.from_documents(
    all_docs,
    embedding_model,
    location=":memory:",
    collection_name="page_split_collection",
)
page_split_retriever = page_split_vectorstore.as_retriever()


In [4]:

# ALTERNATIVE STRATEGY is to recombine all the pages into one string document and then
# split it.  The advantage of this approach is to have chunk overlap, which is not
# possible with my initial strategy.

one_document = ""
for doc in all_docs:
    one_document += doc.page_content

def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-4o").encode(
        text,
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap = 400,
    length_function = tiktoken_len,
)

split_chunks = text_splitter.split_text(one_document)

chunk_split_vectorstore = Qdrant.from_documents(
    text_splitter.create_documents(split_chunks),
    embedding_model,
    location=":memory:",
    collection_name="chunk_split_collection",
)

chunk_split_retriever = chunk_split_vectorstore.as_retriever()


In [5]:
## Check that I have two vectorstores in memory
client = page_split_vectorstore.client
print(client.get_collections())
client = chunk_split_vectorstore.client
print(client.get_collections())

collections=[CollectionDescription(name='page_split_collection')]
collections=[CollectionDescription(name='chunk_split_collection')]


## Task 1 Deliverables:
1.  Describe the default chunking strategy that I will use:<br>
The default strategy will be to load the two PDF files using `PyMuPDFLoader` just as we have previously done.  This results in each PDF page being its own document.  I have checked sample pages with `tiktoken` and the token count per page is <1000, so these are small enough to just embed without further splitting. I saved these embeddings in `page_split_vectorstore`.

2.  Articulate a chunking strategy that I will also test:<br>
The disadvantage of the default strategy is that there is no chunk overlapping between the pages, and this might worsen the ability connect two pages that are both relevant to a query.  So I  recombine the page_content of all pages into a single string, convert it into a document, and split it with a chunk size of 800 and an overlap of 400 (the default settings used by OpenAI).  This strategy allows chunks to overlap, pehaps adding semantic continuity between adjacent pages.  These were embedded with the same embedding model and saved in `chunk_split_vectorstore`.

3.  Describe how and why I made these decisions:<br>
The default behavior of `PyMuPDFLoader` is not bad and I have been using it for several months.  However, I was splitting each of the documents created, not thinking through that if I had chunk sizes greater than the page itself, this was meaningless.  I also had chunk overlap, but had not thought through the implications of each page being a separate document.  So I made these decision for this Midterm Challenge so I can later compare the performance using RAGAS in Task 5.

## Task 2.  Building a Quick End-to-End Prototype
(Role: AI Systems Engineer)

In [6]:
from langchain.prompts import ChatPromptTemplate
from langchain_openai.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

rag_prompt_template = """\
You are a helpful and polite and cheerful assistant who answers questions based solely on the provided context. 
Use the context to answer the question and provide a  clear answer. Do not mention the document in your
response.
If there is no specific information
relevant to the question, then tell the user that you can't answer based on the context.

Context:
{context}

Question:
{question}
"""

rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)

In [7]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser

## CREATE MY TWO RAG CHAINS

page_split_rag_chain = (
    {"context": itemgetter("question") | page_split_retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

chunk_split_rag_chain = (
    {"context": itemgetter("question") | chunk_split_retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)


In [8]:
from IPython.display import Markdown, display

## TEST THE TWO CHAINS
page_response = (page_split_rag_chain.invoke({"question": "List the ten major risks of AI?"}))
display(Markdown(page_response))

chunk_response = (chunk_split_rag_chain.invoke({"question": "What are some risks of AI?"}))
display(Markdown(chunk_response))

Based on the provided context, the ten major risks of AI include:

1. Confabulation
2. Dangerous or Violent Recommendations
3. Data Privacy
4. Value Chain and Component Integration
5. Harmful Bias
6. Homogenization
7. CBRN Information or Capabilities
8. Human-AI Configuration
9. Obscene, Degrading, and/or Abusive Content
10. Information Integrity

Some risks of AI include:

1. **Confabulation**: The production of confidently stated but erroneous or false content, misleading or deceiving users.
2. **Dangerous, Violent, or Hateful Content**: Easier production and access to violent, inciting, radicalizing, or threatening content, including recommendations to carry out self-harm or illegal activities.
3. **Data Privacy**: Impacts due to leakage and unauthorized use, disclosure, or de-anonymization of personal or sensitive data.
4. **Environmental Impacts**: Adverse effects on ecosystems due to high compute resource utilization in training or operating AI models.
5. **Harmful Bias or Homogenization**: Amplification and exacerbation of historical, societal, and systemic biases, and performance disparities between different sub-groups or languages.
6. **Algorithmic Monocultures**: Increased susceptibility to correlated failures due to multiple actors relying on the same model or algorithm in decision-making settings.
7. **Disinformation**: Long-term effects on societal trust in public institutions due to the distribution of harmful deepfake images and disinformation.

These risks can vary significantly depending on the characteristics of the AI model, system, or use case, and may require tailored risk management approaches.

## Task 2 Deliverables:
1.  Build a live public prototype on Hugging Face, and include the public URL link to my space.<br>

Here is a one minute Loom video that demonstrates the prototype running in Hugging Face.
https://www.loom.com/share/70b741d3e4e14af792572b3aa9106463?sid=5aeb2e51-0f75-4c74-9f8c-6bc6d14aaa52

I used the page split retriever for this prototype, meaning that the documents were broken by page, not recombined, and were chunked as whole pages without overlap.

2.  How did I choose my stack, and why did I select each tool the way it Did? <br>

My stack consists of the following:
- Hardware is Apple Mac Studio
- Editor is VSC as recommended, and I have grown to like it very much  because it includes everything.
- Qdrant is the vector store.  I have used FAISS and Chroma, but Qdrant is fantastic.  I run it locally as a server, though for this Hugging Face situation, I am using a memory-based implementation.
- ChainLit is the interface, as recommended.  Notably, the current version of ChainLit does NOT work on Hugging Face, and the version needs to be locked (chainlit==0.7.700).
- RAGAS is part of my stack for purposes of later evaluation.
- LangChain is used to help simplify the code.

I didn't choose this stack by accident, as all of it was suggested in the course.  However, I have resisted moving from one framework to another, etc., because I think important to stop wandering in the AIE jungle and learn one set of tools well.  This will prepare me to switch to other frameworks, if desired, but I think the key is to learn a set of reliable tools well.



## Task 3.  Creating a Golden Test Data Set
(Role: AI Evaluation and Performance Engineer)

In [28]:
## We already have loaded the text but need to chunk it differently


text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 400,
    chunk_overlap = 100,
    length_function = tiktoken_len,
)

eval_chunks = text_splitter_eval.split_text(one_document)
eval_documents = text_splitter_eval.create_documents(eval_chunks)
len(eval_documents)


In [13]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

## llm and embedding models were set earlier
generator_llm = llm
critic_llm = llm
embeddings = embedding_model

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

num_qa_pairs = 50 # I increased this from 20 because I have a surplus of OpenAI credits

testset = generator.generate_with_langchain_docs(eval_documents, num_qa_pairs, distributions)
testset.to_pandas()

embedding nodes:   0%|          | 0/578 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/50 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,Why should data collection be minimized and cl...,"[from reidentification, and appropriate techni...",Data collection should be minimized and clearl...,simple,[{}],True
1,Why are proactive equity assessments important...,"[sex \n(including \npregnancy, \nchildbirth, \...",Proactive equity assessments are important in ...,simple,[{}],True
2,What are the limitations of early lifecycle TE...,[49 \nearly lifecycle TEVV approaches are deve...,Currently available pre-deployment TEVV proces...,simple,[{}],True
3,What is the significance of intellectual prope...,"[content transparency, while balancing the pro...",The answer to given question is not present in...,simple,[{}],True
4,What measures are suggested to address harmful...,"[experts, experience with GAI technology) with...",The suggested measures to address harmful bias...,simple,[{}],True
5,What is the purpose of the Technical Companion...,[moving principles into practice. \nThe expect...,The expectations given in the Technical Compan...,simple,[{}],True
6,What are some examples of automated systems fo...,[Examples of automated systems for which the B...,Examples of automated systems for which the Bl...,simple,[{}],True
7,What is the purpose of engaging in internal an...,"[MS-1.1-008 \nDeﬁne use cases, contexts of use...",The purpose of engaging in internal and extern...,simple,[{}],True
8,How does the Office of Management and Budget (...,"[requirements on drivers, such as slowing down...",The Office of Management and Budget (OMB) has ...,simple,[{}],True
9,What are the sources of bias in GAI training a...,[might be impacted by GAI systems through dire...,Sources of bias in GAI training and TEVV data ...,simple,[{}],True


In [21]:
import pandas as pd
testset_df = testset.to_pandas()
testset_df.to_csv("testset.csv")
test_df = pd.read_csv("testset.csv")

In [22]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

In [37]:
from langchain_core.runnables import RunnablePassthrough
# Set up chains
primary_qa_llm = llm

retrieval_augmented_qa_chain_chunk = (
    {"context": itemgetter("question") | chunk_split_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | primary_qa_llm, "context": itemgetter("context")}
)
retrieval_augmented_qa_chain_paged = (
    {"context": itemgetter("question") | page_split_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | primary_qa_llm, "context": itemgetter("context")}
)


In [38]:
question = "What is a rule of thumb for selecting an industry to invest in?"

result = retrieval_augmented_qa_chain_chunk.invoke({"question" : question})

print(result["response"].content)

I can't answer that based on the context provided.


In [39]:
## Generate responses with our questions
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain_chunk.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

In [40]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

response_dataset[0]

{'question': 'Why should data collection be minimized and clearly communicated to the people whose data is collected?',
 'answer': "Data collection should be minimized and clearly communicated to the people whose data is collected to protect privacy by default and to ensure that individuals understand what data is being collected about them and how it will be used. This approach helps to match the data collection with people's expectations and desires, thereby minimizing potential harms and privacy risks.",
 'contexts': ['be understandable in plain language, and give you agency over data collection \nand the specific context of use; current hard-to-understand no\xad\ntice-and-choice practices for broad uses of data should be changed. Enhanced \nprotections and restrictions for data and inferences related to sensitive do\xad\nmains, including health, work, education, criminal justice, and finance, and \nfor data pertaining to youth should put you first. In sensitive domains, your \ndata

In [41]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
]

In [42]:
results = evaluate(response_dataset, metrics)
print(results)

Evaluating:   0%|          | 0/200 [00:00<?, ?it/s]

No statements were generated from the answer.


{'faithfulness': 0.9074, 'answer_relevancy': 0.9634, 'context_recall': 0.9267, 'context_precision': 0.9061}


In [46]:
## Generate responses with our questions using paged retriever
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain_paged.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

paged_response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

paged_results = evaluate(paged_response_dataset, metrics)

Evaluating:   0%|          | 0/200 [00:00<?, ?it/s]

No statements were generated from the answer.
No statements were generated from the answer.


In [51]:
df_chunked = pd.DataFrame(list(results.items()), columns=['Metric', 'chunks'])
df_paged = pd.DataFrame(list(paged_results.items()), columns=['Metric', 'paged'])
# df_comparison = pd.DataFrame(list(te3_advanced_retrieval_results.items()), columns=['Metric', 'TE3'])

df_merged = pd.merge(df_paged, df_chunked, on='Metric')

# df_merged['Baseline -> TE3'] = df_merged['TE3'] - df_merged['ADA']

df_merged

Unnamed: 0,Metric,paged,chunks
0,faithfulness,0.87363,0.90745
1,answer_relevancy,0.944051,0.963378
2,context_recall,0.85,0.926667
3,context_precision,0.905,0.906111


## Task 3 Deliverables:
1.  Assess my pipeline using the RAGAS framework including key metrics faithfulness, answer relevancy, context precision, and context recall.  Provide a table of my output results.<br>

I did the evaluation of my prototype model which was based on the PDF documents being divided by pages, but while I was here, I compared this with the other strategy, which was to recombine all the text and then split by chunks with overlap.

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Metric</th>
      <th>paged</th>
      <th>chunks</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>faithfulness</td>
      <td>0.873630</td>
      <td>0.907450</td>
    </tr>
    <tr>
      <th>1</th>
      <td>answer_relevancy</td>
      <td>0.944051</td>
      <td>0.963378</td>
    </tr>
    <tr>
      <th>2</th>
      <td>context_recall</td>
      <td>0.850000</td>
      <td>0.926667</td>
    </tr>
    <tr>
      <th>3</th>
      <td>context_precision</td>
      <td>0.905000</td>
      <td>0.906111</td>
    </tr>
  </tbody>
</table>
</div>

It is obvious that the paged approach, using the default for the PDF loader, is INFERIOR and I should alter my app to use the chunking strategy.  In Task 5, I will compare the fine tuned embedding model with the OpenAI embedding model, but with the chunking strategy in both instances.  There is no point in re-studying the paged strategy.

2.  What conclusions can I draw about performance and effectiveness of my pipeline with this information? <br>

First, the comparison of chunking strategies is clearcut - dividing PDF documents by pages and embedding those pages is very inferior to combining all the PDF text into one document and then splitting that with chunk overlap.

- Faithfulness: Measures whether all claims or statements in the answer can be completely inferred from the context that was provided.  The value is the percentage of claims that can be inferred over the total number of claims.  This metric is improved with a chunking overlap strategy.
- Answer relevancy: Measures whether the answer is relevant to the question. It does not matter if the answer is actually correct - only that it directly answers the question without redundancy. This metric was not affected by the chunking strategy.
- Context recall: This measures whether the facts that are in the ground truth reference answer can be inferred from the context that was provided to the LLM.  In a perfect situation, every statement in the ground truth should be able to be linked to the context.  This metric was improved with the chunking overlap strategy.
- Context precision: This measures whether all the elements in the ground truth are in the highest ranked parts of the context.  This metric was not affected by the chunking strategy.



## Task 4.  Fine-Tuning Open-Source Embeddings
(Role: Machine Learning Engineer)

### I performed the training in a different notebook because I needed to run in Colab.
### The notebook is in this GitHub repository and is called FineTuneEmbed.ipynb 
### The model was then placed in Hugging Face and the link is provided as part of the deliverable.


## Task 4 Deliverables:
1.  Swap out my existing embedding model for the new fine tuned version.  Provide a link to m fine-tuned embedding model on the Hugging Face Hub.<br>

2.  How did I choose the embedding model for this application?<br>


## Task 5.  Assessing Performance
(Role: AI Evaluation and Performance Engineer)

## Task 5 Deliverables:
1.  Test the fine-tuned embedding model using the RAGAS frameworks to quantify any improvements.  Provide results in a table.<br>

2.  Test the two chunking strategies using the RAGAS frameworks to quantify any improvements.  Provide results in a table.<br>

3.  The AI Solutions Engineer asks me "Which one is the best to test with internal stakeholders next week, and why?<br>


## Task 6.  Managing Your Boss and User Expectations
(Role: SVP of Technology)

## Task 6 Deliverables:
1.  What is the story that I will give to the CEO to tell the whole company at the launch next month?<br>

2.  There appears to be important information not included in our build.  How might we incorporate relevant white-house briefing information in future versions? <br>
