# Midterm Challenge Notebook - Mike Dean

In [None]:
!pip install -q langchain_openai langchain_huggingface<0.1.0 
!pip install -q langchain_core==0.2.40 langchain==0.2.4 langchain_community langchain-text-splitters==0.2.4
!pip install -q qdrant_client pymupdf tiktoken ragas pandas
!pip install -q python-pptx==1.0.2 nltk==3.9.1

In [1]:
!pip install -qU langchain langchain_openai langchain_core==0.2.40 langchain_community
!pip install -qU qdrant_client pymupdf tiktoken ragas pandas

## Task 1.  Dealing with the Data
(Role: AI Solutions Engineer)

In [2]:
import defaults
llm = defaults.default_llm

In [3]:
# Load PDF documents from a directory
import loadReferenceDocuments
separate_pages, one_document = loadReferenceDocuments.loadReferenceDocuments("References/")

<built-in method count of list object at 0x1143159c0>


## Chunking Strategies
#### Ingest the PDF by page - page_split
#### Ingest the PDF by page and recombine into single file and then chunk - chunk_split

In [4]:
import splitAndVectorize

page_split_vectorstore = splitAndVectorize.createVectorstore(
    separate_pages,
    "separate_page_collection",
)

page_split_retriever = page_split_vectorstore.as_retriever()

chunk_split_vectorstore = splitAndVectorize.createVectorstore(
    one_document,
    "chunk_split_collection",
    chunk_size=800,
    chunk_overlap=400,
)

chunk_split_retriever = chunk_split_vectorstore.as_retriever()


## Task 1 Deliverables:
1.  Describe the default chunking strategy that I will use:<br>
The default strategy will be to load the two PDF files using `PyMuPDFLoader` just as we have previously done.  This results in each PDF page being its own document.  I have checked sample pages with `tiktoken` and the token count per page is <1000, so these are small enough to just embed without further splitting. I saved these embeddings in `page_split_vectorstore`.

2.  Articulate a chunking strategy that I will also test:<br>
The disadvantage of the default strategy is that there is no chunk overlapping between the pages, and this might worsen the ability connect two pages that are both relevant to a query.  So I  recombine the page_content of all pages into a single string, convert it into a document, and split it with a chunk size of 800 and an overlap of 400 (the default settings used by OpenAI).  This strategy allows chunks to overlap, pehaps adding semantic continuity between adjacent pages.  These were embedded with the same embedding model and saved in `chunk_split_vectorstore`.

3.  Describe how and why I made these decisions:<br>
The default behavior of `PyMuPDFLoader` is not bad and I have been using it for several months.  However, I was splitting each of the documents created, not thinking through that if I had chunk sizes greater than the page itself, this was meaningless.  I also had chunk overlap, but had not thought through the implications of each page being a separate document.  So I made these decision for this Midterm Challenge so I can later compare the performance using RAGAS in Task 5.

## Task 2.  Building a Quick End-to-End Prototype
(Role: AI Systems Engineer)

In [5]:
import prompts
from langchain.prompts import ChatPromptTemplate
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser

rag_prompt = ChatPromptTemplate.from_template(prompts.rag_prompt_template)

page_split_rag_chain = (
    {"context": itemgetter("question") | page_split_retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

chunk_split_rag_chain = (
    {"context": itemgetter("question") | chunk_split_retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [6]:
from IPython.display import Markdown, display

## TEST THE TWO CHAINS
page_response = (page_split_rag_chain.invoke({"question": "List the ten major risks of AI?"}))
display(Markdown(page_response))

chunk_response = (chunk_split_rag_chain.invoke({"question": "What are some risks of AI?"}))
display(Markdown(chunk_response))

The ten major risks of AI, as identified, are:

1. **Confabulation**: AI generating false or misleading information.
2. **Dangerous or Violent Recommendations**: AI suggesting harmful actions.
3. **Data Privacy**: Risks related to unauthorized access and misuse of personal data.
4. **Value Chain and Component Integration**: Issues arising from integrating various AI components.
5. **Harmful Bias**: AI perpetuating or amplifying biases present in training data.
6. **Homogenization**: The risk of creating uniformity that can lead to systemic vulnerabilities.
7. **CBRN Information or Capabilities**: Misuse of AI for chemical, biological, radiological, or nuclear purposes.
8. **Human-AI Configuration**: Risks from improper interaction between humans and AI systems.
9. **Obscene, Degrading, and/or Abusive Content**: AI generating harmful or offensive content.
10. **Information Integrity**: Ensuring the accuracy and reliability of information produced by AI systems.

These risks can be categorized into technical/model risks, misuse by humans, and ecosystem/societal risks.

Some risks of AI include:

1. **Technical / Model Risks**:
   - Confabulation: The production of confidently stated but erroneous or false content.
   - Dangerous or Violent Recommendations: AI models suggesting harmful actions.
   - Data Privacy: Leakage and unauthorized use of sensitive data.
   - Value Chain and Component Integration: Issues in integrating various components leading to risks.
   - Harmful Bias and Homogenization: Amplification of biases and undesired homogeneity.

2. **Misuse by Humans**:
   - CBRN Information or Capabilities: Easier access to information related to chemical, biological, radiological, or nuclear weapons.
   - Obscene, Degrading, and/or Abusive Content: Generation and distribution of harmful content.
   - Information Integrity and Security: Issues with the accuracy and security of information.

3. **Ecosystem / Societal Risks**:
   - Environmental Impacts: High resource utilization affecting ecosystems.
   - Intellectual Property: Risks to proprietary information and innovation.

These risks can arise from the design, training, or operation of AI systems and can be exacerbated by human behavior.

## Task 2 Deliverables:
1.  Build a live public prototype on Hugging Face, and include the public URL link to my space.<br>

Here is a one minute Loom video that demonstrates the prototype running in Hugging Face.
https://www.loom.com/share/70b741d3e4e14af792572b3aa9106463?sid=5aeb2e51-0f75-4c74-9f8c-6bc6d14aaa52

I used the page split retriever for this prototype, meaning that the documents were broken by page, not recombined, and were chunked as whole pages without overlap.

2.  How did I choose my stack, and why did I select each tool the way it Did? <br>

My stack consists of the following:
- Hardware is Apple Mac Studio
- Editor is VSC as recommended, and I have grown to like it very much  because it includes everything.
- Qdrant is the vector store.  I have used FAISS and Chroma, but Qdrant is fantastic.  I run it locally as a server, though for this Hugging Face situation, I am using a memory-based implementation.
- ChainLit is the interface, as recommended.  Notably, the current version of ChainLit does NOT work on Hugging Face, and the version needs to be locked (chainlit==0.7.700).
- RAGAS is part of my stack for purposes of later evaluation.
- LangChain is used to help simplify the code.

I learned an important lesson about software engineering - notebooks are NOT good ways to organize!  They are great for teaching.  So I reverted to my last 30 years of software development, and started refactoring code OUT OF THE NOTEBOOK, and then calling it inside the notebook.  So you (or others) can use the notebook as a navigation tool, but it doesn't become so unwieldly that you can't figure anything out.



## Task 3.  Creating a Golden Test Data Set
(Role: AI Evaluation and Performance Engineer)

In [7]:
import splitAndVectorize
# create a new splitting without embedding

eval_documents = splitAndVectorize.split_into_chunks(
    one_document,
    chunk_size=400,
    chunk_overlap=100,
)
len(eval_documents)


318

## The following code was used to create tests but I have commented out to avoid repetition.
## I created 50 pairs instead of 20.

In [14]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
import defaults

## llm and embedding models were set earlier
# generator_llm = llm
# critic_llm = llm
# embeddings = defaults.default_embedding_model

# generator = TestsetGenerator.from_langchain(
#     generator_llm,
#     critic_llm,
#     embeddings
# )

# distributions = {
#     simple: 0.5,
#     multi_context: 0.4,
#     reasoning: 0.1
# }

# num_qa_pairs = 50 # I increased this from 20 because I have a surplus of OpenAI credits

## I ALREADY RAN THIS SO HAVE COMMENTED IT OUT HERE SO I DON'T DO IT AGAIN
## I WILL READ IN THE CSV FILE TO CONTINUE
# testset = generator.generate_with_langchain_docs(eval_documents, num_qa_pairs, distributions)


### Get the test data from the stored CSV file

In [9]:
# READ IN THE TESTSET FROM THE CSV FILE
import pandas as pd
test_df = pd.read_csv("testset.csv")
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

In [10]:
from langchain_core.runnables import RunnablePassthrough

retrieval_augmented_qa_chain_chunk = (
    {"context": itemgetter("question") | chunk_split_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | llm, "context": itemgetter("context")}
)
retrieval_augmented_qa_chain_paged = (
    {"context": itemgetter("question") | page_split_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | llm, "context": itemgetter("context")}
)


In [11]:
import evaluateRAGAS
results = evaluateRAGAS.evaluateRAGAS(retrieval_augmented_qa_chain_chunk,
                                                       test_questions, test_groundtruths)

Evaluating:   0%|          | 0/200 [00:00<?, ?it/s]

No statements were generated from the answer.


In [17]:

print(results)

{'faithfulness': 0.8999, 'answer_relevancy': 0.9475, 'context_recall': 0.9017, 'context_precision': 0.9083}


In [18]:
import evaluateRAGAS
paged_results = evaluateRAGAS.evaluateRAGAS(retrieval_augmented_qa_chain_paged,
                                                       test_questions, test_groundtruths)

Evaluating:   0%|          | 0/200 [00:00<?, ?it/s]

No statements were generated from the answer.


In [19]:
print(paged_results)

{'faithfulness': 0.9146, 'answer_relevancy': 0.9240, 'context_recall': 0.8700, 'context_precision': 0.9006}


In [21]:
df_chunked = pd.DataFrame(list(results.items()), columns=['Metric', 'total_chunked'])
df_paged = pd.DataFrame(list(paged_results.items()), columns=['Metric', 'separate_pages'])
df_merged = pd.merge(df_paged, df_chunked, on='Metric')
df_merged

Unnamed: 0,Metric,separate_pages,total_chunked
0,faithfulness,0.914559,0.899927
1,answer_relevancy,0.923959,0.947485
2,context_recall,0.87,0.901667
3,context_precision,0.900556,0.908333


## Task 3 Deliverables:
1.  Assess my pipeline using the RAGAS framework including key metrics faithfulness, answer relevancy, context precision, and context recall.  Provide a table of my output results.<br>

I did the evaluation of my prototype model which was based on the PDF documents being divided by pages, but while I was here, I compared this with the other strategy, which was to recombine all the text and then split by chunks with overlap.

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Metric</th>
      <th>separate_pages</th>
      <th>total_chunked</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>faithfulness</td>
      <td>0.914559</td>
      <td>0.899927</td>
    </tr>
    <tr>
      <th>1</th>
      <td>answer_relevancy</td>
      <td>0.923959</td>
      <td>0.947485</td>
    </tr>
    <tr>
      <th>2</th>
      <td>context_recall</td>
      <td>0.870000</td>
      <td>0.901667</td>
    </tr>
    <tr>
      <th>3</th>
      <td>context_precision</td>
      <td>0.900556</td>
      <td>0.908333</td>
    </tr>
  </tbody>
</table>
</div>

When the PDF document is separated by page, and then those pages are INDIVIDUALLY chunked or embedded, the context recall is somewhat less, but other parameters are not really striking.  Both strategies need to be assessed later when we use the finetuned embedding model.

2.  What conclusions can I draw about performance and effectiveness of my pipeline with this information? <br>

- Faithfulness: Measures whether all claims or statements in the answer can be completely inferred from the context that was provided.  The value is the percentage of claims that can be inferred over the total number of claims.  
- Answer relevancy: Measures whether the answer is relevant to the question. It does not matter if the answer is actually correct - only that it directly answers the question without redundancy. 
- Context recall: This measures whether the facts that are in the ground truth reference answer can be inferred from the context that was provided to the LLM.  In a perfect situation, every statement in the ground truth should be able to be linked to the context.  
- Context precision: This measures whether all the elements in the ground truth are in the highest ranked parts of the context.  

**OVERALL CONCLUSION:**<br> 
Not really any significant differences here, though I suspect that letting the pages be kept as separate documents is going to be inferior in the long run.



## Task 4.  Fine-Tuning Open-Source Embeddings
(Role: Machine Learning Engineer)

I performed the fine tuning in a separate notebook (FineTunePartTwo.ipynb) that you can find in this location:
https://github.com/mdean77a/AIE4/blob/main/Midterm/FineTunePartTwo.ipynb

I did this separately because I anticipated needing to use Colab.  To my utter surprise, the training actually worked on my Mac Studio (M1 32 gb, 323 sec) and I could reproduce it on M3 laptop (128 gb, 106 sec).  So I didn't end up having to wrestle with Colab.


## Task 4 Deliverables:
1.  Swap out my existing embedding model for the new fine tuned version.  Provide a link to m fine-tuned embedding model on the Hugging Face Hub.<br>

https://huggingface.co/Mdean77/finetuned_arctic

2.  How did I choose the embedding model for this application?<br>

I selected Snowflake/snowflake-arctic-embed-m because it improved dramatically in our previous exercise with it.

## Task 5.  Assessing Performance
(Role: AI Evaluation and Performance Engineer)

In [1]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Mdean77/finetuned_arctic")

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/49.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/657 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [13]:
!pip install langchain_huggingface

Collecting langchain_huggingface
  Using cached langchain_huggingface-0.1.0-py3-none-any.whl.metadata (1.3 kB)
Using cached langchain_huggingface-0.1.0-py3-none-any.whl (20 kB)
Installing collected packages: langchain_huggingface
Successfully installed langchain_huggingface-0.1.0


In [15]:
from langchain_huggingface import HuggingFaceEmbeddings
finetune_embeddings = HuggingFaceEmbeddings(model_name="Mdean77/finetuned_arctic")

In [16]:
import splitAndVectorize
# page_split_vectorstore was created earlier, and uses te3 embedder.
# page_split_vectorstore = splitAndVectorize.createVectorstore(
#     separate_pages,
#     "separate_page_collection",
# )
te3_page_split_retriever = page_split_vectorstore.as_retriever()

arctic_page_split_vectorstore = splitAndVectorize.createVectorstore(
    separate_pages,
    "arctic_separate_page_collection",
    embedding_model=finetune_embeddings,
)
arctic_page_split_retriever = arctic_page_split_vectorstore.as_retriever()

#need to RESPLIT the chunk split for testing 
chunk_split_vectorstore = splitAndVectorize.createVectorstore(
    one_document,
    "te3_chunk_split_collection",
    chunk_size=250,
    chunk_overlap=50,
    embedding_model=finetune_embeddings,
)
te3_chunk_split_retriever = chunk_split_vectorstore.as_retriever()

# now make chunk split with arctic
arctic_chunk_split_vectorstore = splitAndVectorize.createVectorstore(
    one_document,
    "arctic_chunk_split_collection",
    chunk_size=250,
    chunk_overlap=50,
)
arctic_chunk_split_retriever = arctic_chunk_split_vectorstore.as_retriever()



For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_openai.chat_models.azure import AzureChatOpenAI
* 'allow_population_by_field_name' has been renamed to 'populate_by_name'


PydanticUserError: The `__modify_schema__` method is not supported in Pydantic v2. Use `__get_pydantic_json_schema__` instead in class `SecretStr`.

For further information visit https://errors.pydantic.dev/2.9/u/custom-json-schema

In [None]:


retrieval_augmented_qa_chain_chunk = (
    {"context": itemgetter("question") | chunk_split_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | llm, "context": itemgetter("context")}
)
retrieval_augmented_qa_chain_paged = (
    {"context": itemgetter("question") | page_split_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | llm, "context": itemgetter("context")}
)

## Task 5 Deliverables:
1.  Test the fine-tuned embedding model using the RAGAS frameworks to quantify any improvements.  Provide results in a table.<br>

2.  Test the two chunking strategies using the RAGAS frameworks to quantify any improvements.  Provide results in a table.<br>

3.  The AI Solutions Engineer asks me "Which one is the best to test with internal stakeholders next week, and why?<br>


## Task 6.  Managing Your Boss and User Expectations
(Role: SVP of Technology)

## Task 6 Deliverables:
1.  What is the story that I will give to the CEO to tell the whole company at the launch next month?<br>

2.  There appears to be important information not included in our build.  How might we incorporate relevant white-house briefing information in future versions? <br>
