# Machine Learning Deployment: Wedding Day vs. GenAI

### **The "Wedding Day": Traditional Computer Vision Models**
- **Deployment is just the beginning.**
  - Requires continuous updates, retraining, and maintenance.
  - Long-term effort, much like sustaining a marriage.

### **The "Mistress": Generative AI**
- **Effortless but deceptive.**
  - No continuous retraining required, appears easy and efficient.
  - Pathological liar: hallucinates convincingly without a factual basis.

### **The Solution: Multimodal Validation**
- **Catch the lies before they spread.**
  - Ensures all modalities (image, data, and prompt) are genuinely integrated.
  - Detects hallucinations and enforces workflow trustworthiness.

### **Takeaway:**
"Without validation, even the perfect wedding (or model) can fall apart."


# Multimodal Integrity Check for Generative AI
This notebook demonstrates a common sense approach for multimodal integrity check using three input modalities:
1. **Image (Modal 1):** `image.png`, representing the Jenkins character logo.
2. **RAG Data (Modal 2):** Fictional facts about random characters (+Jenkins) provided in `data.txt`.
3. **User Prompt (Modal 3):** A question asking about the character in the image and specific details like hair products.

The goal is to verify that the generative AI pipeline integrates all three modalities: image, RAG data, and user prompt, to generate a coherent and enriched response. The test ensures the model does not compensate for a missing or failing modality by hallucinating its content. The successful integration of all three modalities must be evident in the generated response.


## 1 Start HTTP Server for RAG Data
The HTTP server serves the `data.txt` file for the RAG (Retrieval-Augmented Generation) pipeline.


In [1]:
import subprocess

# Start a simple HTTP server to serve RAG data
server = subprocess.Popen(["python", "-m", "http.server", "8088"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print("HTTP server started on port 8088. Access it at http://127.0.0.1:8088/")


HTTP server started on port 8088. Access it at http://127.0.0.1:8088/


## 2 Multimodal Pipeline to be checked
This cell implements the full multimodal pipeline using the Ollama framework. It integrates data from three modalities to produce a coherent and enriched response.

### Workflow
1. **RAG Data Processing:**
   - Documents are retrieved from the HTTP server.
   - Text is split into manageable chunks and indexed using a vector store.
   - A retriever dynamically fetches relevant chunks based on the user’s input.
2. **Image Encoding:**
   - The image is loaded and encoded in base64 format to be included in the prompt.
3. **Integration:**
   - The retrieved RAG data, user prompt, and encoded image are packaged into a structured input for the LLAVA visual-language model.
4. **Output:**
   - The LLAVA model processes the combined inputs and generates a detailed response.
   - The response integrates visual, textual, and contextual information, demonstrating the pipeline's multimodal capability.



In [2]:
import base64
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_community import embeddings
from langchain_community.chat_models import ChatOllama
from langchain_core.messages import HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import CharacterTextSplitter

# Configuration variables
MODEL_VLM = ChatOllama(model="llava")
MODEL_EMBEDDING = embeddings.OllamaEmbeddings(model='nomic-embed-text')
PROMPT = """
Who is the character in the image, and what kind of hair products the character uses?
"""

def load_image(file_path):
    """Loads an image from the file path and encodes it in base64."""
    with open(file_path, "rb") as img_file:
        return base64.b64encode(img_file.read()).decode("utf-8")

def load_and_split_documents(urls):
    """Loads documents from URLs and splits them into chunks."""
    docs = [WebBaseLoader(url).load() for url in urls]
    docs_list = [item for sublist in docs for item in sublist]
    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=7500, chunk_overlap=100)
    return text_splitter.split_documents(docs_list)

# Load RAG data from the HTTP server
urls = ["http://127.0.0.1:8088/data.txt"]
doc_splits = load_and_split_documents(urls)

# Create a vector store
vectorstore = Chroma.from_documents(documents=doc_splits, collection_name="rag-chroma", embedding=MODEL_EMBEDDING)
retriever = vectorstore.as_retriever()

# Load the image
image_b64 = load_image(r'image.png')

# Function to integrate image, RAG data, and user prompt
def prompt_func_rag(data):
    query_results = retriever.invoke(data["text"])
    if query_results:
        rag_content = " ".join([doc.page_content for doc in query_results])
    else:
        rag_content = "No relevant content found."

    text = data["text"]
    image_part = {
        "type": "image_url",
        "image_url": f"data:image/jpeg;base64,{image_b64}",
    }
    content_parts = [{"type": "text", "text": rag_content}, {"type": "text", "text": text}, image_part]

    return [HumanMessage(content=content_parts)]

# Run the pipeline
chain = prompt_func_rag | MODEL_VLM | StrOutputParser()
result = chain.invoke({"rag": retriever, "text": PROMPT, "image": image_b64})
print(result)


USER_AGENT environment variable not set, consider setting it to identify your requests.
  MODEL_VLM = ChatOllama(model="llava")
  MODEL_EMBEDDING = embeddings.OllamaEmbeddings(model='nomic-embed-text')
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1


 The character in the image is Jenkins, an anthropomorphic character from the webcomic "Jenkins." In the comic, Jenkins appears to be a sophisticated and suave man with a classic hairstyle. As for hair products, while I can't explicitly say what specific product Jenkins uses without more context or seeing the comic where this is mentioned, it's reasonable to assume that given his dapper appearance and the fact that he is drawn in a manner reminiscent of webcomics with attention to detail, Jenkins likely uses high-quality and possibly expensive hair products. This could include items such as pomade, hair gel, or styling mousse to maintain his slicked-back hairstyle. 


## 3 Verifying the Generated Response
This cell evaluates the output of the multimodal pipeline to ensure that it integrates information from all three input modalities and meets specific requirements.



In [3]:
import ollama

# Create the messages list with hardcoded facts in the system prompt
messages = [
    {
        'role': 'system',
        'content': (
            "You are a fact-checking AI assistant. Validate whether the given response: "
            "(1) identifies the character as Jenkins, "
            "(2) mentions that Jenkins uses coconut oil or blend as a hair product, "
            "and (3) maintains coherence with the provided text. Highlight any missing facts or inconsistencies."
        )
    },
    {
        'role': 'user',
        'content': "This is the text to analyze:\n\n" + result
    }
]

# Generate a response
response = ollama.chat(model='llava', messages=messages)

# Print the response
print(response['message']['content'])


 (1) The character in the image is identified as Jenkins.
(2) Yes, Jenkins uses coconut oil or blend as a hair product. The text specifically mentions that he uses this type of hair product and maintains his classic hairstyle.
(3) The provided text coherently describes the character, Jenkins, his appearance, and his possible use of high-quality hair products such as coconut oil or blend. There are no missing facts or inconsistencies in the given response. 


In [4]:
# Shutdown HTTP server
server.terminate()
server.wait()
print("HTTP server stopped.")


HTTP server stopped.
