# Vehicle Specification Extraction — Notebook README

This notebook extracts vehicle specifications (torques, fluid capacities, part numbers, etc.) from a service manual PDF using:
- PyMuPDF for robust PDF text extraction
- LangChain text-splitting for chunking
- HuggingFace Sentence-Transformers for embeddings
- Chroma for vector storage & retrieval
- Ollama + Llama-3 as the LLM (via `langchain-ollama`)

Run the code cells in order (Cell 1 → Cell 8).  




**Install all Python packages needed for PDF parsing, embeddings, Chroma, and LangChain/Ollama.
May take a few minutes — restart the runtime after a successful install.**


In [None]:
!pip install -U langchain PyMuPDF chromadb sentence-transformers transformers huggingface_hub langchain-community langchain-ollama langchain-huggingface --quiet

**Install and start the Ollama server, then pull the Llama-3 model for local inference.**


In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

import subprocess
import os
import time


os.environ['PATH'] += os.pathsep + '/usr/local/bin'


try:
    subprocess.run(['ollama', 'list'], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print("Ollama server is already running.")
except (subprocess.CalledProcessError, FileNotFoundError):
    print("Starting Ollama server...")

    process = subprocess.Popen(["ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, preexec_fn=os.setsid)

    time.sleep(5)
    print("Ollama server started.")

print("Pulling llama3:instruct model (this may take a few minutes)...")
!ollama pull llama3:instruct
print("llama3:instruct model pulled successfully.")

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Starting Ollama server...
Ollama server started.
Pulling llama3:instruct model (this may take a few minutes)...
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l
llama3:instruct model pulled successfully.


**Load /content/service_manual.pdf with PyMuPDF and extract all pages into one text string.
Prints the number of characters extracted for a quick sanity check.**

In [None]:
import fitz  # PyMuPDF

pdf_path = "/content/service_manual.pdf"
doc = fitz.open(pdf_path)
text = ""
for page in doc:
    text += page.get_text()

print(f"Extracted {len(text)} characters from PDF")

Extracted 856936 characters from PDF


**Split the large extracted text into overlapping chunks using LangChain’s splitter.
Chunks keep context across boundaries and are sized for embedding.**

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents([Document(page_content=text)])
print(f"Created {len(chunks)} text chunks")

Created 1069 text chunks


**Initialize the HuggingFace sentence-transformer embedding wrapper used for vectorization.
Model: all-MiniLM-L6-v2 — fast and accurate for semantic search.**

In [None]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

embedder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


**Create a Chroma vector store from the chunks and embeddings; persist it to disk.
This enables fast similarity search for retrieval-augmented generation.**

In [None]:
persist_directory = "chroma_db"

db = Chroma.from_documents(
    documents=chunks,
    embedding=embedder,
    persist_directory=persist_directory
)


print("ChromaDB vector store created and persisted.")

ChromaDB vector store created and persisted.


**Initialize the Llama-3 LLM (ChatOllama), connect the Chroma retriever, and build the RAG chain.
qa_chain is ready to accept queries and return model responses.**

In [None]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser


llm = ChatOllama(model="llama3:instruct")


retriever = db.as_retriever()


prompt_template = ChatPromptTemplate.from_messages([
    ("system", "Answer the question based only on the following context:\n{context}\n\nStrictly follow the output format instruction in the question."),
    ("user", "{question}")
])

qa_chain = (
    {
        "context": retriever,
        "question": RunnablePassthrough()
    }
    | prompt_template
    | llm
    | StrOutputParser()
)

print("✅ 'qa_chain' has been successfully defined. You can now run the next cell.")

✅ 'qa_chain' has been successfully defined. You can now run the next cell.


**Interactive query cell: Enter a question, run the RAG chain, and parse the model’s JSON output.
Prints structured component, spec_type, value, unit (or raw output on parse failure).**

In [None]:
import json

query = input("Enter your query (e.g., 'Torque for brake caliper bolts'): ")

prompt = (
    "Extract the specification from the context and present the answer as JSON in this format:\n"
    "{\n"
    "  \"component\": \"...\",\n"
    "  \"spec_type\": \"...\",\n"
    "  \"value\": \"...\",\n"
    "  \"unit\": \"...\"\n"
    "}\n\n"
    f"Question: {query}"
)

response = qa_chain.invoke(prompt)


try:
    spec_json = json.loads(response)
    print("\nStructured Output (Parsed):")
    print(json.dumps(spec_json, indent=2))
except json.JSONDecodeError:
    print("\n⚠️ Response was not valid JSON. Here's what the model returned:")
    print(response)

Enter your query (e.g., 'Torque for brake caliper bolts'): Torque for brake caliper bolt

⚠️ Response was not valid JSON. Here's what the model returned:
Based on the provided context, the torque specification for the brake caliper bolt is:

{
  "component": "Brake flexible hose bracket bolt",
  "spec_type": "Torque",
  "value": 30,
  "unit": "Nm"
}

Note that there may be additional specifications or variations depending on the specific application or model.
