## Pre-processing and RAG retrieval testing

#### What this script does:
- Loads processed text files
- Prepares documents for embedding by dividing into chunks 
- Creates embeddings using all-MiniLM-L6-v2
- Saves vector index using FAISS in 04_models/vector_index/
- Runs tests queries against index to check that retrieval returns relevant results
- Prints the top-k documents and their similarity scores

In [None]:
# Import libraries

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
import os
import pickle
import numpy as np
from fastembed import TextEmbedding
import faiss
from datetime import datetime

### Preprocessing pipeline

1. Load reports and divide into chunks

In [6]:
# Define paths where processed text files are found 
data_path = "../../01_data/rag_automotive_tech/processed"
papers_path = os.path.join(data_path, "automotive_papers") # Added journal abstracts file
patents_path = os.path.join(data_path, "automotive_tech_patents") # Added patent file
reports_path = os.path.join(data_path, "tech_reports")
startups_path = os.path.join(data_path, "startups")  # Added startups files

# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

# Function for chunking documents from folders
def load_and_chunk_documents(folder_path, doc_type):
    """Load and chunk documents from a specific folder"""
    chunks = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(folder_path, filename)
            try:
                loader = TextLoader(file_path, encoding='utf-8')
                documents = loader.load()
                
                for doc in documents:
                    doc.metadata.update({
                        'source': filename,
                        'doc_type': doc_type,
                        'file_path': file_path
                    })
                
                doc_chunks = text_splitter.split_documents(documents)
                chunks.extend(doc_chunks)
                print(f"Loaded {len(doc_chunks)} chunks from {filename}")
                
            except Exception as e:
                print(f"Error loading {filename}: {e}")
    
    return chunks

# Load and chunk all documents
print("Loading research papers...")
papers_chunks = load_and_chunk_documents(papers_path, "research_paper")

print("Loading patents data...")
patents_chunks = load_and_chunk_documents(patents_path, "patents_data")

print("Loading tech reports...")
reports_chunks = load_and_chunk_documents(reports_path, "tech_report")

print("Loading startups data...")
startups_chunks = load_and_chunk_documents(startups_path, "startups")

# Combine all chunks
all_chunks = papers_chunks + patents_chunks + reports_chunks + startups_chunks
print(f"\nSummary:")
print(f"- Research papers: {len(papers_chunks)} chunks")
print(f"- Patents data: {len(patents_chunks)} chunks")
print(f"- Tech reports: {len(reports_chunks)} chunks")
print(f"- Startups data: {len(startups_chunks)} chunks")
print(f"Total chunks created: {len(all_chunks)}")

Loading research papers...
Loaded 61 chunks from enhanced_drift_aware_computer_vision_achitecture_for_autonomous_driving.txt
Loaded 102 chunks from Gen_AI_in_automotive_applications_challenges_and_opportunities_with_a_case_study_on_in-vehicle_experience.txt
Loaded 120 chunks from leveraging_vision_language_models_for_visual_grounding_and_analysis_of_automative_UI.txt
Loaded 11408 chunks from automotive_papers_processed.txt
Loaded 69 chunks from automating_automative_software_development_a_synergy_of_generative_AI_and_formal_methods.txt
Loaded 137 chunks from automotive-software-and-electronics-2030-full-report.txt
Loaded 102 chunks from AI_agents_in_engineering_design_a_multiagent_framework_for_aesthetic_and_aerodynamic_car_design.txt
Loaded 87 chunks from a_benchmark_framework_for_AI_models_in_automative_aerodynamics.txt
Loaded 227 chunks from generative_AI_for_autonomous_driving_a_review.txt
Loaded 46 chunks from Embedded_acoustic_intelligence_for_automotive_systems.txt
Loaded 107 ch

2. Create embeddings and save FAISS Vector Index


Vectors were generated using the model sentence-transformers/all-MiniLM-L6-v2, and then stored in a FAISS index.

sentence-transformers/all-MiniLM-L6-v2 is a widely used embedding model that was designed for semantic similarity, sentence clustering and small to medium-scale retrieval

FAISS is a vector search engine for storing and searching embeddings.

In [None]:
# Create embeddings and save using FAISS

print(f"Processing {len(all_chunks)} chunks")

# Extract texts
texts = [chunk.page_content for chunk in all_chunks]
metadatas = [chunk.metadata for chunk in all_chunks]

# Setup paths
vector_index_path = "../../04_models/vector_index"
os.makedirs(vector_index_path, exist_ok=True)

# 1. Create embeddings
print("Loading fast embedding model...")
model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

print(f"Creating embeddings for {len(texts)} chunks...")
embeddings = list(model.embed(texts))
print(f"‚úì Created {len(embeddings)} embeddings")

# Convert to numpy array
embeddings_array = np.array([emb.tolist() for emb in embeddings]).astype('float32')
print(f"Embedding shape: {embeddings_array.shape}")

# 2. Create FAISS index
print("\nüîß Creating FAISS index...")
dimension = embeddings_array.shape[1]

# Create index (L2 distance - smaller is better)
index = faiss.IndexFlatL2(dimension)
index.add(embeddings_array)

print(f"‚úì FAISS index created with {index.ntotal} vectors")

# 3. Save everything
print("\nüíæ Saving to disk...")

# Save FAISS index
faiss_index_path = os.path.join(vector_index_path, "faiss_index.bin")
faiss.write_index(index, faiss_index_path)
print(f"‚úì FAISS index saved: {faiss_index_path}")

# Save texts
texts_path = os.path.join(vector_index_path, "texts.pkl")
with open(texts_path, "wb") as f:
    pickle.dump(texts, f)
print(f"‚úì Texts saved: {texts_path} ({len(texts)} texts)")

# Save metadata
metadata_path = os.path.join(vector_index_path, "metadata.pkl")
with open(metadata_path, "wb") as f:
    pickle.dump(metadatas, f)
print(f"‚úì Metadata saved: {metadata_path} ({len(metadatas)} entries)")

# Save embeddings for reference
embeddings_path = os.path.join(vector_index_path, "embeddings.npy")
np.save(embeddings_path, embeddings_array)
print(f"‚úì Embeddings saved: {embeddings_path}")

# 4. Create index info
index_info = {
    "total_chunks": len(texts),
    "embedding_dim": dimension,
    "created_at": str(datetime.now()),
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "index_type": "FAISS IndexFlatL2"
}

info_path = os.path.join(vector_index_path, "index_info.json")
import json
with open(info_path, "w") as f:
    json.dump(index_info, f, indent=2)
print(f"‚úì Index info saved: {info_path}")

# 5. VERIFICATION
print("\n" + "="*50)
print("‚úÖ FAISS INDEX CREATION COMPLETE")
print("="*50)

print(f"\nüìä STATS:")
print(f"Total chunks: {len(texts)}")
print(f"Embedding dimension: {dimension}")
print(f"FAISS index size: {index.ntotal} vectors")

# Test query
print("\nüß™ TEST QUERY:")
test_query = "automotive technology"
print(f"Query: '{test_query}'")

# Create embedding for query
query_embedding = np.array(list(model.embed([test_query]))[0].tolist()).astype('float32').reshape(1, -1)

# Search
k = 3
distances, indices = index.search(query_embedding, k)

print(f"Found {len(indices[0])} results:")
for i, (idx, distance) in enumerate(zip(indices[0], distances[0])):
    if idx != -1:  # FAISS returns -1 if not enough results
        preview = texts[idx][:100] + "..." if len(texts[idx]) > 100 else texts[idx]
        doc_type = metadatas[idx].get('doc_type', 'unknown')
        print(f"\n  Result {i+1} (distance: {distance:.4f}):")
        print(f"    Type: {doc_type}")
        print(f"    Preview: {preview}")

# 6. Directory listing
print(f"\nüìÅ Directory contents of {vector_index_path}:")
for file in os.listdir(vector_index_path):
    file_path = os.path.join(vector_index_path, file)
    size = os.path.getsize(file_path)
    print(f"  - {file} ({size:,} bytes)")

print("\n" + "="*50)
print("üéâ RELIABLE VECTOR STORE READY!")
print("="*50)

üöÄ RELIABLE FAISS - Processing 18717 chunks
Loading fast embedding model...
Creating embeddings for 18717 chunks...
‚úì Created 18717 embeddings
Embedding shape: (18717, 384)

üîß Creating FAISS index...
‚úì FAISS index created with 18717 vectors

üíæ Saving to disk...
‚úì FAISS index saved: ../../04_models/vector_index/faiss_index.bin
‚úì Texts saved: ../../04_models/vector_index/texts.pkl (18717 texts)
‚úì Metadata saved: ../../04_models/vector_index/metadata.pkl (18717 entries)
‚úì Embeddings saved: ../../04_models/vector_index/embeddings.npy
‚úì Index info saved: ../../04_models/vector_index/index_info.json

‚úÖ FAISS INDEX CREATION COMPLETE

üìä STATS:
Total chunks: 18717
Embedding dimension: 384
FAISS index size: 18717 vectors

üß™ TEST QUERY:
Query: 'automotive technology'
Found 3 results:

  Result 1 (distance: 0.6645):
    Type: research_paper
    Preview: (V2X), Internet of Things (IOT), public clouds, data analytics, artificial intelligence, digitalizat...

  Result 2 

### Retrieval test

Load and test the FAISS vector index

In [None]:
# Path to your FAISS index
VECTOR_INDEX_PATH = "../../04_models/vector_index"

print(f"Index path: {VECTOR_INDEX_PATH}")
print(f"Path exists: {os.path.exists(VECTOR_INDEX_PATH)}")

if os.path.exists(VECTOR_INDEX_PATH):
    print(f"\nüìÅ Directory contents:")
    for file in os.listdir(VECTOR_INDEX_PATH):
        file_path = os.path.join(VECTOR_INDEX_PATH, file)
        size = os.path.getsize(file_path)
        print(f"  - {file} ({size:,} bytes)")
    
    try:
        # 1. Load FAISS index
        index_path = os.path.join(VECTOR_INDEX_PATH, "faiss_index.bin")
        index = faiss.read_index(index_path)
        print(f"\n‚úÖ FAISS index loaded: {index.ntotal} vectors")
        
        # 2. Load texts
        texts_path = os.path.join(VECTOR_INDEX_PATH, "texts.pkl")
        with open(texts_path, "rb") as f:
            texts = pickle.load(f)
        print(f"‚úÖ Texts loaded: {len(texts)} chunks")
        
        # 3. Load metadata
        metadata_path = os.path.join(VECTOR_INDEX_PATH, "metadata.pkl")
        with open(metadata_path, "rb") as f:
            metadatas = pickle.load(f)
        print(f"‚úÖ Metadata loaded: {len(metadatas)} entries")
        
        # 4. Initialize embedding model for queries
        model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
        print("‚úÖ Embedding model loaded")
        
        # 5. Test queries
        test_queries = [
            "automotive startups",
            "autonomous driving technology", 
            "generative AI in automotive",
            "electric vehicle innovation"
        ]
        
        print(f"\n{'='*60}")
        print("üß™ Testing FAISS search directly:")
        
        for query in test_queries:
            print(f"\nüîç Query: '{query}'")
            
            # Create query embedding
            query_embedding = np.array(
                list(model.embed([query]))[0].tolist()
            ).astype('float32').reshape(1, -1)
            
            # Search
            k = 3
            distances, indices = index.search(query_embedding, k)
            
            # Show results
            found_results = 0
            for i, (idx, distance) in enumerate(zip(indices[0], distances[0])):
                if idx != -1 and idx < len(texts):
                    similarity = 1.0 / (1.0 + distance)
                    if similarity > 0.5:  # Threshold
                        found_results += 1
                        doc_type = metadatas[idx].get('doc_type', 'N/A') if idx < len(metadatas) else 'N/A'
                        source = metadatas[idx].get('source', 'N/A') if idx < len(metadatas) else 'N/A'
                        
                        print(f"  Result {i+1}:")
                        print(f"    Type: {doc_type}")
                        print(f"    Source: {source}")
                        print(f"    Similarity: {similarity:.4f}")
                        
                        # Check if startups
                        if doc_type == 'startups':
                            print(f"    üöÄ Startups data: ‚úì")
            
            if found_results == 0:
                print(f"  ‚ùå No results above threshold 0.5")
        
        # Show document distribution
        print(f"\n{'='*60}")
        print("üìà Document type distribution:")
        doc_type_counts = {}
        for metadata in metadatas:
            doc_type = metadata.get('doc_type', 'unknown')
            doc_type_counts[doc_type] = doc_type_counts.get(doc_type, 0) + 1
        
        for doc_type, count in sorted(doc_type_counts.items(), key=lambda x: x[1], reverse=True):
            print(f"  - {doc_type}: {count} chunks")
        
        print(f"\n{'='*60}")
        print("‚úÖ Direct FAISS test completed successfully!")
        
    except Exception as e:
        print(f"‚ùå Error during direct test: {e}")
        import traceback
        traceback.print_exc()
else:
    print("‚ùå Vector index directory not found!")
    print("Make sure you ran the FAISS embedding creation code first.")

üß™ Direct FAISS index test from Notebook 2
Index path: ../../04_models/vector_index
Path exists: True

üìÅ Directory contents:
  - metadata.pkl (431,988 bytes)
  - faiss_index.bin (28,749,357 bytes)
  - embeddings.npy (28,749,440 bytes)
  - index_info.json (187 bytes)
  - texts.pkl (11,096,569 bytes)

‚úÖ FAISS index loaded: 18717 vectors
‚úÖ Texts loaded: 18717 chunks
‚úÖ Metadata loaded: 18717 entries
‚úÖ Embedding model loaded

üß™ Testing FAISS search directly:

üîç Query: 'automotive startups'
  Result 1:
    Type: research_paper
    Source: automotive_papers_processed.txt
    Similarity: 0.5672
  Result 2:
    Type: startups
    Source: autotechinsight_startups_processed.txt
    Similarity: 0.5527
    üöÄ Startups data: ‚úì
  Result 3:
    Type: startups
    Source: autotechinsight_startups_processed.txt
    Similarity: 0.5459
    üöÄ Startups data: ‚úì

üîç Query: 'autonomous driving technology'
  Result 1:
    Type: research_paper
    Source: automotive_papers_processed