## Pre-processing and building RAG Retrieval Prototype

#### What this script does:
- Loads processed text files
- Prepares documents for embedding by dividing into chunks 
- Creates embeddings using TF-IDF
- Saves vector index in 04_models/vector_index/
- Loads TF-IDF retriever from rag_componets/retriever.py
- Tests retriever with some basic automotive-focused questions

In [None]:
# Import libraries

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
import importlib.util
import sys
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle
import joblib

### Preprocessing pipeline

1. Load reports and divide into chunks

In [3]:
# Define paths where processed text files are found 
data_path = "../../01_data/rag_automotive_tech/processed"
papers_path = os.path.join(data_path, "automotive_papers")
reports_path = os.path.join(data_path, "tech_reports")
startups_file = os.path.join(data_path, "startups_processed.txt")  # Added startups file

# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

# Function for chunking tech and automative reports

def load_and_chunk_documents(folder_path, doc_type):
    """Load and chunk documents from a specific folder"""
    chunks = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(folder_path, filename)
            try:
                loader = TextLoader(file_path, encoding='utf-8')
                documents = loader.load()
                
                # Add metadata to identify document source
                for doc in documents:
                    doc.metadata.update({
                        'source': filename,
                        'doc_type': doc_type,
                        'file_path': file_path
                    })
                
                # Split documents into chunks
                doc_chunks = text_splitter.split_documents(documents)
                chunks.extend(doc_chunks)
                print(f"Loaded {len(doc_chunks)} chunks from {filename}")
                
            except Exception as e:
                print(f"Error loading {filename}: {e}")
    
    return chunks

# Function for chunking Startup CSV

def load_and_chunk_single_file(file_path, doc_type, source_name):
    """Load and chunk a single file"""
    chunks = []
    try:
        loader = TextLoader(file_path, encoding='utf-8')
        documents = loader.load()
        
        # Add metadata to identify document source
        for doc in documents:
            doc.metadata.update({
                'source': source_name,
                'doc_type': doc_type,
                'file_path': file_path
            })
        
        # Split documents into chunks
        doc_chunks = text_splitter.split_documents(documents)
        chunks.extend(doc_chunks)
        print(f"Loaded {len(doc_chunks)} chunks from {source_name}")
        
    except Exception as e:
        print(f"Error loading {source_name}: {e}")
    
    return chunks

# Load and chunk all documents
print("Loading research papers...")
papers_chunks = load_and_chunk_documents(papers_path, "research_paper")

print("Loading tech reports...")
reports_chunks = load_and_chunk_documents(reports_path, "tech_report")

print("Loading startups data...")
startups_chunks = load_and_chunk_single_file(
    startups_file, 
    "startups_data", 
    "startups_processed.txt"
)

# Combine all chunks
all_chunks = papers_chunks + reports_chunks + startups_chunks
print(f"\nSummary:")
print(f"- Research papers: {len(papers_chunks)} chunks")
print(f"- Tech reports: {len(reports_chunks)} chunks")
print(f"- Startups data: {len(startups_chunks)} chunks")
print(f"Total chunks created: {len(all_chunks)}")

Loading research papers...
Loaded 61 chunks from enhanced_drift_aware_computer_vision_achitecture_for_autonomous_driving.txt
Loaded 102 chunks from Gen_AI_in_automotive_applications_challenges_and_opportunities_with_a_case_study_on_in-vehicle_experience.txt
Loaded 120 chunks from leveraging_vision_language_models_for_visual_grounding_and_analysis_of_automative_UI.txt
Loaded 69 chunks from automating_automative_software_development_a_synergy_of_generative_AI_and_formal_methods.txt
Loaded 137 chunks from automotive-software-and-electronics-2030-full-report.txt
Loaded 102 chunks from AI_agents_in_engineering_design_a_multiagent_framework_for_aesthetic_and_aerodynamic_car_design.txt
Loaded 87 chunks from a_benchmark_framework_for_AI_models_in_automative_aerodynamics.txt
Loaded 227 chunks from generative_AI_for_autonomous_driving_a_review.txt
Loaded 46 chunks from Embedded_acoustic_intelligence_for_automotive_systems.txt
Loaded 107 chunks from drive_disfluency-rich_synthetic_dialog_data_gen

2. Create embeddings using TF-IDF and save Vector Index

In [4]:
# Confirm that we have all chunks
print(f"Total chunks to embed: {len(all_chunks)}")

# Create directory for vector storage
vector_index_path = "../../04_models/vector_index"
os.makedirs(vector_index_path, exist_ok=True)

print("Creating TF-IDF embeddings...")

# Extract text from chunks
texts = [chunk.page_content for chunk in all_chunks]

# Build TF-IDF matrix
vectorizer = TfidfVectorizer(max_features=1000, stop_words="english")
tfidf_matrix = vectorizer.fit_transform(texts)

# Package TF-IDF data
tfidf_data = {
    "matrix": tfidf_matrix,
    "vectorizer": vectorizer,
    "chunks": [
        {
            "page_content": chunk.page_content,
            "metadata": chunk.metadata
        }
        for chunk in all_chunks
    ]
}

# Save TF-IDF model + matrix
joblib.dump(tfidf_data, os.path.join(vector_index_path, "tfidf_embeddings.pkl"))
print("‚úì TF-IDF embeddings created and saved!")

# Save chunk metadata (optional but useful)
print("Saving chunks metadata...")
chunks_metadata = [
    {
        "page_content": chunk.page_content,
        "metadata": chunk.metadata,
        "embedding_index": i
    }
    for i, chunk in enumerate(all_chunks)
]

with open(os.path.join(vector_index_path, "chunks_metadata.pkl"), "wb") as f:
    pickle.dump(chunks_metadata, f)

print("‚úì Chunks metadata saved successfully!")
print(f"‚úì Embedding process completed! Files saved in: {vector_index_path}")

# List created files
print("\nüìÅ Files in vector_index directory:")
for file in os.listdir(vector_index_path):
    print(f"  - {file}")


Total chunks to embed: 19038
Creating TF-IDF embeddings...
‚úì TF-IDF embeddings created and saved!
Saving chunks metadata...
‚úì Chunks metadata saved successfully!
‚úì Embedding process completed! Files saved in: ../../04_models/vector_index

üìÅ Files in vector_index directory:
  - chunks_metadata.pkl
  - tfidf_embeddings.pkl


### Retriever prototype

Load and test the retriever

In [5]:
def import_retriever():
    current_dir = os.getcwd()
    retriever_path = os.path.join(current_dir, 'rag_components', 'retriever.py')
    spec = importlib.util.spec_from_file_location("retriever", retriever_path)
    retriever_module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(retriever_module)
    return retriever_module.DocumentAwareRetriever

DocumentAwareRetriever = import_retriever()
print("‚úÖ Retriever imported!")

‚úÖ Retriever imported!


In [6]:
# Initialize the retriever
print("Initializing TF-IDF retriever...")
VECTOR_INDEX_PATH = "../../04_models/vector_index"
retriever = DocumentAwareRetriever(VECTOR_INDEX_PATH)

# Test the retriever with startups data
print("\nüß™ Testing retriever with automotive-focused queries...")

test_queries = [
    "automotive startups",
    "autonomous driving technology", 
    "generative AI in automotive",
    "electric vehicle innovation"
]

for query in test_queries:
    results = retriever.retrieve_with_sources(query, k=3)
    
    if results:
        print(f"\nüîç Query: '{query}'")
        print(f"üìä Top result: {results[0]['source_file']}")
        print(f"üìù Type: {results[0]['doc_type']}")
        print(f"‚≠ê Score: {results[0]['similarity_score']:.4f}")
        
        # Check if startups data was retrieved
        startups_found = any(doc['doc_type'] == 'startups_data' for doc in results)
        print(f"üöÄ Startups data included: {'‚úì' if startups_found else '‚úó'}")
    else:
        print(f"\n‚ùå No results for: '{query}'")

# Show document type distribution
doc_type_counts = retriever.get_doc_type_counts()
print(f"\nüìà Document type distribution:")
for doc_type, count in doc_type_counts.items():
    print(f"  - {doc_type}: {count} chunks")

Initializing TF-IDF retriever...
‚úì TF-IDF retriever loaded successfully

üß™ Testing retriever with automotive-focused queries...

üîç Query: 'automotive startups'
üìä Top result: automotive-software-and-electronics-2030-full-report.txt
üìù Type: research_paper
‚≠ê Score: 0.4653
üöÄ Startups data included: ‚úó

üîç Query: 'autonomous driving technology'
üìä Top result: Gen_AI_in_automotive_applications_challenges_and_opportunities_with_a_case_study_on_in-vehicle_experience.txt
üìù Type: research_paper
‚≠ê Score: 0.6599
üöÄ Startups data included: ‚úó

üîç Query: 'generative AI in automotive'
üìä Top result: Gen_AI_in_automotive_applications_challenges_and_opportunities_with_a_case_study_on_in-vehicle_experience.txt
üìù Type: research_paper
‚≠ê Score: 0.8090
üöÄ Startups data included: ‚úó

üîç Query: 'electric vehicle innovation'
üìä Top result: startups_processed.txt
üìù Type: startups_data
‚≠ê Score: 0.5062
üöÄ Startups data included: ‚úì

üìà Document type distri