<a href="https://colab.research.google.com/github/lannd3217/Advanced-Business-Analysis/blob/main/InterviewRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://github.com/lannd3217/Interview_RAG.git

fatal: destination path 'Interview_RAG' already exists and is not an empty directory.


In [2]:
%cd "Interview_RAG"

/content/Interview_RAG


In [3]:
!pip install -U langchain langchain-community langchain-openai
!pip install -U langchain langchain-community langchain-text-splitters
!pip install -U pymupdf langchain-community
!pip install -U langchain-huggingface sentence-transformers



In [4]:
!pip install -q sentence-transformers faiss-cpu transformers


In [4]:
import os
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader

def load_pdfs_from_folder(folder_path="./docs"):
    """
    Loads all PDF documents from a specified directory.
    """
    print(f"Loading PDFs from: {folder_path}...")

    # Initialize the DirectoryLoader to specifically target PDF files
    # Using PyMuPDFLoader for its speed and robust text extraction
    loader = DirectoryLoader(
        folder_path,
        glob="*.pdf",
        loader_cls=PyMuPDFLoader,
        show_progress=True,
        silent_errors=True  # Skips files that might be corrupted
    )

    # Load the documents into a list of LangChain Document objects
    docs = loader.load()

    print(f"Successfully loaded {len(docs)} pages from your PDFs.")
    return docs

# Example usage
pdf_documents = load_pdfs_from_folder("./docs")



Loading PDFs from: ./docs...


100%|██████████| 9/9 [00:03<00:00,  2.47it/s]

Successfully loaded 1429 pages from your PDFs.





In [5]:
# Print the first 500 characters of the very first page loaded
len(pdf_documents)
# print(pdf_documents[8].page_content[:500])

1429

In [6]:
search_term = "A/B testing"
for doc in pdf_documents:
    if search_term.lower() in doc.page_content.lower():
        print(f"Found '{search_term}' in {doc.metadata['source']} on page {doc.metadata['page']}")

Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 0
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 1
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 2
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 3
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 4
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 5
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 6
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 7
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 8
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 10
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 12
Found 'A/B testing' in docs/A:B Testing- A SystematicLiterature.pdf on page 13
Found 'A/B testing' in docs/A:B Testing- A SystematicLiteratu

In [7]:
import re

def clean_text(text):
    # 1. Remove weird "cid" characters (often found in PDFs with special fonts)
    text = re.sub(r'\(cid:\d+\)', '', text)

    # 2. Replace multiple newlines with a single one to keep structure but save space
    text = re.sub(r'\n\s*\n', '\n', text)

    # 3. Replace multiple spaces with a single space
    text = re.sub(r' +', ' ', text)

    # 4. Remove purely numerical lines (often page numbers or footers)
    text = re.sub(r'^\d+$\n', '', text, flags=re.MULTILINE)

    return text.strip()

# --- Apply cleaning to your existing documents ---
cleaned_samples = []

for doc in pdf_documents:
    # Extract raw content
    raw_content = doc.page_content

    # Clean it
    cleaned_content = clean_text(raw_content)

    # Store a preview (First 200 chars) for printing
    source_name = doc.metadata.get('source', 'Unknown File')
    cleaned_samples.append((source_name, cleaned_content))

# --- Print the Example Output ---
print(f"✨ Cleaned {len(pdf_documents)} pages.\n")
for filename, text in cleaned_samples[:3]: # Preview first 3 pages
    print(f"Source: {filename}")
    print(f"--- Cleaned Text Preview ---\n{text[:350]}...")
    print("-" * 30)

✨ Cleaned 1429 pages.

Source: docs/Meta_ML_initial_interview_prep.pdf
--- Cleaned Text Preview ---
1 
Machine Learning 
Initial 
Interview Guide 
Welcome to your preparation guide for your Machine Learning (ML) 
initial interview at Meta! Use the sidebar to quickly jump to the 
section you are looking for. Our ML engineering leaders and 
recruiters put together this guide, so you know what to expect and 
how to prepare. We recognize that intervi...
------------------------------
Source: docs/Meta_ML_initial_interview_prep.pdf
--- Cleaned Text Preview ---
2 
Coding (35 minutes) 
You'll solve two coding problems focused on CS fundamentals like algorithms, 
data structures, recursions, and binary trees. If your tech screen is by phone, the 
engineer will send you a collaborative editor (such as coderpad.io). If your tech 
screen is in person, you'll use a whiteboard. 
Answering Your Questions (5 minut...
------------------------------
Source: docs/Meta_ML_initial_interview_prep.pdf
--- C

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,   # Roughly 200 words
    chunk_overlap=200, # Overlap helps keep context between chunks
    separators=["\n\n", "\n", " ", ""]
)

# Create the chunks
chunks = text_splitter.split_documents(pdf_documents)
print(f"Created {len(chunks)} chunks from your 9 PDFs.")

Created 3165 chunks from your 9 PDFs.


In [9]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Initialize the embedding model
# 'all-MiniLM-L6-v2' is the industry standard for fast, local RAG
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'} # Change to 'cuda' if you are using a GPU
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [11]:

# 2. Create your vector store from the chunks you prepared
vector_db = FAISS.from_documents(chunks, embeddings)
print("Vector store created successfully!")

✅ Vector store created successfully!


In [12]:
vector_db.save_local("faiss_interview_index")
print("Vector store saved to 'faiss_interview_index'")

Vector store saved to 'faiss_interview_index'


In [17]:
# To load it back later or in a new cell:
# vector_db = FAISS.load_local("faiss_interview_index", embeddings, allow_dangerous_deserialization=True)

# Create a retriever
retriever = vector_db.as_retriever(search_kwargs={"k": 3})

# Quick Test: Search without the LLM first
query = "What are the behavioral questions for Meta?"
docs = retriever.invoke(query)

print(f"Found {len(docs)} relevant snippets. Here is a peek at the first one:")
print(docs[0].page_content + "...")

Found 3 relevant snippets. Here is a peek at the first one:
The unique part of my Amazon interview was how big of a portion
the behavioral questions took. This made the behavioral interview
stories preparation way more intense and valuable. The
Leadership Principles are very important, even more so once you
get into Amazon.
—Ammar, Engineer at Amazon
Meta/Facebook
Meta has Six Core Values. At the time of writing, they are:
Move fast
Focus on long-term impact
Build awesome things
Live in the future
Be direct and respect your colleagues
Meta, Metamates, me
To repeat the tips I previously mentioned, you’ll have to weave these
principles into your behavioral questions responses: you’re not just
listing them out.
Meta has five signals to assess during the behavioral interview.
These are:
Resolving conflict
Growing continuously
Embracing ambiguity
Driving results
Communicating effectively...
