In [36]:
import os
import textwrap
from dotenv import load_dotenv

# LangChain loaders, splitters, and vector store
from langchain_community.document_loaders import PyPDFLoader, YoutubeLoader
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter 
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore

# LangChain chains & prompts
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain import hub
from langchain_core.prompts import ChatPromptTemplate

load_dotenv()

True

### Ingesting ICA Resources into Pinecone

This cell prepares and ingests documents related to **Independent Component Analysis (ICA)** into the Pinecone vector store for retrieval-based QA.

1. **Load PDFs**  
   We load one or more PDF documents using `PyPDFLoader`. In this case, we're using `fastICA.pdf`, which contains academic content on ICA.

2. **Load YouTube Transcript**  
   The transcript from a YouTube lecture (Stanford Online) is loaded using `YoutubeLoader`. The transcript is treated as a document and will be chunked like any other source.

3. **Combine All Documents**  
   All loaded documents (PDFs + YouTube transcript) are merged into a single list so they can be processed together.

4. **✂Chunking**  
   The combined content is split into smaller, overlapping chunks using `RecursiveCharacterTextSplitter` to preserve context. Each chunk is ~500 characters with 100-character overlap to improve retrieval fidelity.

5. **Generate Embeddings**  
   Each chunk is transformed into a high-dimensional embedding using OpenAI’s embedding model.

6. **Ingest into Pinecone**  
   Chunks are filtered to avoid exceeding Pinecone's per-vector metadata limits (~40KB), then stored in the Pinecone index. This allows for fast similarity-based retrieval later.

Once complete, we can run queries over this embedded corpus and retrieve relevant chunks to answer ICA-related questions.

In [39]:
# --- Load PDFs ---
pdf_paths = ["fastICA.pdf"]
pdf_docs = []
for path in pdf_paths:
    loader = PyPDFLoader(path)
    pdf_docs.extend(loader.load())

# --- Load YouTube transcript ---
# Make sure the video has captions or transcript available
youtube_url = "https://www.youtube.com/watch?v=YQA9lLdLig8&ab_channel=StanfordOnline" 
yt_loader = YoutubeLoader.from_youtube_url(youtube_url, add_video_info=False)
yt_docs = yt_loader.load()

# --- Combine all docs ---
all_docs = pdf_docs + yt_docs

print("Splitting...")
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " ", ""]
)
texts = splitter.split_documents(all_docs)
print(f"Created {len(texts)} chunks")

# --- Embeddings and Pinecone ingestion ---
embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])

print("Ingesting into Pinecone...")
texts = [doc for doc in texts if len(doc.page_content.encode("utf-8")) < 3000]
PineconeVectorStore.from_texts(
    [doc.page_content for doc in texts],
    embedding=embeddings,
    index_name=os.environ["INDEX_NAME"]
)
print("Ingestion complete.")

Splitting...
Created 364 chunks
Ingesting into Pinecone...
Ingestion complete.


### Ask Questions About ICA Using RAG

This cell allows the user to input a natural language question about **Independent Component Analysis (ICA)** and generates an answer based on the previously ingested documents (PDF + YouTube transcript).

1. **Prompt the User**  
   The user is asked to type a question (e.g., *"What is the cocktail-party problem?"*).

2. **Load Embedding Model and Vector Store**  
   - The same OpenAI embedding model used during ingestion is reloaded.
   - The Pinecone vector store is connected so we can retrieve relevant document chunks.

3. **Create a Retrieval-Augmented Generation (RAG) Chain**  
   - A `retrieval_chain` is created by combining a document retriever (from Pinecone) and a prompt-response LLM chain.
   - The system message encourages the LLM to **base its answer primarily on the retrieved documents**, and discourages hallucinations.
   - The retriever uses **vector similarity** to fetch the 5 most relevant document chunks based on the user’s query.

4. **Generate and Display Answer**  
   - The chain is invoked with the user’s question.
   - The resulting answer is printed.
   - Additionally, the document chunks used to generate the answer are shown under **“Context Chunks Used”** to provide transparency.

This setup allows us to ask detailed, contextual questions about ICA theory, applications, or examples, leveraging both text and transcript data.

In [42]:
# --- Prompt user for query ---
query = input("Enter question about ICA: ").strip()

print("Loading vector store and models...")
embeddings = OpenAIEmbeddings()
llm = ChatOpenAI()

vectorstore = PineconeVectorStore(
    index_name=os.environ["INDEX_NAME"],
    embedding=embeddings
)

# --- Create retrieval chain ---
retrieval_qa_chat_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are an expert assistant answering questions using the context below. "
     "Base your answer primarily on the provided documents. "
     "If relevant information is found in the context, use it to answer as clearly and helpfully as possible. "
     "If nothing useful is found, you may respond with 'The answer is not available in the provided documents.'"),
    ("human", "Context:\n{context}\n\nQuestion:\n{input}")
])
combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
retrieval_chain = create_retrieval_chain(
    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5}),
    combine_docs_chain=combine_docs_chain
)

# --- Run the query ---
print("Retrieving and generating answer...\n")
result = retrieval_chain.invoke({"input": query})

# --- Format and print Markdown-style output ---
answer = result.get("answer", "[No answer generated]")
sources = result.get("context", [])

print("### Answer:\n")
print(textwrap.fill(answer, width=100))

if sources:
    print("\n---\n")
    print("### Context Chunks Used:\n")
    for i, doc in enumerate(sources):
        print(f"**Chunk {i+1}:**\n")
        print(textwrap.fill(doc.page_content.strip(), width=100))
        print("\n---\n")

Enter question about ICA:  please explain the cocktail-party problem


Loading vector store and models...
Retrieving and generating answer...

### Answer:

The cocktail-party problem refers to a scenario where multiple sound sources are mixed together,
like at a party where many people are speaking simultaneously. In this situation, the goal is to
separate and isolate the individual sound sources from the mixed signals recorded by microphones.
This problem is analogous to trying to pick out and understand individual voices in a crowded and
noisy room.  To address the cocktail-party problem, researchers use techniques like Independent
Component Analysis (ICA) to estimate the original sound sources based on the mixed signals received
by microphones. By analyzing the recorded signals mathematically, it is possible to separate the
combined voices and extract the individual speech signals, ultimately reconstructing the original
sources.

---

### Context Chunks Used:

**Chunk 1:**

. And the goal is to find the matrix W, uh, which should hopefully be A inverse

### Displaying the Answer in Markdown Format

This cell formats the output using **Markdown**, but it could be adapted for use with any frontend. 

1. **Extract Results**  
   Retrieves the answer generated by the retrieval-augmented chain and the document chunks (`context`) used to produce it.

2. **Display Answer**  
   Uses `IPython.display.Markdown` to render the final answer under a bold "Answer" heading.

3. **Display Source Chunks**  
   If any relevant document chunks were used:
   - Each chunk is shortened to its first ~40 words.
   - Chunks are numbered and displayed in a clean, bullet-style list.
   - This gives users transparency into *where* the answer came from without overwhelming them with full documents.

In [44]:
from IPython.display import display, Markdown

# --- Format and print Markdown-style output ---
answer = result.get("answer", "[No answer generated]")
sources = result.get("context", [])

# Display the answer in Markdown
display(Markdown(f"### 🤖 Answer\n\n{answer.strip()}"))

if sources:
    md_chunks = ["\n---\n", "### Context Chunks Used\n"]
    for i, doc in enumerate(sources):
        content = doc.page_content.strip().replace("\n", " ")
        short = " ".join(content.split()[:40]) + ("..." if len(content.split()) > 40 else "")
        md_chunks.append(f"- **Chunk {i+1}**: {short}")
    display(Markdown("\n".join(md_chunks)))


### 🤖 Answer

The cocktail-party problem refers to a scenario where multiple sound sources are mixed together, like at a party where many people are speaking simultaneously. In this situation, the goal is to separate and isolate the individual sound sources from the mixed signals recorded by microphones. This problem is analogous to trying to pick out and understand individual voices in a crowded and noisy room.

To address the cocktail-party problem, researchers use techniques like Independent Component Analysis (ICA) to estimate the original sound sources based on the mixed signals received by microphones. By analyzing the recorded signals mathematically, it is possible to separate the combined voices and extract the individual speech signals, ultimately reconstructing the original sources.


---

### Context Chunks Used

- **Chunk 1**: . And the goal is to find the matrix W, uh, which should hopefully be A inverse, um, so that SI is W times X recovered the original sources. Uh, and we're going to use these W1 up through WN...
- **Chunk 2**: as a linear equation: x1(t) = a11s1 + a12s2 (1) x2(t) = a21s1 + a22s2 (2) where a11, a12, a21, anda22 are some parameters that depend on the distances of the microphones from the speakers. It would be very useful...
- **Chunk 3**: . So the analogy to the cocktail party problem, the, um, overlapping speakers' voices is that, you know, your- your brain [NOISE] does a lot of things at the same time, right? Your brain helps regulate your heartbeat, um, part...
- **Chunk 4**: reversed, but this has no signiﬁcance.) Independent component analysis was originally developed to deal with problems that are closely related to the cocktail-party problem. Since the recent increase of interest in ICA, it has become clear that this principle has...
- **Chunk 5**: . So think of there as, um, imagine that you have a, you know, cocktail party in your head, right? So many overlapping voices, so this is now voices in your head, uh, just going back, but one- one- one...