# Synthetic Data Generation for YouTube Insights

This notebook generates synthetic test data from YouTube transcripts using RAGAS to help evaluate and improve our RAG pipeline.

In [45]:
from langchain_community.document_loaders import YoutubeLoader

# Load transcript from video
video_url = "https://www.youtube.com/watch?v=BaTjJJsz0rY"
loader = YoutubeLoader.from_youtube_url(video_url)
docs = loader.load()

print(f"Loaded {len(docs)} document(s)")
print(f"First document preview: {docs[0].page_content[:200]}...")


Loaded 1 document(s)
First document preview: okay whiz we're talking about reasoning in latent space today is that the same as test time compute yeah that's right nice nice okay and we've got two big ideas to cover that are aimed at scaling the ...


In [21]:
import os
import getpass

In [22]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [46]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [47]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Generating personas: 100%|██████████| 1/1 [00:00<00:00,  1.63it/s]                                           
Generating Scenarios: 100%|██████████| 2/2 [00:03<00:00,  1.73s/it]
Generating Samples: 100%|██████████| 10/10 [00:02<00:00,  3.95it/s]


In [48]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the concept of Chain of Thought relat...,[okay whiz we're talking about reasoning in la...,"The Chain of Thought, specifically the continu...",single_hop_specifc_query_synthesizer
1,What are embedding models and how do they rela...,[kind of doing this compression okay we're tak...,Embedding models are used to represent data po...,single_hop_specifc_query_synthesizer
2,How did the multi-layer perceptron contribute ...,[like when you when you when you're trying to ...,The multi-layer perceptron played a significan...,single_hop_specifc_query_synthesizer
3,What significant developments in AI are antici...,[let's let's you know these breakthroughs that...,"In 2024, significant developments in AI are ex...",single_hop_specifc_query_synthesizer
4,What is the significance of coconut in the con...,[hidden state of the llm as a representation o...,Coconut is significant because it represents a...,single_hop_specifc_query_synthesizer
5,How does the concept of latent space reasoning...,[<1-hop>\n\nlike that's what I thought was int...,The concept of latent space reasoning in GPT m...,multi_hop_specific_query_synthesizer
6,What are the implications of reasoning in late...,[<1-hop>\n\nokay whiz we're talking about reas...,Reasoning in latent space has significant impl...,multi_hop_specific_query_synthesizer
7,How does the recurrent depth approach in the g...,[<1-hop>\n\nright where the green shared recur...,The recurrent depth approach in the gp2 archit...,multi_hop_specific_query_synthesizer
8,How does the concept of test time compute rela...,[<1-hop>\n\nokay whiz we're talking about reas...,The concept of test time compute is closely re...,multi_hop_specific_query_synthesizer
9,What are the key differences between the conti...,[<1-hop>\n\nokay whiz we're talking about reas...,The continuous Chain of Thought (coconut) and ...,multi_hop_specific_query_synthesizer


In [50]:
from langsmith import Client

client = Client()

dataset_name = "Reasoning in Continuous Latent Space"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Reasoning in Continuous Latent Space"
)

In [51]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

In [52]:
rag_documents = docs

In [53]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

In [54]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [55]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="State of AI"
)

In [56]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

In [57]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Question: {question}

        Context: {context}

        Provide a clear, insightful answer using only the provided context.
        If you cannot answer from the context, say "Insufficient context to answer."
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [58]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [59]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [60]:
rag_chain.invoke({"question" : "What is latent space?"})

"Latent space refers to a conceptual representation used in models, particularly in machine learning and AI, where data is transformed into a mathematical space that captures its essential features. It's distinct from natural language or token space and is often described in terms of embedding space, which consists of numerical representations of data. In the context of Transformer architectures, like those used in language models, latent space is where information is processed and reasoned before being converted back into human-readable language. This involves hidden states and scaling of the model, where the representation of reasoning about data occurs. Essentially, latent space serves as an intermediary space that facilitates computation and reasoning without directly operating in the space of words or language."

## Langsmith Evals

In [None]:
eval_llm = ChatOpenAI(model="gpt-4o")

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)