# Langchain RAG, using Cosmology Data, Parts 10 - 11 - Overview

The idea is to use replicate the LangChain RAG template for our RAG application.
This is the third notebook, based on: https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_10_and_11.ipynb

### Imports and API Keys

In [2]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'

os.environ['LANGCHAIN_API_KEY'] = os.environ['LANGCHAIN_API_KEY']
os.environ['OPENAI_API_KEY'] = os.environ['OPENAI_API_KEY']

In [19]:
import datetime
from typing import Literal, Optional, Tuple


from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.utils.math import cosine_similarity
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_community.document_loaders import YoutubeLoader


In [5]:
import warnings
warnings.filterwarnings("ignore")

## Logical and Semantic Routing
Use fn calling for classification

https://python.langchain.com/docs/use_cases/query_analysis/techniques/routing#routing-to-multiple-indexes

### Logical Routing

In [6]:
# Data model
class RouteQuery(BaseModel):
    """Route a user query to the most relevant datasource."""

    datasource: Literal["python_docs", "cosmology_docs", "history_docs"] = Field(
        ...,
        description="Given a user question choose which datasource would be most relevant for answering their question",
    )

# LLM with function call 
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm = llm.with_structured_output(RouteQuery)

# Prompt 
system = """You are an expert at routing a user question to the appropriate data source.

Based on the programming language the question is referring to, route it to the relevant data source."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)

# Define router 
router = prompt | structured_llm

In [7]:

question = """What is a galaxy cluster?"""

result = router.invoke({"question": question})

In [9]:
result, result.datasource

(RouteQuery(datasource='cosmology_docs'), 'cosmology_docs')

Now we can define a branch that uses result.datasource

https://python.langchain.com/docs/expression_language/how_to/routing

In [11]:
def choose_route(result):
    if "python_docs" in result.datasource.lower():
        ### Logic here 
        return "chain for python_docs"
    elif "cosmology_docs" in result.datasource.lower():
        ### Logic here 
        return "chain for cosmology_docs"
    else:
        ### Logic here 
        return "history_docs"


full_chain = router | RunnableLambda(choose_route)

full_chain.invoke({"question": question})

'chain for cosmology_docs'

In [12]:
# Let's try another one

question = """What was the first Nobel Prize in Physics awarded for and when?"""

result = router.invoke({"question": question})

result.datasource

'history_docs'

In [13]:

full_chain = router | RunnableLambda(choose_route)

full_chain.invoke({"question": question})

'history_docs'

### Semantic Routing

In [15]:
# Two prompts
cosmology_template = """You are a very smart Cosmology professor. \
You are great at answering questions about Cosmology in a concise and easy to understand manner. \
When you don't know the answer to a question you admit that you don't know.

Here is a question:
{query}"""

math_template = """You are a very good mathematician. You are great at answering math questions. \
You are so good because you are able to break down hard problems into their component parts, \
answer the component parts, and then put them together to answer the broader question.

Here is a question:
{query}"""

# Embed prompts
embeddings = OpenAIEmbeddings()
prompt_templates = [cosmology_template, math_template]
prompt_embeddings = embeddings.embed_documents(prompt_templates)

# Route question to prompt 
def prompt_router(input):
    # Embed question
    query_embedding = embeddings.embed_query(input["query"])
    # Compute similarity
    similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]
    most_similar = prompt_templates[similarity.argmax()]
    # Chosen prompt 
    print("Using MATH" if most_similar == math_template else "Using PHYSICS")
    return PromptTemplate.from_template(most_similar)


chain = (
    {"query": RunnablePassthrough()}
    | RunnableLambda(prompt_router)
    | ChatOpenAI()
    | StrOutputParser()
)

print(chain.invoke("What is a Galaxy Cluster?"))

Using PHYSICS
A galaxy cluster is a large group of galaxies that are bound together by gravity. These clusters can contain anywhere from a few dozen to thousands of galaxies, as well as dark matter, hot gas, and dust. Galaxy clusters are some of the largest structures in the universe, and studying them can help us understand the formation and evolution of galaxies.


## Query structuring for Metadata filters

Many vectorstores contain metadata fields, so we can filter for specific chunks based on metadata.

Let's look at some example metadata we might see in a database of YouTube transcripts.

https://python.langchain.com/docs/use_cases/query_analysis/techniques/structuring

The lecture is by an absolute legend of modern physics, Leonard Susskind (https://en.wikipedia.org/wiki/Leonard_Susskind); Lecture 1 of the Cosmology series at Stanford University in 2013

In [17]:
docs = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=P-medYaqVak&t=91s&ab_channel=Stanford", add_video_info=True
).load()

docs[0].metadata

{'source': 'P-medYaqVak',
 'title': 'Cosmology Lecture 1',
 'description': 'Unknown',
 'view_count': 1150943,
 'thumbnail_url': 'https://i.ytimg.com/vi/P-medYaqVak/hq720.jpg',
 'publish_date': '2013-01-28 00:00:00',
 'length': 5746,
 'author': 'Stanford'}

If we’ve built an index that:
* Allows us to perform unstructured search over the contents and title of each document
* And to use range filtering on view count, publication date, and length.

We want to convert natural language into structured search queries.
We can define a schema for structured search queries.

In [20]:
class TutorialSearch(BaseModel):
    """Search over a database of lecture videos about Cosmology"""

    content_search: str = Field(
        ...,
        description="Similarity search query applied to video transcripts.",
    )
    title_search: str = Field(
        ...,
        description=(
            "Alternate version of the content search query to apply to video titles. "
            "Should be succinct and only include key words that could be in a video "
            "title."
        ),
    )
    min_view_count: Optional[int] = Field(
        None,
        description="Minimum view count filter, inclusive. Only use if explicitly specified.",
    )
    max_view_count: Optional[int] = Field(
        None,
        description="Maximum view count filter, exclusive. Only use if explicitly specified.",
    )
    earliest_publish_date: Optional[datetime.date] = Field(
        None,
        description="Earliest publish date filter, inclusive. Only use if explicitly specified.",
    )
    latest_publish_date: Optional[datetime.date] = Field(
        None,
        description="Latest publish date filter, exclusive. Only use if explicitly specified.",
    )
    min_length_sec: Optional[int] = Field(
        None,
        description="Minimum video length in seconds, inclusive. Only use if explicitly specified.",
    )
    max_length_sec: Optional[int] = Field(
        None,
        description="Maximum video length in seconds, exclusive. Only use if explicitly specified.",
    )

    def pretty_print(self) -> None:
        for field in self.__fields__:
            if getattr(self, field) is not None and getattr(self, field) != getattr(
                self.__fields__[field], "default", None
            ):
                print(f"{field}: {getattr(self, field)}")

Now we can prompt the LLM to produce queries

In [21]:
system = """You are an expert at converting user questions into database queries. \
You have access to a database of lecture videos about Cosmology, given by Leonard Susskind \
Given a question, return a database query optimized to retrieve the most relevant results.

If there are acronyms or words you are not familiar with, do not try to rephrase them."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm = llm.with_structured_output(TutorialSearch)
query_analyzer = prompt | structured_llm

The part below is just a simple example, have to try this with an actual database of YouTube video transcripts (Use this same lecture series)

In [27]:
query_analyzer.invoke({"question": "What is the Hubble constant?"}).pretty_print() 

content_search: Hubble constant
title_search: Hubble constant
min_view_count: 10000


Why are we getting the view count above??

In [28]:
query_analyzer.invoke(
    {"question": "videos on cosmology published in 2013"}
).pretty_print()

content_search: cosmology
title_search: 2013
earliest_publish_date: 2013-01-01
latest_publish_date: 2014-01-01


In [30]:
query_analyzer.invoke(
    {
        "question": "videos on hubble's law, under 5 minutes in length, published after 2015"
    }
).pretty_print()

content_search: Hubble's Law
title_search: Hubble's Law
earliest_publish_date: 2016-01-01
min_length_sec: 0
max_length_sec: 300


Further development, connecting to vectorstores:: 
https://python.langchain.com/docs/modules/data_connection/retrievers/self_query#constructing-from-scratch-with-lcel