# LangChain
LangChain is a powerful framework designed to streamline and enhance the development of applications using Large Language Models (LLMs).

# overview
This notebook will introduce you to some of the core concepts and components of LangChain, including:
- **LLM Wrappers**: Learn how to use LangChain to interact with different large language models seamlessly.
- **Prompt Templates**: Discover how to create and manage reusable templates for generating consistent and effective prompts.
- **Chaining**: Understand how to chain multiple LLM calls together to build more complex and capable systems.
- **Structuring**: Formatting LLM outputs into a certain structure
- **Text Splitting**: Explore techniques for splitting text into manageable chunks for processing.
- **Embedding**: Get insights into generating and utilizing embeddings for various natural language processing tasks.
- **Retrieval**: Dive into methods for retrieving relevant information from large text corpora using embeddings and other techniques.

# Package installs


In [None]:
# install required packages; this may take some minutes; ignore dependency warnings it should work anyway
%pip install openai
%pip install langchain
%pip install langchain-openai
%pip install langchain-community
%pip install pypdf
%pip install tiktoken
%pip install chromadb

# Key setup

Let's set up the openai API key

In [None]:
import os

openai_api_key = 'API_KEY'

os.environ["OPENAI_API_KEY"] = openai_api_key

# **LLM Wrappers**

We've seen in the API hands-on how to use the openAI api to call gpt.
Here we see another way to utilize LLMs using langchain.
Here Langchain does all the heavy lifting for us in the 'backend'

In [None]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-3.5-turbo",temperature=0, max_tokens=20,)
messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]
output_message = model.invoke(messages)
print(output_message)

Instead of using Tuples ("sender","message") we can use langchain objects

In [None]:
from langchain.schema import AIMessage, HumanMessage, SystemMessage

messages = [
    SystemMessage(content="YOUR SYSTEM PROMPT HERE"),
    HumanMessage(content="YOUR USER PROMPT HERE")
]

response = model.invoke(messages).content

# Prompt Templates

Now this is already great. But it doesn't scale well.
Everytime we need to ask a question we have to rewrite the prompt.

And this is where prompt templates come in for.
Assume we have the following system message

`You are an expert on machine learning and data science and your goal is to help studnets learn about different topics. Explain the topic they ask about like they were 10 years old. Assume they have no knowledge of that subject. Make sure to answer the question in French`

What if for a different use case you now want the LLM to answer in German? or English or any other language? it would be easier to not have to do another prompt right? What if you want it to be an expert in a differnt field. And so on.



In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate


prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessagePromptTemplate.from_template("You are an expert on {field} and your goal is to help studnets learn about different topics. Explain the topic they ask about like they were 10 years old. Assume they have no knowledge of that subject. Make sure to answer the question in {language}"),
        HumanMessagePromptTemplate.from_template("{input}"),
    ]
)

prompt.invoke({
    "field": "data science and machine learning",
    "language": "english",
    "input": "Large Language Models"
})

# Chaining

Now that we have dynamically changing prompts we want to pass that to our models. We can invoke to get the prompt then invoke the model to get the response.
But luckily Langchain can do all of the work for us by using chains.

Chains are constructed by using `|` between different parts of the chain for example if the output of some llm_model `A` needs to go into another llm_model `B` (or the same one) we can build the chain `A | B`

In [None]:
basic_chain = prompt | model

# now to invoke the entire chain
# have fun making up some cool prompts
chain_output = basic_chain.invoke({
    "field": "ENTER SOME FIELD (computer science)",
    "language": "ENTER SOME LANGUAGE (german)",
    "input": "ENTER A TOPIC"
})
print(chain_output.content)

How about we also stop having to do `output.content`

Let's add another node to the chain and use a parser

In [None]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

parser_chain = basic_chain | parser # you can also recreate this from scratch using prompt | model | parser ...

In [None]:
chain_output = parser_chain.invoke({
    "field": "ENTER SOME FIELD (computer science)",
    "language": "ENTER SOME LANGUAGE (german)",
    "input": "ENTER A TOPIC"
})
print(chain_output)

# Structuring
Now that we have a nice chain with text output, it might also be way nicer to structure the output into some format where we can automate its use.

For example, assume hundreds of people used the previous chain. And you're tasked with analyzing the output. It would be easier if you can create a csv file where you get the field language and input among other information.

Since you know the prompt template you can get these information by code. But what if you needed extra information.

Well Langchain can provide that as well


In [None]:
# Let's start by creating a class that explains the expected structure

from langchain_core.pydantic_v1 import BaseModel, Field
from typing import Optional

class ExplainationStructure(BaseModel):
  """Structure of the explaination"""
  field: str = Field(description="Field of expertise required to understand the topic")
  language: str = Field(description="Language of the text")
  text: str = Field(description="The previous entire text")
  required_knowledge: Optional[str] = Field(description="Knowledge required to understand the answer")
  extra_knowledge: str = Field(description="Extra information for the user to read, and related topics")
  # + any other outputs you expect from the LLM

# create the new model that will structure our outputs for us
structured_model = model.with_structured_output(ExplainationStructure)   # can re-use the model defined above or create a new one

In [None]:
# now we can use this model by itself using structured_model.invoke("What can you tell me about LLMs?")
# but its more interesting to see it as part of our chain. Let's add it as a node to our parser chain

structuring_chain = parser_chain | structured_model

# then we can run it as we did before
chain_output = structuring_chain.invoke({
    "field": "Economics",
    "language": "English",
    "input": "Buisness Model"
})
print(chain_output)

# Text Splitting
Sometimes the text input/output is too large. And in order to work with it you might need to split it to different chunks.
Langchain actually helps us with this too

In [None]:
long_text = """Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

Split the text up into small, semantically meaningful chunks (often sentences).
Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).
That means there are two different axes along which you can customize your text splitter:

How the text is split
How the chunk size is measured
For specifics on how to use text splitters, see the relevant how-to guides here."""

Sometimes the text is part of a PDF or a text file. We can also read from that using the following code snippet

In [None]:
pdf_path = "PATH_TO_PDF.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()

# once the pdf loads its divided into different pages. We can loop over them and create a string variable for all the text
long_text = ""
for page in documents[:2]:
    long_text += page.page_content + "\n\n"

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)

text_chunks = text_splitter.create_documents([long_text])

print(text_chunks)

# Embeddings and Vector Spaces

Embeddings are a vector representation of words. Basically a list of numbers for every word.

This is useful because similar words can now be put together in the vector space. For example we can assume all colors are close together and all animals are close together. We can then have Fish and Aquarium be close because they also relate to one another.


We can use langchain to embbed large documents into vector space.
This allows us to find relevant information to our prompts


In [None]:
from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")

# we can then use the embedding on a text query
vector = embedding_model.embed_query("This is an embedding")
print(vector)

# VectorStore

Now that we have embeddings for our text we need to store the output somewhere. Langchain already offers integration with multiple databases.

For this exercise we will be using Chroma.
This is an in-memory database meaning the output is stored on your machine and used locally

In [None]:
# let's create a directory to store our data
vectorstore_path = "chroma/"
try:
  os.mkdir(vectorstore_path)
except FileExistsError:
  !rm -rf chroma/  # remove old database files if any
  os.mkdir(vectorstore_path)


from langchain.vectorstores import Chroma

vectordb = Chroma.from_documents(
    documents=text_chunks,
    embedding=embedding_model,
    persist_directory=vectorstore_path
)

# Retrieval

Now that our embeddings are saved in a database (in-memory for our case) we would like to retrieve information with similar text.
This is then useful for QA systems that use RAG



In [None]:
question = "How does text splitting work?"
docs = vectordb.similarity_search(question,k=3)
print(docs)

Now let's see how langchain does it for us

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    model,
    retriever=vectordb.as_retriever()
)

result = qa_chain({"query": question})
result["result"]

# Agents

Finally we talked about Langchain containing agents that run code and do complex task.

Let's take a look at a basic langchain agent that forwards your prompt into the python math library and calculates it.

In [None]:
from langchain.chains import LLMMathChain

llm_math = LLMMathChain.from_llm(model)

In [None]:
messages = [
    ("human", "What is the square root of 5 multiplied by pi"),
]
output_message = llm_math.invoke(messages)
output_message

There is so much more you can do with agents.
Feel free to read [this](https://python.langchain.com/v0.2/docs/how_to/function_calling//) article about function calling to learn more