# Getting Started with Langchain and Open AI

## Following things will be done:

- Setting up Langchain, LangSmith and LangServe
- Use the most basic and common components of LangChain: prompt templates, models and output parsers
- Build a ssimple application with LangChain
- Trace your application with LangSmith
- Serve your application using LangServe

### Document Loaders

https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/

In [8]:
### Text Loader

from langchain_community.document_loaders import TextLoader

loader = TextLoader("../requirements.txt")
loader.load()

[Document(metadata={'source': '../requirements.txt'}, page_content='langchain\nipykernel\nlangchain-openai\npython-dotenv\nlangchain_community')]

In [None]:
### PDF Loader
from langchain_community.document_loaders import PyPDFLoader

pdfLoader = PyPDFLoader("../Astrology_of_you_me_Aquarius.pdf")
pdfLoader.load()

[Document(metadata={'source': '../Astrology_of_you_me_Aquarius.pdf', 'page': 0}, page_content='Aquarius\nBIRTHDATE JANUARY 21–FEBRUARY 19\nThe fixed air sign Aquarius rules the new age in which we now\nlive. Governed by the revolutionary planet Uranus, Aquarians\ntend to be modern, forward-looking individuals who are unusual\nand accepting of this quality in others. Often the joy and despair\nof their sweethearts, Aquarians can be maddeningly unstable and\ncool, neglecting human feelings and making enduring'),
 Document(metadata={'source': '../Astrology_of_you_me_Aquarius.pdf', 'page': 1}, page_content='relationships with them difficult. Yet their fascinating qualities and\nquick minds attract people who seem willing to overlook or forgive\ntheir wayward tendencies.'),
 Document(metadata={'source': '../Astrology_of_you_me_Aquarius.pdf', 'page': 2}, page_content='Work\nAQUARIUS\nJanuary 21–February 19\nThe Aquarius Boss\nSince Aquarians are not particularly suited to be bosses, they are

In [23]:
## Web based loader

from langchain.document_loaders import WebBaseLoader
import bs4
webLoader = WebBaseLoader(web_paths=("https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/",),bs_kwargs=dict(parse_only=bs4.SoupStrainer("div",attrs={"class": "theme-doc-markdown markdown"})))
webLoader.load()

[Document(metadata={'source': 'https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/'}, page_content='Document loadersinfoHead to Integrations for documentation on built-in document loader integrations with 3rd-party tools.Use document loaders to load data from a source as Document\'s. A Document is a piece of text\nand associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text\ncontents of any web page, or even for loading a transcript of a YouTube video.Document loaders provide a "load" method for loading data as documents from a configured source. They optionally\nimplement a "lazy load" as well for lazily loading data into memory.Get started\u200bThe simplest loader reads in a file as text and places it all into one document.from langchain_community.document_loaders import TextLoaderloader = TextLoader("./index.md")loader.load()API Reference:TextLoader[    Document(page_content=\'---\\nsidebar_positio

In [None]:
## Community Loaders

from langchain_community.document_loaders import ArxivLoader

arxivLoader = ArxivLoader("1605.08366")
arxivLoader.load()

[Document(metadata={'Published': '2018-01-15', 'Title': 'Hitting minors, subdivisions, and immersions in tournaments', 'Authors': 'Jean-Florent Raymond', 'Summary': "The Erd\\H{o}s-P\\'osa property relates parameters of covering and packing of\ncombinatorial structures and has been mostly studied in the setting of\nundirected graphs. In this note, we use results of Chudnovsky, Fradkin, Kim,\nand Seymour to show that, for every directed graph $H$ (resp.\nstrongly-connected directed graph $H$), the class of directed graphs that\ncontain $H$ as a strong minor (resp. butterfly minor, topological minor) has\nthe vertex-Erd\\H{o}s-P\\'osa property in the class of tournaments. We also prove\nthat if $H$ is a strongly-connected directed graph, the class of directed\ngraphs containing $H$ as an immersion has the edge-Erd\\H{o}s-P\\'osa property in\nthe class of tournaments."}, page_content='arXiv:1605.08366v5  [cs.DM]  15 Jan 2018\nDiscrete Mathematics and Theoretical Computer Science\nDMTCS vo

### Text Splitting Techniques

- Recursive Character Text Splitter
- Character Text Splitter
- HTML Header Text Splitter
- Recursive JSON Splitter

In [32]:
# Recursive Character Text Splitter
from langchain_text_splitters.character import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import ArxivLoader

arxivLoader = ArxivLoader("1605.08366")
docs = arxivLoader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)
split_documents[0], split_documents[1]
# create_documents - when you have text or string
# split_documents - when you have list of objects <Langchain Document>

(Document(metadata={'Published': '2018-01-15', 'Title': 'Hitting minors, subdivisions, and immersions in tournaments', 'Authors': 'Jean-Florent Raymond', 'Summary': "The Erd\\H{o}s-P\\'osa property relates parameters of covering and packing of\ncombinatorial structures and has been mostly studied in the setting of\nundirected graphs. In this note, we use results of Chudnovsky, Fradkin, Kim,\nand Seymour to show that, for every directed graph $H$ (resp.\nstrongly-connected directed graph $H$), the class of directed graphs that\ncontain $H$ as a strong minor (resp. butterfly minor, topological minor) has\nthe vertex-Erd\\H{o}s-P\\'osa property in the class of tournaments. We also prove\nthat if $H$ is a strongly-connected directed graph, the class of directed\ngraphs containing $H$ as an immersion has the edge-Erd\\H{o}s-P\\'osa property in\nthe class of tournaments."}, page_content='arXiv:1605.08366v5  [cs.DM]  15 Jan 2018\nDiscrete Mathematics and Theoretical Computer Science\nDMTCS vo

In [33]:
# Character Text Splitter

from langchain_text_splitters import CharacterTextSplitter

char_text_splitter = CharacterTextSplitter(separator="\n\n", chunk_size=100, chunk_overlap=20)
char_split_docs = char_text_splitter.split_documents(docs)
char_split_docs

[Document(metadata={'Published': '2018-01-15', 'Title': 'Hitting minors, subdivisions, and immersions in tournaments', 'Authors': 'Jean-Florent Raymond', 'Summary': "The Erd\\H{o}s-P\\'osa property relates parameters of covering and packing of\ncombinatorial structures and has been mostly studied in the setting of\nundirected graphs. In this note, we use results of Chudnovsky, Fradkin, Kim,\nand Seymour to show that, for every directed graph $H$ (resp.\nstrongly-connected directed graph $H$), the class of directed graphs that\ncontain $H$ as a strong minor (resp. butterfly minor, topological minor) has\nthe vertex-Erd\\H{o}s-P\\'osa property in the class of tournaments. We also prove\nthat if $H$ is a strongly-connected directed graph, the class of directed\ngraphs containing $H$ as an immersion has the edge-Erd\\H{o}s-P\\'osa property in\nthe class of tournaments."}, page_content='arXiv:1605.08366v5  [cs.DM]  15 Jan 2018\nDiscrete Mathematics and Theoretical Computer Science\nDMTCS vo

### RecursiveCharacterTextSplitter vs CharacterTextSplitter
https://python.langchain.com/docs/how_to/recursive_text_splitter/
https://python.langchain.com/docs/concepts/text_splitters/

These are the part of Text-structured based Splitting Techniques

Text is naturally organized into hierarchical units such as paragraphs, sentences, and words. We can leverage this inherent structure to inform our splitting strategy, creating split that maintain natural language flow, maintain semantic coherence within split, and adapts to varying levels of text granularity. LangChain's RecursiveCharacterTextSplitter implements this concept:

The RecursiveCharacterTextSplitter attempts to keep larger units (e.g., paragraphs) intact.
If a unit exceeds the chunk size, it moves to the next level (e.g., sentences).
This process continues down to the word level if necessary.



### Doument-Sturctured Based

Some documents have an inherent structure, such as HTML, Markdown, or JSON files. In these cases, it's beneficial to split the document based on its structure, as it often naturally groups semantically related text. Key benefits of structure-based splitting:

Preserves the logical organization of the document
Maintains context within each chunk
Can be more effective for downstream tasks like retrieval or summarization
Examples of structure-based splitting:

Markdown: Split based on headers (e.g., #, ##, ###)
HTML: Split using tags
JSON: Split by object or array elements
Code: Split by functions, classes, or logical blocks

In [6]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_header_split = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "header 1"), ( "h2", "header 2")])
text = """<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>"""

docs = html_header_split.split_text(text)
docs_lg = html_header_split.split_text_from_url("https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/")
print(docs, docs_lg)

[Document(metadata={'header 1': 'My First Heading'}, page_content='My first paragraph.')] [Document(metadata={}, page_content='Skip to main content  \nThis is documentation for LangChain v0.1, which is no longer actively maintained. Check out the docs for the latest version here.  \nComponentsIntegrationsGuidesAPI Reference  \nMore  \nPeopleVersioningContributingTemplatesCookbooksTutorialsYouTube  \n💬  \nv0.1  \nLatestv0.2v0.1  \n🦜️🔗  \nLangSmithLangSmith DocsLangServe GitHubTemplates GitHubTemplates HubLangChain HubJS/TS Docs  \nSearch  \nComponents  \nModel I/O  \nPrompts  \nChat models  \nLLMs  \nOutput parsers  \nRetrieval  \nVector storesIndexing  \nDocument loaders  \nDocument loadersCustom Document LoaderCSVFile DirectoryHTMLJSONMarkdownMicrosoft OfficePDF  \nText splitters  \nEmbedding models  \nRetrievers  \nComposition  \nChains  \nTools  \nAgents  \nMore  \nThis is documentation for LangChain v0.1, which is no longer actively maintained.  \nFor the current stable version, se

### How to Split JSON data

In [11]:
import json
import requests
from langchain_text_splitters import RecursiveJsonSplitter
response = requests.get("https://jsonplaceholder.typicode.com/comments")
json_response = response.json()
json_splitter = RecursiveJsonSplitter(max_chunk_size=100)
json_chunks_lg = json_splitter.split_json({"data": json_response})
json_docs_lg = json_splitter.create_documents([{"data": json_response}])
print(json_docs_lg[:3])

[Document(metadata={}, page_content='{"data": [{"postId": 1, "id": 1, "name": "id labore ex et quam laborum", "email": "Eliseo@gardner.biz", "body": "laudantium enim quasi est quidem magnam voluptate ipsam eos\\ntempora quo necessitatibus\\ndolor quam autem quasi\\nreiciendis et nam sapiente accusantium"}, {"postId": 1, "id": 2, "name": "quo vero reiciendis velit similique earum", "email": "Jayne_Kuhic@sydney.com", "body": "est natus enim nihil est dolore omnis voluptatem numquam\\net omnis occaecati quod ullam at\\nvoluptatem error expedita pariatur\\nnihil sint nostrum voluptatem reiciendis et"}, {"postId": 1, "id": 3, "name": "odio adipisci rerum aut animi", "email": "Nikita@garfield.biz", "body": "quia molestiae reprehenderit quasi aspernatur\\naut expedita occaecati aliquam eveniet laudantium\\nomnis quibusdam delectus saepe quia accusamus maiores nam est\\ncum et ducimus et vero voluptates excepturi deleniti ratione"}, {"postId": 1, "id": 4, "name": "alias odio sit", "email": "Le

In [12]:
import os
from dotenv import load_dotenv
load_dotenv()


True

In [13]:
from langchain_openai import OpenAIEmbeddings
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

open_ai_embedding_object = OpenAIEmbeddings(model="text-embedding-3-large")
open_ai_embedding_object

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x1073419f0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x1074ddae0>, model='text-embedding-3-large', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [16]:
docs = open_ai_embedding_object.embed_documents(["This is a practice code worksheet on Langchain and Generative AI"])

In [17]:
docs

[[-0.009898383170366287,
  0.011213991791009903,
  -0.017666734755039215,
  0.017807694151997566,
  -0.009350213222205639,
  -0.028426527976989746,
  0.006221729330718517,
  0.013688587583601475,
  0.0022455391008406878,
  0.052874911576509476,
  -0.0004133299516979605,
  0.019499188289046288,
  -0.00556784076616168,
  -0.020110007375478745,
  0.014706617221236229,
  0.011378442868590355,
  -0.003081498434767127,
  0.014048813842236996,
  -0.016382452100515366,
  -0.053407419472932816,
  -0.0037353867664933205,
  -0.010665821842849255,
  -0.05572539195418358,
  0.026782019063830376,
  -0.014565659686923027,
  0.0029366249218583107,
  0.003429977921769023,
  0.005877165123820305,
  -0.007596069481223822,
  -0.005070572253316641,
  0.03254563361406326,
  -0.013062107376754284,
  -0.006440997123718262,
  -0.025262804701924324,
  -0.004330542869865894,
  -0.030321631580591202,
  0.04595230519771576,
  0.0463908426463604,
  -0.04921000078320503,
  0.015192138962447643,
  0.00398597866296768