# Started Notebook

- Get setup with LangChain
- Use basic component of LangChain : prompt templates, model and output
- Build basic simple application with LangChain
- Trace your application with LangChain
- Serve your application with LangServer

## Data Ingestion - Document Loaders

loading document, web and text into python

https://python.langchain.com/docs/integrations/document_loaders/

In [1]:
#Text
from langchain_community.document_loaders import TextLoader

loader = TextLoader('Data\plato_speech.txt')
text_data = loader.load()
text_data

[Document(metadata={'source': 'Data\\plato_speech.txt'}, page_content='Concerning the things about which you ask to be informed I believe that I am not ill-prepared with an answer. For the day before yesterday I was coming from my own home at Phalerum to the city, and one of my acquaintance, who had caught a sight of me from behind, hind, out playfully in the distance, said: Apollodorus, O thou Phalerian man, halt! So I did as I was bid; and then he said, I was looking for you, Apollodorus, only just now, that I might ask you about the speeches in praise of love, which were delivered by Socrates, Alcibiades, and others, at Agathon\'s supper. Phoenix, the son of Philip, told another person who told me of them; his narrative was very indistinct, but he said that you knew, and I wish that you would give me an account of them. Who, if not you, should be the reporter of the words of your friend? And first tell me, he said, were you present at this meeting?\n\nYour informant, Glaucon, I said

In [5]:
#PDF
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('Data\data.pdf')
pdf_data = loader.load()
pdf_data


  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


[Document(metadata={'producer': 'OpenOffice.org 2.4', 'creator': 'Writer', 'creationdate': '2009-03-24T11:33:15-06:00', 'source': 'Data\\data.pdf', 'total_pages': 9, 'page': 0, 'page_label': '1'}, page_content="Bitcoin: A Peer-to-Peer Electronic Cash System\nSatoshi Nakamoto\nsatoshin@gmx.com\nwww.bitcoin.org\nAbstract.  A purely peer-to-peer version of electronic cash would allow online  \npayments to be sent directly from one party to another without going through a  \nfinancial institution.  Digital signatures provide part of the solution, but the main  \nbenefits are lost if a trusted third party is still required to prevent double-spending.  \nWe propose a solution to the double-spending problem using a peer-to-peer network.  \nThe network timestamps transactions by hashing them into an ongoing chain of  \nhash-based proof-of-work, forming a record that cannot be changed without redoing  \nthe proof-of-work.  The longest chain not only serves as proof of the sequence of  \nevents 

In [3]:
#Web
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(web_path = ('https://en.wikipedia.org/wiki/Symposium_(Plato)','https://en.wikipedia.org/wiki/Rhetoric_(Aristotle)'),bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("mw-page-container","mw-portlet mw-portlet-dock-bottom emptyPortlet"))))
web_data = loader.load()
web_data

USER_AGENT environment variable not set, consider setting it to identify your requests.


[Document(metadata={'source': 'https://en.wikipedia.org/wiki/Symposium_(Plato)'}, page_content='\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1\nSetting\n\n\n\n\nToggle Setting subsection\n\n\n\n\n\n1.1\nPrincipal characters\n\n\n\n\n\n\n\n\n1.2\nBackground\n\n\n\n\n\n\n\n\n\n\n2\nStyle, dating and authorship\n\n\n\n\n\n\n\n\n3\nSynopsis\n\n\n\n\nToggle Synopsis subsection\n\n\n\n\n\n3.1\nFrame story\n\n\n\n\n\n\n\n\n3.2\nPhaedrus\' speech\n\n\n\n\n\n\n\n\n3.3\nPausanias\' speech\n\n\n\n\n\n\n\n\n3.4\nEryximachus\' speech\n\n\n\n\n\n\n\n\n3.5\nAristophanes\' speech\n\n\n\n\n\n\n\n\n3.6\nAgathon\'s speech\n\n\n\n\n\n\n\n\n3.7\nSocrates\' speech\n\n\n\n\n\n\n\n\n3.8\nAlcibiades\' speech\n\n\n\n\n\n\n\n\n3.9\nConclusion\n\n\n\n\n\n\n\n\n\n\n4\nInterpretations and themes\n\n\n\n\n\n\n\n\n5\nReception\n\n\n\n\nToggle Reception subsection\n\n\n\n\n\n5.1\nAncient\n\n\n\n\n\n\n5.1.1\nClassical and Hellenistic period\n\n\n\n\n\n\n\n\n5.1.2\nRoma

In [4]:
from langchain_community.document_loaders import ArxivLoader
arxiv_paper = ArxivLoader(query='2502.14926').load()
arxiv_paper

[Document(metadata={'Published': '2025-04-07', 'Title': 'DeepSeek-V3, GPT-4, Phi-4, and LLaMA-3.3 generate correct code for LoRaWAN-related engineering tasks', 'Authors': 'Daniel Fernandes, João P. Matos-Carvalho, Carlos M. Fernandes, Nuno Fachada', 'Summary': 'This paper investigates the performance of 16 Large Language Models (LLMs) in\nautomating LoRaWAN-related engineering tasks involving optimal placement of\ndrones and received power calculation under progressively complex zero-shot,\nnatural language prompts. The primary research question is whether lightweight,\nlocally executed LLMs can generate correct Python code for these tasks. To\nassess this, we compared locally run models against state-of-the-art\nalternatives, such as GPT-4 and DeepSeek-V3, which served as reference points.\nBy extracting and executing the Python functions generated by each model, we\nevaluated their outputs on a zero-to-five scale. Results show that while\nDeepSeek-V3 and GPT-4 consistently provided a

## How To Split Text Characters

Spliting text to chunk

### Recursively

In [5]:
print(type(pdf_data))
print(type(pdf_data[0]))

<class 'list'>
<class 'langchain_core.documents.base.Document'>


In [6]:
print(pdf_data[0])

page_content='Bitcoin: A Peer-to-Peer Electronic Cash System
Satoshi Nakamoto
satoshin@gmx.com
www.bitcoin.org
Abstract.  A purely peer-to-peer version of electronic cash would allow online  
payments to be sent directly from one party to another without going through a  
financial institution.  Digital signatures provide part of the solution, but the main  
benefits are lost if a trusted third party is still required to prevent double-spending.  
We propose a solution to the double-spending problem using a peer-to-peer network.  
The network timestamps transactions by hashing them into an ongoing chain of  
hash-based proof-of-work, forming a record that cannot be changed without redoing  
the proof-of-work.  The longest chain not only serves as proof of the sequence of  
events witnessed, but proof that it came from the largest pool of CPU power.  As  
long as a majority of CPU power is controlled by nodes that are not cooperating to  
attack the network, they'll generate the longest

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50) #every chuunk will have 500 characters and 50 characters max overlap
chunk_data = text_spliter.create_documents([page.page_content for page in pdf_data])


In [8]:
print(len(chunk_data[0].page_content),'0')
print(len(chunk_data[1].page_content),'1')
print(len(chunk_data[2].page_content),'2')

435 0
497 1
434 2


### Character Text Splitter

In [6]:
from langchain_text_splitters import CharacterTextSplitter
text_char_spliter = CharacterTextSplitter(separator="\n", chunk_size=500, chunk_overlap=50)
text_char_spliter.create_documents([page.page_content for page in pdf_data])
chunk_text_char_splitter = text_char_spliter.create_documents([page.page_content for page in pdf_data])
print(len(chunk_text_char_splitter[0].page_content),'0')
print(len(chunk_text_char_splitter[1].page_content),'1')

435 0
497 1


### HTML Header Splitter

In [10]:
url = 'https://search.bisnis.com/?q=indofarma'

In [11]:
from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
    ("h1","Header 1"),
    ('h2','Header 2'),
    ('h3','Header3')
]
html_text_splitter= HTMLHeaderTextSplitter(headers_to_split_on)
html_text_splitter.split_text_from_url(url)

[Document(metadata={}, page_content='Google Tag Manager (noscript) End Google Tag Manager (noscript) s: bottom billboard e: bottom billboard  \ns: network e: network s: flashnews e: flashnews s: skyscrapper billboard e: skyscrapper billboard  \nBisnis Indonesia Premium  \nE-Paper  \nBisnismuda.id  \nKonten Interaktif  \nBisnis Plus  \nBisnisgrafik  \nVideo  \nHypeabis.id  \nContext.id  \nBreaking News  \nMenanti Sentuhan AI dalam Optimalisasi Sumber Daya Papua  \nJAPFA Luncurkan OLAGUD Fillet Dada Ayam Siap Makan untuk Ekspor  \nPramono Anung, Jakarta Fund dan IPO Bank BUMD  \nExporters Adjust Sails Amid Trade Tensions: MLIA, SMSM, OMNED, and AISA  \nPNM Sediakan Sekolah Kejar Paket Gratis bagi Keluarga Nasabah Mekaar  \nPIS Beberkan 3 Strategi Jadi Pemain Maritim Kelas Dunia\xa0di\xa0IMW\xa02025  \nMidtrans Hadirkan Static QRIS, Tambahkan Opsi Pembayaran bagi Merchant  \nAskrindo Dukung Nelayan lewat Asuransi Mikro Bahari  \nDonald Trump Pertimbangkan Patok Tarif Impor 15% selama 150 

### JSON Splitter

In [16]:
import json
import requests

In [23]:
json_data = requests.get('https://api.smith.langchain.com/openapi.json').json()

In [26]:
from langchain_text_splitters import RecursiveJsonSplitter
json_splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks_data = json_splitter.split_json(json_data)
print(json_chunks_data)



In [31]:
json_chunks_documents = json_splitter.create_documents(texts=[json_data])
for doc in json_chunks_documents[:5]:
    print(doc.page_content)

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"tags": ["tracer-sessions"], "summary": "Get Tracing Project Prebuilt Dashboard", "description": "Get a prebuilt dashboard for a tracing project."}}}}
{"paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"operationId": "get_tracing_project_prebuilt_dashboard_api_v1_sessions__session_id__dashboard_post", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}
{"paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}
{"paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"requestBody": {"required": true, "content": {"application/json":

## Embeddings

Converting text to vector

### Google Gen AI

In [22]:
import os
from dotenv import load_dotenv
load_dotenv()
google_api_key = os.getenv('GOOGLE_API_KEY')

python-dotenv could not parse statement starting at line 4


In [78]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-exp-03-07")

text_embeddings = embeddings.embed_query('test google gen ai embeddings')

In [53]:
len(text_embeddings)

3072

In [59]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50) #every chuunk will have 500 characters and 50 characters max overlap
chunk_data = text_spliter.create_documents([page.page_content for page in text_data])


In [76]:
chunk_data[:3]

[Document(metadata={}, page_content='Concerning the things about which you ask to be informed I believe that I am not ill-prepared with an answer. For the day before yesterday I was coming from my own home at Phalerum to the city, and one of my acquaintance, who had caught a sight of me from behind, hind, out playfully in the distance, said: Apollodorus, O thou Phalerian man, halt! So I did as I was bid; and then he said, I was looking for you, Apollodorus, only just now, that I might ask you about the speeches in praise of love,'),
 Document(metadata={}, page_content="ask you about the speeches in praise of love, which were delivered by Socrates, Alcibiades, and others, at Agathon's supper. Phoenix, the son of Philip, told another person who told me of them; his narrative was very indistinct, but he said that you knew, and I wish that you would give me an account of them. Who, if not you, should be the reporter of the words of your friend? And first tell me, he said, were you present 

In [72]:
test_embedding = chunk_data[:3] #limit because of the free tier

In [73]:
from langchain_community.vectorstores import Chroma

db=Chroma.from_documents(test_embedding,embeddings)

In [77]:
query  = 'Your informant, Glaucon'
results = db.similarity_search(query)
print(results)

[Document(metadata={}, page_content='Your informant, Glaucon, I said, must have been very indistinct indeed, if you imagine that the occasion was recent; or that I could have been of the party.\n\nWhy, yes, he replied, I thought so.'), Document(metadata={}, page_content="ask you about the speeches in praise of love, which were delivered by Socrates, Alcibiades, and others, at Agathon's supper. Phoenix, the son of Philip, told another person who told me of them; his narrative was very indistinct, but he said that you knew, and I wish that you would give me an account of them. Who, if not you, should be the reporter of the words of your friend? And first tell me, he said, were you present at this meeting?"), Document(metadata={}, page_content='Concerning the things about which you ask to be informed I believe that I am not ill-prepared with an answer. For the day before yesterday I was coming from my own home at Phalerum to the city, and one of my acquaintance, who had caught a sight of 

### Ollama Embeddings

please download ollama to your local

download ollama model in your model

https://ollama.com/blog/embedding-models

https://ollama.com/search


we will use gemma 2b

In [3]:
from langchain_community.embeddings import OllamaEmbeddings

In [4]:
embeddings =  (OllamaEmbeddings(model="gemma:2b"))

  embeddings =  (OllamaEmbeddings(model="gemma:2b"))


In [11]:
raw_document = chunk_text_char_splitter[0]

In [13]:
embedded_document = embeddings.embed_documents(raw_document)

### HuggingFace Embeddings Model

Sentence Transformers

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()
os.environ['HF_TOKEN'] = os.getenv('HUGGINGFACE_API_KEY')

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
huggingface_embeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2",)

  from .autonotebook import tqdm as notebook_tqdm
