## Documents
#### Documents in LangChain are made up of piece of text and optional metadata. The piece of text is what we interact with the language model, while the optional metadata is useful for keeping track of metadata about the document.


## Document Loaders
#### Document loaders in LangChain load data from a source as documents.
1. CSVLoader
2. PDFLoader
3. WebBaseLoader
4. JSONLoader
5. YouTubeLoader
6. WikiPediaLoader

### This notebook demonstrates commonly used Loaders and Splitters

In [12]:
import random
import datetime
sources = ["Wikipedia", "BBC News", "National Geographic", "NASA", "Reuters"]
source = random.choice(sources)
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
author = "Generated by Python Script"

# Page content
content = """
Hard work and discipline are the foundation of success, forming the bridge between aspirations and achievements. 
Hard work fuels the effort we put into pursuing our goals, helping us push past challenges and rise above limitations. 
It teaches perseverance and develops the skills necessary to excel. Discipline, on the other hand, provides the structure to stay consistent 
and focused, even when motivation fades. Motivation may ignite the fire within us, but discipline sustains it, 
ensuring we continue to take small, deliberate steps toward our dreams. Together, hard work and discipline cultivate resilience, 
build character, and instill a sense of purpose
, proving that the greatest rewards often come from the determination to keep going, no matter how difficult the path.
"""

# Combine metadata and content
metadata = f"""Source: {source}
Author: {author}
Date: {timestamp}

"""

# Combine metadata and page content
full_text = metadata + "Page Content:\n" + content

# Write to file
file_name = "sample.txt"
with open(file_name, "w") as file:
    file.write(full_text)

#### In LangChain, a document is a simple structure with two fields
1. page_content (string): This field contains the raw text of the document.
2. metadata (dictionary): This field stores additional metadata about text, such as the source URL, author or any other relevant information

In [14]:
from langchain.document_loaders import TextLoader

loader = TextLoader("sample.txt")
document = loader.load()
print(document)

[Document(metadata={'source': 'sample.txt'}, page_content='Source: BBC News\nAuthor: Generated by Python Script\nDate: 2024-12-20 23:29:44\n\nPage Content:\n\nHard work and discipline are the foundation of success, forming the bridge between aspirations and achievements. \nHard work fuels the effort we put into pursuing our goals, helping us push past challenges and rise above limitations. \nIt teaches perseverance and develops the skills necessary to excel. Discipline, on the other hand, provides the structure to stay consistent \nand focused, even when motivation fades. Motivation may ignite the fire within us, but discipline sustains it, \nensuring we continue to take small, deliberate steps toward our dreams. Together, hard work and discipline cultivate resilience, \nbuild character, and instill a sense of purpose\n, proving that the greatest rewards often come from the determination to keep going, no matter how difficult the path.\n')]


In [16]:
document[0].page_content

'Source: BBC News\nAuthor: Generated by Python Script\nDate: 2024-12-20 23:29:44\n\nPage Content:\n\nHard work and discipline are the foundation of success, forming the bridge between aspirations and achievements. \nHard work fuels the effort we put into pursuing our goals, helping us push past challenges and rise above limitations. \nIt teaches perseverance and develops the skills necessary to excel. Discipline, on the other hand, provides the structure to stay consistent \nand focused, even when motivation fades. Motivation may ignite the fire within us, but discipline sustains it, \nensuring we continue to take small, deliberate steps toward our dreams. Together, hard work and discipline cultivate resilience, \nbuild character, and instill a sense of purpose\n, proving that the greatest rewards often come from the determination to keep going, no matter how difficult the path.\n'

In [18]:
document[0].metadata

{'source': 'sample.txt'}

### Type of Document Loaders in LangChain
#### LangChain offers three main types of Document Loaders

1. Transform Loaders : These loaders handle different input formats and transform them into the Document format. For instance, consider a SV file name.csv with columns "name" and "age". Using the CSVLoader, we can load the CSV data into Documents
2. Public Dataset or Service Loaders : LangChain provides loaders for popular public sources, allowing quick retrieval and creation of Documents. For example, WikipediaLoader can load content from Wikipedia
3. Proprietary Dataset or Service Loaders: These loaders are designed to handle proprietary sources that may require additional authentication or setup. For instance, a loader could be created specifically for loading data from an internal database or an API with proprietary access

### Transform Loader - CSV Loader

In [24]:
# CSV Loader

from langchain.document_loaders import CSVLoader

loader = CSVLoader("movie_review_test.csv")
documents = loader.load()

for document in documents:
    content = document.page_content
    metadata = document.metadata

    print(content)
    print("------")
    break;

class: Pos
text: films adapted from comic books have had plenty of success   whether they re about superheroes   batman   superman   spawn     or geared toward kids   casper   or the arthouse crowd   ghost world     but there s never really been a comic book like from hell before    for starters   it was created by alan moore   and eddie campbell     who brought the medium to a whole new level in the mid  80s with a 12 part series called the watchmen    to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd    the book   or   graphic novel     if you will   is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes    in other words   don t dismiss this film because of its source    if you can get past the whole comic book thing   you might find another stumbling block in from hell s directors   albert and allen hughes    getting the hughes brothers to direct this

### Transform Loader - PDF Loader

In [29]:
!pip install pypdf

Collecting pypdf
  Using cached pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.1.0


In [31]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("Priyanuj_Misra_Resume_AI.pdf")
pages = loader.load()

In [33]:
cnt = 0

for page in pages:
    cnt = cnt+1
    print("--------Document #", cnt)
    print(page.page_content.strip())

--------Document # 1
PriyanujMisra
Senior Data Scientist
Dubai,UAE
 GitHub
 LinkedIn
 priyanujmisra.nits@gmail.com
 +971506138031
PROFESSIONALSUMMARY
Senior Data Scientist with 5+ years of experience in leveraging Advanced Analytics and Artificial Intelligence to provide ac-
tionable insights for Fortune 50 global leaders in BFSI, Retail and Telecom industries. Proficient in implementing end-to-end
data modeling pipelines and solutions using Python, PySpark, and Driverless AI, with expertise in Machine Learning and Deep
Learning algorithms.
EXPERIENCE
Etisalate&UAE | SENIOR DATA SCIENTIST
Aug2024–Present|Dubai,UAE
Ô Roles and Responsibilities - Part of the CVM Modeling team delivering Machine
Learning and Generative AI solutions to optimize business needs and drive
revenue. Currently working on preparing a SOP for Information Retrieval using Gen
AI (LLM)
Ô Customer Rejection Reason Analysis - Developed a Topic Modeling Pipeline
using Latent Dirichlet Allocation (LDA) to systematically 

### Transform Loader - WebBaseLoader

In [36]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.eand.com/en/who-we-are.html")
data = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [38]:
data[0].page_content

'\n\n\n\n\n\n\n\ne& (etisalat and ) | Global technology group | Who We Are\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWho we are\n\n\n\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t About Us\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t Governance\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t Senior Management\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t Our Strategy\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t Sustainability\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t Awards and recognitions\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t Carrier & Wholesale\n\n universe\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSince 1976, we have given people the power to come together .\n\n\n\n\n\n\n\n\nOur brands\n\n\n\n\n UAE\n\n life\n\n international\n\n enterprise\n\n capital\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAt e&, we do more so you can be more. Our mission is to enrich every day, every moment, for everyone we reach.\n\n\n\n\n\n\n\n\nInvestors\n\n\n\n\n\t\t\t\t\t\t\t\t\t\

In [44]:
formatted_text = data[0].page_content.strip().replace("\n\n", " \n").replace("\t\t", " \t").replace("  ", " ")
print(formatted_text)

e& (etisalat and ) | Global technology group | Who We Are 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Who we are 
 

 	 	 	 	 	 	 	 		 About Us 
 	 	 	 	 	 	 	 		 Governance 
 	 	 	 	 	 	 	 		 Senior Management 
 	 	 	 	 	 	 	 		 Our Strategy 
 	 	 	 	 	 	 	 		 Sustainability 
 	 	 	 	 	 	 	 		 Awards and recognitions 
 	 	 	 	 	 	 	 		 Carrier & Wholesale 
 universe 
 
 
 
 
 
 
 

Since 1976, we have given people the power to come together . 
 
 
 

Our brands 
 

 UAE 
 life 
 international 
 enterprise 
 capital 
 
 
 
 
 
 
 
At e&, we do more so you can be more. Our mission is to enrich every day, every moment, for everyone we reach. 
 
 
 

Investors 
 

 	 	 	 	 	 	 	 		 Financial Highlights 
 	 	 	 	 	 	 	 		 Share Information 
 	 	 	 	 	 	 	 		 Debt Profile 
 	 	 	 	 	 	 	 		 Dividends 
 	 	 	 	 	 	 	 		 Analyst Coverage 
 	 	 	 	 	 	 	 		 Financial Results 
 	 	 	 	 	 	 	 		 Annual Reports 
 	 	 	 	 	 	 	 		 Financial Calendar 
 	 	 	 	 	 	 	 		 Corporat

#### Using regular expressions for more comprehensive cleaning

In [47]:
import re
cleaned_text = re.sub(r"\s+", " ", formatted_text)
cleaned_text = re.sub(r"\n+", "\n\n", cleaned_text)

print(cleaned_text)

e& (etisalat and ) | Global technology group | Who We Are Who we are About Us Governance Senior Management Our Strategy Sustainability Awards and recognitions Carrier & Wholesale universe Since 1976, we have given people the power to come together . Our brands UAE life international enterprise capital At e&, we do more so you can be more. Our mission is to enrich every day, every moment, for everyone we reach. Investors Financial Highlights Share Information Debt Profile Dividends Analyst Coverage Financial Results Annual Reports Financial Calendar Corporate Announcements Invest in e& We operate in 38 countries across Middle East, Asia, Africa and central and eastern Europe Stories Careers عربي Who we are About Us Governance Senior Management Our Strategy Sustainability Awards and recognitions Carrier & Wholesale universe Our brands UAE life international enterprise capital Investors Financial Highlights Share Information Debt Profile Dividends Analyst Coverage Financial Results Annual

### Transform Loader - JSON Loader

In [72]:
import json

# Data to be written to the JSON file
data = {
    "employees": [
        {"name": "Alice Johnson", "email": "alice.johnson@example.com"},
        {"name": "Bob Smith", "email": "bob.smith@example.com"},
        {"name": "Charlie Brown", "email": "charlie.brown@example.com"},
        {"name": "Diana Prince", "email": "diana.prince@example.com"}
    ]
}

# File name
file_name = "sample.json"

# Write data to a JSON file
with open(file_name, "w") as file:
    json.dump(data, file, indent=4)

print(f"File '{file_name}' written successfully!")

File 'sample.json' written successfully!


In [74]:
!pip install jq



In [76]:
from langchain_community.document_loaders import JSONLoader

from pathlib import Path
from pprint import pprint

file_path = "sample.json"

data = json.loads(Path(file_path).read_text())

In [78]:
pprint(data)

{'employees': [{'email': 'alice.johnson@example.com', 'name': 'Alice Johnson'},
               {'email': 'bob.smith@example.com', 'name': 'Bob Smith'},
               {'email': 'charlie.brown@example.com', 'name': 'Charlie Brown'},
               {'email': 'diana.prince@example.com', 'name': 'Diana Prince'}]}


In [82]:
loader = JSONLoader(
    file_path = "sample.json",
    jq_schema = ".employees[].email",
    text_content = False
)

data = loader.load()

In [84]:
data

[Document(metadata={'source': '/Users/priyanuj/Desktop/GenAI/sample.json', 'seq_num': 1}, page_content='alice.johnson@example.com'),
 Document(metadata={'source': '/Users/priyanuj/Desktop/GenAI/sample.json', 'seq_num': 2}, page_content='bob.smith@example.com'),
 Document(metadata={'source': '/Users/priyanuj/Desktop/GenAI/sample.json', 'seq_num': 3}, page_content='charlie.brown@example.com'),
 Document(metadata={'source': '/Users/priyanuj/Desktop/GenAI/sample.json', 'seq_num': 4}, page_content='diana.prince@example.com')]

### Public Dataset or Service Loaders

#### WikipediaLoader

In [88]:
from langchain.document_loaders import WikipediaLoader

loader = WikipediaLoader("Generative AI")

document = loader.load()

In [89]:
document[0].page_content

'Generative artificial intelligence (generative AI, GenAI, or GAI) is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.  \nImprovements in transformer-based deep neural networks, particularly large language models (LLMs), enabled an AI boom of generative AI systems in the early 2020s. These include chatbots such as ChatGPT, Copilot, Gemini, and LLaMA; text-to-image artificial intelligence image generation systems such as Stable Diffusion, Midjourney, and DALL-E; and text-to-video AI generators such as Sora. Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.\nGenerative AI has uses across a wide range of industries, including software developm

In [90]:
document[0].metadata

{'title': 'Generative artificial intelligence',
 'summary': 'Generative artificial intelligence (generative AI, GenAI, or GAI) is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.  \nImprovements in transformer-based deep neural networks, particularly large language models (LLMs), enabled an AI boom of generative AI systems in the early 2020s. These include chatbots such as ChatGPT, Copilot, Gemini, and LLaMA; text-to-image artificial intelligence image generation systems such as Stable Diffusion, Midjourney, and DALL-E; and text-to-video AI generators such as Sora. Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.\nGenerative AI has uses ac

#### IMDB Movie Script Loader

In [95]:
from langchain_community.document_loaders import IMSDbLoader

loader = IMSDbLoader("https://imsdb.com/scripts/BlacKkKlansman.html")
data = loader.load()

In [97]:
formatted_text = data[0].page_content[:5000].strip()
print(formatted_text)

BLACKKKLANSMAN
                         
                         
                         
                         
                                      Written by

                          Charlie Wachtel & David Rabinowitz

                                         and

                              Kevin Willmott & Spike Lee








                         FADE IN:
                         
          SCENE FROM "GONE WITH THE WIND"
                         
          Scarlett O'Hara, played by Vivian Leigh, walks through the
          Thousands of injured Confederate Soldiers pulling back to
          reveal the Famous Shot of the tattered Confederate Flag in
          "Gone with the Wind" as The Max Stein Music Score swells from
          Dixie to Taps.
                         
                                   BEAUREGARD- KLAN NARRATOR (O.S.)
                       They say they may have lost the
                       Battle but they didn't lose The War.
                  

#### YouTubeLoader

In [100]:
!pip install --upgrade --quiet youtube-transcript-api

In [102]:
from langchain_community.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=dUYRdW6fYTk&t=2s", add_video_info=False)

data = loader.load()

In [104]:
formatted_data = data[0].page_content[:5000].strip()

print(data)

[Document(metadata={'source': 'dUYRdW6fYTk'}, page_content="look at this fire number so you will need 12 and2 crores what happened why you laughing okay anyway we'll see no it is possible I'll tell you why because how many mutual funds do you have 25 what see actually Shar see he has recommended 25 different mutual funds for you which is definitely bad advice so that is your asset current asset allocation this is what it looks like did you know this looking ising nice actually XL magic so total you have 1 lakh 62,000 rupees available for investing every single month I thought that is not at all possible actually okay so these are all your goals now let us put the amount that you require for each of these goals what is this gold 40 lakh gold is not enough you want more gold I want a bigger car so all people are suggesting not to go for a bigger car what is the smart decision to do now the mistake that people make over here that that also is possible yes right so remember that time you w

#### Another template

In [111]:
!pip install pytube

Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl.metadata (5.0 kB)
Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
Installing collected packages: pytube
Successfully installed pytube-15.0.0


In [127]:
loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=e14UnK9FRxM", 
                                        add_video_info=False,
                                       language = ["en", "id"],
                                       translation = "en",
                                       )

ydata = loader.load()

In [129]:
ydata

[Document(metadata={'source': 'e14UnK9FRxM'}, page_content="in today's video I need to explain to  you why net worth explodes after 100K so  let me show you the math first and then  I'll show you what you need to do so  let's start with this  demonstration so what if I give you a  choice I'll give you two offers and then  you tell me which one that you would  rather take offer number one I will give  you a million dollars right now and  offer number two I will give you a penny  right now and every day for 30 days I  will double your money so which offer  would you take offer number one provides  you with $1 million and here's what  offer number two provides you with so on  the first day you start off with a penny  and on the second day you have two cents  day three 4 cents day four day five 6 7  8 9 10 so your first 10 days are very  boring and then it speeds up so take a  look at day 11 day 12 13 14 15 16 17 18  19 20 and these last 10 days it gets  very exciting on day 21 you're at  

In [131]:
formatted_text = ydata[0].page_content[:5000].strip()
print(formatted_text)

in today's video I need to explain to  you why net worth explodes after 100K so  let me show you the math first and then  I'll show you what you need to do so  let's start with this  demonstration so what if I give you a  choice I'll give you two offers and then  you tell me which one that you would  rather take offer number one I will give  you a million dollars right now and  offer number two I will give you a penny  right now and every day for 30 days I  will double your money so which offer  would you take offer number one provides  you with $1 million and here's what  offer number two provides you with so on  the first day you start off with a penny  and on the second day you have two cents  day three 4 cents day four day five 6 7  8 9 10 so your first 10 days are very  boring and then it speeds up so take a  look at day 11 day 12 13 14 15 16 17 18  19 20 and these last 10 days it gets  very exciting on day 21 you're at  $1,486 day 22 day 23 day 24 25 26  27 day 28 you're over a m