# Load Data

## LangChain documentation on Document Loaders
* [version v0.1](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/)
* [latest version](https://python.langchain.com/docs/how_to/#document-loaders)

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [2]:
MODEL_GPT = 'gpt-4o-mini'

## Simple data loading

### Loading TXT file

In [3]:
from langchain_openai import ChatOpenAI

# chatModel = ChatOpenAI(model="gpt-3.5-turbo-0125")
chatModel = ChatOpenAI(model=MODEL_GPT)

In [4]:
from langchain_community.document_loaders import TextLoader

# loader = TextLoader("./data/be-good.txt")
loader = TextLoader("../../data/be-good.txt")

loaded_data = loader.load()

In [5]:
#loaded_data

### Loading CSV file

In [6]:
from langchain_community.document_loaders import CSVLoader

# loader = CSVLoader('./data/Street_Tree_List.csv')
loader = CSVLoader('../../data/Street_Tree_List.csv')

loaded_data = loader.load()

In [7]:
#loaded_data

### Loading HTML file

In [10]:
from langchain_community.document_loaders import BSHTMLLoader

# loader = CSVLoader('./data/100-startups.html')
loader = CSVLoader('../../data/100-startups.html')

loaded_data = loader.load()

In [11]:
#loaded_data

### Loading PDF file

In [12]:
#!pip install pypdf

In [13]:
from langchain_community.document_loaders import PyPDFLoader

# loader = PyPDFLoader('./data/5pages.pdf')
loader = PyPDFLoader('../../data/5pages.pdf')

loaded_data = loader.load_and_split()

In [14]:
#loaded_data[0].page_content

### Loading Wikipedia page and asking questions about it

In [15]:
# !pip install wikipedia

In [16]:
from langchain_community.document_loaders import WikipediaLoader

loader = WikipediaLoader('query=name, load_max_docs=1')

loaded_data = loader.load()[0].page_content

In [17]:
from langchain_core.prompts import ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages(
    [
        ("human", "Answer this {question}, here is some extra {context}"),
    ]
)

messages = chat_template.format_messages(
    name="JFK",
    question="Where was JFK born?",
    context=loaded_data
)

In [18]:
response = chatModel.invoke(messages)

In [19]:
response

AIMessage(content='John F. Kennedy (JFK) was born in Brookline, Massachusetts, on May 29, 1917.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 26, 'prompt_tokens': 800, 'total_tokens': 826, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_06737a9306', 'finish_reason': 'stop', 'logprobs': None}, id='run-5c3543eb-dfde-4410-813d-2bef0ed2ce13-0', usage_metadata={'input_tokens': 800, 'output_tokens': 26, 'total_tokens': 826, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

## Splitters (divide loaded document in small chunks of text)

### Simple splitting by character

In [20]:
from langchain_community.document_loaders import TextLoader

# loader = TextLoader("./data/be-good.txt")
loader = TextLoader("../../data/be-good.txt")

loaded_data = loader.load()

In [21]:
loaded_data

[Document(metadata={'source': '../../data/be-good.txt'}, page_content='Be good\n\nApril 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we started Y Combinator we came up with the\nphrase that became our motto: Make something people want.  We\'ve\nlearned a lot since then, but if I were choosing now that\'s still\nthe one I\'d pick.Another thing we tell founders is not to worry too much about the\nbusiness model, at least at first.  Not because making money is\nunimportant, but because it\'s so much easier than building something\ngreat.A couple weeks ago I realized that if you put those two ideas\ntogether, you get something surprising.  Make something people want.\nDon\'t worry too much about making money.  What you\'ve got is a\ndescription of a charity.When you get an unexpected result like this, it could either be a\nbug or a new discovery.  Either businesses aren\'t supposed to be\nlike charities, and we\'ve proven by reductio ad absurdum th

In [22]:
loaded_data[0].page_content

'Be good\n\nApril 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we started Y Combinator we came up with the\nphrase that became our motto: Make something people want.  We\'ve\nlearned a lot since then, but if I were choosing now that\'s still\nthe one I\'d pick.Another thing we tell founders is not to worry too much about the\nbusiness model, at least at first.  Not because making money is\nunimportant, but because it\'s so much easier than building something\ngreat.A couple weeks ago I realized that if you put those two ideas\ntogether, you get something surprising.  Make something people want.\nDon\'t worry too much about making money.  What you\'ve got is a\ndescription of a charity.When you get an unexpected result like this, it could either be a\nbug or a new discovery.  Either businesses aren\'t supposed to be\nlike charities, and we\'ve proven by reductio ad absurdum that one\nor both of the principles we began with is false.  Or we have 

In [23]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

In [24]:
texts = text_splitter.create_documents([loaded_data[0].page_content])

In [25]:
len(texts)

2

In [26]:
texts[0]

Document(metadata={}, page_content='Be good')

In [27]:
# texts[1]

### Splitting with metadata

In [28]:
metadatas = [{"chunk": 0}, {"chunk": 1}]

documents = text_splitter.create_documents(
    [loaded_data[0].page_content, loaded_data[0].page_content], 
    metadatas=metadatas
)

In [29]:
documents[0]

Document(metadata={'chunk': 0}, page_content='Be good')

In [30]:
print(documents[0])

page_content='Be good' metadata={'chunk': 0}


## Embeddings (transform chunks of text in chunks of numbers)

In [31]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

In [32]:
chunks_of_text =     [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]

In [33]:
embeddings = embeddings_model.embed_documents(chunks_of_text)

In [34]:
len(embeddings)

5

In [35]:
len(embeddings[0])

1536

In [36]:
print(embeddings[0][:5])

[-0.020325319841504097, -0.007096723187714815, -0.022839006036520004, -0.026279456913471222, -0.037527572363615036]


In [37]:
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")

In [38]:
len(embedded_query)

1536

## Vector databases (vector stores) store and search embeddings

In [39]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
# loaded_document = TextLoader('./data/state_of_the_union.txt').load()
loaded_document = TextLoader('../../data/state_of_the_union.txt').load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

chunks_of_text = text_splitter.split_documents(loaded_document)

vector_db = Chroma.from_documents(chunks_of_text, OpenAIEmbeddings())

In [40]:
question = "What did the president say about the John Lewis Voting Rights Act?"

response = vector_db.similarity_search(question)

print(response[0].page_content)

Repeal the liability shield that makes gun manufacturers the only industry in America that canâ€™t be sued. 

These laws donâ€™t infringe on the Second Amendment. They save lives. 

The most fundamental right in America is the right to vote â€“ and to have it counted. And itâ€™s under assault. 

In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen. 

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youâ€™re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, Iâ€™d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyerâ€”an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.


## Retriever (returns response given question)

### Vector store as retriever

In [41]:
from langchain_community.document_loaders import TextLoader

# loader = TextLoader("./data/state_of_the_union.txt")
loader = TextLoader("../../data/state_of_the_union.txt")

In [42]:
#!pip install faiss-cpu

In [43]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loaded_document = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

chunks_of_text = text_splitter.split_documents(loaded_document)

embeddings = OpenAIEmbeddings()

vector_db = FAISS.from_documents(chunks_of_text, embeddings)

In [44]:
retriever = vector_db.as_retriever()

### Simple use without LCEL

In [45]:
response = retriever.invoke("what did he say about ketanji brown jackson?")

In [46]:
len(response)

4

In [47]:
response[0]

Document(id='144c6543-c865-4ae5-b5ea-194a5c965597', metadata={'source': '../../data/state_of_the_union.txt'}, page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nationâ€™s top legal minds, who will continue Justice Breyerâ€™s legacy of excellence. \n\nA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since sheâ€™s been nominated, sheâ€™s received a broad range of supportâ€”from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, weâ€™ve installed new technology like cutting-edge scann

### Specifying top k

In [48]:
retriever = vector_db.as_retriever(search_kwargs={"k": 1})

In [49]:
response = retriever.invoke("what did he say about ketanji brown jackson?")

In [50]:
len(response)

1

In [51]:
response

[Document(id='144c6543-c865-4ae5-b5ea-194a5c965597', metadata={'source': '../../data/state_of_the_union.txt'}, page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nationâ€™s top legal minds, who will continue Justice Breyerâ€™s legacy of excellence. \n\nA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since sheâ€™s been nominated, sheâ€™s received a broad range of supportâ€”from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, weâ€™ve installed new technology like cutting-edge scan

### Simple use with LCEL and input and output formatters

In [52]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

# model = ChatOpenAI()
model = ChatOpenAI(model=MODEL_GPT)

def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

response = chain.invoke("what did he say about ketanji brown jackson?")

In [53]:
response

"He described Ketanji Brown Jackson as one of the nation's top legal minds who will continue Justice Breyer's legacy of excellence. He noted her background as a former top litigator in private practice, a former federal public defender, and her roots in a family of public school educators and police officers. He also emphasized her ability as a consensus builder and mentioned that she has received a broad range of support from various organizations and individuals, including the Fraternal Order of Police and former judges appointed by both Democrats and Republicans."

## Indexing
* [version v0.1](https://python.langchain.com/v0.1/docs/modules/data_connection/indexing/)
* [latest version](https://python.langchain.com/docs/how_to/indexing/)