Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Langchain Information Extraction #12636

Closed
HasnainKhanNiazi opened this issue Oct 31, 2023 · 7 comments
Closed

Langchain Information Extraction #12636

HasnainKhanNiazi opened this issue Oct 31, 2023 · 7 comments
Labels
Ɑ: doc loader Related to document loader module (not documentation) extraction 🤖:improvement Medium size change to existing code to handle new use-cases Ɑ: models Related to LLMs or chat model modules 🤖:question A specific question about the codebase, product, project, or how to use a feature

Comments

@HasnainKhanNiazi
Copy link

I am trying to use Langchain information extraction chain with OpenAI. Firstly, I am reading a PDF file having some text about products or product. There is no specefic format of PDF, it can be in any format like, there can be only one product on one page or one product can be on two pages or there can be 10 products on one page. So it's completely random. How can I extract the the useful information (some key attributes) for all the products. Currently, I am using Pydantic to validate the output schema and I am using PyPDFLoader to load the pdf and then using the load_and_split.

loader = PyPDFLoader(file_path=file_path)
pages = loader.load_and_split()

And then I am looping on the pages like this:

llm = ChatOpenAI(temperature=0, model=model_name, openai_api_key=key)
chain = create_extraction_chain(schema, llm)

for index in range(0, len(pages)):
         output = chain.run(pages[index])
         results.append(output)

How can I run the extraction in such a way that it should be completely automated, currently, it is just extracting information from one page and then other and then other.

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) Ɑ: models Related to LLMs or chat model modules 🤖:improvement Medium size change to existing code to handle new use-cases 🤖:question A specific question about the codebase, product, project, or how to use a feature labels Oct 31, 2023
Copy link

dosubot bot commented Oct 31, 2023

🤖

Hello @HasnainKhanNiazi,

I'm here to assist you with your query. I understand you're trying to automate the information extraction process from a PDF file using LangChain, PyPDFLoader, and Pydantic, and you want the extraction to consider the entire document as a whole, not just page by page.

To achieve this, you can load the entire PDF file and concatenate all the text into a single string before running the extraction chain. Here's a simplified example of how you might do this:

from langchain.chains.openai_functions.extraction import create_extraction_chain_pydantic
from langchain.pydantic_v1 import BaseModel
from langchain.schema.language_model import BaseLanguageModel
from langchain.document_loaders.pdf import PyMuPDFLoader
from langchain.document_loaders.blob_loaders import Blob

# Define your Pydantic schema for the product details
class ProductDetails(BaseModel):
    product_name: str
    product_price: float
    # Add more fields as needed

# Load your PDF file
loader = PyMuPDFLoader("your_pdf_file.pdf")
documents = loader.load()

# Concatenate all the text from the PDF into a single string
text = "".join(doc.page_content for doc in documents)

# Create your extraction chain
llm = BaseLanguageModel()  # Replace with your actual language model
chain = create_extraction_chain_pydantic(ProductDetails, llm)

# Extract the product details from the entire PDF text
product_details = chain.run(text)

This code will consider the entire document as a whole when extracting information, as it first concatenates all the text from the PDF into a single string before running the extraction chain.

Please note that you'll need to replace "your_pdf_file.pdf" with the path to your actual PDF file, and BaseLanguageModel() with your actual language model. Also, the ProductDetails class should be updated to match the actual structure of the product details in your PDF file.

You can find more details about the create_extraction_chain_pydantic function and how it works in the LangChain repository.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@HasnainKhanNiazi
Copy link
Author

@dosu-bot this is helpful but what if I concatenate multiple PDF pages text into a single string and I might not be able to pass the whole string as a context to LLM because of token limitations. In that case, I need to divide the text?

Copy link

dosubot bot commented Oct 31, 2023

🤖

I understand your concern about the token limitations when dealing with large amounts of text. In your case, you can indeed divide the text into smaller parts to handle the token limitations of the Language Model (LLM).

LangChain provides text splitters that can split the text into chunks that fit within the token limit of the language model. For example, you can use the CharacterTextSplitter.from_tiktoken_encoder or TokenTextSplitter if you are using a BPE tokenizer like tiktoken.

Here is an example of how you can use the CharacterTextSplitter.from_tiktoken_encoder:

from langchain.text_splitter import CharacterTextSplitter

# Assume that 'text_from_pdf' is the text extracted from the PDF file
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(text_from_pdf)

And here is an example of how you can use the TokenTextSplitter:

from langchain.text_splitter import TokenTextSplitter

# Assume that 'text_from_pdf' is the text extracted from the PDF file
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(text_from_pdf)

If you are using a different tokenizer, you can use the corresponding text splitter. For example, if you are using the Hugging Face tokenizer, you can use the CharacterTextSplitter.from_huggingface_tokenizer:

from transformers import GPT2TokenizerFast
from langchain.text_splitter import CharacterTextSplitter

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# Assume that 'text_from_pdf' is the text extracted from the PDF file
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(text_from_pdf)

These examples show how you can split the text into chunks that fit within the token limit of the language model. The chunk_size parameter specifies the maximum number of tokens in each chunk, and the chunk_overlap parameter specifies the number of tokens that overlap between consecutive chunks.

You can find more information about text splitters in the LangChain documentation here.

As for the maximum token limit for the BaseLanguageModel in the LangChain framework, I wasn't able to find an exact number in the repository. The token limit can vary depending on the specific language model you are using. For example, GPT-3 has a maximum token limit of 2048. You might want to check the documentation of the specific language model you are using for more accurate information.

I hope this helps! If you have any more questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@HasnainKhanNiazi
Copy link
Author

@dosu-bot if I divide the text into multiple chunks like this;

from langchain.text_splitter import TokenTextSplitter

# Assume that 'text_from_pdf' is the text extracted from the PDF file
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(text_from_pdf)

then I need to call chain.run(texts[0]) or text[1] ....upto text[N] but in this case also, model is taking only chunk in the context but not complete PDF document text.

Copy link

dosubot bot commented Oct 31, 2023

🤖

Based on the issues and discussions in the LangChain repository, it seems like you're facing a common challenge when dealing with large documents and language models with token limitations. The LangChain framework provides several tools to handle this, but they need to be used correctly to achieve the desired results.

From your code, it seems like you're using the TokenTextSplitter to split the text into smaller chunks. However, when you run the extraction chain on each chunk, the model is only considering the individual chunk in the context, not the complete PDF document text. This is because each chunk is treated as a separate document by the extraction chain.

To modify the extraction process to consider the entire document as a whole, you can use the MapReduceDocumentsChain as suggested in the issue #6805. This chain is designed to process multiple documents, but it doesn't automatically split a single long document into smaller chunks. You can use the CharacterTextSplitter to split the long document into smaller chunks before passing it to the load_qa_chain. Here is the suggested solution:

from langchain.document_loaders import TextLoader
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader("state_of_the_union.txt")
document = loader.load()

# Split the long document into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents([document])

chain = load_qa_chain(OpenAI(temperature=0), chain_type="map_reduce")
query = "What did the president say about Justice Breyer"
chain({"input_documents": documents, "question": query}, return_only_outputs=True)

In this code, the CharacterTextSplitter is used to split the long document into smaller chunks, and then these chunks are passed to the load_qa_chain function, allowing it to process the document without exceeding the token limit.

However, if you're using a Sentence Transformers model, you might want to use the SentenceTransformersTokenTextSplitter class instead, as suggested in the discussion #7760. This class is designed to split the text based on tokens, which is more suitable for Sentence Transformers models.

Please note that these are potential solutions based on the information you've provided and the similar issues found in the LangChain repository. You might need to adjust them to fit your specific needs. If you encounter any issues or have further questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@JoffreyLemery
Copy link

@HasnainKhanNiazi

Dis you finally succeed ? I'm facing the same situation !

@eyurtsev
Copy link
Collaborator

eyurtsev commented Mar 8, 2024

Closing as extraction use case has now been updated:

Please review: https://python.langchain.com/docs/use_cases/extraction

@eyurtsev eyurtsev closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) extraction 🤖:improvement Medium size change to existing code to handle new use-cases Ɑ: models Related to LLMs or chat model modules 🤖:question A specific question about the codebase, product, project, or how to use a feature
Projects
None yet
Development

No branches or pull requests

3 participants