You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was able to use it 2 months ago for parsing scanned PDFs, but when I rebuilt my docker container, it keeps using pdfminer instead. Now whenever I try to parse a scanned PDF, it returns an empty array when running loader.load_and_split().
Here's my dockerfile:
FROM python:3.9-slim-buster
# Update package lists
RUN apt-get update && apt-get install ffmpeg libsm6 libxext6 gcc g++ git build-essential libpoppler-cpp-dev libmagic-dev pkg-config poppler-utils tesseract-ocr libtesseract-dev -y
# Make working directories
RUN mkdir -p /app
WORKDIR /app
# Copy the requirements.txt file to the container
COPY requirements.txt .
# Install dependencies
RUN pip install --upgrade pip
RUN pip install torch torchvision torchaudio
RUN pip install unstructured-inference
RUN pip install -r requirements.txt
RUN pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'
# Copy the .env file to the container
COPY .env .
# Copy every file in the source folder to the created working directory
COPY . .
# Expose the port that the application will run on
EXPOSE 8080
# Start the application
CMD ["python3.9", "-m", "uvicorn", "main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]
Here's the code segment that uses Unstructured:
@app.post("/document/index/scanned")
async def index_scanned_document(document: Document):
try:
loader = OnlinePDFLoader(document.pdf_url)
recursive_text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=2000,
chunk_overlap=100,
)
data = loader.load_and_split(text_splitter=recursive_text_splitter)
if (len(data) == 0):
raise Exception("No texts found")
embeddings = OpenAIEmbeddings()
text_data = [d.page_content for d in data]
text_metadata = [{"source": f"{i}-pl"} for i in range(len(data))]
db = PGVector.from_texts(
texts=text_data,
embedding=embeddings,
collection_name=document.user_id + "/scanned/" + document.pdf_title,
connection_string=connection_string,
distance_strategy=DistanceStrategy.COSINE,
metadatas=text_metadata,
pre_delete_collection=False
)
except Exception as e:
raise HTTPException(status_code=404, detail={
"message": "Failed to index document",
"error": str(e)
})
else:
return {
"message": "Document indexed successfully",
}
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I was able to use it 2 months ago for parsing scanned PDFs, but when I rebuilt my docker container, it keeps using
pdfminer
instead. Now whenever I try to parse a scanned PDF, it returns an empty array when runningloader.load_and_split()
.Here's my dockerfile:
Here's the code segment that uses
Unstructured
:Beta Was this translation helpful? Give feedback.
All reactions