Who here were able to use `Unstructured` for parsing scanned PDFs? #5969

eRuaro · 2023-06-10T02:11:47Z

eRuaro
Jun 10, 2023

I was able to use it 2 months ago for parsing scanned PDFs, but when I rebuilt my docker container, it keeps using pdfminer instead. Now whenever I try to parse a scanned PDF, it returns an empty array when running loader.load_and_split().

Here's my dockerfile:

FROM python:3.9-slim-buster

# Update package lists
RUN apt-get update && apt-get install ffmpeg libsm6 libxext6 gcc g++ git build-essential libpoppler-cpp-dev libmagic-dev pkg-config poppler-utils tesseract-ocr libtesseract-dev -y

# Make working directories
RUN  mkdir -p  /app
WORKDIR  /app

# Copy the requirements.txt file to the container
COPY requirements.txt .

# Install dependencies
RUN pip install --upgrade pip

RUN pip install torch torchvision torchaudio

RUN pip install unstructured-inference

RUN pip install -r requirements.txt

RUN pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'

# Copy the .env file to the container
COPY .env .

# Copy every file in the source folder to the created working directory
COPY  . .

# Expose the port that the application will run on
EXPOSE 8080

# Start the application
CMD ["python3.9", "-m", "uvicorn", "main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

Here's the code segment that uses Unstructured:

@app.post("/document/index/scanned")
async def index_scanned_document(document: Document):
    try:
        loader = OnlinePDFLoader(document.pdf_url)
        recursive_text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            chunk_size=2000,
            chunk_overlap=100,
        )
        data = loader.load_and_split(text_splitter=recursive_text_splitter)

        if (len(data) == 0):
            raise Exception("No texts found")
            
        embeddings = OpenAIEmbeddings()

        text_data = [d.page_content for d in data]
        text_metadata = [{"source": f"{i}-pl"} for i in range(len(data))]

        db = PGVector.from_texts(
            texts=text_data,
            embedding=embeddings,
            collection_name=document.user_id + "/scanned/" + document.pdf_title,
            connection_string=connection_string,
            distance_strategy=DistanceStrategy.COSINE,
            metadatas=text_metadata,
            pre_delete_collection=False
        )
    except Exception as e:
        raise HTTPException(status_code=404, detail={
            "message": "Failed to index document",
            "error": str(e)
        })
    else:
        return {
            "message": "Document indexed successfully",
        }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Who here were able to use `Unstructured` for parsing scanned PDFs? #5969

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Who here were able to use Unstructured for parsing scanned PDFs? #5969

eRuaro Jun 10, 2023

Replies: 0 comments

Who here were able to use `Unstructured` for parsing scanned PDFs? #5969

eRuaro
Jun 10, 2023