Skip to content

Commit

Permalink
add ocr and recognizer demo, update README (#74)
Browse files Browse the repository at this point in the history
  • Loading branch information
KevinHuSh committed Feb 26, 2024
1 parent d141710 commit d1c600d
Show file tree
Hide file tree
Showing 9 changed files with 525 additions and 73 deletions.
4 changes: 2 additions & 2 deletions api/apps/conversation_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def set_conversation():
conv = {
"id": get_uuid(),
"dialog_id": req["dialog_id"],
"name": "New conversation",
"name": req.get("name", "New conversation"),
"message": [{"role": "assistant", "content": dia.prompt_config["prologue"]}]
}
ConversationService.save(**conv)
Expand Down Expand Up @@ -102,7 +102,7 @@ def rm():
def list_convsersation():
dialog_id = request.args["dialog_id"]
try:
convs = ConversationService.query(dialog_id=dialog_id)
convs = ConversationService.query(dialog_id=dialog_id, order_by=ConversationService.model.create_time, reverse=True)
convs = [d.to_dict() for d in convs]
return get_json_result(data=convs)
except Exception as e:
Expand Down
6 changes: 6 additions & 0 deletions api/utils/file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,5 +185,11 @@ def thumbnail(filename, blob):
pass


def traversal_files(base):
for root, ds, fs in os.walk(base):
for f in fs:
fullname = os.path.join(root, f)
yield fullname



62 changes: 55 additions & 7 deletions deepdoc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,51 @@ English | [简体中文](./README_zh.md)

With a bunch of documents from various domains with various formats and along with diverse retrieval requirements,
an accurate analysis becomes a very challenge task. *Deep*Doc is born for that purpose.
There 2 parts in *Deep*Doc so far: vision and parser.
There are 2 parts in *Deep*Doc so far: vision and parser.
You can run the flowing test programs if you're interested in our results of OCR, layout recognition and TSR.
```bash
python deepdoc/vision/t_ocr.py -h
usage: t_ocr.py [-h] --inputs INPUTS [--output_dir OUTPUT_DIR]

options:
-h, --help show this help message and exit
--inputs INPUTS Directory where to store images or PDFs, or a file path to a single image or PDF
--output_dir OUTPUT_DIR
Directory where to store the output images. Default: './ocr_outputs'
```
```bash
python deepdoc/vision/t_recognizer.py -h
usage: t_recognizer.py [-h] --inputs INPUTS [--output_dir OUTPUT_DIR] [--threshold THRESHOLD] [--mode {layout,tsr}]

options:
-h, --help show this help message and exit
--inputs INPUTS Directory where to store images or PDFs, or a file path to a single image or PDF
--output_dir OUTPUT_DIR
Directory where to store the output images. Default: './layouts_outputs'
--threshold THRESHOLD
A threshold to filter out detections. Default: 0.5
--mode {layout,tsr} Task mode: layout recognition or table structure recognition
```

Our models are served on HuggingFace. If you have trouble downloading HuggingFace models, this might help!!
```bash
export HF_ENDPOINT=https://hf-mirror.com
```

<a name="2"></a>
## 2. Vision

We use vision information to resolve problems as human being.
- OCR. Since a lot of documents presented as images or at least be able to transform to image,
OCR is a very essential and fundamental or even universal solution for text extraction.

```bash
python deepdoc/vision/t_ocr.py --inputs=path_to_images_or_pdfs --output_dir=path_to_store_result
```
The inputs could be directory to images or PDF, or a image or PDF.
You can look into the folder 'path_to_store_result' where has images which demonstrate the positions of results,
txt files which contain the OCR text.
<div align="center" style="margin-top:20px;margin-bottom:20px;">
<img src="https://lh6.googleusercontent.com/2xdiSjaGWkZ71YdORc71Ujf7jCHmO6G-6ONklzGiUYEh3QZpjPo6MQ9eqEFX20am_cdW4Ck0YRraXEetXWnM08kJd99yhik13Cy0_YKUAq2zVGR15LzkovRAmK9iT4o3hcJ8dTpspaJKUwt6R4gN7So" width="300"/>
<img src="https://github.com/infiniflow/ragflow/assets/12318111/f25bee3d-aaf7-4102-baf5-d5208361d110" width="900"/>
</div>

- Layout recognition. Documents from different domain may have various layouts,
Expand All @@ -39,11 +73,18 @@ We use vision information to resolve problems as human being.
- Footer
- Reference
- Equation

Have a try on the following command to see the layout detection results.
```bash
python deepdoc/vision/t_recognizer.py --inputs=path_to_images_or_pdfs --threshold=0.2 --mode=layout --output_dir=path_to_store_result
```
The inputs could be directory to images or PDF, or a image or PDF.
You can look into the folder 'path_to_store_result' where has images which demonstrate the detection results as following:
<div align="center" style="margin-top:20px;margin-bottom:20px;">
<img src="https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppstructure/docs/layout/layout.png?raw=true" width="900"/>
<img src="https://github.com/infiniflow/ragflow/assets/12318111/07e0f625-9b28-43d0-9fbb-5bf586cd286f" width="1000"/>
</div>

- Table Structure Recognition(TSR). Data table is a frequently used structure present data including numbers or text.
- Table Structure Recognition(TSR). Data table is a frequently used structure to present data including numbers or text.
And the structure of a table might be very complex, like hierarchy headers, spanning cells and projected row headers.
Along with TSR, we also reassemble the content into sentences which could be well comprehended by LLM.
We have five labels for TSR task:
Expand All @@ -52,8 +93,15 @@ We use vision information to resolve problems as human being.
- Column header
- Projected row header
- Spanning cell

Have a try on the following command to see the layout detection results.
```bash
python deepdoc/vision/t_recognizer.py --inputs=path_to_images_or_pdfs --threshold=0.2 --mode=tsr --output_dir=path_to_store_result
```
The inputs could be directory to images or PDF, or a image or PDF.
You can look into the folder 'path_to_store_result' where has both images and html pages which demonstrate the detection results as following:
<div align="center" style="margin-top:20px;margin-bottom:20px;">
<img src="https://user-images.githubusercontent.com/10793386/139559159-cd23c972-8731-48ed-91df-f3f27e9f4d79.jpg" width="900"/>
<img src="https://github.com/infiniflow/ragflow/assets/12318111/cb24e81b-f2ba-49f3-ac09-883d75606f4c" width="1000"/>
</div>
<a name="3"></a>
Expand All @@ -71,4 +119,4 @@ The résumé is a very complicated kind of document. A résumé which is compose
with various layouts could be resolved into structured data composed of nearly a hundred of fields.
We haven't opened the parser yet, as we open the processing method after parsing procedure.



45 changes: 45 additions & 0 deletions deepdoc/vision/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,49 @@

from .ocr import OCR
from .recognizer import Recognizer
from .layout_recognizer import LayoutRecognizer
from .table_structure_recognizer import TableStructureRecognizer

def init_in_out(args):
from PIL import Image
import fitz
import os
import traceback
from api.utils.file_utils import traversal_files
images = []
outputs = []

if not os.path.exists(args.output_dir):
os.mkdir(args.output_dir)

def pdf_pages(fnm, zoomin=3):
nonlocal outputs, images
pdf = fitz.open(fnm)
mat = fitz.Matrix(zoomin, zoomin)
for i, page in enumerate(pdf):
pix = page.get_pixmap(matrix=mat)
img = Image.frombytes("RGB", [pix.width, pix.height],
pix.samples)
images.append(img)
outputs.append(os.path.split(fnm)[-1] + f"_{i}.jpg")

def images_and_outputs(fnm):
nonlocal outputs, images
if fnm.split(".")[-1].lower() == "pdf":
pdf_pages(fnm)
return
try:
images.append(Image.open(fnm))
outputs.append(os.path.split(fnm)[-1])
except Exception as e:
traceback.print_exc()

if os.path.isdir(args.inputs):
for fnm in traversal_files(args.inputs):
images_and_outputs(fnm)
else:
images_and_outputs(args.inputs)

for i in range(len(outputs)): outputs[i] = os.path.join(args.output_dir, outputs[i])

return images, outputs
26 changes: 19 additions & 7 deletions deepdoc/vision/layout_recognizer.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,26 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import re
from collections import Counter
from copy import deepcopy

import numpy as np

from api.utils.file_utils import get_project_base_directory
from .recognizer import Recognizer
from deepdoc.vision import Recognizer


class LayoutRecognizer(Recognizer):
def __init__(self, domain):
self.layout_labels = [
labels = [
"_background_",
"Text",
"Title",
Expand All @@ -24,7 +33,8 @@ def __init__(self, domain):
"Reference",
"Equation",
]
super().__init__(self.layout_labels, domain,
def __init__(self, domain):
super().__init__(self.labels, domain,
os.path.join(get_project_base_directory(), "rag/res/deepdoc/"))

def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.7, batch_size=16):
Expand All @@ -37,7 +47,7 @@ def __is_garbage(b):
return any([re.search(p, b["text"]) for p in patt])

layouts = super().__call__(image_list, thr, batch_size)
# save_results(image_list, layouts, self.layout_labels, output_dir='output/', threshold=0.7)
# save_results(image_list, layouts, self.labels, output_dir='output/', threshold=0.7)
assert len(image_list) == len(ocr_res)
# Tag layout type
boxes = []
Expand Down Expand Up @@ -117,3 +127,5 @@ def findLayout(ty):
ocr_res = [b for b in ocr_res if b["text"].strip() not in garbag_set]
return ocr_res, page_layout



Loading

0 comments on commit d1c600d

Please sign in to comment.