AWS Textract Table Extraction

Recognize Tables in PDF Files and Convert Them into Pandas DataFrames Using AWS Textract

This repository demonstrates how to use AWS Textract to recognize and extract tables from PDF files and convert them into Pandas DataFrames for easy data manipulation. The project also integrates with AWS S3 for storing PDF files and extracted data.

Overview

This project uses AWS Textract, a service that automatically extracts text and data from scanned documents, to recognize tables from PDF files. The extracted tables are then converted into Pandas DataFrames for further analysis and processing. Files are stored and retrieved from AWS S3 buckets.

Features

AWS Textract Integration: Automatically extracts tables from PDF files using AWS Textract.
AWS S3 Integration: PDF files and extracted data are stored and accessed from AWS S3.
Pandas DataFrames: Extracted tables are converted into Pandas DataFrames for easy analysis and manipulation.
CSV Export: Convert DataFrames to CSV for downstream processing or storage.

Requirements

AWS Account with access to Textract and S3.
IAM Role with appropriate permissions for Textract and S3.
AWS CLI configured with access credentials.
Python 3.7+ and the following libraries:
- boto3: AWS SDK for Python
- pandas: For working with DataFrames
- awscli: For AWS CLI commands (optional)
- textract-trp: Textract response parser (optional)

Installation

Clone the repository:

git clone https://github.com/yourusername/aws-textract-table-extraction.git

Navigate to the project directory:
```
cd aws-textract-table-extraction
```
Install the required dependencies:
```
pip install -r requirements.txt
```
Configure AWS CLI with your credentials:
```
aws configure
```

Usage

Upload a PDF to S3: Upload your PDF file to your S3 bucket. Make sure to note the bucket name and file key for processing.
```
aws s3 cp path_to_your_file.pdf s3://your-s3-bucket/path/to/file.pdf
```
Run the table extraction: Execute the script to extract tables from the PDF stored in S3:
```
python extract_tables.py --bucket your-s3-bucket --file_key path/to/file.pdf
```
Convert to Pandas DataFrame: Once the extraction is complete, the tables will be processed and converted into Pandas DataFrames.
Save DataFrame to CSV (optional): You can export the extracted tables to CSV for further use:
```
df.to_csv('extracted_table.csv', index=False)
```

Configuration

Before running the script, ensure that your AWS environment is correctly configured with the following:

S3 Bucket: A bucket to store and retrieve your PDF files.
AWS Textract Role: Ensure that your AWS IAM role has Textract and S3 access.

Examples

Here's a sample usage of the script to extract tables from a PDF:

python extract_tables.py --bucket my-bucket --file_key documents/sample.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data		Data
Images		Images
Test_Inv/Invoice_2		Test_Inv/Invoice_2
Final Textract-Copy1.ipynb		Final Textract-Copy1.ipynb
New Text Document.txt		New Text Document.txt
Output.csv		Output.csv
Output_KV.csv		Output_KV.csv
README.md		README.md
books.xlsx		books.xlsx
create_bucket.ipynb		create_bucket.ipynb
display.ipynb		display.ipynb
display_KV.ipynb		display_KV.ipynb
main.ipynb		main.ipynb
pdf2Image.ipynb		pdf2Image.ipynb
table_data.jpg		table_data.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS Textract Table Extraction

Recognize Tables in PDF Files and Convert Them into Pandas DataFrames Using AWS Textract

Table of Contents

Overview

Features

Requirements

Installation

Usage

Configuration

Examples

About

Releases

Packages

Languages

jrspatel/aws-textract

Folders and files

Latest commit

History

Repository files navigation

AWS Textract Table Extraction

Recognize Tables in PDF Files and Convert Them into Pandas DataFrames Using AWS Textract

Table of Contents

Overview

Features

Requirements

Installation

Usage

Configuration

Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages