Extract text and transcriptions from PDFs and videos uploaded to an S3 bucket.
This project deploys an S3 Lambda trigger. It performs the following tasks:
-
PDF Processing:
- When a PDF file is uploaded, the function extracts images from the PDF.
- It then processes the extracted images using Amazon Textract, an AWS service for optical character recognition (OCR).
- The extracted text from the images is then uploaded to the same S3 bucket, with the file path modified to include the image name and "textract.txt".
-
Video Processing:
- When a video file (e.g.,
.mp4) is uploaded, the function submits a transcription job to Amazon Transcribe, an AWS service for speech-to-text conversion. - The transcription output is then uploaded to the S3 bucket, with the file path modified to include the original file name and "transcribe.out".
- When a video file (e.g.,
-
Transcription Processing:
- When the transcription output file (with the ".transcribe.out" extension) is uploaded, the function reads the JSON data, extracts the transcript, and uploads it to the S3 bucket with the file path modified to include the original file name and "transcribe.txt".
- After the transcript is uploaded, the function deletes the intermediate artifacts (the ".transcribe.out" file and any associated objects) from the S3 bucket.
Optional - register pre-commit hooks and asdf for deps.
make initDeploy the infra. S3 trigger to Lambda container.
terraform init && terraform applyDeploy code changes (uses containers on Lambda).
Setup python environment for local dev.
cd lambda
make initmake deploy-container function=my-function| Name | Version |
|---|---|
| terraform | >= 1.0 |
| aws | ~> 5.0 |
| docker | >= 3.0 |
| Name | Version |
|---|---|
| aws | ~> 5.0 |
| Name | Source | Version |
|---|---|---|
| docker_image | terraform-aws-modules/lambda/aws//modules/docker-build | n/a |
| lambda | terraform-aws-modules/lambda/aws | n/a |
| Name | Type |
|---|---|
| aws_iam_role.lambda | resource |
| aws_iam_role_policy_attachment.test_attach | resource |
| aws_s3_bucket_notification.main | resource |
| aws_caller_identity.current | data source |
| aws_ecr_authorization_token.token | data source |
| aws_s3_bucket.main | data source |
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| region | region | string |
"us-east-1" |
no |
| s3_bucket | s3 bucket to run image extraction against | string |
n/a | yes |
No outputs.