Skip to content

Hegghammer/daiR

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
R
 
 
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

daiR: OCR with Google Document AI in R

daiR is an R package for Google Document AI, a powerful server-based OCR processor with support for over 60 languages. The package provides an interface for the Document AI API and comes with additional tools for output file parsing and text reconstruction. See the daiR website and this journal article for more details.

Use

Quick OCR short documents:

## NOT RUN
library(daiR)
response <- dai_sync("file.pdf")
text <- text_from_dai_response(response)
cat(text)

Batch process asynchronously via Google Storage:

## NOT RUN
library(googleCloudStorageR)
library(purrr)
my_files <- c("file1.pdf", "file2.pdf", "file3.pdf")
map(my_files, gcs_upload)
dai_async(my_files)
contents <- gcs_list_objects()
output_files <- grep("json$", contents$name, value = TRUE)
map(output_files, ~ gcs_get_object(.x, saveToDisk = file.path(tempdir(), .x)))
sample_text <- text_from_dai_file(file.path(tempdir(), output_files[1]))
cat(sample_text)

Turn images of tables into R dataframes:

## NOT RUN:
response <- dai_sync_tab("tables.pdf")
dfs <- tables_from_dai_response(response)

Requirements

Google Document AI is a paid service that requires a Google Cloud account and a Google Storage bucket. I recommend using Mark Edmondson's googleCloudStorageR package in combination with daiR.

Installation

Install daiR from CRAN:

install.packages("daiR")

Or install the latest development version from Github:

devtools::install_github("hegghammer/daiR")

Citation

To cite daiR in publications, please use

Hegghammer, T., (2021). daiR: an R package for OCR with Google Document AI. Journal of Open Source Software, 6(68), 3538, https://doi.org/10.21105/joss.03538

Bibtex:

@article{Hegghammer2021,
  doi = {10.21105/joss.03538},
  url = {https://doi.org/10.21105/joss.03538},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {68},
  pages = {3538},
  author = {Thomas Hegghammer},
  title = {daiR: an R package for OCR with Google Document AI},
  journal = {Journal of Open Source Software}
}

Acknowledgments

Thanks to Mark Edmondson, Hallvar Gisnås, Will Hanley, Neil Ketchley, Trond Arne Sørby, Chris Barrie, and Geraint Palmer for contributions to the project.

Code of conduct

Please note that the daiR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

DOI CRAN status R-CMD-check Codecov test coverage

About

R package for Google Document AI

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Stars

Watchers

Forks

Languages