New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc-file #5
Comments
If you have LibreOffice (not OpenOffice) installed, you can do something like (this is an OS X command-line):
which will convert It'd be somewhat straightforward for me to write a function to identify whether libreoffice is installed on a given system (win/mac/linux) and then perform this conversion if a This may also work with OpenOffice but the last time I tried the |
NOTESprobable linux locations
probable macOS locations
probable Windows locations
shld work on linux/macOS
shld work on Windows
|
Dose it can change the table of a docx file? |
@boksic1986 If you're asking if the package can modify the contents of a table in a Microsoft Word document, it cannot. If you desire this functionality, please ask file a new issue with some specifics on what you are looking for. |
I know this is an old issue, but I had a work need to use this package for both I'm using the LibreOffice software to convert set_libreoffice_path("C:\\Program Files\\LibreOffice\\program\\soffice.exe")
paths <- c(
"http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1709151519250301478.docx",
"http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1708211847068712082.doc"
)
for (path in paths) {
tbl <- docxtractr::read_docx(path)
df <- docxtractr::docx_extract_tbl(tbl, 1)
print(dim(df))
Sys.sleep(5)
}
#> [1] 69 10
#> [1] 15 10 If I make any more progress I can update here. Also, just want to say this package is AWESOME! About a year ago I had a need to get data from both |
wow (like, srsly: wow!) Definitely add yourself as an aut+ctb into the DESCRIPTION and shoot a PR over. I can poke at cross-platform bits (I have all three OSes but only rarely fire up Windows and always have libreoffice installed for forensic-tool-purposes) and also any CRAN issues that might be there due to a dep on libreoffice. This def needs to get on CRAN as I think alot of folks are feeling similar pain. This is great! #ty! |
Sure thing! Just opened a PR. I plan on working on this some more, I'll let you know if I make more progress. If there's anything in particular you'd like help with on this pkg let me know, I'm happy to help! |
I worked on this some more, I have it up and running on my Mac for both I don't have much experience with calling command line tools within an R package, so I'm not sure if my implementation choices are the best. Again, let me know if you want a PR for these edits 😄 # Same test on a Mac
docxtractr::set_libreoffice_path("/Applications/LibreOffice.app/Contents/MacOS/soffice")
paths <- c(
"http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1709151519250301478.docx",
"http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1708211847068712082.doc"
)
for (path in paths) {
tbl <- docxtractr::read_docx(path)
df <- docxtractr::docx_extract_tbl(tbl, 1)
print(dim(df))
Sys.sleep(5)
}
#> [1] 69 10
#> [1] 15 10 |
I've been thinking about how to "automatically" determine the path to We could do a manual search (example below), but that doesn't seem ideal. Is something like this what you had in mind? # Check env variable "path_to_libreoffice". If it's NULL, call lo_find(), which
# will try to determine the local path to LibreOffice file "soffice". If
# lo_find() is successful, the path to "soffice" will be assigned to env
# variable "path_to_libreoffice", otherwise an error is thrown.
lo_assert <- function() {
lo_path <- getOption("path_to_libreoffice")
if (is.null(lo_path)) {
lo_path <- lo_find()
set_libreoffice_path(lo_path)
}
}
# Returns the local path to LibreOffice file "soffice". Search is performed by
# looking in the known file locations for the current OS. If OS is not Linux,
# OSX, or Windows, an error is thrown. If path to "soffice" is not found, an
# error is thrown.
lo_find <- function() {
user_os <- Sys.info()["sysname"]
if (!user_os %in% names(lo_paths_to_check)) {
stop(lo_path_missing, call. = FALSE)
}
lo_path <- NULL
for (path in lo_paths_to_check[[user_os]]) {
if (file.exists(path)) {
lo_path <- path
break
}
}
if (is.null(lo_path)) {
stop(lo_path_missing, call. = FALSE)
}
lo_path
}
# List obj containing known locations of LibreOffice file "soffice".
lo_paths_to_check <- list(
"Linux" = c("/usr/bin/soffice",
"/usr/local/bin/soffice"),
"Darwin" = c("/Applications/LibreOffice.app/Contents/MacOS/soffice",
"~/Applications/LibreOffice.app/Contents/MacOS/soffice"),
"Windows" = c("C:\\Program Files\\LibreOffice\\program\\soffice.exe",
"C:\\progra~1\\libreo~1\\program\\soffice.exe")
)
# Error message thrown if LibreOffice file "soffice" cannot be found.
lo_path_missing <- paste(
"LibreOffice software required to read '.doc' files.",
"Cannot determine file path to LibreOffice.",
"To download LibreOffice, visit: https://www.libreoffice.org/ \n",
"If you've already downloaded the software, use function",
"'set_libreoffice_path()' to point R to your local 'soffice.exe' file"
) And then ## <snip at line 25>
# Check to see if input is a .doc file
is_input_doc <- is_doc(path)
# If input is a .doc file, create a temp .doc file
if (is_input_doc) {
lo_assert()
tmpf_doc <- tempfile(tmpdir = tmpd, fileext = ".doc")
tmpf_docx <- gsub("\\.doc$", ".docx", tmpf_doc)
} else {
tmpf_doc <- NULL
tmpf_docx <- NULL
}
## <continue with function> |
I was thinking of an alternative way of supporting doc-files. I opened #23 for it. |
I have several .doc files that each (unzipped) only contain "[Content_Types].xml" and the folders "_rels" (with ".rls") and "theme" (with "theme/theme1.xml", "theme/themeManager.xml" and "theme/_rels/themeManager.xml.rels").
Any idea how to read the old ".doc" format?
(I hope it's OK to post this as an issue. Just delete it if not^^)
The text was updated successfully, but these errors were encountered: