Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc-file #5

Closed
brry opened this issue Jul 17, 2016 · 10 comments
Closed

doc-file #5

brry opened this issue Jul 17, 2016 · 10 comments
Assignees

Comments

@brry
Copy link

brry commented Jul 17, 2016

I have several .doc files that each (unzipped) only contain "[Content_Types].xml" and the folders "_rels" (with ".rls") and "theme" (with "theme/theme1.xml", "theme/themeManager.xml" and "theme/_rels/themeManager.xml.rels").

Any idea how to read the old ".doc" format?
(I hope it's OK to post this as an issue. Just delete it if not^^)

@hrbrmstr hrbrmstr self-assigned this Jul 17, 2016
@hrbrmstr
Copy link
Owner

hrbrmstr commented Jul 17, 2016

If you have LibreOffice (not OpenOffice) installed, you can do something like (this is an OS X command-line):

/Applications/LibreOffice.app/Contents/MacOS/soffice --convert-to docx:"MS Word 2007 XML" filename.doc --headless 

which will convert .doc files to .docx. I believe Windows requires single dashes (-) vs double dashes (--) for the cmd line param options.

It'd be somewhat straightforward for me to write a function to identify whether libreoffice is installed on a given system (win/mac/linux) and then perform this conversion if a .doc is detected (it's going to be a while before I get to that tho).

This may also work with OpenOffice but the last time I tried the soffice command in headless mode with OpenOffice it failed miserably.

@hrbrmstr
Copy link
Owner

NOTES

probable linux locations

  • /usr/bin/soffice
  • /usr/local/bin/soffice

probable macOS locations

  • /Applications/LibreOffice.app/Contents/MacOS/soffice
  • ~/Applications/LibreOffice.app/Contents/MacOS/soffice

probable Windows locations

  • C:\Program Files\LibreOffice #.#\program\soffice.exe
  • C:\progra~1\libreo~1\program\soffice.exe

shld work on linux/macOS

  • soffice --convert-to docx:"MS Word 2007 XML" --headless --outdir (somedir) filename.doc

shld work on Windows

  • soffice -convert-to docx:"MS Word 2007 XML" -headless -outdir (somedir) filename.doc

@boksic1986
Copy link

Dose it can change the table of a docx file?

@hrbrmstr
Copy link
Owner

@boksic1986 If you're asking if the package can modify the contents of a table in a Microsoft Word document, it cannot. If you desire this functionality, please ask file a new issue with some specifics on what you are looking for.

@ChrisMuir
Copy link
Contributor

I know this is an old issue, but I had a work need to use this package for both .docx and .doc files, so I've started making revisions in a fork to make it work for both file types. Figured I'd share here, I'd be happy to open a PR if you'd like (or you can use any pieces/parts if you'd like). See my latest commit for all edits.

I'm using the LibreOffice software to convert .doc to .docx, as suggested in this thread. I'm working on Windows so currently the edits are only suited to work on Windows (and it's on the user to figure out their file path of soffice.exe and register it using set_libreoffice_path()). I'd love to expand the functionality though, to work on Mac and Linux. I'm testing using some local files and two urls, everything is working great on my machine:

set_libreoffice_path("C:\\Program Files\\LibreOffice\\program\\soffice.exe")

paths <- c(
  "http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1709151519250301478.docx", 
  "http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1708211847068712082.doc"
)

for (path in paths) {
  tbl <- docxtractr::read_docx(path)
  df <- docxtractr::docx_extract_tbl(tbl, 1)
  print(dim(df))
  Sys.sleep(5)
}
#> [1] 69 10
#> [1] 15 10

If I make any more progress I can update here.

Also, just want to say this package is AWESOME! About a year ago I had a need to get data from both .docx and .doc files, I resorted to using Python and the win32com module to extract all content as a string, and then piece the data tables back together.....it was kind of a nightmare. So glad I found this, thank you for building it!

@hrbrmstr
Copy link
Owner

wow (like, srsly: wow!) Definitely add yourself as an aut+ctb into the DESCRIPTION and shoot a PR over. I can poke at cross-platform bits (I have all three OSes but only rarely fire up Windows and always have libreoffice installed for forensic-tool-purposes) and also any CRAN issues that might be there due to a dep on libreoffice. This def needs to get on CRAN as I think alot of folks are feeling similar pain.

This is great! #ty!

@ChrisMuir
Copy link
Contributor

Sure thing! Just opened a PR.

I plan on working on this some more, I'll let you know if I make more progress. If there's anything in particular you'd like help with on this pkg let me know, I'm happy to help!

@ChrisMuir
Copy link
Contributor

I worked on this some more, I have it up and running on my Mac for both .docx and .doc files (see my commit ecbf2a3). I'm basically just splitting the guts of convert_doc_to_docx() into two functions, convert_win() and convert_osx(), and then using Sys.info() to determine if the os is Windows or not.

I don't have much experience with calling command line tools within an R package, so I'm not sure if my implementation choices are the best.

Again, let me know if you want a PR for these edits 😄

# Same test on a Mac
docxtractr::set_libreoffice_path("/Applications/LibreOffice.app/Contents/MacOS/soffice")

paths <- c(
  "http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1709151519250301478.docx", 
  "http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1708211847068712082.doc"
)

for (path in paths) {
  tbl <- docxtractr::read_docx(path)
  df <- docxtractr::docx_extract_tbl(tbl, 1)
  print(dim(df))
  Sys.sleep(5)
}
#> [1] 69 10
#> [1] 15 10

@ChrisMuir
Copy link
Contributor

I've been thinking about how to "automatically" determine the path to soffice. I've looked around for similar set ups in other packages, but every example I can find involves utilizing software that has its own PATH variable, so it's easy to just use Sys.which() to get the software file path.

We could do a manual search (example below), but that doesn't seem ideal. Is something like this what you had in mind?

# Check env variable "path_to_libreoffice". If it's NULL, call lo_find(), which
# will try to determine the local path to LibreOffice file "soffice". If 
# lo_find() is successful, the path to "soffice" will be assigned to env 
# variable "path_to_libreoffice", otherwise an error is thrown.
lo_assert <- function() {
  lo_path <- getOption("path_to_libreoffice")
  
  if (is.null(lo_path)) {
    lo_path <- lo_find()
    set_libreoffice_path(lo_path)
  }
}

# Returns the local path to LibreOffice file "soffice". Search is performed by 
# looking in the known file locations for the current OS. If OS is not Linux, 
# OSX, or Windows, an error is thrown. If path to "soffice" is not found, an 
# error is thrown.
lo_find <- function() {
  user_os <- Sys.info()["sysname"]
  if (!user_os %in% names(lo_paths_to_check)) {
    stop(lo_path_missing, call. = FALSE)
  }
  
  lo_path <- NULL
  for (path in lo_paths_to_check[[user_os]]) {
    if (file.exists(path)) {
      lo_path <- path
      break
    }
  }
  
  if (is.null(lo_path)) {
    stop(lo_path_missing, call. = FALSE)
  }
  
  lo_path
}

# List obj containing known locations of LibreOffice file "soffice".
lo_paths_to_check <- list(
  "Linux" = c("/usr/bin/soffice",
              "/usr/local/bin/soffice"),
  "Darwin" = c("/Applications/LibreOffice.app/Contents/MacOS/soffice",
               "~/Applications/LibreOffice.app/Contents/MacOS/soffice"),
  "Windows" = c("C:\\Program Files\\LibreOffice\\program\\soffice.exe",
                "C:\\progra~1\\libreo~1\\program\\soffice.exe")
)

# Error message thrown if LibreOffice file "soffice" cannot be found.
lo_path_missing <- paste(
  "LibreOffice software required to read '.doc' files.",
  "Cannot determine file path to LibreOffice.",
  "To download LibreOffice, visit: https://www.libreoffice.org/ \n",
  "If you've already downloaded the software, use function",
  "'set_libreoffice_path()' to point R to your local 'soffice.exe' file"
)

And then lo_assert() could be inserted near the top of read_docx(), like so:

## <snip at line 25>
# Check to see if input is a .doc file
is_input_doc <- is_doc(path)

# If input is a .doc file, create a temp .doc file
if (is_input_doc) {
  lo_assert()
  tmpf_doc <- tempfile(tmpdir = tmpd, fileext = ".doc")
  tmpf_docx <- gsub("\\.doc$", ".docx", tmpf_doc)
} else {
  tmpf_doc <- NULL
  tmpf_docx <- NULL
}
## <continue with function>

@bedantaguru
Copy link

I was thinking of an alternative way of supporting doc-files. I opened #23 for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants