Extractors

This plug-in lets Vim read text documents of type PDF, Microsoft Office such as Word (doc(x)), Excel (xls(x)) or Powerpoint (ppt(x)), Open Document (odt), EPUB .... The text extraction depends on external tools, but most use cases are covered by an installation of

LibreOffice and a common text browser (such as lynx), and
pdftotext.

Extractors

It uses, whenever available, appropriate external converters such as unrtf, pandoc, docx2txt.pl, odt2txt, xlscat, xlsx2csv.py or pptx2md ..., but will fall back to:

Either LibreOffice which is an office suite that (together with a common text browser such as lynx) can handle all those formats listed above, except PDFs. (On Microsoft Windows, ensure after its installation that the path of the folder containing the executable, by default %ProgramFiles%\LibreOffice\program, is added to the %PATH% environment variable.
Or Tika which is a content extractor that can handle all those formats listed above and many more. To use it:
1. Download the latest runnable tika-app-...jar from Tika to ~/bin/tika.jar (on Linux) respectively %USERPROFILE%\bin (on Microsoft Windows).
2. Create
  - on Linux, a shell script ~/bin/tika that reads
```
    #!/bin/sh
    exec java -Dfile.encoding=UTF-8 -jar "$HOME/bin/tika.jar" "$@" 2>/dev/null
```
  and mark it executable (by chmod a+x ~/bin/tika).
  - on Microsoft Windows, a batch script %USERPROFILE%\bin\tika.bat that reads
```
    @echo off
    java -Dfile.encoding=UTF-8 -jar "%USERPROFILE%\bin\tika.jar" %*
```
3. Add the folder of the newly created tika executable to your environment variable $PATH (on Linux) respectively %PATH% (on Microsoft Windows):
  - on Linux, if you use bash or zsh by adding to ~/.profile or ~/.zshenv the line
```
    PATH=$PATH:~/bin
```
  - on Microsoft Windows, a convenient program to update %PATH% is Rapidee.

OCR

For the (English) text extraction of common image files of common formats, it uses tesseract whenever its executable is found, available on Microsoft Windows and Linux. On Microsoft Windows, ensure after its installation that the path of the folder containing its executable, by default %ProgramFiles%\Tesseract-OCR, is added to the %PATH% environment variable.

Pass additional command-line options to tesseract by g:office_tesseract (left empty by default), for example

  let g:office_tesseract = '-l eng+ita'

to properly extract Italian words (as well as English ones).

Other (media) file formats

To go even further, for example, to read, among many others file formats, media files in Vim, add this Vimscript snippet from lesspipe.sh to your vimrc!

Pandoc

To convert a file to markdown, add the following command to your vimrc and run :PandocToMarkdown inside the buffer of the opened file:

  command! -range=% PandocToMarkdown exe '<line1>,<line2>!pandoc --wrap=preserve --from='..PandocFiletype(&l:filetype)..'--to markdown %:S'
  function! PandocFiletype(filetype) abort
    if a:filetype ==# 'tex'
      return 'latex'
    elseif a:filetype ==# 'pandoc'
      return 'markdown'
    elseif a:filetype ==# 'text' || empty(a:filetype)
      return expand('%:e')
    else
      return a:filetype
    endif
  endfunction

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
doc		doc
ftdetect		ftdetect
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

ftdetect

ftdetect

README.md

README.md

Repository files navigation

Extractors

OCR

Other (media) file formats

Pandoc

Related

About

Releases 2

Packages

Languages

Konfekt/vim-office

Folders and files

Latest commit

History

Repository files navigation

Extractors

OCR

Other (media) file formats

Pandoc

Related

About

Resources

Stars

Watchers

Forks

Languages