Skip to content

mett29/ppt2txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ppt2txt

A pure python based utility to extract text from PPT files.

The code is based on the official documentation for MS-PPT files available at https://msopenspecs.azureedge.net/files/MS-PPT/%5bMS-PPT%5d.pdf.

How to install?

pip install ppt2txt

How to run?

  • From command line:
ppt2txt file.ppt -o output_dir
  • From python:
import ppt2txt

# extract content
parsed_ppt_dict = ppt2txt.process("file.ppt") 

Output

parsed_ppt_dict is a dictionary with the following structure:

{
    "filename": "file.ppt",
    "slides": 4,
    "content": {
        "0": "Text from the first record",
        "1": "Text from the second record"
    }
}

where:

  • filename is the name of the input file
  • slides is the number of slides
  • content is a dictionary containing an element for each record of type text found in the document

About

A pure python based utility to extract text from PPT files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages