feat(document): add docx doc pptx ppt html to transform to text in markdown format #232

chuang8511 · 2024-07-22T16:33:09Z

Because

we want to handle more types of unstructured data

This commit

add the transformer for docx doc pptx ppt html to markdown

In high level, we do

judge what file extensions
- If it is pdf, directly use pdfplumber to convert pdf to markdown
- If it is docx / doc / ppt / pptx, transform the document to pdf by libreoffice, convert pdf to markdown with pdfplumber.
- If it is html, transform it to markdown with html-to-markdown package

The reason we separate docx/doc & ppt/pptx is they are essentially different for users according to use cases.
It will differentiate the pdf to markdown logic. So, we divide them first.

…kdown format

linear · 2024-07-22T16:33:13Z

INS-5302 Convert doc docx ppt pptx html to md

chuang8511 · 2024-07-26T10:07:35Z

Will need to deal with this issue in this PR as well.
Please wait for me.

chuang8511 · 2024-07-26T12:39:10Z

fixed

🤖 I have created a release *beep* *boop* --- ## [0.24.0-beta](v0.23.0-beta...v0.24.0-beta) (2024-07-31) ### Features * add audio operator ([#236](#236)) ([fe8abff](fe8abff)) * add handler to auto-fill missing default values ([#210](#210)) ([dcad3f0](dcad3f0)) * add HubSpot component ([#199](#199)) ([b3936a8](b3936a8)) * add Jira component ([#205](#205)) ([51f3ed7](51f3ed7)) * add Ollama component ([#224](#224)) ([810f850](810f850)) * add sql component ([#193](#193)) ([9a373f3](9a373f3)) * add token count for each chunk ([#235](#235)) ([bb69104](bb69104)) * add video operator to fulfil unstructured data process ([#238](#238)) ([a1459d7](a1459d7)) * **document:** add docx doc pptx ppt html to transform to text in markdown format ([#232](#232)) ([2932db9](2932db9)) * **document:** move ConvertToText task from text operator to document operator ([#248](#248)) ([699ca70](699ca70)) * introduce event handler interface ([#253](#253)) ([9599b42](9599b42)) * **restapi:** recategorize the restapi component as a generic component ([#249](#249)) ([fbfc3a3](fbfc3a3)) * **website:** add scrape sitemap function ([#239](#239)) ([8648326](8648326)) ### Bug Fixes * bug of duplicate document ([#256](#256)) ([e028a6e](e028a6e)) * bug of json without setting array for images ([#259](#259)) ([4aeae69](4aeae69)) * change md format to html tag for correct frontend link ([#240](#240)) ([7e16b2b](7e16b2b)) * revert the alias because they are same as package name ([#243](#243)) ([1d9c42d](1d9c42d)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

chuang8511 added 3 commits July 22, 2024 16:41

feat: add more file extensions to be able to transform to text in mar…

6906a44

…kdown format

refactor: move common utils to internal helper package

39916e1

feat: add integration test code and mock the markdown transformer

fd07429

chuang8511 requested review from donch1989, pinglin, xiaofei-du and jvallesm as code owners July 22, 2024 16:33

droplet-bot added the instill component label Jul 22, 2024

chuang8511 marked this pull request as draft July 22, 2024 16:39

chuang8511 marked this pull request as ready for review July 22, 2024 18:44

chore(document): update doc

d157a0f

chuang8511 requested a review from GeorgeWilliamStrong as a code owner July 23, 2024 10:43

chuang8511 added 2 commits July 26, 2024 13:30

fix: the constant name cannot handle multiple request

d476326

chore(document): add the explanation for the libreoffice

8ab52cf

donch1989 merged commit 2932db9 into main Jul 26, 2024
8 checks passed

donch1989 deleted the chunhao/ins-5302 branch July 26, 2024 13:51

droplet-bot mentioned this pull request Jul 26, 2024

chore(main): release 0.24.0-beta #242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(document): add docx doc pptx ppt html to transform to text in markdown format #232

feat(document): add docx doc pptx ppt html to transform to text in markdown format #232

chuang8511 commented Jul 22, 2024 •

edited

Loading

linear bot commented Jul 22, 2024

chuang8511 commented Jul 26, 2024

chuang8511 commented Jul 26, 2024

feat(document): add docx doc pptx ppt html to transform to text in markdown format #232

feat(document): add docx doc pptx ppt html to transform to text in markdown format #232

Conversation

chuang8511 commented Jul 22, 2024 • edited Loading

linear bot commented Jul 22, 2024

chuang8511 commented Jul 26, 2024

chuang8511 commented Jul 26, 2024

chuang8511 commented Jul 22, 2024 •

edited

Loading