Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(document): add docx doc pptx ppt html to transform to text in markdown format #232

Merged
merged 6 commits into from
Jul 26, 2024

Conversation

chuang8511
Copy link
Member

@chuang8511 chuang8511 commented Jul 22, 2024

Because

  • we want to handle more types of unstructured data

This commit

  • add the transformer for docx doc pptx ppt html to markdown

In high level, we do

  • judge what file extensions
    • If it is pdf, directly use pdfplumber to convert pdf to markdown
    • If it is docx / doc / ppt / pptx, transform the document to pdf by libreoffice, convert pdf to markdown with pdfplumber.
    • If it is html, transform it to markdown with html-to-markdown package

The reason we separate docx/doc & ppt/pptx is they are essentially different for users according to use cases.
It will differentiate the pdf to markdown logic. So, we divide them first.

Copy link

linear bot commented Jul 22, 2024

@chuang8511 chuang8511 marked this pull request as draft July 22, 2024 16:39
@chuang8511 chuang8511 marked this pull request as ready for review July 22, 2024 18:44
@chuang8511
Copy link
Member Author

Will need to deal with this issue in this PR as well.
Please wait for me.

@chuang8511
Copy link
Member Author

fixed

@donch1989 donch1989 merged commit 2932db9 into main Jul 26, 2024
8 checks passed
@donch1989 donch1989 deleted the chunhao/ins-5302 branch July 26, 2024 13:51
donch1989 pushed a commit that referenced this pull request Jul 31, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.24.0-beta](v0.23.0-beta...v0.24.0-beta)
(2024-07-31)


### Features

* add audio operator
([#236](#236))
([fe8abff](fe8abff))
* add handler to auto-fill missing default values
([#210](#210))
([dcad3f0](dcad3f0))
* add HubSpot component
([#199](#199))
([b3936a8](b3936a8))
* add Jira component
([#205](#205))
([51f3ed7](51f3ed7))
* add Ollama component
([#224](#224))
([810f850](810f850))
* add sql component
([#193](#193))
([9a373f3](9a373f3))
* add token count for each chunk
([#235](#235))
([bb69104](bb69104))
* add video operator to fulfil unstructured data process
([#238](#238))
([a1459d7](a1459d7))
* **document:** add docx doc pptx ppt html to transform to text in
markdown format
([#232](#232))
([2932db9](2932db9))
* **document:** move ConvertToText task from text operator to document
operator ([#248](#248))
([699ca70](699ca70))
* introduce event handler interface
([#253](#253))
([9599b42](9599b42))
* **restapi:** recategorize the restapi component as a generic component
([#249](#249))
([fbfc3a3](fbfc3a3))
* **website:** add scrape sitemap function
([#239](#239))
([8648326](8648326))


### Bug Fixes

* bug of duplicate document
([#256](#256))
([e028a6e](e028a6e))
* bug of json without setting array for images
([#259](#259))
([4aeae69](4aeae69))
* change md format to html tag for correct frontend link
([#240](#240))
([7e16b2b](7e16b2b))
* revert the alias because they are same as package name
([#243](#243))
([1d9c42d](1d9c42d))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: 👋 Done
Development

Successfully merging this pull request may close these issues.

3 participants