Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] issue #1 exclude tags (html clean-up) #16

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

oliviermills
Copy link

@oliviermills oliviermills commented Apr 18, 2024

Replaced the exclude tag list with a function that does nicer and safer clean up. Resolves #1
Added basics tests for the function.

Important: should add an integration test with a much larger variety of html pages see #15

@oliviermills oliviermills changed the title Feat/issue #1 exclude tags (html clean-up) [Feat] issue #1 exclude tags (html clean-up) Apr 18, 2024
Copy link
Collaborator

@rafaelsideguide rafaelsideguide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added back the tags that must be removed by default

@rafaelsideguide
Copy link
Collaborator

rafaelsideguide commented Apr 19, 2024

CI/CD is failing because we hit the llamaparse rate limit

@nickscamara
Copy link
Member

nickscamara commented May 15, 2024

@rafaelsideguide Now that we have an initial testing framework, we should start testing these changes and get this merged.

also @oliviermills quick thing, I noticed that this pr was made before we switched to AGPL 3.0, can you just confirm that you agree to relicense your contributions under the new license? Once that's done, we can proceed to merging them!

(you can just write a comment here saying "I agree to relicense my contributions to the AGPL" - see #134 for more context)

Thank you.

@oliviermills
Copy link
Author

I agree to relicense my contributions to the AGPL 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feat] Strip non-content tags, headers, footers
3 participants