Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureRequest/HelpNeeded: highlight is not an exact subset of the text content #179

Open
thiswillbeyourgithub opened this issue Feb 24, 2024 · 4 comments

Comments

@thiswillbeyourgithub
Copy link

Hi,

I'm the dev behind LogseqMarkdownParser and am working on a small script to directly turn highlights into anki flashcards.

It's not yet working because I'm running into an issue with text formats.

You see, I don't just want the highlight to be sent to anki, I want to grab the 1000 ish characters before and after the highlight, make a cloze card (= putting a hole in the text and you have to guess the content) with the highlight then sending that to anki.

The main issue I have is that for example I have this highlight:
For example, suppose ΔW is the weight update for a weight matrix W∈RA×B.
And the relevant section of text is this:
For example, suppose \\(\\Delta W\\) is the weight update for a weight ' 'matrix \\(W \\in \\mathbb{R}^{A \\times B}\\).

I'm guessing this is mathjax.

I can't seem to find a good python lib to parse mathjax into text, or text into mathjax, let alone reliably.

So is it possible to:

  1. Either add {{{rawText}}} for the highlight, that would not be parsed (so would still contain the mathjax)
  2. Or parse the content of the article just like the highlight (currently only the highlight is parsed to text)
  3. Also, it seems the position highlight is broken because they are all equal to 0 on my end. Is this normal?

Thanks!

@thiswillbeyourgithub thiswillbeyourgithub changed the title FR/Help needed: Omnivore to Anki FeatureRequest/HelpNeeded: highlight is not an exact subset of the texte content Mar 14, 2024
@thiswillbeyourgithub
Copy link
Author

Hi ! Just a quick bump as I would really like to wrap up my project while I got some free time :) But if you can't find the time to take a look it totally fine of course!

@jacksonh
Copy link
Contributor

Hi i think what you are seeing in the highlight text is raw text or at least markdown. Can you post a screenshot of the highlight itself?

@thiswillbeyourgithub
Copy link
Author

Here's the highlighted section of the text:
image

The article link is that one: https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html

@thiswillbeyourgithub
Copy link
Author

Hi,

I decided to go the "most robust way" anyway and implement a function that finds the best substring in a corpus that matches the highlight. This is computationaly intensive and probably will be an issue for very long texts but at least I can move on towards finishing this.

When I finish this project, if I think it's worth it I'll come back to you to see if that's worth a mention in a blog post or whatever :)

In the meantime, although I still think my request is legit and someone might have a real need for more precise filter access in the API, I'll let you decide if you want to close this or not :)

Have a nice day!

@thiswillbeyourgithub thiswillbeyourgithub changed the title FeatureRequest/HelpNeeded: highlight is not an exact subset of the texte content FeatureRequest/HelpNeeded: highlight is not an exact subset of the text content Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants