Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract hrefs from links, add to end of text and reference them. #3

Merged
merged 6 commits into from
Mar 16, 2021

Conversation

aduggleby
Copy link
Contributor

This change allows the links to survive the text conversion by extracting the urls (href attributes) during traversal and then appending them as a numbered list to the end of the document.

Copy link
Owner

@matteocontrini matteocontrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Alex, thank you for this PR, your code looks almost perfect!

I left a few comments in the review, but there's also another issue I noticed. I haven't run the code but I think that the fact that links are collected in that private List field could create an unideal situation where links in a list item or in a heading are placed inside the item/heading, instead of at the bottom of the document.

If you take a look here you can see that an HtmlTraversal instance is created when a list item needs to be converted. To be honest, I don't remember the exact reason why I did this, but that was what I went to. So if there's a link inside the <li>, when that list item is rendered links would be placed inline, which is not what we want.

Off the top of my mind I don't have an immediate solution to this... We probably need to refactor a bit those sections, to avoid creating a new instance of HtmlTraversal, or at least to share the state somehow. One temporary workaround could be to have a flag to disable the links references generation, so the constructor would be: HtmlTraversal(bool appendLinks = true). Links in list items and headings would be then discarded, which is unideal, but...

Textify/HtmlTraversal.cs Outdated Show resolved Hide resolved
Textify/HtmlTraversal.cs Outdated Show resolved Hide resolved
Textify/HtmlTraversal.cs Outdated Show resolved Hide resolved
Textify/HtmlTraversal.cs Outdated Show resolved Hide resolved
@aduggleby
Copy link
Contributor Author

There was a test that was commented out for nested lists which I think is where the double traversal object originally was an attempt to solve that use case but was failing. I've taken it out and put the nesting handling (adding tabs) just before the output is appended. Added some more tests with deeper testing for this.
Also added the README example to the tests and updated it to reflect the link extraction.

@matteocontrini
Copy link
Owner

matteocontrini commented Jan 13, 2021

Thanks for the changes!

I think I remember why there was that double traversal object: it was needed in order to add the * prefix only when the list item actually contains something. Some websites use list items that contain only icons, for example, so without that check there could be many useless empty lists.

The same thing happens with headings. The code needs the whole content of the heading to:

  • check whether the heading actually contains something
  • extract the length of the longest line in the heading so that the Markdown-like formatting can be generated (the ++++ or ----- symbols)

If we keep the double traversal object, one solution I can think of is to pass the links list to new HtmlTraversal instances, so that links can be appended by them and then be output only by the "root" instance. A bit hacky though...

Any thoughts?

@aduggleby
Copy link
Contributor Author

I understand, but I've decided to change the dividers to fixed length which doesn't required the new object. For my use case the trade-off for slightly less "beautiful" output vs. the other features and simplicity I gain is worth it, but I'm not sure if you want to pull it in for your case or others that depend on the existing output. It's here for your taking if you want, otherwise the changes will just live on in my fork. Thanks for the basis though, saved me a chunk of time.

@matteocontrini
Copy link
Owner

Sorry, I'm very late here. I don't have much time to think about this at the moment, but I'll merge and release :)

I'll create an issue to track the problems mentioned above with this approach.

Thank you!

@matteocontrini matteocontrini merged commit 0c9dbc8 into matteocontrini:master Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants