Extract hrefs from links, add to end of text and reference them. #3

aduggleby · 2021-01-07T19:59:05Z

This change allows the links to survive the text conversion by extracting the urls (href attributes) during traversal and then appending them as a numbered list to the end of the document.

matteocontrini

Hi Alex, thank you for this PR, your code looks almost perfect!

I left a few comments in the review, but there's also another issue I noticed. I haven't run the code but I think that the fact that links are collected in that private List field could create an unideal situation where links in a list item or in a heading are placed inside the item/heading, instead of at the bottom of the document.

If you take a look here you can see that an HtmlTraversal instance is created when a list item needs to be converted. To be honest, I don't remember the exact reason why I did this, but that was what I went to. So if there's a link inside the <li>, when that list item is rendered links would be placed inline, which is not what we want.

Off the top of my mind I don't have an immediate solution to this... We probably need to refactor a bit those sections, to avoid creating a new instance of HtmlTraversal, or at least to share the state somehow. One temporary workaround could be to have a flag to disable the links references generation, so the constructor would be: HtmlTraversal(bool appendLinks = true). Links in list items and headings would be then discarded, which is unideal, but...

Textify/HtmlTraversal.cs

aduggleby · 2021-01-13T19:15:29Z

There was a test that was commented out for nested lists which I think is where the double traversal object originally was an attempt to solve that use case but was failing. I've taken it out and put the nesting handling (adding tabs) just before the output is appended. Added some more tests with deeper testing for this.
Also added the README example to the tests and updated it to reflect the link extraction.

matteocontrini · 2021-01-13T19:59:59Z

Thanks for the changes!

I think I remember why there was that double traversal object: it was needed in order to add the * prefix only when the list item actually contains something. Some websites use list items that contain only icons, for example, so without that check there could be many useless empty lists.

The same thing happens with headings. The code needs the whole content of the heading to:

check whether the heading actually contains something
extract the length of the longest line in the heading so that the Markdown-like formatting can be generated (the ++++ or ----- symbols)

If we keep the double traversal object, one solution I can think of is to pass the links list to new HtmlTraversal instances, so that links can be appended by them and then be output only by the "root" instance. A bit hacky though...

Any thoughts?

…rsion and ignore links with anchor hrefs.

aduggleby · 2021-01-20T10:32:44Z

I understand, but I've decided to change the dividers to fixed length which doesn't required the new object. For my use case the trade-off for slightly less "beautiful" output vs. the other features and simplicity I gain is worth it, but I'm not sure if you want to pull it in for your case or others that depend on the existing output. It's here for your taking if you want, otherwise the changes will just live on in my fork. Thanks for the basis though, saved me a chunk of time.

matteocontrini · 2021-03-16T19:42:44Z

Sorry, I'm very late here. I don't have much time to think about this at the moment, but I'll merge and release :)

I'll create an issue to track the problems mentioned above with this approach.

Thank you!

Extract hrefs from links, add to end of text and reference them.

53847bc

matteocontrini requested changes Jan 12, 2021

View reviewed changes

Textify/HtmlTraversal.cs Outdated Show resolved Hide resolved

Textify/HtmlTraversal.cs Outdated Show resolved Hide resolved

Textify/HtmlTraversal.cs Outdated Show resolved Hide resolved

Textify/HtmlTraversal.cs Outdated Show resolved Hide resolved

aduggleby added 4 commits January 13, 2021 19:07

Code review comments

014daa0

Add link to TEADME.md and add that example to tests

048b68d

Support nested UL without new HtmlTraversal object

472c0d9

One more test for nesting plus link extraction.

f0e53ba

Change header underlining to fixed divider, remove HtmlTraversal recu…

54be568

…rsion and ignore links with anchor hrefs.

matteocontrini merged commit 0c9dbc8 into matteocontrini:master Mar 16, 2021

matteocontrini mentioned this pull request Mar 16, 2021

Improve traversal #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract hrefs from links, add to end of text and reference them. #3

Extract hrefs from links, add to end of text and reference them. #3

aduggleby commented Jan 7, 2021

matteocontrini left a comment

aduggleby commented Jan 13, 2021

matteocontrini commented Jan 13, 2021 •

edited

aduggleby commented Jan 20, 2021

matteocontrini commented Mar 16, 2021

Extract hrefs from links, add to end of text and reference them. #3

Extract hrefs from links, add to end of text and reference them. #3

Conversation

aduggleby commented Jan 7, 2021

matteocontrini left a comment

Choose a reason for hiding this comment

aduggleby commented Jan 13, 2021

matteocontrini commented Jan 13, 2021 • edited

aduggleby commented Jan 20, 2021

matteocontrini commented Mar 16, 2021

matteocontrini commented Jan 13, 2021 •

edited