Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some linkified urls are including the app server name #2496

Closed
JulienParis opened this issue Jun 15, 2020 · 17 comments · Fixed by #2500
Closed

Some linkified urls are including the app server name #2496

JulienParis opened this issue Jun 15, 2020 · 17 comments · Fixed by #2500

Comments

@JulienParis
Copy link
Contributor

JulienParis commented Jun 15, 2020

In a dataset's text, an url written in a markdown format is transformed into a <a> tag but its href attribute contains the app server name within the url...

You can check an example of this behaviour here => test Bleach dataset

Detailed Description

if you write a text as :

link 03 : [rte-france-03.com](rte-france-03.com)

you get on the client side something like :

<p>link 03 : <a href="https://www.data.gouv.frrte-france-03.com">rte-france-03.com</a></p>

... which is wrong and not desired behaviour I guess : it shoud not contain www.data.gouv.fr before rte-france-03.com in the href attribute

Possible Implementation

check PR #2486

@abulte
Copy link
Contributor

abulte commented Jun 15, 2020

What does the markdown says about that? I wouldn't be shocked if http(s):// was mandatory in a markdown external link.

@abulte
Copy link
Contributor

abulte commented Jun 15, 2020

Cf https://pad.incubateur.net/WK9dDgvlSZSP--G059UVMA?both for a test on another markdown parser (similar behaviour as data.gouv.fr today).

@abulte
Copy link
Contributor

abulte commented Jun 15, 2020

What's your suggested behaviour BTW?

@JulienParis
Copy link
Contributor Author

the wrong netloc value is injected/recreated at this moment =>

attrs[(None, 'href')] = '{scheme}://{netloc}{path}'.format(

@JulienParis
Copy link
Contributor Author

JulienParis commented Jun 15, 2020

What's your suggested behaviour BTW?

link 03 : [rte-france-03.com](rte-france-03.com)

should produce

<p>link 03 : <a href="https://rte-france-03.com">rte-france-03.com</a></p>

and also a rel="nofollow" attribute, but that's described in another issue

@abulte
Copy link
Contributor

abulte commented Jun 15, 2020

Cf https://gist.github.com/abulte/9b4ffe706f4667a13db3265bec14538e

So I think the real issue is that we don't add a / between data.gouv.fr and the (internal) URL when we need one. Other than that it's OK. #2497 is not an issue either.

@JulienParis
Copy link
Contributor Author

What should it produce then ?

@abulte
Copy link
Contributor

abulte commented Jun 15, 2020

To be clear, [rte-france-03.com](rte-france-03.com) should yield a link to https://www.data.gouv.fr/rte-france-03.com

@JulienParis
Copy link
Contributor Author

JulienParis commented Jun 15, 2020

but following that logic [www.rte-france-06.com](www.rte-france-06.com) should also yield href="https://www.data.gouv.fr/www.rte-france-06.com"... which seems very weird to me (but better for SEO :)

@abulte
Copy link
Contributor

abulte commented Jun 15, 2020

It's not a question of SEO, it's that we shouldn't try to guess if a link is internal or external. If the user wants an external link, she uses http(s)://, if not we consider it as an internal link and enforce our site name in front of it. By doing that we ensure that nofollow is used when needed.

@abulte
Copy link
Contributor

abulte commented Jun 15, 2020

This is what GitHub does in the end. They keep it as a relative link but we can't because we need the nofollow.

Capture d'écran 2020-06-15 13 37 57

@JulienParis
Copy link
Contributor Author

JulienParis commented Jun 15, 2020

nope, to https://www.data.gouv.fr/www.rte-france-06.com

sorry I rectified my comment above, but ok I get it better now...

@JulienParis
Copy link
Contributor Author

JulienParis commented Jun 15, 2020

Github behaviour just to compare...

- email text : me@lalala.com
- email text : me@lalala-composed.com
- email md : [me@lalala-composed.com](mailto:me@lalala-composed.com)
- plain text : lalala.com
- plain text + http : http://lalala.com
- alone : [lalala.com](lalala.com)
- www : [lalala.com](www.lalala.com)
- http : [lalala.com](http://lalala.com)
- https : [lalala.com](https://lalala.com)

So yes Bleach behaves another way and induce an external url, that is a root cause of our mutual miscomprehension on this matter :)

@abulte
Copy link
Contributor

abulte commented Jun 15, 2020

So yes Bleach behaves another way and induce an external url, that is a root cause of our mutual miscomprehension on this matter :)

It's not Bleach, it's our nofollow_callback function. And it's because we enforce a nofollow, like the function name says.

@JulienParis
Copy link
Contributor Author

JulienParis commented Jun 15, 2020

I did an update on a live test => https://www.data.gouv.fr/fr/datasets/bleach-test-002/

test **BLEACH** 002

----
### emails bleaching

#### as markdown 
email-01-md : [email-01-md@email.com](mailto:email-01-md@email.com)
email-02-md : [email-02-md@email-multi-composed.com](mailto:email-02-md@email-multi-composed.com)

#### as plain text
email-01-txt : email-01-txt@email.com
email-02-txt : email-02-txt@email-multi-composed.com

---- 
### links

#### as markdown 
link-md-01 : [link-md-01.com](link-md-01.com)
link-md-02 : [http://link-md-02.com](http://link-md-02.com)
link-md-03 : [https://link-md-03.com](https://link-md-03.com)

#### as plain text
link-txt-01 : link-txt-01.com
link-txt-02 : http://link-txt-02.com
link-txt-03 : www.link-txt-03.com
link-txt-04 : http://link-txt-04.com

--- 
### internal / client explicitly writes an internal link
internal-md-01 : [local/route/internal-01](local/route/internal-md-01)
internal-md-02 : [/local/route/internal-md-02](/local/route/internal-02)
internal-md-03 : [/ ... internal-md-03](/)

Capture d’écran 2020-06-15 à 15 26 36

@JulienParis
Copy link
Contributor Author

what confused me was link-txt-01 and link-txt-02 are not currently behaving as you mentionned above but link-md-01 is (approximatly modulo the /)...

@JulienParis
Copy link
Contributor Author

JulienParis commented Jun 15, 2020

Anyway I'll work on a solution giving the exact same behaviour than GH + adding nofollow if link is external + neutralizing all emails...

seems easier to describe that way... is that ok for u @abulte @quaxsze ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants