Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link to an archived page shows up as broken link/flaw #2675

Closed
hamishwillee opened this issue Jan 25, 2021 · 19 comments
Closed

Link to an archived page shows up as broken link/flaw #2675

hamishwillee opened this issue Jan 25, 2021 · 19 comments
Labels
🧑‍🤝‍🧑 community contributions by our wonderful community flaw-system issues and feature requests related to the flaws system 🐌 idle Issues and PRs without recent activity. Flagged for maintainer follow-up. need more info Further information is requested p4 Not urgent, only if time allows 🚉 platform keeping the platform healthy

Comments

@hamishwillee
Copy link
Contributor

Locally testing I see broken links reported as a non-fixable flaw. Sometimes this is because the link is to an archived page (it appears archived pages are not fetched when you get the source).

Not sure what the solution is - perhaps a hint that this is possible, and in this case what we expect people to do in that case?

@peterbe
Copy link
Contributor

peterbe commented Jan 25, 2021

Only just last did we add a think so we now know what all the archived URLs are. We'll use that in the broken_link flaw checking.

@peterbe
Copy link
Contributor

peterbe commented Jan 25, 2021

@hamishwillee Do you know, by heart, a good page to test this with?

@peterbe
Copy link
Contributor

peterbe commented Jan 25, 2021

CC @fiji-flo
I looked into this a bit. It's tricky. The problem is that we honestly don't have this information. It's something we didn't export when we migrated from MySQL to git. The way the migration worked, was something like this:

if isArchive(row) {
  if (isRedirect(row) {
    /// NOTHING HERE!!
  else {
    writeArchivedContent(row)
  }
} else {
  if (isRedirect(row)) {
    writeRedirect(row);
  else {
    writeContent(row)
  }
}

Basically, we don't have, in git, a record of all known archived redirects.
Thankfully, the MySQL backup has all of that information and we can run a hack version of the migration script that extracts all of these.
For example:

mysql> select title, is_redirect, html from wiki_document where locale='en-us' and slug='The_Mozilla_platform';
+----------------------+-------------+-------------------------------------------------------------------------------------------------------+
| title                | is_redirect | html                                                                                                  |
+----------------------+-------------+-------------------------------------------------------------------------------------------------------+
| The Mozilla platform |           1 | REDIRECT <a class="redirect" href="/en-US/docs/Mozilla/The_Mozilla_platform">The Mozilla platform</a> |
+----------------------+-------------+-------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Rows like this got send to /dev/null in the big migration.

@peterbe
Copy link
Contributor

peterbe commented Jan 25, 2021

However, there is some hope. On http://localhost:3000/en-US/docs/Mozilla/Firefox#_flaws for example, there's a

<a href="/en-US/docs/Mozilla/Firefox/Multiple_profiles">use multiple Firefox profiles</a>

which shows up as "broken" and not being fixable. However, in the archived.txt we do have an entry of:

en-us/mozilla/firefox/multiple_profiles/index.html

equipped with that, you can conclude it's not a broken link. It's just archived. So it shouldn't be flagged as a flaw.

@hamishwillee
Copy link
Contributor Author

Excellent, and even though I don't know a good test page, you can find one easily in archived.txt :-)

fiji-flo pushed a commit that referenced this issue Jan 28, 2021
* link to archived pages is not a broken link

Part of #2675

* refactor
@peterbe
Copy link
Contributor

peterbe commented Jan 28, 2021

Note-to-self; #2687 make things better.
But there might still exist links that aren't archived URLs, but are archived redirects.

@hamishwillee
Copy link
Contributor Author

hamishwillee commented Jan 28, 2021

@peterbe Should #2687 be working now? I just pulled upstream main, did yarn install, and tried this - still see the flaw on your test page: http://localhost:5000/en-US/docs/Mozilla/Firefox#_flaws

image


Don't want to seem ungrateful because this is MUCH better for me - having a false-positive flaw I can't do anything about is annoying and unhelpful.
But whether it is the "best" fix depends a bit on our archive policy and process. For example if Multiple profiles was archived because we don't support them any more then the fix is to remove the whole section from the linking doc - http://localhost:5000/en-US/docs/Mozilla/Firefox. You can't do that if you don't know that the link is to a doc that is archived.

Upshot, if it is actually a flaw to link to an archived document then we should leave the link in, but add highlighting and an explanation of why it is a flaw. If it is not a flaw to link to an archived document, then please ignore this and accept my thanks.

@peterbe
Copy link
Contributor

peterbe commented Jan 29, 2021

Just seconds off the presses, mdn/content#1889 upgrades the Yari that's shipped inside mdn/content.
So pull the latest main and run yarn and try again.

Secondly, the flaw system isn't yet smart enough to make "policy decisions" about links to sections that we don't care for.
It's technically easy to know, in the flaw system that "Here is a link to the archived section!!!!"
But it's not clear what you should do about that. Many other flaws are "uncontroversial" in a sense. This one would require serious consideration from someone like @chrisdavidmills

As a note, archiving is hard. The reason we're doing it is to remove things that make the content repo otherwise unnecessarily heavy. By shoving it away from our active repo, we're also saving electricity waste by not building it frequently.

@hamishwillee
Copy link
Contributor Author

@peterbe I'm still getting the test case shown in flaws ("Multiple profiles" doc). I've updated to latest and run yarn (and yarn install for good measure).
image

So either this did not fix the problem, it did not get pulled in, or is being affected by some of my other problems.

Yes, archiving is hard. It's a lot easier though if you apply a "reason" for the archiving at the point you make the change. That might be a good case for page metadata.
What I am fairly sure on is that even though we don't want to build the archive every time, it would be good to have an easy way to fetch and build "everything" locally. This would make it easy to move things back out of archive if we wanted to. If you agree very happy to create a separate issue for that.

@peterbe
Copy link
Contributor

peterbe commented Feb 4, 2021

This is what I'm seeing, on this URL, in the latest and greatest version of yari within mdn/content.
Screen Shot 2021-02-04 at 10 11 42 AM
What do you get?

I ran git pull origin && yarn install && yarn start about 2min ago.

@hamishwillee
Copy link
Contributor Author

@peterbe So I see 3 flaws - including multiple profiles. BUT they are all valid - including the multiple profiles link, which now opens (and even though this is still listed in archive.txt (en-us/mozilla/firefox/multiple_profiles/index.html).

So a) I don't know why you see different, b) is there a known archive link I can test that is present in docs. I tried a few from archive.txt but can't find any hits.

image

@hamishwillee
Copy link
Contributor Author

Actually test case might be http://localhost:5000/en-US/docs/Web/JavaScript/#_flaws

The flaw shown is in the archived.txt and if you click it there is a redirect.

image

@hamishwillee
Copy link
Contributor Author

@peterbe FYI This is still a problem (archived pages showing up as broken links). It is a bit of a pain frankly, because I have to check them manually.

@peterbe
Copy link
Contributor

peterbe commented Mar 19, 2021

The only way I think of solving this is if you have a checkout of the github.com/mdn/archived-content on your laptop and then you add CONTENT_ARCHIVED_ROOT=path/to/mdn/archived-content in your .env file.
The link is gone now, but for example /en-US/docs/Web/JavaScript/New_in_JavaScript isn't an actual archived URL but actually an archived redirect URL.

Do you think that's a viable thing? Most people wouldn't need to worry about it but serious editors and writers like yourself can perhaps go the extra mile to get that figured out.

@hamishwillee
Copy link
Contributor Author

hamishwillee commented Mar 22, 2021

@peterbe That absolutely works "for me". If you were feeling very brave you could even add a yarn option to fetch the archive and update the env for those who want it.

Just an idea. Given that the archived links are more or less static might we not be better off just generating a static list of the archived topics and use that to hide the flaw (and possibly provide a stub target "archived topic").
I see your proposal as more generic/extensible, but the benefit of this idea is that it might work transparently for everyone.

@peterbe
Copy link
Contributor

peterbe commented Mar 22, 2021

might we not be better off just generating a static list of the archived topics

Quite complicatedly, we have a list of all the files
But we don't have one for all the redirects. For that, the only place is https://raw.githubusercontent.com/mdn/archived-content/main/files/$LOCALE/_redirects.txt

@hamishwillee
Copy link
Contributor Author

I guess then it comes down to the effort of perhaps importing that list of redirects (and using it) or your proposal. Either is much better "for power users" than the current situation.

@schalkneethling schalkneethling added the 🚉 platform keeping the platform healthy label Nov 4, 2021
@github-actions github-actions bot added the 🐌 idle Issues and PRs without recent activity. Flagged for maintainer follow-up. label Dec 11, 2021
@schalkneethling schalkneethling added 🧑‍🤝‍🧑 community contributions by our wonderful community flaw-system issues and feature requests related to the flaws system labels Apr 15, 2022
@caugner
Copy link
Contributor

caugner commented Nov 30, 2022

Locally testing I see broken links reported as a non-fixable flaw. Sometimes this is because the link is to an archived page (it appears archived pages are not fetched when you get the source).

@hamishwillee Is this still an issue? I think I'm missing context to understand what a link to an archived page is, because I only know about the archived-content repo.

@caugner caugner added p4 Not urgent, only if time allows need more info Further information is requested labels Nov 30, 2022
@hamishwillee
Copy link
Contributor Author

hamishwillee commented Dec 2, 2022

@caugner Closing.

TLDR;

FYI the historical context was that in the past the archived content was held in a separate repo, but built with yari using special setup that put clear "archiving" borders around it. When I posted this you got a warning when doing local testing, even though the content would actually be build on release - i.e a false positive.

We no longer build the archived material so this is invalid - a broken link is a real flaw :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧑‍🤝‍🧑 community contributions by our wonderful community flaw-system issues and feature requests related to the flaws system 🐌 idle Issues and PRs without recent activity. Flagged for maintainer follow-up. need more info Further information is requested p4 Not urgent, only if time allows 🚉 platform keeping the platform healthy
Projects
Development

No branches or pull requests

4 participants