Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use git log to know a documents last modified date #706

Closed
peterbe opened this issue Jun 3, 2020 · 9 comments
Closed

Use git log to know a documents last modified date #706

peterbe opened this issue Jun 3, 2020 · 9 comments
Assignees
Projects
Milestone

Comments

@peterbe
Copy link
Contributor

peterbe commented Jun 3, 2020

At the time of writing, we're only using the .modified date in the sitemaps XML files we build at the end of the builder. But we should really soon have a "last modified" displayed in the footer of every document.

For documents that have been imported from MySQL we get this from the wikihistory.json but not for documents that came into existence after the MySQL import.

@peterbe
Copy link
Contributor Author

peterbe commented Jun 3, 2020

@Gregoor and I dug into this early in Jan 2020 and we managed to write an "algorithm" that uses git libraries that can look up a complete map of path => date based on git log and since that's fast, we make that map once instead of having to figure it out on one-document-at-a-time basis in the builder.

What we stumbled on was that nodegit wasn't supported in modern versions of Node so the examples we proved were done with Python and Rust.

@peterbe peterbe changed the title T - use git log know a documents last modified date Use git log know a documents last modified date Sep 23, 2020
@peterbe peterbe added this to the Yari1 milestone Sep 23, 2020
@escattone escattone added this to To do in Yari via automation Sep 28, 2020
@peterbe
Copy link
Contributor Author

peterbe commented Oct 22, 2020

I'm going to wait for #1510 because I know it has a nifty wrapper on child_process called gitExec which I'd want to use to control the gathering of the git logs from within Node.
Also, it might be good idea to just git clone the whole thing with no depth set. I.e. full depth. Setting fetch-depth: 0 in actions/checkout. That's going to work for a very long time.

@fiji-flo Had an interesting idea. We could potentially control that fetch-depth parameter to actions/checkout by first attempting to retrieve the last log file from the cache. Something like this::

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - name: Cache git log JSON
      id: cache-git-log
      uses: actions/cache@v2
      with:
        path: git-log.json
        key: ${{ runner.os }}-git-log.json

    - uses: actions/checkout@v2
      with:
        fetch-depth: ${{ steps.cache-git-log.outputs.cache-hit == 'true' && 100 || 0 }}
    
    - name: Build Yari
       ...

    - name: Dump git-log.json
      run: node cli.js dump-git-log-past -o git-log.json

@peterbe peterbe self-assigned this Oct 29, 2020
@peterbe peterbe changed the title Use git log know a documents last modified date Use git log to know a documents last modified date Oct 29, 2020
@peterbe
Copy link
Contributor Author

peterbe commented Oct 29, 2020

Ugh. I’m struggling to figure out how to get the last-modified date. I think we might need to accept that all last-modified dates come from git.
Thing is, there is no document that we build, archived or not, that does NOT have a git last-modified. And it's always greater-or-equal to the date we did the migration.
I.e. We can't say a certain document was last modified "2019-11-02" (for example).

Every single document (we build) is in git. So for every single filepath, we'll have a date and it'll be the last time it got checked in. So if a document, in the Wiki, was last modified in 2019-05-10, it won't be anymore because it'll be something from either the first time we ran the migration (the creation of the mdn/content repo) or the final migration, or if it has changed in github since the final migration. It'll never be anything older than that.

In other words, we're going to lose the last-modified from before the migration. You'll never see, in the document footer: "Last modified: July 21, 2019"

Given the _wikihistory.json file or the output of the git log --name-only ... command. It'll be something like this:

# from _wikihistory.json
en-us/docs/foo 2019-04-01
en-us/docs/bar 2019-07-23

# from 'git log --name-only ...'
en-us/docs/foo 2020-10-20
en-us/docs/bar 2019-10-20
en-us/docs/buzz 2020-11-01

If you merge these it has to become

en-us/docs/foo 2020-10-20
en-us/docs/bar 2019-10-20
en-us/docs/buzz 2020-11-01

Some documents have not had any read edits to them since they migration, but how can you possibly know?? And if we run some fixable flaws mass-edits, it'll still look like even more edits in it.

So I think the conclusion is pretty solid. We'll have to give up on last-modified dates that predate the migration(s).

@peterbe
Copy link
Contributor Author

peterbe commented Oct 29, 2020

One repercussion is; we can probably stop recording the modified in _wikihistory.json because we can never use it.

@Ryuno-Ki
Copy link

In case it helps: In 11ty/eleventy#142 (comment) I digged into how to get the last modified date for all files known to git.

peterbe added a commit to peterbe/yari that referenced this issue Oct 29, 2020
peterbe added a commit to peterbe/yari that referenced this issue Oct 29, 2020
@escattone
Copy link
Contributor

@peterbe I wonder if we could solve this by always using the git last-modifed date, except if it matches the migration date (the date we officially migrate all of our documents from the Wiki into our GitHub repos)? So, in other words, assuming we launch on Dec. 14th as planned, we always use the git last-modifed date, except when it matches 2020-12-14, and only then do we use the modified date in _wikihistory.json. Of course, for that to work we'd also have to avoid making any changes to the documents via the repository on that date. What do you think about that?

@peterbe
Copy link
Contributor Author

peterbe commented Oct 30, 2020

@peterbe I wonder if we could solve this by always using the git last-modifed date, except if it matches the migration date (the date we officially migrate all of our documents from the Wiki into our GitHub repos)? So, in other words, assuming we launch on Dec. 14th as planned, we always use the git last-modifed date, except when it matches 2020-12-14, and only then do we use the modified date in _wikihistory.json. Of course, for that to work we'd also have to avoid making any changes to the documents via the repository on that date. What do you think about that?

Doable, but I don't like it. The problem is that we've been migrating from Wiki to github.com/mdn/content for over a month now. I've already done 5 checkins for content from MySQL and I hope to do many more in the next couple of weeks.

The other secret truth is that a lot of "last modified" dates in the Wiki are quite invalid because it could be bots that just trigger an edit to a document rather than a new revision creation.

One option is that we display both dates. E.g.

Last modified: Dec 14, 2020 (migrated from the MDN Wiki on Feb 13, 2019)

The other not so secret truth is that a LOT of document has fixable flaws. Either it's external images or it's broken links that are fixable. So a huge amount of documents will have a git last-modified date like Dec 15, 2020. Should we make an exception for that too? It's getting complicated.

@fiji-flo
Copy link
Contributor

git log --name-only --grep="^Bump " --invert-grep --no-decorate --format="←→ %ci" --date-order --after="2020-10-01 00:00:00 -0000" --reverse

We should come up with a convention for non-content commits and put that into the grep instead of my ^Bump example.

That might be a solution to all of this.

@peterbe
Copy link
Contributor Author

peterbe commented Oct 30, 2020

By the way, the way we compute last-modified in the dumper is wrong. Very wrong.
If you use wiki_document.modified you'll get the wrong date 79% of the time :)
If you use the max(wiki_revision.created) it is the most realistic last-modified date.

Obviously, it's not that simple. Without taking the time to carefully study the Kuma Wiki ORM code, it's very possible that sometimes a document legitimately is modified since the latest revision was created. For example, when you edit other things such as slug, title, revision adjustments.

Note-to-self; this was the query I used to deduce these percentages:

    select max(r.created) as max_created, d.id, d.modified from
    wiki_revision r
    INNER JOIN wiki_document d ON r.document_id = d.id
    where d.locale = 'en-us'
    group by r.document_id

@peterbe peterbe closed this as completed in 6c96d7f Dec 2, 2020
Yari automation moved this from To do to Done Dec 2, 2020
fiji-flo pushed a commit that referenced this issue Jan 26, 2022
#699 Trigger focus when main topbar search is opened
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Yari
  
Done
Development

No branches or pull requests

4 participants