Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git LFS support #1375

Open
rgn opened this issue Jun 28, 2021 · 15 comments
Open

Git LFS support #1375

rgn opened this issue Jun 28, 2021 · 15 comments

Comments

@rgn
Copy link

rgn commented Jun 28, 2021

Hi there,

is there any plan to support the large file system capability with isomorphic-git?

I found #218 that notes LFS in an example, but I didn't found in the docs how to enable it.

Best regards

Ralf

@jcubic
Copy link
Contributor

jcubic commented Jun 28, 2021

I don't think there will be support unless there will be a person that will add this feature. The main person behind the project doesn't work on it like at the beginning. I'm right now an admin/maintainer, but I will not be able to add this feature. In fact, I will probably not able to add any feature myself. At least no in near future, maybe it will change later when I'll work on the project for a while.

@mojavelinux
Copy link
Contributor

mojavelinux commented Jun 30, 2021

I think the first thing to try/prove is whether Git LFS can be supported without a change to isomorphic-git. isomorphic-git gives you the low level functions to read the objects in a git repository (abstracting away all the interaction with the loose, pack, index, and ref files). From that information, it should (in theory) be possible to discover Git LFS references and resolve them in application code.

Personally, I don't yet understand Git LFS enough to know what has to happen. But I can't imagine that there is anything in the loose, pack, index, or ref files that pertain to Git LFS, and thus it's possible to handle those references without a change to isomorphic-git.

If we determine that there's something that has to change in isomorphic-git, and we know what that change is, then of course we can consider making those updates so that an application can resolve Git LFS references.

@mojavelinux
Copy link
Contributor

I just tried to clone a repository that has files tracked by git-lfs and it seemed to work just fine. What's the specific request here? Is this a question about being able to add and remove lfs files, or is it about being able to clone a repository with git-lfs enabled? What specifically isn't working right now? And what do you want to see working?

@mojavelinux
Copy link
Contributor

I understand this better now. When the repository is cloned, the lfs blobs are actually info files (much like symlinks). Here's an example:

version https://git-lfs.github.com/spec/v1
oid sha256:5f47400a1b4be065c4b64c9d2a06123bc4b463781561834aa9f0cd7061a86768
size 70787

So it's clear this reference needs to be resolved. The question is, what needs to be sent to the server to get the lfs storage (either a single file or all of them)? Does anyone know where this exchange is documented?

@mojavelinux
Copy link
Contributor

Aha! Adding the following flags when running git allowed me to see what it is requesting:

GIT_TRACE=1 GIT_TRANSFER_TRACE=1 GIT_CURL_VERBOSE=1 

Here's what I saw:

HTTP: POST https://<host>/<org>/<repo>/info/lfs/objects/batch

That at least gives me a thread to follow.

@mojavelinux
Copy link
Contributor

I have a very rudimentary prototype working to resolve lfs files when walking the git tree. I'll get the code organized into a step-by-step example and share it here. From there, we can think about how this can become part of isomorphic-git. I'm thinking something like adding lfs support transparently to the readBlob function, but it's too early to commit to anything at this point.

@mojavelinux
Copy link
Contributor

Here are the docs for the lfs service for reference: https://github.com/git-lfs/git-lfs/tree/main/docs/api

@rgn
Copy link
Author

rgn commented Jul 1, 2021

Thank you, @jcubic, for your quick response.

@mojavelinux As I see, you already got the point. Sorry for reacting that late.
Nice, that you already managed to build a first prototype. I can test it when you share the code.

Actually, there are two sides relevant for LFS:

a) Checkout and retrieve the binary blobs based of reference files from the endpoint.
b) Replace binary blobs as configured in .gitattribtues with the reference file and upload to the endpoint.

There is a specification on how to implement LFS.

@mojavelinux
Copy link
Contributor

I'll be focusing on (a) at first, though I don't see anything preventing (b) from being implemented too.

@mojavelinux
Copy link
Contributor

mojavelinux commented Jul 5, 2021

Here's the code to clone a repository, populate the lfs object cache from the LFS pointer files found in the tree, and replace each LFS pointer file in the worktree with the real lfs object. This code uses two commands from isomorphic-git, clone and walk, as well as the request function from the http plugin.

'use strict'

// $ node <url> <dir>

const fs = require('fs')
const { promises: fsp } = fs
if (!fsp.rm) fsp.rm = fsp.rmdir
const git = require('isomorphic-git')
const http = require('isomorphic-git/http/node')
const ospath = require('path')

const SYMLINK_MODE = 40960
const LFS_POINTER_PREAMBLE = 'version https://git-lfs.github.com/spec/v1\n'

async function bodyToBuffer (body) {
  const buffers = []
  let offset = 0
  let size = 0
  for await (const chunk of body) {
    buffers.push(chunk)
    size += chunk.byteLength
  }
  body = new Uint8Array(size)
  for (const buffer of buffers) {
    body.set(buffer, offset)
    offset += buffer.byteLength
  }
  return Buffer.from(body.buffer)
}

function readLfsPointer ({ dir, gitdir = ospath.join(dir, '.git'), content }) {
  const info = content.toString().trim().split('\n').reduce((accum, line) => {
    const [k, v] = line.split(' ', 2)
    if (k === 'oid') {
      accum[k] = v.split(':', 2)[1]
    } else if (k === 'size') {
      accum[k] = v
    }
    return accum
  }, {})
  const oid = info.oid
  const objectPath = ospath.join(gitdir, 'lfs', 'objects', oid.substr(0, 2), oid.substr(2, 2), oid)
  return { info, objectPath }
}

async function downloadLfsObject ({ http: { request }, headers, url }, lfsInfo, lfsObjectPath) {
  const lfsInfoRequestData = { operation: 'download', transfers: ['basic'], objects: [lfsInfo] }
  const { body: lfsInfoBody } = await request({
    url: `${url}/info/lfs/objects/batch`,
    method: 'POST',
    headers: {
      ...headers,
      Accept: 'application/vnd.git-lfs+json',
      'Content-Type': 'application/vnd.git-lfs+json',
    },
    body: [Buffer.from(JSON.stringify(lfsInfoRequestData))]
  })
  const lfsInfoResponseData = JSON.parse(await bodyToBuffer(lfsInfoBody))
  const lfsObjectDownloadUrl = lfsInfoResponseData.objects[0].actions.download.href
  const { body: lfsObjectBody } = await request({ url: lfsObjectDownloadUrl, method: 'GET', headers })
  const content = await bodyToBuffer(lfsObjectBody)
  await fsp.mkdir(ospath.dirname(lfsObjectPath), { recursive: true })
  await fsp.writeFile(lfsObjectPath, content)
  return content
}

;(async (url, dir) => {
  const repo = { fs, dir }
  const headers = { 'user-agent': `git/isomorphic-git@${git.version()}` }
  await fsp.rm(repo.dir, { recursive: true })
  await git.clone({ ...repo, headers: { ...headers }, http, url })
  await git.walk({ ...repo, trees: [git.TREE({ ref: 'HEAD' })], map: async (filepath, [treeEntry]) => {
    const type = await treeEntry.type()
    if (type === 'tree') return true
    if (type === 'blob' && await treeEntry.mode() !== SYMLINK_MODE) {
      let content = await treeEntry.content().then((bytes) => Buffer.from(bytes.buffer))
      if (content[0] === 118 && content.subarray(0, 100).indexOf(LFS_POINTER_PREAMBLE) === 0) {
        const { info: lfsInfo, objectPath: lfsObjectPath } = readLfsPointer({ ...repo, content })
        if (await fsp.access(lfsObjectPath).catch(() => true)) {
          await downloadLfsObject({ headers, http, url }, lfsInfo, lfsObjectPath).then((content) => {
            const lfsWorktreePath = ospath.join(repo.dir, filepath)
            return fsp.lstat(lfsWorktreePath).then(({ mode }) => fsp.writeFile(lfsWorktreePath, content, { mode }))
          })
        }
      }
    }
  }})
})(...process.argv.slice(2))

This code has several shortcomings.

First, it leaves the worktree dirty. If I switch to the directory and run git status, the lfs object is reported as being changed. But if I run git add, the problem is corrected without having to make a commit. So something needs to be updated in the git index to indicate that the file has not changed.

Second, the code doesn't handle authentication. But that shouldn't be difficult to add (especially since it's using the same request function that isomorphic-git uses internally).

Finally, the code makes a separate requests for each LFS object to get the download URL. But the LFS service supports returning the download URL for multiple objects. So these requests could be consolidated into a single request so it only makes N+1 requests to the LFS service (one to collect all the download URLs and one for each download). (It also might be best to stream to the file in lfs/objects instead of buffering it into memory).

@mojavelinux
Copy link
Contributor

The next step is now to figure out what isomorphic-git can do for us.

It seems reasonable that the checkout command would replace the LFS pointer files in the worktree since that's what git does (when lfs is installed). We could allow this behavior to be controlled using a new keyword argument such as lfs: 'smudge' (or perhaps even filters: ['lfs-smudge']).

But let's assume the repository is cloned without a checkout. When should isomorphic-git populate the .git/lfs/objects cache? Should it do it during clone/fetch? Or should it provide another object command like readLfsBlob that can be used when walking the tree. Also, when walking the tree, should isomorphic-git detect an LFS pointer file and set the type to lfs-pointer?

@jcubic
Copy link
Contributor

jcubic commented Jul 5, 2021

@mojavelinux awesome job.

@rgn
Copy link
Author

rgn commented Jul 6, 2021

I would expect to populate .git/lfs/objects during clone and fetch for the actual branch. At least, this is the behaviour I had with the git CLI. If you switch the branch, fetch the relevant objects.

Regarding the file type, I'm not sure what it is about and what the benefit would be.

@mojavelinux Awesome work!

@mojavelinux
Copy link
Contributor

mojavelinux commented Jul 6, 2021

I would expect to populate .git/lfs/objects during clone and fetch for the actual branch.

Indeed, that would match canonical git when lfs is installed. But there's also overhead in doing so that not every user may want when cloning (at least not unconditionally). So I think the behavior will need to be controlled via a switch.

If you switch the branch, fetch the relevant objects.

This would be checkout. Again, I think it needs to be controlled via a switch. There are use cases when you don't want this to happen if you aren't interested in those files.

Regarding the file type, I'm not sure what it is about and what the benefit would be.

If we are walking the git tree, we need to know whether we are looking at an lfs pointer file or a regular file...just like we do when we're looking at a symlink pointer. isomorphic-git should be able to tell us this. (It's more than just checking the file to see if it looks like an LFS pointer file...we need to be 100% sure by consulting .gitattributes). Otherwise, we cannot trigger the appropriate action.

@strogonoff
Copy link

Or should it provide another object command like readLfsBlob that can be used when walking the tree.

From end user perspective, I believe there are two typical uses as far as reading LFS data goes:

  • Read a blob (possibly at a specified oid). A separate function like readLfsBlob() could work really well actually, matching the readBlob() API and just returning a blob. No need to fetch in parallel or worry about tainting the working directory in this case.
  • Retrieve blobs matching some arbitrary pattern (dependent on UX needs at runtime) into the working directory. A list of object paths would be built for a single batch request, but as @mojavelinux’s pointed out the resulting state of the working directory might confuse other APIs. There’s also the unpleasant race potential, I imagine. Seems both trickier to implement and less essential.

As to adding objects, it’s more difficult to mimc the existing API but a low-level helper function like lfsUpload({ filepath, oid, ... }) could work. One could call it with current oid after adding & committing a file, for example.

Agreed that the behavior of fetching all LFS files matching .gitattributes patterns would rarely be welcome from end application performance standpoint. Actually, personally I’d be fine if IsoGit ignored .gitattributes altogether, and left it to the programmer to decide when use low-level LFS functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants