Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the hashes based on content for locally-modified modules. #61

Closed
johanatan opened this issue Apr 25, 2018 · 13 comments
Closed

Expose the hashes based on content for locally-modified modules. #61

johanatan opened this issue Apr 25, 2018 · 13 comments

Comments

@johanatan
Copy link

johanatan commented Apr 25, 2018

Which mbt command are you having troubles with?
mbt describe local

What's the current behaviour?
It returns 'local' for any modules with local modifications. This is fine but I'd like to also build some optimization in our pipeline whereby it doesn't unnecessarily rebuild things. So, exposing the "content hash" per module perhaps via mbt hash module would be a nice feature to have.

What's the expected behaviour?
Access to the content hashes (apart from committing local changes and then calling mbt describe head could be made available.

@johanatan johanatan changed the title Expose the hashes based on content for modules. Expose the hashes based on content for locally-modified modules. Apr 25, 2018
@johanatan
Copy link
Author

Another way to look at this issue is to consider that local is a premature, unnecessary & [inherently] destructive coercion to a "bottom" value. If one takes this viewpoint, he might be inclined to argue that mbt describe local itself should not be special and should just return the module hashes for the modules which have changed on disk (per their [and their dependencies] current contents).

@buddhike
Copy link
Member

Thanks for reporting. Current behaviour is intentional. mbt build|describe local commands are there to find what's changed in your workspace. They should just build/describe workspace changes.
Although, I'm keen to understand where this improvement could be useful. Could you please elaborate the optimisation you are trying to do in your pipeline?

@johanatan
Copy link
Author

johanatan commented Apr 26, 2018

I want to use the hash of the content to see if the build artifact for a particular locally modified module has been built and, if so, skip it.

@buddhike
Copy link
Member

Thanks for explaining what you are trying to do. After making a change in your repo, if you run mbt describe|build local, it should just build/describe what's changed. So this behaviour achieves what you need. Although, I think a potential problem is if you now repeat this command, same changes get rebuilt. Which is probably what you are trying to avoid.

Traditional build systems like make, maven, msbuild already optimise the build process this way so that if you run the same command multiple times, it does not do much.
From the start, we did not want to replace these tools with mbt. Its sole purpose was to execute build tools of your choice in a subtree of a repo, when it has been modified.

Version calculation was required because, if we are building a sub-set of our repository, we may want a reliable way to associate those build artifacts to a change in the repo (commit SHA is not good enough now because it represents every module in our commit, not just the ones got built as a result of a change).
Currently the version of a module is calculated by combining:

  • SHA of git tree object for its dir (same thing you get when you run git cat-file -p master^{tree} | grep tree)
  • SHA of git blob objects for of its file dependencies (same as git cat-file -p master^{tree} | grep blob)
  • Version numbers of dependencies in .mbt.yml

mbt build|describe local was introduced because we found ourselves doing a lot of git commit... and mbt build|describe head while we are developing the build related scripts. When thinking about the problem we figured out that, we could just interrogate the workspace (file system) for modules and reduce that down to the ones that are not added to the index. However, since these are the items we don't have in git yet, we cannot get the SHAs for them. This is why we have a fixed version called local to denote a module that's being built using this command.

Sorry about the length of the response. Thought giving a bit of background to how local was born and the problem it's trying to solve will clear things up.

If you still think the version number based on the local content is essential, feel free to have a stab. I think you'd be able to generate the modules set as per head and change our discover workspace algorithm to construct the modified modules based on the workspace diff :-).

//cc @wenisman

@johanatan
Copy link
Author

Ah, thanks for the background.

Hmm, would that even be possible though since as you mentioned there isn’t any hash of local changes until they’re committed?

Or were you thinking of a different algorithm that looks at the content of the files on disk?

Also, can you please point me to where in the codebase your current “discover workspace algorithm” is implemented?

Thanks!

@buddhike
Copy link
Member

No problem at all. Thank you for feedback like this.

Current workspace discovery code is in ModulesInWorkspace function in discover component.

func (d *stdDiscover) ModulesInWorkspace() (Modules, error) {

That works out all the modules based on the presence of .mbt.yml file in directory tree. ModulesInWorkspace is used by a component called ManifestBuilder which produces various views of our modules set. In this particular case, it reduces the modules down to what's modified in the workspace.

mods, err = b.Reducer.Reduce(mods, deltas)

I was thinking that, ModulesInWorkspace may not be necessary to achieve our goal here. ByWorkspaceChanges in ManifestBuilder could generate the modules based on the current head (i.e. call ManifestBuilder.ByCurrentBranch()) then have a another function in Discover module to interrogate the workspace diff for modules. Finally we union the two sets.

New function in Discover could generate the Version based on the content of those modified files and their paths (to track mv x y scenarios).

Just food for thought anyway. There could be a simpler approach... 😄

@buddhike
Copy link
Member

buddhike commented May 3, 2018

Closing this for now due to inactivity.

@buddhike buddhike closed this as completed May 3, 2018
@johanatan
Copy link
Author

How about:

-. commit local changes
-. grab the new content hashes
-. restore the working dir, staging area, etc to original state.

??

Not the most elegant solution but should be straightforward. The most difficult part would be step 3, restoration of all the associated states (new files, edited files, staged files, unstaged files etc etc).

Thoughts?

@buddhike
Copy link
Member

That's definitely workable. However, there's a bit more elegant story to this now.

Couple of weeks ago, @sergey-ostapenko did some awesome work to remove the file system walking for working out local modules. If you look at the discussion here #86, you can see a method I suggested around creating a union of modules as per head commit and currently modified items in workspace. Current implementation is much simpler than that but it does not have a way to differentiate modules that have been changed. So it is still using the fixed version string local.

If we change the implementation to the union based one, you could technically retain the versions from HEAD tree for unmodified modules. As for modified ones, you could simply create a hash based on the content of modified files and their path.

A bit more work than what you proposed I suppose, but it would be cleaner in my opinion (You might even realise an easier approach than the union if you look at @sergey-ostapenko implementation 😉 ).

Let me know what you think.

@johanatan
Copy link
Author

johanatan commented Jul 26, 2018

Can you elaborate on "simply create a hash based on the content of modified files and their path"? Keep in mind that I'd like the hash we compute here to ultimately match the one that git itself would compute if we were to add and commit all of the local changes (so that build optimization can occur--i.e., so that already-built modules would not be rebuilt later on).

@johanatan
Copy link
Author

I am using the following script for now as a workaround. Feel free to integrate the equivalent into mbt itself:

#!/usr/bin/env bash
# Argument to the script is a valid project name

pushd $CODEBUILD_SRC_DIR > /dev/null
tmp_file=`mktemp`

function cleanup {
    rm -f $tmp_file
    rm -r .git/logs
    rm -r .git/refs
    mv .git/logs2 .git/logs
    mv .git/refs2 .git/refs
    [ -f .git/COMMIT_EDITMSG2 ] && mv .git/COMMIT_EDITMSG2 .git/COMMIT_EDITMSG
    popd > /dev/null
}

trap cleanup EXIT

cp .git/index $tmp_file
cp -R .git/logs .git/logs2
cp -R .git/refs .git/refs2
[ -f .git/COMMIT_EDITMSG ] && cp .git/COMMIT_EDITMSG .git/COMMIT_EDITMSG2

(
export GIT_INDEX_FILE=$tmp_file
git add .
git commit -m "commit for latest version" > /dev/null
mbt describe head | grep "^$1 " | tr -s ' ' | cut -d ' ' -f3
)

@buddhike
Copy link
Member

Can you elaborate on "simply create a hash based on the content of modified files and their path"?
We could compute a SHA1 of the content of the modified file and its path. However, this does not guarantee that you would get the same version after it's committed.

Thanks for sharing your script.
Is this a requirement because your CI does some automated work before building your artifacts but you don't want them in the commit tree?

@johanatan
Copy link
Author

johanatan commented Jul 28, 2018

Yes, one should not have to commit in order to build.

I think this sums it up (from a systems consistency perspective):

local is a premature, unnecessary & [inherently] destructive coercion to a "bottom" value. It breaks consistency with the rest of an otherwise clean design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants