Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add reinsert functionality #11

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

NickLaMuro
Copy link
Collaborator

@NickLaMuro NickLaMuro commented Nov 13, 2019

VERY MUCH [WIP] AT THE MOMENT

I expect that I will be moving some commits around here and there, possibly into other branches to reduce this branch's size, and hopefully make the code look a lot better... but this is just a starting place for some feedback and discussion to get started.

Also of note: The first couple of commits are currently based on my previous unmerged PRs:

Simpler viewing can be found here:

https://github.com/NickLaMuro/code-extractor/compare/convert-to-module..extract-to-existing-repo

Changes

This is effectively the "undo" feature of the original project. By that, it allows a user to take a portion (or all) of a previously extracted project, and re-insert it back into the original project from which it came from.

This becomes tricky, as we need to filter out the existing commits that existed previously in the old ("target") repository. As a result, 3 filter-branch passes are done to achieve this change:

  1. After determining all file names that existed for the current extractions, filter commits to down to ones that only include the files we wish to commit. In addition, create a script to move any of those files into a temporary directory, maintaining the directory structure as we go.
  2. Take the temporary directory that was created and do a second filter that makes the new root of the project the temporary directory we created in the previous filter.
  3. After adding the target repository as a remote, filter the commits out that already exist in that remote. Also in this step, a commit is re-written that is shared between both repos that will be the "re-inject" commit, which all of the injected commits will be based off of.

From there, cherry-pick commits onto a new branch that is based off the existing HEAD of the current target remote's branch (most likely master) that will be receiving the "injected" commits. This allows the target repo to have the commits that were created "post extraction" that are based off a commit that re-adds the code in a state where it was originally extracted from.

@NickLaMuro
Copy link
Collaborator Author

Just an FYI: I tried running this on bits of the manageiq-gems-pending code that I planned on running this on, and unfortunately part of it doesn't work since that project was extracted prior to this project existing, and some assumptions were made to filter out existing commits that doesn't work without the commit messages including the (transferred from ManageIQ/manageiq@-----) bit.

I have an idea for a fallback, but it is taking me a bit to find a good way to handle that. I will be adding that as a follow up commit as well with a different test case to support the fallback, which will add to the time it will take to impilent.

@Fryguy
Copy link
Collaborator

Fryguy commented Nov 14, 2019

@NickLaMuro hard to review because it's too many things in one PR (though I understand as you said you'd be moving things around and things are based on different PRs), so I'll break down my thoughts:

  • I am 👍 on gemifying...I'll review the other PR.

  • I am 👍 on testing, but that should be a standalone PR (also hard to tell which tests are for the old stuff or the new stuff)

  • I am 〰️ on refactoring the original code. It was pretty simple (77 lines) to begin with, and very easy to follow, so moving things out into classes seems like overkill for a simple tool, unless those classes are going to be reused for the new functionality.

  • I am not sure about reinsert functionality. I can't think of a use-case for it, and even so I'm not sure why we need the complicated logic of matching the tree history vs just grafting in the repo's master branch (or master branch plus a commit to moves things into the right directory locations). I'm more trying to understand what problem you are trying to solve by introducing this direction.

    Separately, I'm not sure this feature belongs in this repo. The code-extractor is very single responsibility at the moment. I could see a separate code-inserter tool in the same vein as this repo.

@NickLaMuro
Copy link
Collaborator Author

NickLaMuro commented Nov 14, 2019

I am 👍 on testing, but that should be a standalone PR (also hard to tell which tests are for the old stuff or the new stuff)

@Fryguy Just an FYI the testing and even part of the "extract into classes" bit were part of the other two PRs mentioned. Best to to use the compare link that I provided in the PR description to see what this PR specifically adds (which admittedly, is still a frick'n lot...)

As I got going with the main purpose of this PR, I realized I had some more refactoring to do to share code between the two processes, but decided not to make too much more "PR-ception" than I already had before I started getting some form of feedback.

I am not sure about reinsert functionality. I can't think of a use-case for it...

As mentioned here, I have one with manageiq-gems-pending and putting some code back into ManageIQ/manageiq. But more on that below.

Separately, I'm not sure this feature belongs in this repo. The code-extractor is very single responsibility at the moment. I could see a separate code-inserter tool in the same vein as this repo.

I am actually fine with that, though I wouldn't mind if you then just scrap all three of these since I don't really see a point in even Gemifying (aka: "making it more complex") if we don't want to add any further scope to the project.

Honestly, the main reason I did this is I wanted a place to put the process that I would be taking with the effort of moving out he "object storage" and miq_ftp_lib.rb that should not exist in the manageiq-gems-pending repo, but don't really deserve standalone gems either. That is the current use case I am working on, but I also attempted something similar past in this PR:

And even documented it in one of the commits:

But it had a poor result, and was eventually scrapped. Besides, eventually ManageIQ/managiq-loggers came into being, but that was almost 1.5 years later. However, I do realize this feature has very limited uses, but I really don't want others to have to try and figure this out for themselves like I did if we ever have to do it again, and codifying it is the easiest way to do that I think.

@Fryguy
Copy link
Collaborator

Fryguy commented Nov 14, 2019

the main reason I did this is I wanted a place to put the process that I would be taking with the effort of moving out he "object storage" and miq_ftp_lib.rb that should not exist in the manageiq-gems-pending repo, but don't really deserve standalone gems either. That is the current use case I am working on

Ah ok. Thanks for the history behind this. That makes more sense to me now. I still think that it deserves a separate tool/repo, and I still like gemifying and testing, and will likely merge those in some form.

I really don't want others to have to try and figure this out for themselves like I did if we ever have to do it again, and codifying it is the easiest way to do that I think.

That's cool. I'm interested to see where this goes, so feel free to rev on it in this PR / repo space.

@NickLaMuro
Copy link
Collaborator Author

So a few force pushes later, and I have I think got this working... (still need to test it on my use case though)

Worth noting: Since this is based off of the #9 and #10 PRs, I found a few mistakes in those that I have since fixed and updated in those respective branches, and did some (not all) of the moving around that I said I would with this PR. A few commits were moved into other branches, and I might make an entirely new PR for some of the test_helper.rb/Rakefile changes I have done and re-ordered already. Hard to see with GitHub, since they sort commits by date and not ancestor order, but here is a pretty git log of the current order:

$ git log ...
* 8e0fc61 Nick LaMuro - (HEAD -> extract-to-existing-repo, origin/extract-to-existing-repo) [CodeExtractor] Support merge commits (23 hours ago)
* 52d2a40 Nick LaMuro - [CodeExtractorReinsertTest] Support custom commits (23 hours ago)
* 2059b75 Nick LaMuro - Add Fallback for when code_extractor wasn't used (30 hours ago)
* cef82a2 Nick LaMuro - Add code_extractor_reinsert_test.rb (5 days ago)
* dcc269f Nick LaMuro - [code_extractor_test.rb] Add .update_extraction_hash (5 days ago)
* c20177b Nick LaMuro - [code_extractor_test] Add .apply_new_commits_on_extracted_repo (5 days ago)
* d4c2e80 Nick LaMuro - [code_extractor_test] Add .perform_merges_of_extracted_code (5 days ago)
* a77c492 Nick LaMuro - [code_extractor_test.rb] Add .create_base_repo (5 days ago)
* 2943b5b Nick LaMuro - [CodeExtractor] Add reinsert functionality (7 days ago)
* 06f02bf Nick LaMuro - [test_helper.rb] Add TestRepo.merge (24 hours ago)
* 49d5de8 Nick LaMuro - [test_helper.rb] Add TestRepo#execute (24 hours ago)
* 5b35c07 Nick LaMuro - [test_helper.rb] Add TestRepo#checkout_b (7 days ago)
* cacf177 Nick LaMuro - [test_helper.rb] Add TestRepo.clone_at helper (7 days ago)
* 5351667 Nick LaMuro - [test_helper.rb] Add .set_destination_dir (7 days ago)
* eaaca5c Nick LaMuro - [test_helper.rb] Add assert_commits helper (7 days ago)
* e55817d Nick LaMuro - [Rakefile] Add test:debug task (7 days ago)
* 19ad15f Nick LaMuro - (origin/convert-to-module, convert-to-module) [CodeExtractor] Break out GitProject (3 weeks ago)
* 6020bdf Nick LaMuro - [CodeExtractor] Break out Config to it's own class (3 weeks ago)
* 12da249 Nick LaMuro - [CodeExtractor] Convert to module (3 weeks ago)
* 75534dc Nick LaMuro - (origin/gemify, gemify) [README.md] Update install/usage instructions (3 weeks ago)

(note: commits with [Rakefile] and [test_helper.rb] would get their own PR)

As another aside: All of this digging into git I have done for this PR has resulted in me finding out a better way to rebase "onto" branches I am working off of and moving commits between the two (the tl;dr for the lengthy blog post I would write about it is git rebase --onto).


Anyway, I will be testing out what I currently have against manageiq-gems-pending to so how that goes. Hopefully I have covered all the bases at this point.

This could have been done as part of the "Gemify" effort, but the intent
of that branch was to make as few changes to the core script before
starting to make sweeping changes.

By converting `CodeExtractor` to a `module`, that file and
`lib/code_extractor/version.rb` can be loaded at the same time (yes,
they couldn't before...), and it is just a better practice for dealing
with namespaces in general.

The `Runner` class was a necessary side effect, but I think it is mostly
understandable what it's role is and should be a straight forward
change.

Fun Fact
--------

The `attr_reader :extraction` is actually never used.  Kept it as a part
of `CodeExtractor::Runner`, but really, it is never used.
This is just a refactor commit to start the process of breaking out the
pieces of the code so that we can more easily distinguish certain pieces
of the script.
This commit separates the concerns of the runner and the GitProject to
define methods that make sense for a generic GitProject.

In the future, the intent is that this lib should be able to move code
between two separate and already existing `GitProject` instances, and
having this abstraction layer in place will allow code to be shared more
easily.

This also still attempts to retain the "script-like" nature of the
original project in the `Runner`, but hopefully creating this generic
`GitProject` construct will facilitate future improvements.
A convenience task for running specs and do some diagnosis on the
generated repos following the test run.

Basically avoids cleaning up the sandbox git directory after each test.
Used to test certain commits exist in the destination repo for the
current branch.
Used to reset the location of the destination_repo directory and clear
out any related variables.
Used to clone a repo (first arg) to a new directory (second arg).

`TestRepo#clone` is added as well to assit with this as well.
A instance method helper for TestRepo which performs a checkout out of a
new branch based off another (or the current branch if no second arg is
passed).

Mirrors the usage of `git checkout -b [new_branch] [base_branch]`
Shared method for running DSL methods (basically calls just calls
`instance_eval`)
Adds a helper function for handling merges via rugged.

Of note:  A `.checkout_head` call was added in addition to what was
there previously.  After doing a `commit` in rugged with a merge, the
git tree is left in a odd state (probably since the merge functionality
isn't "porcelain"), so this makes sure the git tree state is on HEAD.
This is effectively the "undo" feature of the original project.

By that, it allows a user to take a portion (or all) of a previously
extracted project, and re-insert it back into the original project from
which it came from.

This becomes tricky, as we need to filter out the existing commits that
existed previously in the old ("target") repository.  As a result, 3
`filter-branch` passes are done to achieve this change:

1. After determining all file names that existed for the current
extractions, filter commits to down to ones that only include the files
we wish to commit.  In addition, create a script to move any of those
files into a temporary directory, maintaining the directory structure as
we go.
2. Take the temporary directory that was created and do a second filter
that makes the new root of the project the temporary directory we
created in the previous filter.
3. After adding the target repository as a remote, filter the commits
out that already exist in that remote.  Also in this step, a commit is
re-written that is shared between both repos that will be the
"re-inject" commit, which all of the injected commits will be based off
of.

From there, cherry-pick commits onto a new branch that is based off the
existing HEAD of the current target remote's branch (most likely
`master`) that will be receiving the "injected" commits.  This allows
the target repo to have the commits that were created "post extraction"
that are based off a commit that re-adds the code in a state where it
was originally extracted from.
Meant as a seed method for building up the same sample repo, and
reducing boiler plate in a few of the tests.

Currently isn't made terribly flexible, but if that is required, it can
be added in the future.
Another helper method for breaking up some of the setup code for the
"unextract" test, which handles the portion of the setup that replicates
the process of merging the original extraction.

Not the best implementation, since `@bare_repo_dir` gets set as part of
the new method, and is used later in the test itself as the
`target_remote` in the `extraction_hash`, but for now it is a quick
solution.
Another helper method that includes boilerplate for updating the newly
extracted repo with some new commits for use when "unextracting"

Once again, instance variables were used here in a less than graceful
fashion, but it allowed for moving a bit more code out of the main
testing method that was mostly trivial.
Another helper methods for updating common changes to the
extraction_hash when doing a re-insert.

This, and the previous commits, will make more sense in a follow up
commit handling a different edge case with `re-insert`.
This extracts the "unextract" (aka: "reinsert") test into it's own test,
with the intent to later fill out other test cases in future commits.

Note:  The `create_base_repo` needed to be added to the `test_helper.rb`
since it was needed by both test files.  Not a huge change, but worth
nothing.
Allows reinsert to work when code_extractor wasn't used to originally
extract the commits
In .apply_new_commits_on_extracted_repo, add the ability to pass a
block, which will then instead be used instead of what was used by
default in the method.

This allows for taking advantage of the new `.merge` DSL for
representing so more real-world extracted repos
When using cherry-pick, it wouldn't handle merges properly, since it
requires a `-m 1` argument for it to work properly.

By switching to using `git rebase --root --onto`, we can take the branch
that now is a "orphan" branch, and apply the commits onto the target
remote's target branch.

Left extra code in there to show other possibilities (since this isn't
getting merged in away)
Improves the fallback approach for `.previously_extracted_commits` to
check against the entirety of the commit message.

For those cases where it is "fixes tests" or "Review Comments".
When updating the commit that will be the root commit for the injected
commits, update the committer and author values so that it is more
obvious who re-added the commit.

As a result, also include the original committer/author info in the
updated commit message.  Formatting has been tweaked in this commit as
well to make it a little easier to see what is being done.
Makes it easier to try test things out from post "pruning" step.
This was a mistake to have this.  Since the commits will work just fine
as they are, and don't need to have the content of master to do so, this
can be removed entirely.

This step is much simpler now, as it is just updating the commits we are
bringing over and then applying them with a `git rebase --onto ...`.
The upstream_name should be used here, and the target name should be
used for the filter.
This could be a "[FIXUP]" commit, but I think it is worth pointing out
what I did here.

Since there is multiple `git-filter-branch` calls, the original commit
SHAs that existed are re-written after each.  So delaying writing the
updated commit message will only produce references to commits that
don't exist since they were temporary.

So moving `--msg-filter` to the first `git-filter-branch` was the first
logical step.  However, this caused another issue.  Since we allow a
fallback for finding the commit messages that existed in the previous
repository, that would now be incorrect (in it's previous form) since it
relied on matching the exact messages created.

To handle this in most cases, and lines of `(transferred from ...@` are
filtered out from the commit messages before comparing.

* * *

The remaining changes are just to support this change.

  - Adjust some method signatures to support this change.
  - Update `assert_commits` to confirm the commit messages are pointing
    to a proper commit
  - Add a `reference_repo_dir` and `reference_repo` to help with the
    `assert_commits` changes (set in one of the helper methods)
Prior, when doing `build_prune_script`, the branch was left at what ever
state the previous step had left it at.  However, this caused and issue
when running `build_prune_script` as it was attempting to find the
history for files that were just deleted in the `extract_branch` step
(the branch that will be pushed to the `upstream_repo` that "deletes
code").

This step ensures a `git checkout #{source_branch}` prior to building
the script to ensure the files and their previous versions still exist.

And additional tweak was also added to split up the initialization of
the `file_and_ancestors` variable to make it easier to debug.
In a method above, "3 steps" was used to described how the prune script
would be used.  This change updates the second explaination to match.
This is a test to also include grafting as a feature (currently
disabled).

The intent is that a `graft_parent_commit` could be provided in the
config yaml, and that would be used in the changes here.

The steps in exampled in order:

1. Create a new branch with a HEAD at the "graft" point
2. Get back on the "inject_branch" (possibly unnecessary)
3. Rebase ("graft") the inject_branch onto the "old_branch"
4. Update the now grafted inject_branch to have the latest upstream
   changes

Step two is only "possibly unnecessary" if making use of another argument
of `git rebase` which I have yet to test.

Besides that though, the changes are pretty minimal to work with, so
this ends up being a pretty simple section to wrap your head around
(once you wrap your head around getting to this point in the first
place...).

Of note:  Since `inject_branch` is a branch with no parent commit for
it's root commit, the `--root` option is used here for `git rebase`,
which is important since the second arg is then the tip of
`inject_branch` (the branch name), and not the base of the branch that
you might normally provide when using `git rebase`.  This is important
since providing the base commit otherwise would "drop" that commit as
part of the rebase, and cause some other issues as a result (merge
conflicts).
@juliancheal
Copy link
Owner

👍 @NickLaMuro this looks like an awesome feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants