Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fallback to starters pull on kedro new #3900

Merged
merged 40 commits into from
Jul 15, 2024
Merged

Conversation

lrcouto
Copy link
Contributor

@lrcouto lrcouto commented May 29, 2024

Description

During our current project creation and test setup, Kedro looks up to the kedro-starters repo and always uses the latest released version to get project templates. This can cause a problem during releases that depends on changes on the kedro-starters repo, as they won't be acknowledged by the current flow until after they are released.

This PR implements a fallback for when this situation happens. When the version of Kedro installed on your environment does not match the latest kedro-startersrelease, it will pull the main branch instead.

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

lrcouto and others added 12 commits May 29, 2024 01:29
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
@lrcouto lrcouto marked this pull request as ready for review June 5, 2024 04:44
@lrcouto lrcouto requested a review from merelcht as a code owner June 5, 2024 04:44
@lrcouto
Copy link
Contributor Author

lrcouto commented Jun 5, 2024

A couple notes:

  • I've used the requests_mock library to run tests for this PR because I was having a lot of trouble mocking requests to the Github API with fixtures we already had.
  • In agreement with this response on Slack, I've preferred to not touch the kedro checkout function, since it is user-facing and this fallback feature is just for us.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me 👍 I left some small suggestions.

kedro/framework/cli/starters.py Outdated Show resolved Hide resolved
kedro/framework/cli/starters.py Outdated Show resolved Hide resolved
Signed-off-by: lrcouto <laurarccouto@gmail.com>
@lrcouto lrcouto linked an issue Jun 5, 2024 that may be closed by this pull request
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @lrcouto, like the solution!

Left one question and suggestion.

@@ -773,7 +812,7 @@ def _make_cookiecutter_args_and_fetch_template(

tools = config["tools"]
example_pipeline = config["example_pipeline"]
starter_path = "git+https://github.com/kedro-org/kedro-starters.git"
starter_path = _STARTERS_REPO
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot comment on the proper line as it was not changed but the suggestion is what if we do not introduce the starter_path variable as it complicates the logic? So instead we can do:

    else:
        # Use the default template path for non PySpark, Viz or example options:
        return cookiecutter_args, template_path
    return cookiecutter_args, _STARTERS_REPO

Copy link
Contributor Author

@lrcouto lrcouto Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. To be honest, I'm not sure why this variable was there in the first place.

EDIT: Reading the code now, the starter_path variable might not aways be the same as the default _STARTERS_REPO, as the value may be passed through the template_path parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've seen it changed but I still prefer to return the original one instead. See the edited suggestion (I've missed a "return" and looked confusing)

_STARTERS_REPO = (
"git+https://github.com/kedro-org/kedro-starters.git"
if _kedro_and_starters_version_identical()
else "https://github.com/kedro-org/kedro-starters.git@main"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we point to the main branch (_STARTERS_REPO="https://github.com/kedro-org/kedro-starters.git@main") but then below in _make_cookiecutter_args_and_fetch_template we set the checkout version as well? For example in this case:

if "PySpark" in tools and "Kedro Viz" in tools:
        # Use the spaceflights-pyspark-viz starter if both PySpark and Kedro Viz are chosen.
        cookiecutter_args["directory"] = "spaceflights-pyspark-viz"
        # Ensures we use the same tag version of kedro for kedro-starters
        cookiecutter_args["checkout"] = version

Will it then work as we expect I mean creating a project and passing the branch and the checkout version to the cookiecutter?

def _create_project(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lead us to cookiecutter shenanigans, which is where these parameters are ultimately used. You have to dig down four or five functions to see it, but cookiecutter is using the template parameter to determine the repo address and the checkout parameter to determine the branch, tag or commit ID. What it does is the following:

    if clone:
        try:
            subprocess.check_output(  # nosec
                [repo_type, 'clone', repo_url],
                cwd=clone_to_dir,
                stderr=subprocess.STDOUT,
            )
            if checkout is not None:
                checkout_params = [checkout]
                # Avoid Mercurial "--config" and "--debugger" injection vulnerability
                if repo_type == "hg":
                    checkout_params.insert(0, "--")
                subprocess.check_output(  # nosec
                    [repo_type, 'checkout', *checkout_params],
                    cwd=repo_dir,
                    stderr=subprocess.STDOUT,
                )

So if you passed the repo_url as the address to the main branch and, for example, passed the checkout value as a different branch, cookiecutter would clone the main branch and then checkout to the other branch.

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining, that's what I thought as well. Doesn't it mean that in this case instead of main branch (which we aim for) we again will have the released branch (as the checkout argument is the installed kedro version in case it's not passed) which is not desired?

checkout = checkout or version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, let me test that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you're right. The stuff passed to cookiecutter has to be formatted in a different way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of refactoring the logic in _make_cookiecutter_args_and_fetch_template because now we have two places where we define the template which is a bit confusing. But that can be done in a separate PR.

Now, we can just identify the case when repos versions are not matched (as you do for setting the _STARTERS_REPO). In this case, we only need to use the template path provided but skip setting up the checkout parameter in _make_cookiecutter_args_and_fetch_template.

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can just do something like

checkout = checkout or version if _kedro_and_starters_version_identical() else None

checkout = checkout or version

Edit: updated link

lrcouto and others added 3 commits June 12, 2024 12:26
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates!

One of the concerns which is still in my head is how we use checkout argument provided with version and the way we change the related logic between _get_cookiecutter_dir and _make_cookiecutter_args_and_fetch_template methods. I would suggest to stick to common approach if possible, otherwise it's a bit confusing to follow the logic and some corner cases are appearing.

Once we're done I would also suggest to update the release check list to clarify how it works now.

Happy to help with the above.

if directory:
cookiecutter_args["directory"] = directory

tools = config["tools"]
example_pipeline = config["example_pipeline"]
starter_path = "git+https://github.com/kedro-org/kedro-starters.git"
checkout_version = version if kedro_version_match_starters else "main"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that the checkout key should be different for the cases below? In my understanding, we should:

  1. Use the checkout argument passed
  2. In case the checkout is None go with version if kedro version matches the starters
  3. Use main otherwise

Now we're mixing checkout and version which is a bit confusing. Please correct me if I'm missing any other logic behind it or if there's any particular reason why we don't want to use the provided checkout argument in case some tools are selected.

else:
# Use the default template path for non PySpark, Viz or example options:
starter_path = template_path

cookiecutter_args["checkout"] = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the previous comment.

If the checkout argument is provided and kedro version doesn't match starters, do we indeed want to sick to main?

In this case, there's also a discrepancy with _get_cookiecutter_dir method, where we use the provided checkout argument.

Copy link
Contributor

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, @lrcouto!

I have a few questions regarding it:

  1. I'm trying to understand the algorithm in this PR. Could you please help with that?

In my opinion, we need to check the following:

If current_kedro_version is not in the kedro_starters_tags_list, we should not use checkout in the starters repo (since that tag won't exist). Instead, we should use the default starters repo.
Otherwise, if current_kedro_version is in the list, we should set kedro_starters_tag to current_kedro_version.

From what I see in the current PR, we are checking current_kedro_version against latest_kedro_starters_tag. If they don't match, we use the default repo. It seems that if a user is using an older version of Kedro, it won't match with latest_kedro_starters_tag, and the user will receive the latest default repo. However, they should receive the version of starters that matches current_kedro_version (I believe this is a current logic that we don't want to change).

  1. Why are we using os.environ["KEDRO_STARTERS_VERSION"] to store the latest starters version? Why isn't a global variable sufficient?

@lrcouto
Copy link
Contributor Author

lrcouto commented Jun 25, 2024

Thank you for your reviews, @ElenaKhaustova and @DimedS ! From what I'm getting, the logic of the version/branch/tag selection for starters itself is a little confusing, so I'm gonna look into refactoring it a little bit.

As for the question regarding the environment variable, it came from this discussion. We were having problems with repeated requests to Github being rejected, which happened every time tests were ran on out CI. My idea is that setting it on the environment once and then checking it would avoid this large number of requests, and it would not need to be re-declared every time the file is loaded.

@lrcouto
Copy link
Contributor Author

lrcouto commented Jul 2, 2024

If current_kedro_version is not in the kedro_starters_tags_list, we should not use checkout in the starters repo (since that tag won't exist). Instead, we should use the default starters repo. Otherwise, if current_kedro_version is in the list, we should set kedro_starters_tag to current_kedro_version.

Regarding this point, would we have to get all of the existing Kedro versions from Git? Or does that exist already somewhere? I'm asking because it'd be another request that we'd have to make.

@DimedS
Copy link
Contributor

DimedS commented Jul 3, 2024

If current_kedro_version is not in the kedro_starters_tags_list, we should not use checkout in the starters repo (since that tag won't exist). Instead, we should use the default starters repo. Otherwise, if current_kedro_version is in the list, we should set kedro_starters_tag to current_kedro_version.

Regarding this point, would we have to get all of the existing Kedro versions from Git? Or does that exist already somewhere? I'm asking because it'd be another request that we'd have to make.

I believe we can use the latest_kedro_starters_tag that you already received and compare it with the current_kedro_version. If the current_kedro_version is greater than the latest_kedro_starters_tag (indicating that Kedro has been released with a new version, but the starters have not yet been updated), then we should use the default starters repository (i.e., the latest starters from the main branch). Otherwise, we should maintain the current logic and use the same version of starters that matches the Kedro version.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one more question, but otherwise the approach with checking if the version is smaller or equal to the starters version looks good!

else:
# Use the default template path for non PySpark, Viz or example options:
starter_path = template_path

cookiecutter_args["checkout"] = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this just be cookiecutter_args["checkout"] = checkout_version as well?

Copy link
Contributor

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for changing the logic, @lrcouto. It looks mostly good now! I've left a few additional questions.

_STARTERS_REPO = (
"git+https://github.com/kedro-org/kedro-starters.git"
if _kedro_version_equal_or_lower_to_starters(version)
else "https://github.com/kedro-org/kedro-starters.git@main"
Copy link
Contributor

@DimedS DimedS Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please comment why do we need that two lines here now:

    if _kedro_version_equal_or_lower_to_starters(version)
    else "https://github.com/kedro-org/kedro-starters.git@main"

elif "Kedro Viz" in tools:
# Use the spaceflights-pandas-viz starter if only Kedro Viz is chosen.
cookiecutter_args["directory"] = "spaceflights-pandas-viz"
cookiecutter_args["checkout"] = checkout_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Merel, it's better to put cookiecutter_args["checkout"] = checkout_version on top and write it only once, if we can handle that section properly:

else:
        # Use the default template path for non PySpark, Viz or example options:
        starter_path = template_path

could you please explain checkout logic with that part, what should we do here in terms of checkout? As I understood here we are taking standard kedro template from kedro repo?

logging.error(f"Error fetching kedro-starters latest release version: {e}")
return ""

os.environ["KEDRO_STARTERS_VERSION"] = latest_release["tag_name"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We started discussing the use of environment variables here to optimise time for different runs, such as in CI/CD pipelines. However, I read the following about os.environ:

The environment variable set using os.environ within a Python script is only set for the duration of the current process (i.e., the script execution) and does not persist beyond that. It will not be available to other processes or sessions, and once the script finishes executing, the environment variable will be lost.

If this is correct, it doesn't make sense to set this type of environment variable. Have you tested that it works?

if directory:
cookiecutter_args["directory"] = directory

tools = config["tools"]
example_pipeline = config["example_pipeline"]
starter_path = "git+https://github.com/kedro-org/kedro-starters.git"

if checkout:
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making setting the checkout_version logic cleaner. Just to double check: shouldn't we differ checkout and version here? if yes, above we do the following:

checkout = checkout or version

checkout = checkout or version

Do we expect here that checkout is True because it is set to version?

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be global within new command as well as selection on starter_path?

@lrcouto
Copy link
Contributor Author

lrcouto commented Jul 10, 2024

Pushed some changes based on the conversation earlier today with @ElenaKhaustova and @DimedS.

  • The logic that selects which checkout argument will be passed to the Cookiecutter function was moved to is own function.
  • _STARTERS_REPO now points only to the main starters repo.
  • Removed redundant logic from _make_cookiecutter_args_and_fetch_template.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Contributor

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR and addressing comment, @lrcouto ! Looks nice without that cookiecutter_args["checkout"] = version repetition in every if else line.

@lrcouto lrcouto merged commit 1aea4a3 into main Jul 15, 2024
34 checks passed
@lrcouto lrcouto deleted the fallback-to-starters-git-pull branch July 15, 2024 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement version fallback for starters & framework
5 participants