-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fallback to starters pull on kedro new #3900
Conversation
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
A couple notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me 👍 I left some small suggestions.
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @lrcouto, like the solution!
Left one question and suggestion.
kedro/framework/cli/starters.py
Outdated
@@ -773,7 +812,7 @@ def _make_cookiecutter_args_and_fetch_template( | |||
|
|||
tools = config["tools"] | |||
example_pipeline = config["example_pipeline"] | |||
starter_path = "git+https://github.com/kedro-org/kedro-starters.git" | |||
starter_path = _STARTERS_REPO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot comment on the proper line as it was not changed but the suggestion is what if we do not introduce the starter_path
variable as it complicates the logic? So instead we can do:
else:
# Use the default template path for non PySpark, Viz or example options:
return cookiecutter_args, template_path
return cookiecutter_args, _STARTERS_REPO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. To be honest, I'm not sure why this variable was there in the first place.
EDIT: Reading the code now, the starter_path variable might not aways be the same as the default _STARTERS_REPO, as the value may be passed through the template_path parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've seen it changed but I still prefer to return the original one instead. See the edited suggestion (I've missed a "return" and looked confusing)
kedro/framework/cli/starters.py
Outdated
_STARTERS_REPO = ( | ||
"git+https://github.com/kedro-org/kedro-starters.git" | ||
if _kedro_and_starters_version_identical() | ||
else "https://github.com/kedro-org/kedro-starters.git@main" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we point to the main branch (_STARTERS_REPO="https://github.com/kedro-org/kedro-starters.git@main"
) but then below in _make_cookiecutter_args_and_fetch_template
we set the checkout version as well? For example in this case:
if "PySpark" in tools and "Kedro Viz" in tools:
# Use the spaceflights-pyspark-viz starter if both PySpark and Kedro Viz are chosen.
cookiecutter_args["directory"] = "spaceflights-pyspark-viz"
# Ensures we use the same tag version of kedro for kedro-starters
cookiecutter_args["checkout"] = version
Will it then work as we expect I mean creating a project and passing the branch and the checkout version to the cookiecutter?
kedro/kedro/framework/cli/starters.py
Line 932 in 15ba4da
def _create_project( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will lead us to cookiecutter shenanigans, which is where these parameters are ultimately used. You have to dig down four or five functions to see it, but cookiecutter is using the template parameter to determine the repo address and the checkout parameter to determine the branch, tag or commit ID. What it does is the following:
if clone:
try:
subprocess.check_output( # nosec
[repo_type, 'clone', repo_url],
cwd=clone_to_dir,
stderr=subprocess.STDOUT,
)
if checkout is not None:
checkout_params = [checkout]
# Avoid Mercurial "--config" and "--debugger" injection vulnerability
if repo_type == "hg":
checkout_params.insert(0, "--")
subprocess.check_output( # nosec
[repo_type, 'checkout', *checkout_params],
cwd=repo_dir,
stderr=subprocess.STDOUT,
)
So if you passed the repo_url as the address to the main branch and, for example, passed the checkout value as a different branch, cookiecutter would clone the main branch and then checkout to the other branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explaining, that's what I thought as well. Doesn't it mean that in this case instead of main
branch (which we aim for) we again will have the released branch (as the checkout argument is the installed kedro version in case it's not passed) which is not desired?
kedro/kedro/framework/cli/starters.py
Line 344 in 15ba4da
checkout = checkout or version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, let me test that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think you're right. The stuff passed to cookiecutter has to be formatted in a different way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of refactoring the logic in _make_cookiecutter_args_and_fetch_template
because now we have two places where we define the template which is a bit confusing. But that can be done in a separate PR.
Now, we can just identify the case when repos versions are not matched (as you do for setting the _STARTERS_REPO
). In this case, we only need to use the template path provided but skip setting up the checkout parameter in _make_cookiecutter_args_and_fetch_template
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can just do something like
checkout = checkout or version if _kedro_and_starters_version_identical() else None
kedro/kedro/framework/cli/starters.py
Line 344 in 15ba4da
checkout = checkout or version |
Edit: updated link
Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates!
One of the concerns which is still in my head is how we use checkout
argument provided with version
and the way we change the related logic between _get_cookiecutter_dir
and _make_cookiecutter_args_and_fetch_template
methods. I would suggest to stick to common approach if possible, otherwise it's a bit confusing to follow the logic and some corner cases are appearing.
Once we're done I would also suggest to update the release check list to clarify how it works now.
Happy to help with the above.
kedro/framework/cli/starters.py
Outdated
if directory: | ||
cookiecutter_args["directory"] = directory | ||
|
||
tools = config["tools"] | ||
example_pipeline = config["example_pipeline"] | ||
starter_path = "git+https://github.com/kedro-org/kedro-starters.git" | ||
checkout_version = version if kedro_version_match_starters else "main" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that the checkout
key should be different for the cases below? In my understanding, we should:
- Use the
checkout
argument passed - In case the
checkout
isNone
go withversion
if kedro version matches the starters - Use
main
otherwise
Now we're mixing checkout
and version
which is a bit confusing. Please correct me if I'm missing any other logic behind it or if there's any particular reason why we don't want to use the provided checkout
argument in case some tools are selected.
kedro/framework/cli/starters.py
Outdated
else: | ||
# Use the default template path for non PySpark, Viz or example options: | ||
starter_path = template_path | ||
|
||
cookiecutter_args["checkout"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the previous comment.
If the checkout
argument is provided and kedro version doesn't match starters, do we indeed want to sick to main
?
In this case, there's also a discrepancy with _get_cookiecutter_dir
method, where we use the provided checkout
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR, @lrcouto!
I have a few questions regarding it:
- I'm trying to understand the algorithm in this PR. Could you please help with that?
In my opinion, we need to check the following:
If current_kedro_version
is not in the kedro_starters_tags_list
, we should not use checkout in the starters repo (since that tag won't exist). Instead, we should use the default starters repo.
Otherwise, if current_kedro_version
is in the list, we should set kedro_starters_tag
to current_kedro_version
.
From what I see in the current PR, we are checking current_kedro_version
against latest_kedro_starters_tag
. If they don't match, we use the default repo. It seems that if a user is using an older version of Kedro, it won't match with latest_kedro_starters_tag
, and the user will receive the latest default repo. However, they should receive the version of starters that matches current_kedro_version
(I believe this is a current logic that we don't want to change).
- Why are we using
os.environ["KEDRO_STARTERS_VERSION"]
to store the latest starters version? Why isn't a global variable sufficient?
Thank you for your reviews, @ElenaKhaustova and @DimedS ! From what I'm getting, the logic of the version/branch/tag selection for starters itself is a little confusing, so I'm gonna look into refactoring it a little bit. As for the question regarding the environment variable, it came from this discussion. We were having problems with repeated requests to Github being rejected, which happened every time tests were ran on out CI. My idea is that setting it on the environment once and then checking it would avoid this large number of requests, and it would not need to be re-declared every time the file is loaded. |
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Regarding this point, would we have to get all of the existing Kedro versions from Git? Or does that exist already somewhere? I'm asking because it'd be another request that we'd have to make. |
I believe we can use the |
Signed-off-by: lrcouto <laurarccouto@gmail.com>
…kedro into fallback-to-starters-git-pull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one more question, but otherwise the approach with checking if the version is smaller or equal to the starters version looks good!
kedro/framework/cli/starters.py
Outdated
else: | ||
# Use the default template path for non PySpark, Viz or example options: | ||
starter_path = template_path | ||
|
||
cookiecutter_args["checkout"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this just be cookiecutter_args["checkout"] = checkout_version
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for changing the logic, @lrcouto. It looks mostly good now! I've left a few additional questions.
kedro/framework/cli/starters.py
Outdated
_STARTERS_REPO = ( | ||
"git+https://github.com/kedro-org/kedro-starters.git" | ||
if _kedro_version_equal_or_lower_to_starters(version) | ||
else "https://github.com/kedro-org/kedro-starters.git@main" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please comment why do we need that two lines here now:
if _kedro_version_equal_or_lower_to_starters(version)
else "https://github.com/kedro-org/kedro-starters.git@main"
kedro/framework/cli/starters.py
Outdated
elif "Kedro Viz" in tools: | ||
# Use the spaceflights-pandas-viz starter if only Kedro Viz is chosen. | ||
cookiecutter_args["directory"] = "spaceflights-pandas-viz" | ||
cookiecutter_args["checkout"] = checkout_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Merel, it's better to put cookiecutter_args["checkout"] = checkout_version
on top and write it only once, if we can handle that section properly:
else:
# Use the default template path for non PySpark, Viz or example options:
starter_path = template_path
could you please explain checkout logic with that part, what should we do here in terms of checkout? As I understood here we are taking standard kedro template from kedro repo?
logging.error(f"Error fetching kedro-starters latest release version: {e}") | ||
return "" | ||
|
||
os.environ["KEDRO_STARTERS_VERSION"] = latest_release["tag_name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We started discussing the use of environment variables here to optimise time for different runs, such as in CI/CD pipelines. However, I read the following about os.environ:
The environment variable set using os.environ within a Python script is only set for the duration of the current process (i.e., the script execution) and does not persist beyond that. It will not be available to other processes or sessions, and once the script finishes executing, the environment variable will be lost.
If this is correct, it doesn't make sense to set this type of environment variable. Have you tested that it works?
kedro/framework/cli/starters.py
Outdated
if directory: | ||
cookiecutter_args["directory"] = directory | ||
|
||
tools = config["tools"] | ||
example_pipeline = config["example_pipeline"] | ||
starter_path = "git+https://github.com/kedro-org/kedro-starters.git" | ||
|
||
if checkout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making setting the checkout_version
logic cleaner. Just to double check: shouldn't we differ checkout
and version
here? if yes, above we do the following:
checkout = checkout or version
kedro/kedro/framework/cli/starters.py
Line 345 in 9ca8844
checkout = checkout or version |
Do we expect here that checkout
is True
because it is set to version
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this logic be global within new
command as well as selection on starter_path
?
Signed-off-by: lrcouto <laurarccouto@gmail.com>
Pushed some changes based on the conversation earlier today with @ElenaKhaustova and @DimedS.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR and addressing comment, @lrcouto ! Looks nice without that cookiecutter_args["checkout"] = version
repetition in every if else
line.
Description
During our current project creation and test setup, Kedro looks up to the
kedro-starters
repo and always uses the latest released version to get project templates. This can cause a problem during releases that depends on changes on thekedro-starters
repo, as they won't be acknowledged by the current flow until after they are released.This PR implements a fallback for when this situation happens. When the version of Kedro installed on your environment does not match the latest
kedro-starters
release, it will pull themain
branch instead.Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file