Skip to content
This repository has been archived by the owner on May 30, 2022. It is now read-only.

Add PR Files stream #6

Closed
wants to merge 7 commits into from
Closed

Add PR Files stream #6

wants to merge 7 commits into from

Conversation

Ry-DS
Copy link

@Ry-DS Ry-DS commented Nov 9, 2021

Not sure what state_partitioning_keys is, hoping it's correct currently. Any ideas?

@Ry-DS Ry-DS requested a review from laurentS November 9, 2021 09:14
@laurentS
Copy link
Collaborator

laurentS commented Nov 9, 2021

Not sure what state_partitioning_keys is, hoping it's correct currently. Any ideas?

The sdk docs are not super clear about it, but basically, what it means from what I've understood:

  • "state" is the progress of the stream, and is used to save a bookmark every so often, so that the tap can resume fetching data from where it left off.
  • "partitioning" is more or less what the state relates to. So if your endpoint lists issues for a repo, the partition is basically the repo. Put another way, a different state partition implies a different api call.
  • "keys", they are the field names on which a multi-column primary key is created by target-postgres. if you look at the current streams.py file, the keys are org, repo for all streams. In your case, I think it would be ["org", "repo", "pr_number"] (or whatever that last field is called). With just org/repo as you have now, all PR states for a given repo would overwrite each other.

@laurentS
Copy link
Collaborator

laurentS commented Nov 9, 2021

As a side question, which maybe we should have looked at first, is there no way to get the same info as what you have, but on a repo level endpoint? Something like /org/repo/pr_files? The reason being to reduce the number of api requests made.

Also, I just realised that you opened this PR against our fork, but maybe worth opening a PR against the main repo, to also get feedback from the meltano team, they understand this tap structure (and the underlying sdk) much better than we do.

Copy link
Collaborator

@laurentS laurentS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, barring a few minor changes. Is there another endpoint to get commits that are not attached to a PR (with the same info, "list commits" does not have the number of changes for instance)


name = "pull_request_files"
path = "/repos/{org}/{repo}/pulls/{pull_number}/files"
primary_keys = ["filename"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause problems, as filenames are most likely not going to be globally unique. This comment might be helpful.

Copy link
Author

@Ry-DS Ry-DS Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of the output:
image
I'm not sure what else could be the key. Any ideas? Since hashes could be the same (unlikely) I chose filename, since it looks like it gives the full filepath. What do you think?

primary_keys = ["filename"]
parent_stream_type = PullRequestsStream
ignore_parent_replication_key = False
state_partitioning_keys = ["repo", "org"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want something like this, but not 100% sure how you get the pull_number, you might need to add it to the record in post_process.

Suggested change
state_partitioning_keys = ["repo", "org"]
state_partitioning_keys = ["repo", "org", "pull_number"]

@Ry-DS
Copy link
Author

Ry-DS commented Nov 10, 2021

As a side question, which maybe we should have looked at first, is there no way to get the same info as what you have, but on a repo level endpoint? Something like /org/repo/pr_files? The reason being to reduce the number of api requests made.

Also, I just realised that you opened this PR against our fork, but maybe worth opening a PR against the main repo, to also get feedback from the meltano team, they understand this tap structure (and the underlying sdk) much better than we do.

I tried looking thru the docs but couldn't find anything like you mentioned 😕 (primarily looked to see if we could get files from Repo and work out which PR it came from). We might be stuck with this approach, but do let me konw if you find something that might work.

@ericboucher
Copy link
Member

Opened on meltano's main

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants