Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute relation between patch series #34

Open
metp opened this issue Nov 8, 2019 · 8 comments
Open

Compute relation between patch series #34

metp opened this issue Nov 8, 2019 · 8 comments

Comments

@metp
Copy link
Contributor

metp commented Nov 8, 2019

Currently PaStA only analyzes patches (thus of course the name). As Patchwork will support both, patches and cover letters to be in a relation, the question arises whether we shall support analyzing cover letters as well?

@bulwahn
Copy link
Contributor

bulwahn commented Nov 25, 2019

Pasta relates patches to each other based on a suitable heuristics optimised for relating patches.

However, pasta also has the information of which patches are in which series. So, we can use a further algorithm/heuristics to conclude from related patches among multiple series which series (possibly identified by their cover letters) are related to each other.

We can also consider possible metrics in the cover letters of the series as further factor for determining the correct relation between series.

@rralf Would this be a suitable task for a bachelor's/master's thesis topic?

@bulwahn
Copy link
Contributor

bulwahn commented Nov 25, 2019

Probably, the issue should be renamed to "Compute relation between patch series"; the cover letter is only a part of a patch series, that identifies and is unique to the patch series.

@metp metp changed the title Support cover letter analysis Compute relation between patch series Nov 29, 2019
@vaniisgh
Copy link

vaniisgh commented Jul 1, 2020

Hey :)
I read about PaStA on the community bridge website and was looking at this issue, it seemed really interesting ( &challenging ) but I would like to try and work one something like this possibly ... or even contribute to smaller issues independently, if you have any pointers on how to go about this process I would really appreciate it.

thanks & regards

edit: maybe something like #33 combining PaStA with the cregit tool will be more suitable, but the algorithm part of this issue really excites me :)

@bulwahn
Copy link
Contributor

bulwahn commented Jul 1, 2020

@vaniisgh we have enough work on all ends of this project, deep internals, nice visualisations, connecting with other tools etc.

I think this task here is suitable for a mentorship. For the beginning, I need to ask if you roughly know the kernel workflow on the mailing lists, e.g., do you know what is a cover letter, what is a patch series etc.

Also, a bit simpler to get started is to look into #21 or #14; please have a look, then we create a vision for a tool that we would like to develop for those points.

@vaniisgh
Copy link

vaniisgh commented Jul 1, 2020

Thanks for the reply :)
Honestly, I am a beginner to contribution workflows, but I have only ever used GitHub & Gerrit to push changes, so I haven't really used a mailing list before. Though I am aware of cover letters, I haven't ever sent one. I think my knowledge of patch and patch series is a bit better :)

I will look at the issues you have mentioned and possibly comment on my doubts/ideas on the appropriate one and try to get started on one of those first.

@vaniisgh
Copy link

vaniisgh commented Jul 5, 2020

So I was reading though the papers mentioned in the readme :)
and was thinking about how the current algorithms could be extended most elegantly, I was wondering if this should be done by

  • extracting keywords from the cover letter and commit messages (which is done currently with the Levenshtein string distance together after tokenisation) by extending it with something like the Needleman and Wunsch or an Affine gap algorithm.

I was referring to these resources to understand the string matching better:

and then extend the same methodology for the diffs too.

I also have this ... kind of adventurous idea. It really is half baked though ...
In bioinformatics we use a lot of sequence alignment algorithms it would be cool to use them here too, since code like DNA or Protein code has a fixed number of sensible tokens, this is still something I am thinking about but I wanted to share. I was thinking of something like :

  • the BLAST algorithm

  • The algotithm used by Clustal :
    Steps for CLUSTAL algorithm are:

    -- Calculate all possible pairwise alignments, record the score for each pair.
    -- Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining).
    -- Find the two most closely related sequences
    -- Align the sequences by progressive method
    i. Calculate a consensus of this alignment
    ii. Replace the two sequences with the consensus
    iii. Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).
    iv. Iterate until all sequences have been aligned

  1. Expand the consensus sequences with the (gapped) original sequences
  2. Report the multiple sequence alignment
    then we could use this sequence alignment to generate similarity results based on the weights/significance and amout of total changed code that matches ?

@bulwahn
Copy link
Contributor

bulwahn commented Jul 5, 2020

Nice collection of stuff, but unfortunately probably all irrelevant.

The pasta project faces two issues:

  1. we have very few applications using pasta implemented (we have no real users of the overall program), because many ideas of all the use cases are not implemented. This should be the focus.

  2. We have only a very small ground truth. Any sophisticated algorithm does not help because the ground truth is small. There is no way to increase the ground truth dataset, so we should focus on fixing specific systematic issues.

This issue is about extending the data structures to identify and include the notion of patch series and try to compute a relationship between them.

@vaniisgh
Copy link

vaniisgh commented Jul 6, 2020

thanks for taking the time to review all this and answer any doubts I have, I'm just trying to understand PaStA atm. so sorry about all the irrelavant comments.
I think I understand the issues outlined now. thanks :)
I will follow up soon with a more appropriate solution idea/POC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants