Compute relation between patch series #34

metp · 2019-11-08T16:17:12Z

Currently PaStA only analyzes patches (thus of course the name). As Patchwork will support both, patches and cover letters to be in a relation, the question arises whether we shall support analyzing cover letters as well?

bulwahn · 2019-11-25T07:50:21Z

Pasta relates patches to each other based on a suitable heuristics optimised for relating patches.

However, pasta also has the information of which patches are in which series. So, we can use a further algorithm/heuristics to conclude from related patches among multiple series which series (possibly identified by their cover letters) are related to each other.

We can also consider possible metrics in the cover letters of the series as further factor for determining the correct relation between series.

@rralf Would this be a suitable task for a bachelor's/master's thesis topic?

bulwahn · 2019-11-25T07:57:10Z

Probably, the issue should be renamed to "Compute relation between patch series"; the cover letter is only a part of a patch series, that identifies and is unique to the patch series.

vaniisgh · 2020-07-01T04:39:28Z

Hey :)
I read about PaStA on the community bridge website and was looking at this issue, it seemed really interesting ( &challenging ) but I would like to try and work one something like this possibly ... or even contribute to smaller issues independently, if you have any pointers on how to go about this process I would really appreciate it.

thanks & regards

edit: maybe something like #33 combining PaStA with the cregit tool will be more suitable, but the algorithm part of this issue really excites me :)

bulwahn · 2020-07-01T11:22:15Z

@vaniisgh we have enough work on all ends of this project, deep internals, nice visualisations, connecting with other tools etc.

I think this task here is suitable for a mentorship. For the beginning, I need to ask if you roughly know the kernel workflow on the mailing lists, e.g., do you know what is a cover letter, what is a patch series etc.

Also, a bit simpler to get started is to look into #21 or #14; please have a look, then we create a vision for a tool that we would like to develop for those points.

vaniisgh · 2020-07-01T11:57:10Z

Thanks for the reply :)
Honestly, I am a beginner to contribution workflows, but I have only ever used GitHub & Gerrit to push changes, so I haven't really used a mailing list before. Though I am aware of cover letters, I haven't ever sent one. I think my knowledge of patch and patch series is a bit better :)

I will look at the issues you have mentioned and possibly comment on my doubts/ideas on the appropriate one and try to get started on one of those first.

vaniisgh · 2020-07-05T17:29:38Z

So I was reading though the papers mentioned in the readme :)
and was thinking about how the current algorithms could be extended most elegantly, I was wondering if this should be done by

extracting keywords from the cover letter and commit messages (which is done currently with the Levenshtein string distance together after tokenisation) by extending it with something like the Needleman and Wunsch or an Affine gap algorithm.

I was referring to these resources to understand the string matching better:

and then extend the same methodology for the diffs too.

I also have this ... kind of adventurous idea. It really is half baked though ...
In bioinformatics we use a lot of sequence alignment algorithms it would be cool to use them here too, since code like DNA or Protein code has a fixed number of sensible tokens, this is still something I am thinking about but I wanted to share. I was thinking of something like :

the BLAST algorithm
The algotithm used by Clustal :
Steps for CLUSTAL algorithm are:

-- Calculate all possible pairwise alignments, record the score for each pair.
-- Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining).
-- Find the two most closely related sequences
-- Align the sequences by progressive method
i. Calculate a consensus of this alignment
ii. Replace the two sequences with the consensus
iii. Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).
iv. Iterate until all sequences have been aligned

Expand the consensus sequences with the (gapped) original sequences
Report the multiple sequence alignment
then we could use this sequence alignment to generate similarity results based on the weights/significance and amout of total changed code that matches ?

bulwahn · 2020-07-05T20:26:39Z

Nice collection of stuff, but unfortunately probably all irrelevant.

The pasta project faces two issues:

we have very few applications using pasta implemented (we have no real users of the overall program), because many ideas of all the use cases are not implemented. This should be the focus.
We have only a very small ground truth. Any sophisticated algorithm does not help because the ground truth is small. There is no way to increase the ground truth dataset, so we should focus on fixing specific systematic issues.

This issue is about extending the data structures to identify and include the notion of patch series and try to compute a relationship between them.

vaniisgh · 2020-07-06T00:35:55Z

thanks for taking the time to review all this and answer any doubts I have, I'm just trying to understand PaStA atm. so sorry about all the irrelavant comments.
I think I understand the issues outlined now. thanks :)
I will follow up soon with a more appropriate solution idea/POC.

metp changed the title ~~Support cover letter analysis~~ Compute relation between patch series Nov 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute relation between patch series #34

Compute relation between patch series #34

metp commented Nov 8, 2019

bulwahn commented Nov 25, 2019

bulwahn commented Nov 25, 2019

vaniisgh commented Jul 1, 2020 •

edited

bulwahn commented Jul 1, 2020

vaniisgh commented Jul 1, 2020

vaniisgh commented Jul 5, 2020

bulwahn commented Jul 5, 2020

vaniisgh commented Jul 6, 2020

Compute relation between patch series #34

Compute relation between patch series #34

Comments

metp commented Nov 8, 2019

bulwahn commented Nov 25, 2019

bulwahn commented Nov 25, 2019

vaniisgh commented Jul 1, 2020 • edited

bulwahn commented Jul 1, 2020

vaniisgh commented Jul 1, 2020

vaniisgh commented Jul 5, 2020

bulwahn commented Jul 5, 2020

vaniisgh commented Jul 6, 2020

vaniisgh commented Jul 1, 2020 •

edited