Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare patch recipients vs. get_maintainer.pl recommendation #21

Closed
bulwahn opened this issue May 9, 2019 · 8 comments
Closed

Compare patch recipients vs. get_maintainer.pl recommendation #21

bulwahn opened this issue May 9, 2019 · 8 comments

Comments

@bulwahn
Copy link
Contributor

bulwahn commented May 9, 2019

Motivation:
If we find systematic shortcomings between to whom the patches are usually sent to and get_maintainer.pl recommendation, we can improve get_maintainer.pl in this regard.

@bulwahn
Copy link
Contributor Author

bulwahn commented Jul 20, 2019

On small scale (for getting started):

To start understanding what needs to be done, we can just quickly run this investigation on a few hundred patches (selected from one week of an release candidate, where we checked or can just assume the MAINTAINERS file to be not changing).

Doing this at large scale will be more difficult because we would really like to collect this information for every patch emails and there are millions of patch emails, so we need to tweak and tune the operation.

@rsarky
Copy link
Contributor

rsarky commented Mar 19, 2020

Since this issue has been opened @rralf has created the LinuxMaintainers class. Can we leverage that instead of get_maintainers?

@rsarky
Copy link
Contributor

rsarky commented Mar 22, 2020

Aah after going over the codebase a bit more I see the module LinuxMailCharacteristics (particularly the _get_maintainer method) essentially acquires heuristics required for this

@ShubhamPandey28
Copy link
Contributor

A classification approach :
If possible we can classify the patch for its maintainers considering a target vector containing the probability of patch belonging to a maintainer. for eg. suppose there are 5 maintainers so the patch is classified into its maintainers as <0.7, 0.8, 0.67, 0.09, 0.9> if we set thresold probability to 0.7 in this case, so we have maintainer 0,1,4 to whom the patch could be sent.
There is one drawback of this approach that I can think of right now is that, if we add a maintainer we have to either train the model over all the data again including the maintainer in the target vector or compute a different model for each maintainer (considering the classification among maintainers mutually independent events)(This also results in storing the weights separately for every maintainer using more memory).

Please object me if this solution is inappropriate.

@bulwahn
Copy link
Contributor Author

bulwahn commented Jul 4, 2020

@ShubhamPandey28 I fear you are providing such a dense description that we are not able to judge if this makes sense.
How are you determining any model? Neither get_maintainers.pl nor the history might be a ground truth because both contain mistakes, either developers sending patches to the wrong maintainers (or maintainers handing over to someone else) or get_maintainers.pl containing false entries.

How about starting with writing a script to identify significant systematic differences between get_maintainers.pl and email recipients? If you need to use vectors, go ahead, but you would really need to first describe your model.

@fun-akhil
Copy link

I have observed that in some cases, the authors of a patch are mailing the patches to addresses/recipients which are not present in the MAINTAINER file. In such a scenario, it is almost impossible to suggest the appropriate mailing list by running get_maintainer.pl script. Is there any possible solution to this problem?

@bulwahn
Copy link
Contributor Author

bulwahn commented Jul 6, 2020

@quantum109678 Well, you can just add that person to the MAINTAINERS file, right?

The real question is how can you train a "suitable" model on a set of files in a directory and keywords in patches for recipients and then determine the "minimal invasive" but "maximal effective" change in the MAINTAINERS file.

I would be surprised if some crazy machine learning could solve that, but let us start easy and collect where there is a systematic difference, e.g., in 80% of the patches, between the recipient list of all patches belonging to a MAINTAINERS section and the information in the MAINTAINERS section. That probably already indicates that the entry might need adjustment.

@bulwahn
Copy link
Contributor Author

bulwahn commented Jul 14, 2020

I have formulated issue #65 as nice side investigation to this question here.

@rralf rralf closed this as completed Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants