Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching #136

Merged
merged 17 commits into from Oct 17, 2014

Conversation

Projects
None yet
2 participants
@AlexisEidelman
Copy link
Contributor

commented Aug 31, 2014

No description provided.

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 6dcec12 Apr 9, 2014

the test is not needed IMO since it is a new function

This comment has been minimized.

Copy link
Member Author

replied Apr 9, 2014

OK

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 6dcec12 Apr 9, 2014

This "takes" the individuals with the lowest scores. Is that intended? Taking those with the highest scores seems more logical to me.

This comment has been minimized.

Copy link

replied Apr 9, 2014

Forget that... These are ranks not scores, so it is pretty logical as is.

This comment has been minimized.

Copy link
Member Author

replied Apr 9, 2014

In fact, both can be right, we just have to chose one and define it as a norm. For example we can rank people from the youngest to the oldest or from the oldest to the youngest. It's not obvious wether "rank1 = age" should set the youngest or the oldest in first position.
That raise the idea of having a 'decreasing' boolean option, as in sorting function in general.

This comment has been minimized.

Copy link

replied Apr 9, 2014

True, it would be better to have either score here, or change other methods to have rank instead of score. This is an arbitrary choice indeed. For me, the naming "implicitly" sets the default order: taking the highest "score" and smallest "rank" seem logical, but using the same vocabulary everywhere seems like a good idea. Having a "reverse" argument like in the sorted() Python builtin wouldn't hurt...

This comment has been minimized.

Copy link

replied Apr 9, 2014

Hmm, changing other methods to have rank instead of score is not really an option because in most cases, the score is actually a probability, which means it is much more practical to take the highest first.

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 37c1982 Apr 9, 2014

typo in class name: Neighbour !

This comment has been minimized.

Copy link

replied Apr 9, 2014

Unless you have some other matching function in mind that would inherit directly from Score_Matching, I do not think it makes sense to separate Score_Matching and Closest_Neighbourg.

This comment has been minimized.

Copy link

replied Apr 9, 2014

and by the way, I am a bit puzzled why you named it Closest_Neighbour, is it a standard name for the technique in the literature? I fail to see what is the relation to a "closest neighbour" algorithm...

This comment has been minimized.

Copy link
Member Author

replied Apr 9, 2014

Sorry for the typo. :)
Yes I've something else in mind. We can do everything we want with what I called closest neighbour algorithm it's not the lonely solution.
There is some other ways of matching (see for example the discussion on liam-dev google groups "Probability-weighted option for order of decreasing difficulty (ODD) matching in LIAM2").
As I told you in past I'm thinking of implemented an optimal matching method where the goal is minimazing for example the sum of distance of all pairs.

I wasn't really confident with the name I wrote. First, nearest neighbour is better. But I didn't find a better name. I wanted to explicit that the method is based on a list and an iteration. Why did I reject Ordered_Matching ? Because there can be some confusion with Ranking matching. Sequential_matching ?

This comment has been minimized.

Copy link

replied Apr 9, 2014

SequentialMatching sounds good to me

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 37c1982 Apr 9, 2014

missing "raise" statement

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 37c1982 Apr 9, 2014

the orderby field and argument (and the related traverse and collect_variables methods) should probably be moved to Classic_CN_Matching since Closest_Neighbourg is apparently meant as a "generic"/abstract class, it should be general.

This comment has been minimized.

Copy link
Member Author

replied Apr 9, 2014

I did that because both Classic_CN_Matching and ODD_CN_Matching share these methods so I thought it was general in that way but I may not use well the heritance in python.

This comment has been minimized.

Copy link

replied Apr 9, 2014

Well, that's the whole point of my comment: ODD_CN_Matching does not use the orderby field, nor needs the overridden traverse and collect_variables methods. The only common bit is the evaluate method which can stay where it is.

This comment has been minimized.

Copy link
Member Author

replied Apr 9, 2014

Yes, you're right. It makes me realize that ODD_CN_Matching is in fact a Classic_CN_Matching where the orderby is calculated (as you told below). So, now I support the "argument option" more than two function. I'll change that.

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 6dcec12 Apr 9, 2014

uh? AFAIK, this is an invalid import statement

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 37c1982 Apr 9, 2014

I wonder if it wouldn't be better to have both under the same "matching" function with an "algorithm" argument, akin to the "method" argument you did for "sidewalk" alignment. That is what I would have done anyway, but I am unsure it is better. If you have an opinion about this, please share it. ;-)

This comment has been minimized.

Copy link
Member Author

replied Apr 9, 2014

In general, I think it more user-friendly to have option more than different function name when the deep goal of the function is the same.
But I did it that way because of the "difficulty" argument which is useless in Classic_CN_Matching. So I preferred to explicit it rather than having a argument for Classic_CN_Matching with no interest.

That's my opinion but I think there are good reasons for both options but no decisive one.

This comment has been minimized.

Copy link

replied Apr 9, 2014

It would make sense to rename difficulty to orderby, as it is really what it is anyway. We would then have two "predefined" string values for it, or the most general way: the user passes the orderby vector directly.

This comment has been minimized.

Copy link
Member Author

replied Apr 9, 2014

Agree, cf above.

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 37c1982 Apr 9, 2014

invalid import?

This comment has been minimized.

Copy link
Member Author

replied Apr 9, 2014

It works for me. What do you mean ?

This comment has been minimized.

Copy link

replied Apr 9, 2014

ah well, it works indeed, but you should rather use "import itertools"

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 37c1982 Apr 9, 2014

missing raise

@gdementen

This comment has been minimized.

Copy link

commented on 6dcec12 Apr 9, 2014

since "Ranking_Matching" is not present in "functions", it is not available in models!

This comment has been minimized.

Copy link
Member Author

replied Apr 9, 2014

True but as I didn't tested it, I don't consider it as finished :)
That why I didn't push it to liam2/liam2. But your comments are welcomed right now !

This comment has been minimized.

Copy link

replied Apr 9, 2014

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 37c1982 Apr 9, 2014

'square distance to the other mean' looks like a misnomer: there is nothing squared about it, see above for some thoughts.

@gdementen

This comment has been minimized.

Copy link

commented on src/matching.py in 37c1982 Apr 9, 2014

Those two seem too long to be practical to type.

This comment has been minimized.

Copy link

replied Apr 9, 2014

"sum of squares" is actually "Sum of Squared Difference to Mean" => we could call it "SSDM". But since this is very close to (and in fact would result in the same results than) the variance (over all the fields/variables of the expression), we could call it something like "var" or "fieldsvar" or even "varvar" ;-)
"square distance to the other mean" is actually "Distance to Other Mean" => call it "DOM"?


from expr import expr_eval, collect_variables, traverse_expr
from exprbases import EvaluableExpression
from context import context_length, context_subset, context_delete
from utils import loop_wh_progress

# TODO: a quoi sert le collect variable ?

This comment has been minimized.

Copy link
@gdementen

gdementen Sep 1, 2014

Member

for various reasons, including testing whether all columns/variables needed for an expressions are present in the context. However, in the devel branch, it is not needed anymore to define one for each function you add as there is a generic one which works on all functions.


*generic setup* ::

matching(set1filter=boolean_expr,

This comment has been minimized.

Copy link
@gdementen

gdementen Sep 1, 2014

Member

should be rank_matching

def __init__(self, set1filter, set2filter, score, orderby):
class ScoreMatching(EvaluableExpression):
''' General framework for a Matching based on score
That kind of matching doesn't

This comment has been minimized.

Copy link
@gdementen

gdementen Sep 1, 2014

Member

doesn't what?

@gdementen

This comment has been minimized.

Copy link
Member

commented Sep 1, 2014

Thanks for following up on this! The tests you added are a good start: they ensure the functions are actually callable and do not crash... But I was hoping for "real" tests with asserts which check whether the results are correct/meaningful or not. The reason I insist so much that you write those, instead of writing them myself is that I am unsure what it should return and understanding the functions deeply enough takes more time than I have available.

I know that the normal "matching" function does not currently have those. This is because that part dates back to when the asserts were not implemented. All new code should have proper tests. That is the only rule I will be inflexible about. My other comments are all informational only...

PS: Why didn't you update your previous PR?

@gdementen gdementen added this to the 0.9 milestone Sep 1, 2014

@gdementen gdementen self-assigned this Sep 1, 2014

@gdementen gdementen merged commit 1ad7679 into liam2:master Oct 17, 2014

gdementen added a commit that referenced this pull request Oct 17, 2014

@gdementen

This comment has been minimized.

Copy link
Member

commented Oct 17, 2014

this is in 89c7024 ! Sorry it took so long, and thanks for the great contribution!

@gdementen

This comment has been minimized.

Copy link
Member

commented Oct 17, 2014

as a follow-up, it would be nice to tackle #145

@gdementen gdementen removed the inprogress label Dec 15, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.