New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized matching #144

Merged
merged 24 commits into from Nov 7, 2014

Conversation

Projects
None yet
2 participants
@AlexisEidelman
Contributor

AlexisEidelman commented Sep 28, 2014

No description provided.

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

the test is not needed IMO since it is a new function

gdementen commented on src/matching.py in 6dcec12 Apr 9, 2014

the test is not needed IMO since it is a new function

This comment has been minimized.

Show comment
Hide comment
@AlexisEidelman
Member

AlexisEidelman replied Apr 9, 2014

OK

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

This "takes" the individuals with the lowest scores. Is that intended? Taking those with the highest scores seems more logical to me.

gdementen commented on src/matching.py in 6dcec12 Apr 9, 2014

This "takes" the individuals with the lowest scores. Is that intended? Taking those with the highest scores seems more logical to me.

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

Forget that... These are ranks not scores, so it is pretty logical as is.

gdementen replied Apr 9, 2014

Forget that... These are ranks not scores, so it is pretty logical as is.

This comment has been minimized.

Show comment
Hide comment
@AlexisEidelman

AlexisEidelman Apr 9, 2014

Member

In fact, both can be right, we just have to chose one and define it as a norm. For example we can rank people from the youngest to the oldest or from the oldest to the youngest. It's not obvious wether "rank1 = age" should set the youngest or the oldest in first position.
That raise the idea of having a 'decreasing' boolean option, as in sorting function in general.

Member

AlexisEidelman replied Apr 9, 2014

In fact, both can be right, we just have to chose one and define it as a norm. For example we can rank people from the youngest to the oldest or from the oldest to the youngest. It's not obvious wether "rank1 = age" should set the youngest or the oldest in first position.
That raise the idea of having a 'decreasing' boolean option, as in sorting function in general.

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

True, it would be better to have either score here, or change other methods to have rank instead of score. This is an arbitrary choice indeed. For me, the naming "implicitly" sets the default order: taking the highest "score" and smallest "rank" seem logical, but using the same vocabulary everywhere seems like a good idea. Having a "reverse" argument like in the sorted() Python builtin wouldn't hurt...

gdementen replied Apr 9, 2014

True, it would be better to have either score here, or change other methods to have rank instead of score. This is an arbitrary choice indeed. For me, the naming "implicitly" sets the default order: taking the highest "score" and smallest "rank" seem logical, but using the same vocabulary everywhere seems like a good idea. Having a "reverse" argument like in the sorted() Python builtin wouldn't hurt...

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

Hmm, changing other methods to have rank instead of score is not really an option because in most cases, the score is actually a probability, which means it is much more practical to take the highest first.

gdementen replied Apr 9, 2014

Hmm, changing other methods to have rank instead of score is not really an option because in most cases, the score is actually a probability, which means it is much more practical to take the highest first.

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

typo in class name: Neighbour !

gdementen commented on src/matching.py in 37c1982 Apr 9, 2014

typo in class name: Neighbour !

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

Unless you have some other matching function in mind that would inherit directly from Score_Matching, I do not think it makes sense to separate Score_Matching and Closest_Neighbourg.

gdementen replied Apr 9, 2014

Unless you have some other matching function in mind that would inherit directly from Score_Matching, I do not think it makes sense to separate Score_Matching and Closest_Neighbourg.

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

and by the way, I am a bit puzzled why you named it Closest_Neighbour, is it a standard name for the technique in the literature? I fail to see what is the relation to a "closest neighbour" algorithm...

gdementen replied Apr 9, 2014

and by the way, I am a bit puzzled why you named it Closest_Neighbour, is it a standard name for the technique in the literature? I fail to see what is the relation to a "closest neighbour" algorithm...

This comment has been minimized.

Show comment
Hide comment
@AlexisEidelman

AlexisEidelman Apr 9, 2014

Member

Sorry for the typo. :)
Yes I've something else in mind. We can do everything we want with what I called closest neighbour algorithm it's not the lonely solution.
There is some other ways of matching (see for example the discussion on liam-dev google groups "Probability-weighted option for order of decreasing difficulty (ODD) matching in LIAM2").
As I told you in past I'm thinking of implemented an optimal matching method where the goal is minimazing for example the sum of distance of all pairs.

I wasn't really confident with the name I wrote. First, nearest neighbour is better. But I didn't find a better name. I wanted to explicit that the method is based on a list and an iteration. Why did I reject Ordered_Matching ? Because there can be some confusion with Ranking matching. Sequential_matching ?

Member

AlexisEidelman replied Apr 9, 2014

Sorry for the typo. :)
Yes I've something else in mind. We can do everything we want with what I called closest neighbour algorithm it's not the lonely solution.
There is some other ways of matching (see for example the discussion on liam-dev google groups "Probability-weighted option for order of decreasing difficulty (ODD) matching in LIAM2").
As I told you in past I'm thinking of implemented an optimal matching method where the goal is minimazing for example the sum of distance of all pairs.

I wasn't really confident with the name I wrote. First, nearest neighbour is better. But I didn't find a better name. I wanted to explicit that the method is based on a list and an iteration. Why did I reject Ordered_Matching ? Because there can be some confusion with Ranking matching. Sequential_matching ?

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

SequentialMatching sounds good to me

gdementen replied Apr 9, 2014

SequentialMatching sounds good to me

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

missing "raise" statement

gdementen commented on src/matching.py in 37c1982 Apr 9, 2014

missing "raise" statement

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

the orderby field and argument (and the related traverse and collect_variables methods) should probably be moved to Classic_CN_Matching since Closest_Neighbourg is apparently meant as a "generic"/abstract class, it should be general.

gdementen commented on src/matching.py in 37c1982 Apr 9, 2014

the orderby field and argument (and the related traverse and collect_variables methods) should probably be moved to Classic_CN_Matching since Closest_Neighbourg is apparently meant as a "generic"/abstract class, it should be general.

This comment has been minimized.

Show comment
Hide comment
@AlexisEidelman

AlexisEidelman Apr 9, 2014

Member

I did that because both Classic_CN_Matching and ODD_CN_Matching share these methods so I thought it was general in that way but I may not use well the heritance in python.

Member

AlexisEidelman replied Apr 9, 2014

I did that because both Classic_CN_Matching and ODD_CN_Matching share these methods so I thought it was general in that way but I may not use well the heritance in python.

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

Well, that's the whole point of my comment: ODD_CN_Matching does not use the orderby field, nor needs the overridden traverse and collect_variables methods. The only common bit is the evaluate method which can stay where it is.

gdementen replied Apr 9, 2014

Well, that's the whole point of my comment: ODD_CN_Matching does not use the orderby field, nor needs the overridden traverse and collect_variables methods. The only common bit is the evaluate method which can stay where it is.

This comment has been minimized.

Show comment
Hide comment
@AlexisEidelman

AlexisEidelman Apr 9, 2014

Member

Yes, you're right. It makes me realize that ODD_CN_Matching is in fact a Classic_CN_Matching where the orderby is calculated (as you told below). So, now I support the "argument option" more than two function. I'll change that.

Member

AlexisEidelman replied Apr 9, 2014

Yes, you're right. It makes me realize that ODD_CN_Matching is in fact a Classic_CN_Matching where the orderby is calculated (as you told below). So, now I support the "argument option" more than two function. I'll change that.

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

uh? AFAIK, this is an invalid import statement

gdementen commented on src/matching.py in 6dcec12 Apr 9, 2014

uh? AFAIK, this is an invalid import statement

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

I wonder if it wouldn't be better to have both under the same "matching" function with an "algorithm" argument, akin to the "method" argument you did for "sidewalk" alignment. That is what I would have done anyway, but I am unsure it is better. If you have an opinion about this, please share it. ;-)

gdementen commented on src/matching.py in 37c1982 Apr 9, 2014

I wonder if it wouldn't be better to have both under the same "matching" function with an "algorithm" argument, akin to the "method" argument you did for "sidewalk" alignment. That is what I would have done anyway, but I am unsure it is better. If you have an opinion about this, please share it. ;-)

This comment has been minimized.

Show comment
Hide comment
@AlexisEidelman

AlexisEidelman Apr 9, 2014

Member

In general, I think it more user-friendly to have option more than different function name when the deep goal of the function is the same.
But I did it that way because of the "difficulty" argument which is useless in Classic_CN_Matching. So I preferred to explicit it rather than having a argument for Classic_CN_Matching with no interest.

That's my opinion but I think there are good reasons for both options but no decisive one.

Member

AlexisEidelman replied Apr 9, 2014

In general, I think it more user-friendly to have option more than different function name when the deep goal of the function is the same.
But I did it that way because of the "difficulty" argument which is useless in Classic_CN_Matching. So I preferred to explicit it rather than having a argument for Classic_CN_Matching with no interest.

That's my opinion but I think there are good reasons for both options but no decisive one.

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

It would make sense to rename difficulty to orderby, as it is really what it is anyway. We would then have two "predefined" string values for it, or the most general way: the user passes the orderby vector directly.

gdementen replied Apr 9, 2014

It would make sense to rename difficulty to orderby, as it is really what it is anyway. We would then have two "predefined" string values for it, or the most general way: the user passes the orderby vector directly.

This comment has been minimized.

Show comment
Hide comment
@AlexisEidelman

AlexisEidelman Apr 9, 2014

Member

Agree, cf above.

Member

AlexisEidelman replied Apr 9, 2014

Agree, cf above.

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

invalid import?

gdementen commented on src/matching.py in 37c1982 Apr 9, 2014

invalid import?

This comment has been minimized.

Show comment
Hide comment
@AlexisEidelman

AlexisEidelman Apr 9, 2014

Member

It works for me. What do you mean ?

Member

AlexisEidelman replied Apr 9, 2014

It works for me. What do you mean ?

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

ah well, it works indeed, but you should rather use "import itertools"

gdementen replied Apr 9, 2014

ah well, it works indeed, but you should rather use "import itertools"

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

missing raise

gdementen commented on src/matching.py in 37c1982 Apr 9, 2014

missing raise

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

since "Ranking_Matching" is not present in "functions", it is not available in models!

gdementen commented on 6dcec12 Apr 9, 2014

since "Ranking_Matching" is not present in "functions", it is not available in models!

This comment has been minimized.

Show comment
Hide comment
@AlexisEidelman

AlexisEidelman Apr 9, 2014

Member

True but as I didn't tested it, I don't consider it as finished :)
That why I didn't push it to liam2/liam2. But your comments are welcomed right now !

Member

AlexisEidelman replied Apr 9, 2014

True but as I didn't tested it, I don't consider it as finished :)
That why I didn't push it to liam2/liam2. But your comments are welcomed right now !

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

gdementen replied Apr 9, 2014

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

'square distance to the other mean' looks like a misnomer: there is nothing squared about it, see above for some thoughts.

gdementen commented on src/matching.py in 37c1982 Apr 9, 2014

'square distance to the other mean' looks like a misnomer: there is nothing squared about it, see above for some thoughts.

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

Those two seem too long to be practical to type.

gdementen commented on src/matching.py in 37c1982 Apr 9, 2014

Those two seem too long to be practical to type.

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Apr 9, 2014

"sum of squares" is actually "Sum of Squared Difference to Mean" => we could call it "SSDM". But since this is very close to (and in fact would result in the same results than) the variance (over all the fields/variables of the expression), we could call it something like "var" or "fieldsvar" or even "varvar" ;-)
"square distance to the other mean" is actually "Distance to Other Mean" => call it "DOM"?

gdementen replied Apr 9, 2014

"sum of squares" is actually "Sum of Squared Difference to Mean" => we could call it "SSDM". But since this is very close to (and in fact would result in the same results than) the variance (over all the fields/variables of the expression), we could call it something like "var" or "fieldsvar" or even "varvar" ;-)
"square distance to the other mean" is actually "Distance to Other Mean" => call it "DOM"?

The SDtOM is the most relevant distance.
'''
def __init__(self, set1filter, set2filter, score, orderby, pool_size=None):
# Why not to have a second order ?

This comment has been minimized.

@gdementen

gdementen Sep 28, 2014

Member

because it doesn't make sense, we take the best score, independently to the order of individuals in set 2

@gdementen

gdementen Sep 28, 2014

Member

because it doesn't make sense, we take the best score, independently to the order of individuals in set 2

This comment has been minimized.

@AlexisEidelman

AlexisEidelman Sep 28, 2014

Contributor

True. :)

@AlexisEidelman

AlexisEidelman Sep 28, 2014

Contributor

True. :)

if order == 'EDtM':
for var in used_variables1:
orderby += (df1[var] - df1[var].mean())**2 / df1[var].var()
# if order == 'SDtOM':

This comment has been minimized.

@gdementen

gdementen Sep 28, 2014

Member

this case should be handled properly (if it is possible at all -- haven't thought about it). Handling might just be actually raising the exception, like in the comment, but not crashing with an hard-to-understand traceback like I suspect it does now.

@gdementen

gdementen Sep 28, 2014

Member

this case should be handled properly (if it is possible at all -- haven't thought about it). Handling might just be actually raising the exception, like in the comment, but not crashing with an hard-to-understand traceback like I suspect it does now.

This comment has been minimized.

@AlexisEidelman

AlexisEidelman Sep 28, 2014

Contributor

I think I start beeing lazy about that option. The idea is to evaluate the score to the center of set2 and then to sort by decreasing distance.

@AlexisEidelman

AlexisEidelman Sep 28, 2014

Contributor

I think I start beeing lazy about that option. The idea is to evaluate the score to the center of set2 and then to sort by decreasing distance.

@@ -1492,6 +1492,47 @@ entities:
if(gender, partner.newhousehold, newhousehold),
hh_id)
test_matching:

This comment has been minimized.

@gdementen

gdementen Sep 28, 2014

Member

is this function still making sense given we have a specific model/file to test matching stuff?

@gdementen

gdementen Sep 28, 2014

Member

is this function still making sense given we have a specific model/file to test matching stuff?

earnings: 40 + age
marriage:

This comment has been minimized.

@gdementen

gdementen Sep 28, 2014

Member

I would rather see this test in matching.yml, possibly in another procedure, but the same file, to avoid duplicating setup stuff...

@gdementen

gdementen Sep 28, 2014

Member

I would rather see this test in matching.yml, possibly in another procedure, but the same file, to avoid duplicating setup stuff...

@@ -569,16 +569,33 @@ def update(self, value):
sys.stdout.write(''.join(chars_to_write))
self.percent = percent_done
#class TkProgressBar(ProgressBar):

This comment has been minimized.

@gdementen

gdementen Sep 28, 2014

Member

these changes smell like a bad merge somewhere, these lines do not exist anymore in current master

@gdementen

gdementen Sep 28, 2014

Member

these changes smell like a bad merge somewhere, these lines do not exist anymore in current master

if matching_ctx['__len__'] == 0:
raise StopIteration()
size1 = len(row['idx'])

This comment has been minimized.

@gdementen

gdementen Sep 28, 2014

Member

a few comments in here wouldn't hurt. I am admittedly tired, but I have trouble following what happens exactly in here, even if I understand the general principle. What is row['idx'], the list of ids of the current "cell"?

@gdementen

gdementen Sep 28, 2014

Member

a few comments in here wouldn't hurt. I am admittedly tired, but I have trouble following what happens exactly in here, even if I understand the general principle. What is row['idx'], the list of ids of the current "cell"?

@gdementen gdementen added this to the 0.9 milestone Sep 28, 2014

@gdementen gdementen self-assigned this Sep 28, 2014

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Sep 28, 2014

Member

This looks like a great contribution (again). I wonder if it is not possible to make the optimized version return exactly the same results as the old method (possibly as an option if it has a non-negligible cost). If my understanding is correct, I guess it is a matter of getting "df_by_cell" to return a list of ids in the "original" order within the cell, no?

If that is indeed possible, we could simply remove the old method. There are probably cases where there are enough different combination of variables that the additional groupby makes the new method slower rather than faster. I believe those cases should be relatively rare, but it would be nice to know what is the threshold/at what point it is not worth it. Have you done some tests in that area?

Member

gdementen commented Sep 28, 2014

This looks like a great contribution (again). I wonder if it is not possible to make the optimized version return exactly the same results as the old method (possibly as an option if it has a non-negligible cost). If my understanding is correct, I guess it is a matter of getting "df_by_cell" to return a list of ids in the "original" order within the cell, no?

If that is indeed possible, we could simply remove the old method. There are probably cases where there are enough different combination of variables that the additional groupby makes the new method slower rather than faster. I believe those cases should be relatively rare, but it would be nice to know what is the threshold/at what point it is not worth it. Have you done some tests in that area?

@gdementen

This comment has been minimized.

Show comment
Hide comment
@gdementen

gdementen Nov 7, 2014

Member

merged in optimized_matching branch. I will clean it up before merging to master

Member

gdementen commented Nov 7, 2014

merged in optimized_matching branch. I will clean it up before merging to master

@gdementen gdementen merged commit 67fc8e6 into liam2:master Nov 7, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment