{{ message }}

# Optimized matching #144

Merged
merged 24 commits into from Nov 7, 2014
Merged

# Optimized matching#144

merged 24 commits into from Nov 7, 2014

## Conversation

### AlexisEidelman commented Sep 28, 2014

 No description provided.
``` more matching functions ```
``` 37c1982 ```
``` Ranking Matching ```
``` 6dcec12 ```

### gdementen commented on `src/matching.py` in `6dcec12`Apr 9, 2014

 the test is not needed IMO since it is a new function

 OK

### gdementen commented on `src/matching.py` in `6dcec12`Apr 9, 2014

 This "takes" the individuals with the lowest scores. Is that intended? Taking those with the highest scores seems more logical to me.

### gdementen replied Apr 9, 2014

 Forget that... These are ranks not scores, so it is pretty logical as is.

### AlexisEidelman replied Apr 9, 2014

 In fact, both can be right, we just have to chose one and define it as a norm. For example we can rank people from the youngest to the oldest or from the oldest to the youngest. It's not obvious wether "rank1 = age" should set the youngest or the oldest in first position. That raise the idea of having a 'decreasing' boolean option, as in sorting function in general.

### gdementen replied Apr 9, 2014

 True, it would be better to have either score here, or change other methods to have rank instead of score. This is an arbitrary choice indeed. For me, the naming "implicitly" sets the default order: taking the highest "score" and smallest "rank" seem logical, but using the same vocabulary everywhere seems like a good idea. Having a "reverse" argument like in the sorted() Python builtin wouldn't hurt...

### gdementen replied Apr 9, 2014

 Hmm, changing other methods to have rank instead of score is not really an option because in most cases, the score is actually a probability, which means it is much more practical to take the highest first.

### gdementen commented on `src/matching.py` in `37c1982`Apr 9, 2014

 typo in class name: Neighbour !

### gdementen replied Apr 9, 2014

 Unless you have some other matching function in mind that would inherit directly from Score_Matching, I do not think it makes sense to separate Score_Matching and Closest_Neighbourg.

### gdementen replied Apr 9, 2014

 and by the way, I am a bit puzzled why you named it Closest_Neighbour, is it a standard name for the technique in the literature? I fail to see what is the relation to a "closest neighbour" algorithm...

### AlexisEidelman replied Apr 9, 2014

 Sorry for the typo. :) Yes I've something else in mind. We can do everything we want with what I called closest neighbour algorithm it's not the lonely solution. There is some other ways of matching (see for example the discussion on liam-dev google groups "Probability-weighted option for order of decreasing difficulty (ODD) matching in LIAM2"). As I told you in past I'm thinking of implemented an optimal matching method where the goal is minimazing for example the sum of distance of all pairs. I wasn't really confident with the name I wrote. First, nearest neighbour is better. But I didn't find a better name. I wanted to explicit that the method is based on a list and an iteration. Why did I reject Ordered_Matching ? Because there can be some confusion with Ranking matching. Sequential_matching ?

### gdementen replied Apr 9, 2014

 SequentialMatching sounds good to me

### gdementen commented on `src/matching.py` in `37c1982`Apr 9, 2014

 missing "raise" statement

### gdementen commented on `src/matching.py` in `37c1982`Apr 9, 2014

 the orderby field and argument (and the related traverse and collect_variables methods) should probably be moved to Classic_CN_Matching since Closest_Neighbourg is apparently meant as a "generic"/abstract class, it should be general.

### AlexisEidelman replied Apr 9, 2014

 I did that because both Classic_CN_Matching and ODD_CN_Matching share these methods so I thought it was general in that way but I may not use well the heritance in python.

### gdementen replied Apr 9, 2014

 Well, that's the whole point of my comment: ODD_CN_Matching does not use the orderby field, nor needs the overridden traverse and collect_variables methods. The only common bit is the evaluate method which can stay where it is.

### AlexisEidelman replied Apr 9, 2014

 Yes, you're right. It makes me realize that ODD_CN_Matching is in fact a Classic_CN_Matching where the orderby is calculated (as you told below). So, now I support the "argument option" more than two function. I'll change that.

### gdementen commented on `src/matching.py` in `6dcec12`Apr 9, 2014

 uh? AFAIK, this is an invalid import statement

### gdementen commented on `src/matching.py` in `37c1982`Apr 9, 2014

 I wonder if it wouldn't be better to have both under the same "matching" function with an "algorithm" argument, akin to the "method" argument you did for "sidewalk" alignment. That is what I would have done anyway, but I am unsure it is better. If you have an opinion about this, please share it. ;-)

### AlexisEidelman replied Apr 9, 2014

 In general, I think it more user-friendly to have option more than different function name when the deep goal of the function is the same. But I did it that way because of the "difficulty" argument which is useless in Classic_CN_Matching. So I preferred to explicit it rather than having a argument for Classic_CN_Matching with no interest. That's my opinion but I think there are good reasons for both options but no decisive one.

### gdementen replied Apr 9, 2014

 It would make sense to rename difficulty to orderby, as it is really what it is anyway. We would then have two "predefined" string values for it, or the most general way: the user passes the orderby vector directly.

### AlexisEidelman replied Apr 9, 2014

 Agree, cf above.

### gdementen commented on `src/matching.py` in `37c1982`Apr 9, 2014

 invalid import?

### AlexisEidelman replied Apr 9, 2014

 It works for me. What do you mean ?

### gdementen replied Apr 9, 2014

 ah well, it works indeed, but you should rather use "import itertools"

### gdementen commented on `src/matching.py` in `37c1982`Apr 9, 2014

 missing raise

### gdementen commented on `6dcec12`Apr 9, 2014

 since "Ranking_Matching" is not present in "functions", it is not available in models!

### AlexisEidelman replied Apr 9, 2014

 True but as I didn't tested it, I don't consider it as finished :) That why I didn't push it to liam2/liam2. But your comments are welcomed right now !

### gdementen replied Apr 9, 2014

 You did send a pull request though... Maybe you didn't intent to have your PR "updated" when you committed your second change to TaxIPP-Life:master but that's how github works, and that's why it is usually recommended to create a specific branch per PR. By the way, I was about to write that your PR needs tests before I merge it ;-) … On Wed, Apr 9, 2014 at 11:44 AM, AlexisEidelman ***@***.***: True but as I didn't tested it, I don't consider it as finished :) That why I didn't push it to liam2/liam2. But your comments are welcomed right now ! ## Reply to this email directly or view it on GitHubhttps://github.com/TaxIPP-Life/liam2/commit/6dcec1276f3b64b990192d947d5a646cc3867743#commitcomment-5953630 .

### gdementen commented on `src/matching.py` in `37c1982`Apr 9, 2014

 'square distance to the other mean' looks like a misnomer: there is nothing squared about it, see above for some thoughts.

### gdementen commented on `src/matching.py` in `37c1982`Apr 9, 2014

 Those two seem too long to be practical to type.

### gdementen replied Apr 9, 2014

 "sum of squares" is actually "Sum of Squared Difference to Mean" => we could call it "SSDM". But since this is very close to (and in fact would result in the same results than) the variance (over all the fields/variables of the expression), we could call it something like "var" or "fieldsvar" or even "varvar" ;-) "square distance to the other mean" is actually "Distance to Other Mean" => call it "DOM"?
``` matching after Gaëtan's comments ```
``` ca41a95 ```
``` matching: doc and changelog ```
``` 2fe472b ```
```No test because, there is nothing to do except adding other matching
options in simulation.yml```
``` add arguments to the function in loop_wh_progress ```
``` 7efd233 ```
``` add a pool_size parameter in matching function ```
``` 32af7f8 ```
```Oops, there is something wrong with change/version_0_8_2.rst, I'll fix
it in an incoming commit.```
``` matching changes ```
``` 078e584 ```
``` change rank for score ```
``` f3222e0 ```
``` step for more general MatchingClass ```
``` 2cd0b0e ```
``` new organisation of matching functions ```
``` 1c2d3b9 ```
``` small correction ```
``` e0b00a8 ```
``` oups ```
``` 711d227 ```
``` test for matching functions ```
``` 405b105 ```
``` small improvment in doc about matching ```
``` 34d7c66 ```
``` small corrections ```
``` 5d59937 ```
``` more documentation ```
``` 2eafc60 ```
``` matching test ```
``` 1ad7679 ```
``` first optimization ```
``` 51930bd ```
```TODO:
- deal with orederby
- keep only optimized matching

- test performance```
``` use loop_wh_progress ```
``` 24b86e8 ```
``` many improvment in optimized matching ```
``` 58573f6 ```
``` merge with master ```
``` ae8e28f ```
``` Merge branch 'master' of https://github.com/liam2/liam2 into optimize… ```
``` 5dd62ec ```
```…d_matching

Conflicts:
doc/usersguide/source/changes.rst```
``` groupby matching with order ```
``` ccc1a26 ```
``` optimized as default funtion ```
``` 67fc8e6 ```
 The SDtOM is the most relevant distance. ''' def __init__(self, set1filter, set2filter, score, orderby, pool_size=None): # Why not to have a second order ?

#### gdementen Sep 28, 2014 Member

because it doesn't make sense, we take the best score, independently to the order of individuals in set 2

#### AlexisEidelman Sep 28, 2014 Author Contributor

True. :)

 if order == 'EDtM': for var in used_variables1: orderby += (df1[var] - df1[var].mean())**2 / df1[var].var() # if order == 'SDtOM':

#### gdementen Sep 28, 2014 Member

this case should be handled properly (if it is possible at all -- haven't thought about it). Handling might just be actually raising the exception, like in the comment, but not crashing with an hard-to-understand traceback like I suspect it does now.

#### AlexisEidelman Sep 28, 2014 Author Contributor

I think I start beeing lazy about that option. The idea is to evaluate the score to the center of set2 and then to sort by decreasing distance.

 @@ -1492,6 +1492,47 @@ entities: if(gender, partner.newhousehold, newhousehold), hh_id) test_matching:

#### gdementen Sep 28, 2014 Member

is this function still making sense given we have a specific model/file to test matching stuff?

 earnings: 40 + age marriage:

#### gdementen Sep 28, 2014 Member

I would rather see this test in matching.yml, possibly in another procedure, but the same file, to avoid duplicating setup stuff...

 @@ -569,16 +569,33 @@ def update(self, value): sys.stdout.write(''.join(chars_to_write)) self.percent = percent_done #class TkProgressBar(ProgressBar):

#### gdementen Sep 28, 2014 Member

these changes smell like a bad merge somewhere, these lines do not exist anymore in current master

 if matching_ctx['__len__'] == 0: raise StopIteration() size1 = len(row['idx'])

#### gdementen Sep 28, 2014 Member

a few comments in here wouldn't hurt. I am admittedly tired, but I have trouble following what happens exactly in here, even if I understand the general principle. What is row['idx'], the list of ids of the current "cell"?

added this to the 0.9 milestone Sep 28, 2014
self-assigned this Sep 28, 2014

### gdementen commented Sep 28, 2014

 This looks like a great contribution (again). I wonder if it is not possible to make the optimized version return exactly the same results as the old method (possibly as an option if it has a non-negligible cost). If my understanding is correct, I guess it is a matter of getting "df_by_cell" to return a list of ids in the "original" order within the cell, no? If that is indeed possible, we could simply remove the old method. There are probably cases where there are enough different combination of variables that the additional groupby makes the new method slower rather than faster. I believe those cases should be relatively rare, but it would be nice to know what is the threshold/at what point it is not worth it. Have you done some tests in that area?

### gdementen commented Nov 7, 2014

 merged in optimized_matching branch. I will clean it up before merging to master
merged commit `67fc8e6` into liam2:master Nov 7, 2014