Dev lgbm #147

apyskir · 2018-06-20T14:46:34Z

lgbm as a score builder with possibility to replace it with default random forest

* initial restructure * clean structure (neptune-ai#126) * clean structure * correct readme * further cleaning * Dev apply transformer (neptune-ai#131) * clean structure * correct readme * further cleaning * resizer docstring * couple docstrings * make apply transformer, memory cache * fixes * postprocessing docstrings * fixes in PR * Dev repo cleanup (neptune-ai#132) * cleanup * remove src. * Dev clean tta (neptune-ai#134) * added resize padding, refactored inference pipelines * refactored piepliens * added color shift augmentation * reduced caching to just mask_resize * updated config * Dev-repo_cleanup models and losses docstrings (neptune-ai#135) * models and losses docstrings * small fixes in docstrings * resolve conflicts in with TTA PR (neptune-ai#137)

# Conflicts: # src/callbacks.py # src/evaluate_checkpoint.py # src/main.py # src/metrics.py # src/models.py # src/neptune.yaml # src/pipeline_config.py # src/pipelines.py # src/postprocessing.py # src/steps/preprocessing/misc.py

# Conflicts: # README.md # main.py # neptune.yaml # src/callbacks.py # src/loaders.py # src/models.py # src/pipeline_config.py # src/pipeline_manager.py # src/pipelines.py # src/postprocessing.py # src/utils.py

jakubczakon · 2018-06-21T06:52:07Z

neptune.yaml

  erode_selem_size: 0
  dilate_selem_size: 2
  tta_aggregation_method: gmean
+  iou_threshold: 0.5


@apyskir I'd go with something more descriptive like nms_iou_threshold or smth.

@jakubczakon OK, got it.

jakubczakon · 2018-06-21T06:54:37Z

src/models.py

-        weighted_loss = partial(multiclass_weighted_cross_entropy,
-                                **get_loss_variables(**architecture_config['weighted_cross_entropy']))
-        loss = partial(mixed_dice_cross_entropy_loss, dice_weight=architecture_config['loss_weights']['dice_mask'],
+        weights_function = partial(get_weights, **architecture_config['weighted_cross_entropy'])


@apyskir Why do we have to refactor this part once again? Is this change necessary?

@jakubczakon I just adapted PyTorchUNetWeightedStream to match loss definition in PyTorchUNetWeighted. I think @taraspiotr could miss it while refactoring, because of giving up the stream mode

jakubczakon · 2018-06-21T06:56:13Z

src/models.py

+        super().__init__(model_params, training_params)
+
+    def fit(self, features, **kwargs):
+        df_features = []


@apyskir I would dump this to some local method to have high level logic like prepare date, fit in fit.

@jakubczakon OK, I can do it.

jakubczakon · 2018-06-21T06:57:48Z

src/models.py

+        return self
+
+    def transform(self, features, **kwargs):
+        scores = []


@apyskir I'd refactor this to have prepare data and super().transform here

@jakubczakon I don't think it's possible. I mean, it is possible, but not very helpful, I think. That's because super().transform sits inside double for loop and it transforms data frame for each layer in each image.

@apyskir typo here I meand prepare data and super().fit (as is)

jakubczakon · 2018-06-21T06:58:57Z

src/models.py

+        return {'scores': scores}
+
+    def save(self, filepath):
+        joblib.dump((self.estimator, self.feature_names), filepath)


@apyskir I actually like that it's less verbose than dict

jakubczakon · 2018-06-21T06:59:37Z

src/models.py

+        self.estimator = RandomForestRegressor()
+
+    def fit(self, features, **kwargs):
+        df_features = []


@apyskir again I would move some stuff to private method

@jakubczakon sure

jakubczakon · 2018-06-21T06:59:49Z

src/models.py

+        return self
+
+    def transform(self, features, **kwargs):
+        scores = []


@apyskir same here

@jakubczakon same here :)

@apyskir even samer here (prepare data and super().fit() as is)

jakubczakon · 2018-06-21T07:04:14Z

src/pipeline_manager.py

+        meta_train = meta_train.sample(params.lgbm__num_training_examples, random_state=seed)
+        train_mode = False
+        annotations = []
+        for image_id in meta_train['ImageId'].values:


@apyskir this logic should be somewhere else and only high level function should be called in the manager. So I would refactor to have something like load_lgbm_data that produces meta_train_lgbm, annotations_lgbm .

@jakubczakon OK, you're right

jakubczakon · 2018-06-21T07:05:19Z

src/pipeline_manager.py

    meta_valid = meta_valid.sample(int(params.evaluation_data_sample), random_state=seed)

    if dev_mode:
        meta_train = meta_train.sample(20, random_state=seed)
        meta_valid = meta_valid.sample(10, random_state=seed)

+    if 'lgbm' in pipeline_name:


@apyskir since this is stircly for the lgbm_train I would explicitly go if piepline_name==lgbm_train:

@jakubczakon Currently it's strictly for `pipeline_name=='lgbm', actually. OK.

@jakubczakon ...and now it's pipeline_name=='scoring_model'

jakubczakon · 2018-06-21T07:07:02Z

src/pipeline_manager.py

@@ -146,7 +168,7 @@ def predict(pipeline_name, dev_mode, submit_predictions, chunk_size, logger, par
        meta_test = meta_test.sample(2, random_state=seed)

    pipeline = PIPELINES[pipeline_name]['inference'](SOLUTION_CONFIG)
-    prediction = generate_prediction(meta_test, pipeline, logger, CATEGORY_IDS, chunk_size)
+    prediction = generate_prediction(meta_test, pipeline, logger, CATEGORY_IDS, chunk_size, params.num_threads)


@apyskir I think we have num_threads somewhare and n_threads in other places. We should choose one I think

@jakubczakon Right, I'll adjust it.

jakubczakon · 2018-06-21T07:09:28Z

src/pipelines.py

+def lgbm_train(config):
+    save_output = False
+    unet_type = 'weighted'
+    config['execution']['stream_mode'] = True


@apyskir I think we should be able to access that via config.execution.stream_mode thanks to attrdict. I think it looks nicer that way.

@jakubczakon OK

jakubczakon · 2018-06-21T07:12:21Z

src/pipelines.py

+    unet_type = 'weighted'
+    config['execution']['stream_mode'] = True
+
+    if unet_type == 'standard':


@apyskir I think that it doesn't matter which we choose since we are not training unet in this pipepline. We need to train it on the entire dataset and later port it to this pipeline. So I think untill the memory jump (and hence 10k subset limitation) has been dealt with we don't need both.

@jakubczakon Yes, I will clean it.

jakubczakon · 2018-06-21T07:13:33Z

src/pipelines.py

+
+    scoring_model = Step(name='scoring_model',
+                         transformer=ScoringLightGBM(**config['postprocessor']['lightGBM'])
+                         if config['postprocessor']['scoring_model'] == 'lgbm' else


@apyskir Again I think we can access values via config.postprocessor.scoring_model etc

@jakubczakon OK

jakubczakon · 2018-06-21T07:15:38Z

src/pipelines.py

+    scoring_model = Step(name='scoring_model',
+                         transformer=ScoringLightGBM(**config['postprocessor']['lightGBM'])
+                         if config['postprocessor']['scoring_model'] == 'lgbm' else
+                         ScoringRandomForest(**config['postprocessor']['random_forest']),


@apyskir Since the pipeline is called lgbm_train having random forest here doesn't make sense. I would either create new piepline random_forest_train (and just use .get_step() substitution) or I would change the name to second_level_model . I think the latter is better.

@jakubczakon Right, primary RF was just a temporary idea, but it stayed as valid solution, so I will generalize naming.

jakubczakon · 2018-06-21T07:19:11Z

src/pipelines.py

+    return scoring_model
+
+
+def lgbm_inference(config, input_pipeline):


@apyskir I like this with 2 tiny tweaksI'd rather have input_pipeline as first argument and config second. Also I think the naming is to be changed (as explained before).

@jakubczakon OK, thx

@jakubczakon I think we need config to be the first argument, because in pipeline_manager.py we call pipeline(SOLUTION_CONFIG), not pipeline(config=SOLUTION_CONFIG), so it tries to set first argument of lgbm_inference to SOLUTION_CONFIG.
I have to keep the order, or change pipeline_manager.py

jakubczakon · 2018-06-21T07:23:04Z

src/postprocessing.py

+def categorize_multilayer_image(image):
+    categorized_image = []
+    for category_id, category_output in enumerate(image):
+        thrs_step = 1. / (CATEGORY_LAYERS[category_id] + 1)


@apyskir thrs_step->threshold_step . But more importantly I don't understand this CATEGORY_LAYERS logic.

@jakubczakon That's actually @taraspiotr 's idea. CATEGORY_LAYERS array indicates, how many probability thresholds (uniformly distributed between 0 and 1) you want to use to extract objects. In case of category with index 1 (buildings) you want to have 19 thresholds: from 0.05 to 0.95 with step 0.05. That's why CATEGORY_LAYERS[1]==19.

@apyskir ok, so CATEGORY_LAYERS[0]=1 is just background threshold right?

@jakubczakon Yes, it's background threshold. So in some way we extract "background objects". We even extract features for them, but we drop them while preparing data for training.

jakubczakon · 2018-06-21T07:23:17Z

src/postprocessing.py

+        thrs_step = 1. / (CATEGORY_LAYERS[category_id] + 1)
+        thresholds = np.arange(thrs_step, 1, thrs_step)
+        for thrs in thresholds:
+            categorized_image.append(category_output > thrs)


@apyskir thrs-> threshold

@jakubczakon Done

jakubczakon · 2018-06-21T07:23:43Z

src/postprocessing.py

+
+
+def join_score_image(image, score):


@apyskir do we need a function for that?

@jakubczakon I wrote it to use with make_apply_transformer, I forgot to do it and test it.

@apyskir if so you can always use lambda function

jakubczakon · 2018-06-21T07:25:16Z

src/steps/sklearn/models.py

@@ -2,7 +2,7 @@
 import numpy as np
 import sklearn.linear_model as lr
 from attrdict import AttrDict
-from catboost import CatBoostClassifier
+#from catboost import CatBoostClassifier


@jakubczakon You need to import this file, but you don't need catboost, so I decided to comment it out. Just to be obliged to install catboost while I'm not using it.

@ if you see something that should be dropped, just dropp it. I don't like commented out lines on master

@jakubczakon But someone from outside may want to try catboost as scoring model, so we don't want to drop transformer with CatBoost. Following your argument we should drop whole keras folder, but we don't do it to keep steps in place.

@apyskir then keep it without commend and add Catboost transformer

jakubczakon · 2018-06-21T07:25:51Z

src/utils.py

@@ -68,7 +68,7 @@ def decompose(labeled):
        return masks


-def create_annotations(meta, predictions, logger, category_ids, save=False, experiment_dir='./'):
+def create_annotations(meta, predictions, logger, category_ids, category_layers, save=False, experiment_dir='./'):
    '''
    :param meta: pd.DataFrame with metadata


@apyskir Didn't see it before but this looks like numpy docstrings

@jakubczakon That's probably my docstring. Pycharm proposed this format by default. I can change it.

jakubczakon · 2018-06-21T07:30:00Z

src/postprocessing.py

+
+
+def get_features_for_image(image, probabilities, annotations):


@apyskir I think I'd rather have this as a transformer where in transform we just execute private prepare_data and then we run _transform where a list of those feature_extractions is executed. That would make it easy to read I think. If you want to have it as a function(which is ok of course) extract chunks responsible for getting to the one object mask point and feature extractions for that object mask.

@jakubczakon done

jakubczakon · 2018-06-21T07:31:13Z

src/postprocessing.py

+        annotations = [annotation['segmentation'] for annotation in annotations]
+        for label_nr in range(1, labels.max() + 1):
+            mask = labels == label_nr


@apyskir I like np.where here (doesn't mean we can't use this of course)

jakubczakon · 2018-06-21T07:32:18Z

src/postprocessing.py

+        for label_nr in range(1, labels.max() + 1):
+            mask = labels == label_nr
+            mask_anns.append(rle_from_binary(mask.astype('uint8')))


@apyskir I'd rather create mask_ann =rle_... and than mask_anns.append(mask_ann) .Easier to read

@jakubczakon OK

jakubczakon · 2018-06-21T07:32:46Z

src/postprocessing.py

+    thresholds = []
+    for n_thresholds in CATEGORY_LAYERS:
+        thrs_step = 1. / (n_thresholds + 1)


@apyskir thrs_step->threshold_step

@jakubczakon done

jakubczakon · 2018-06-21T07:33:46Z

src/postprocessing.py

+
+
+def get_distance_to_border(bbox, im_size):


@apyskir this is min distance to border right? maybe we could have one more feature with max distance to border as well

@jakubczakon sure we can, I can add it

taraspiotr · 2018-06-21T12:59:24Z

src/utils.py

@@ -298,6 +308,7 @@ def coco_evaluation(gt_filepath, prediction_filepath, image_ids, category_ids, s
    cocoEval.params.areaRng = [[0 ** 2, 1e5 ** 2], [0 ** 2, small_annotations_size ** 2],
                               [small_annotations_size ** 2, 1e5 ** 2]]
    cocoEval.params.areaRngLbl = ['all', 'small', 'large']
+    cocoEval.params.maxDets = [1, 10, 100000]


@apyskir I think it is fine for local testing, but in general you should have max(maxDets) == 100, i.e. maxDets = [1, 10, 100], because that is the default in COCO and these values are being used on CrowdAI.

@taraspiotr Thanks. I put it here because I found it somewhere else in code. I'd better remove this line.

@apyskir True, I believe I was using it for testing recall with multilayer without NMS, here it shouldn't matter, because after NMS there shouldn't be over 100 annotations for one image.

* initial restructure * thresholds on unet output * added gmean tta, experimented with thresholding (#125) * feature exractor and lightgbm * pipeline is running ok * tmp commit * lgbm ready for tests * tmp * faster nms and feature extraction * small fix * cleaning * Dev repo cleanup (#138) * initial restructure * clean structure (#126) * clean structure * correct readme * further cleaning * Dev apply transformer (#131) * clean structure * correct readme * further cleaning * resizer docstring * couple docstrings * make apply transformer, memory cache * fixes * postprocessing docstrings * fixes in PR * Dev repo cleanup (#132) * cleanup * remove src. * Dev clean tta (#134) * added resize padding, refactored inference pipelines * refactored piepliens * added color shift augmentation * reduced caching to just mask_resize * updated config * Dev-repo_cleanup models and losses docstrings (#135) * models and losses docstrings * small fixes in docstrings * resolve conflicts in with TTA PR (#137) * refactor in stream mode (#139) * hot fix of mask_postprocessing in tta with new make transformer * finishing merge * finishing merge v2 * finishing merge v3 * finishing merge v4 * tmp commit * lgbm train and evaluate pipelines run correctly * something is not yes * fix * working lgbm training with ugly train_mode=True * back to pipelines.py * small fix * preparing PR * preparing PR v2 * preparing PR v2 * fix * fix_2 * fix_3 * fix_4

Jakub Czakon and others added 29 commits May 29, 2018 19:12

initial restructure

1395fc4

thresholds on unet output

29e1c34

added gmean tta, experimented with thresholding (neptune-ai#125)

79578b5

feature exractor and lightgbm

1ff691c

pipeline is running ok

a9233c7

tmp commit

237ac76

lgbm ready for tests

6cebec5

tmp

3534d4e

faster nms and feature extraction

88216d6

small fix

c38c7d3

cleaning

29c8754

refactor in stream mode (neptune-ai#139)

e267cf6

hot fix of mask_postprocessing in tta with new make transformer

f07b7fb

Merge branch 'dev-recall' into dev

7d929ab

# Conflicts: # src/callbacks.py # src/evaluate_checkpoint.py # src/main.py # src/metrics.py # src/models.py # src/neptune.yaml # src/pipeline_config.py # src/pipelines.py # src/postprocessing.py # src/steps/preprocessing/misc.py

finishing merge

94ca956

finishing merge v2

f4ea6d5

finishing merge v3

1d9079f

finishing merge v4

db60e2e

tmp commit

892a1e7

lgbm train and evaluate pipelines run correctly

44c2c77

something is not yes

4be76d1

fix

44134a6

working lgbm training with ugly train_mode=True

c239e6d

Merge branch 'main-dev' into dev

62e241e

# Conflicts: # README.md # main.py # neptune.yaml # src/callbacks.py # src/loaders.py # src/models.py # src/pipeline_config.py # src/pipeline_manager.py # src/pipelines.py # src/postprocessing.py # src/utils.py

back to pipelines.py

4ee5915

small fix

392be74

preparing PR

f2b9b49

preparing PR v2

685783b

jakubczakon reviewed Jun 21, 2018

View reviewed changes

preparing PR v2

d2d9603

taraspiotr reviewed Jun 21, 2018

View reviewed changes

apyskir added 4 commits June 21, 2018 15:30

fix

d545525

fix_2

228df74

fix_3

0289a38

fix_4

ebfb5b4

jakubczakon merged commit ee7a25c into neptune-ai:dev Jun 21, 2018

		return scoring_model


		def lgbm_inference(config, input_pipeline):



		def get_features_for_image(image, probabilities, annotations):

Dev lgbm #147

Dev lgbm #147

Conversation

apyskir commented Jun 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apyskir Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apyskir Jun 21, 2018 •

edited

Loading