Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev lgbm #147

Merged
merged 34 commits into from
Jun 21, 2018
Merged

Dev lgbm #147

merged 34 commits into from
Jun 21, 2018

Conversation

apyskir
Copy link
Contributor

@apyskir apyskir commented Jun 20, 2018

lgbm as a score builder with possibility to replace it with default random forest

Jakub Czakon and others added 29 commits May 29, 2018 19:12
* initial restructure

* clean structure (neptune-ai#126)

* clean structure

* correct readme

* further cleaning

* Dev apply transformer (neptune-ai#131)

* clean structure

* correct readme

* further cleaning

* resizer docstring

* couple docstrings

* make apply transformer, memory cache

* fixes

* postprocessing docstrings

* fixes in PR

* Dev repo cleanup (neptune-ai#132)

* cleanup

* remove src.

* Dev clean tta (neptune-ai#134)

* added resize padding, refactored inference pipelines

* refactored piepliens

* added color shift augmentation

* reduced caching to just mask_resize

* updated config

* Dev-repo_cleanup models and losses docstrings (neptune-ai#135)

* models and losses docstrings

* small fixes in docstrings

* resolve conflicts in with TTA PR (neptune-ai#137)
# Conflicts:
#	src/callbacks.py
#	src/evaluate_checkpoint.py
#	src/main.py
#	src/metrics.py
#	src/models.py
#	src/neptune.yaml
#	src/pipeline_config.py
#	src/pipelines.py
#	src/postprocessing.py
#	src/steps/preprocessing/misc.py
# Conflicts:
#	README.md
#	main.py
#	neptune.yaml
#	src/callbacks.py
#	src/loaders.py
#	src/models.py
#	src/pipeline_config.py
#	src/pipeline_manager.py
#	src/pipelines.py
#	src/postprocessing.py
#	src/utils.py
neptune.yaml Outdated
erode_selem_size: 0
dilate_selem_size: 2
tta_aggregation_method: gmean
iou_threshold: 0.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I'd go with something more descriptive like nms_iou_threshold or smth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon OK, got it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weighted_loss = partial(multiclass_weighted_cross_entropy,
**get_loss_variables(**architecture_config['weighted_cross_entropy']))
loss = partial(mixed_dice_cross_entropy_loss, dice_weight=architecture_config['loss_weights']['dice_mask'],
weights_function = partial(get_weights, **architecture_config['weighted_cross_entropy'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir Why do we have to refactor this part once again? Is this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon I just adapted PyTorchUNetWeightedStream to match loss definition in PyTorchUNetWeighted. I think @taraspiotr could miss it while refactoring, because of giving up the stream mode

src/models.py Outdated
super().__init__(model_params, training_params)

def fit(self, features, **kwargs):
df_features = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I would dump this to some local method to have high level logic like prepare date, fit in fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon OK, I can do it.

return self

def transform(self, features, **kwargs):
scores = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I'd refactor this to have prepare data and super().transform here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon I don't think it's possible. I mean, it is possible, but not very helpful, I think. That's because super().transform sits inside double for loop and it transforms data frame for each layer in each image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir typo here I meand prepare data and super().fit (as is)

return {'scores': scores}

def save(self, filepath):
joblib.dump((self.estimator, self.feature_names), filepath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I actually like that it's less verbose than dict

src/models.py Outdated
self.estimator = RandomForestRegressor()

def fit(self, features, **kwargs):
df_features = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir again I would move some stuff to private method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return self

def transform(self, features, **kwargs):
scores = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon same here :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir even samer here (prepare data and super().fit() as is)

meta_train = meta_train.sample(params.lgbm__num_training_examples, random_state=seed)
train_mode = False
annotations = []
for image_id in meta_train['ImageId'].values:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir this logic should be somewhere else and only high level function should be called in the manager. So I would refactor to have something like load_lgbm_data that produces meta_train_lgbm, annotations_lgbm .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon OK, you're right

meta_valid = meta_valid.sample(int(params.evaluation_data_sample), random_state=seed)

if dev_mode:
meta_train = meta_train.sample(20, random_state=seed)
meta_valid = meta_valid.sample(10, random_state=seed)

if 'lgbm' in pipeline_name:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir since this is stircly for the lgbm_train I would explicitly go if piepline_name==lgbm_train:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon Currently it's strictly for `pipeline_name=='lgbm', actually. OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon ...and now it's pipeline_name=='scoring_model'

@@ -146,7 +168,7 @@ def predict(pipeline_name, dev_mode, submit_predictions, chunk_size, logger, par
meta_test = meta_test.sample(2, random_state=seed)

pipeline = PIPELINES[pipeline_name]['inference'](SOLUTION_CONFIG)
prediction = generate_prediction(meta_test, pipeline, logger, CATEGORY_IDS, chunk_size)
prediction = generate_prediction(meta_test, pipeline, logger, CATEGORY_IDS, chunk_size, params.num_threads)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I think we have num_threads somewhare and n_threads in other places. We should choose one I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon Right, I'll adjust it.

src/pipelines.py Outdated
def lgbm_train(config):
save_output = False
unet_type = 'weighted'
config['execution']['stream_mode'] = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I think we should be able to access that via config.execution.stream_mode thanks to attrdict. I think it looks nicer that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/pipelines.py Outdated
unet_type = 'weighted'
config['execution']['stream_mode'] = True

if unet_type == 'standard':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I think that it doesn't matter which we choose since we are not training unet in this pipepline. We need to train it on the entire dataset and later port it to this pipeline. So I think untill the memory jump (and hence 10k subset limitation) has been dealt with we don't need both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon Yes, I will clean it.

src/pipelines.py Outdated

scoring_model = Step(name='scoring_model',
transformer=ScoringLightGBM(**config['postprocessor']['lightGBM'])
if config['postprocessor']['scoring_model'] == 'lgbm' else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir Again I think we can access values via config.postprocessor.scoring_model etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/pipelines.py Outdated
scoring_model = Step(name='scoring_model',
transformer=ScoringLightGBM(**config['postprocessor']['lightGBM'])
if config['postprocessor']['scoring_model'] == 'lgbm' else
ScoringRandomForest(**config['postprocessor']['random_forest']),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir Since the pipeline is called lgbm_train having random forest here doesn't make sense. I would either create new piepline random_forest_train (and just use .get_step() substitution) or I would change the name to second_level_model . I think the latter is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon Right, primary RF was just a temporary idea, but it stayed as valid solution, so I will generalize naming.

src/pipelines.py Outdated
return scoring_model


def lgbm_inference(config, input_pipeline):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I like this with 2 tiny tweaksI'd rather have input_pipeline as first argument and config second. Also I think the naming is to be changed (as explained before).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon OK, thx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon I think we need config to be the first argument, because in pipeline_manager.py we call pipeline(SOLUTION_CONFIG), not pipeline(config=SOLUTION_CONFIG), so it tries to set first argument of lgbm_inference to SOLUTION_CONFIG.
I have to keep the order, or change pipeline_manager.py

def categorize_multilayer_image(image):
categorized_image = []
for category_id, category_output in enumerate(image):
thrs_step = 1. / (CATEGORY_LAYERS[category_id] + 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir thrs_step->threshold_step . But more importantly I don't understand this CATEGORY_LAYERS logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon That's actually @taraspiotr 's idea. CATEGORY_LAYERS array indicates, how many probability thresholds (uniformly distributed between 0 and 1) you want to use to extract objects. In case of category with index 1 (buildings) you want to have 19 thresholds: from 0.05 to 0.95 with step 0.05. That's why CATEGORY_LAYERS[1]==19.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir ok, so CATEGORY_LAYERS[0]=1 is just background threshold right?

Copy link
Contributor Author

@apyskir apyskir Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon Yes, it's background threshold. So in some way we extract "background objects". We even extract features for them, but we drop them while preparing data for training.

thrs_step = 1. / (CATEGORY_LAYERS[category_id] + 1)
thresholds = np.arange(thrs_step, 1, thrs_step)
for thrs in thresholds:
categorized_image.append(category_output > thrs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir thrs-> threshold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def join_score_image(image, score):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir do we need a function for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon I wrote it to use with make_apply_transformer, I forgot to do it and test it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir if so you can always use lambda function

@@ -2,7 +2,7 @@
import numpy as np
import sklearn.linear_model as lr
from attrdict import AttrDict
from catboost import CatBoostClassifier
#from catboost import CatBoostClassifier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon You need to import this file, but you don't need catboost, so I decided to comment it out. Just to be obliged to install catboost while I'm not using it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ if you see something that should be dropped, just dropp it. I don't like commented out lines on master

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon But someone from outside may want to try catboost as scoring model, so we don't want to drop transformer with CatBoost. Following your argument we should drop whole keras folder, but we don't do it to keep steps in place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir then keep it without commend and add Catboost transformer

src/utils.py Outdated
@@ -68,7 +68,7 @@ def decompose(labeled):
return masks


def create_annotations(meta, predictions, logger, category_ids, save=False, experiment_dir='./'):
def create_annotations(meta, predictions, logger, category_ids, category_layers, save=False, experiment_dir='./'):
'''
:param meta: pd.DataFrame with metadata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir Didn't see it before but this looks like numpy docstrings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon That's probably my docstring. Pycharm proposed this format by default. I can change it.



def get_features_for_image(image, probabilities, annotations):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I think I'd rather have this as a transformer where in transform we just execute private prepare_data and then we run _transform where a list of those feature_extractions is executed. That would make it easy to read I think. If you want to have it as a function(which is ok of course) extract chunks responsible for getting to the one object mask point and feature extractions for that object mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotations = [annotation['segmentation'] for annotation in annotations]
for label_nr in range(1, labels.max() + 1):
mask = labels == label_nr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I like np.where here (doesn't mean we can't use this of course)

for label_nr in range(1, labels.max() + 1):
mask = labels == label_nr
mask_anns.append(rle_from_binary(mask.astype('uint8')))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I'd rather create mask_ann =rle_... and than mask_anns.append(mask_ann) .Easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thresholds = []
for n_thresholds in CATEGORY_LAYERS:
thrs_step = 1. / (n_thresholds + 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir thrs_step->threshold_step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def get_distance_to_border(bbox, im_size):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir this is min distance to border right? maybe we could have one more feature with max distance to border as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubczakon sure we can, I can add it

src/utils.py Outdated
@@ -298,6 +308,7 @@ def coco_evaluation(gt_filepath, prediction_filepath, image_ids, category_ids, s
cocoEval.params.areaRng = [[0 ** 2, 1e5 ** 2], [0 ** 2, small_annotations_size ** 2],
[small_annotations_size ** 2, 1e5 ** 2]]
cocoEval.params.areaRngLbl = ['all', 'small', 'large']
cocoEval.params.maxDets = [1, 10, 100000]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir I think it is fine for local testing, but in general you should have max(maxDets) == 100, i.e. maxDets = [1, 10, 100], because that is the default in COCO and these values are being used on CrowdAI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taraspiotr Thanks. I put it here because I found it somewhere else in code. I'd better remove this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apyskir True, I believe I was using it for testing recall with multilayer without NMS, here it shouldn't matter, because after NMS there shouldn't be over 100 annotations for one image.

@jakubczakon jakubczakon merged commit ee7a25c into neptune-ai:dev Jun 21, 2018
jakubczakon added a commit that referenced this pull request Jun 21, 2018
* initial restructure

* thresholds on unet output

* added gmean tta, experimented with thresholding (#125)

* feature exractor and lightgbm

* pipeline is running ok

* tmp commit

* lgbm ready for tests

* tmp

* faster nms and feature extraction

* small fix

* cleaning

* Dev repo cleanup (#138)

* initial restructure

* clean structure (#126)

* clean structure

* correct readme

* further cleaning

* Dev apply transformer (#131)

* clean structure

* correct readme

* further cleaning

* resizer docstring

* couple docstrings

* make apply transformer, memory cache

* fixes

* postprocessing docstrings

* fixes in PR

* Dev repo cleanup (#132)

* cleanup

* remove src.

* Dev clean tta (#134)

* added resize padding, refactored inference pipelines

* refactored piepliens

* added color shift augmentation

* reduced caching to just mask_resize

* updated config

* Dev-repo_cleanup models and losses docstrings (#135)

* models and losses docstrings

* small fixes in docstrings

* resolve conflicts in with TTA PR (#137)

* refactor in stream mode (#139)

* hot fix of mask_postprocessing in tta with new make transformer

* finishing merge

* finishing merge v2

* finishing merge v3

* finishing merge v4

* tmp commit

* lgbm train and evaluate pipelines run correctly

* something is not yes

* fix

* working lgbm training with ugly train_mode=True

* back to pipelines.py

* small fix

* preparing PR

* preparing PR v2

* preparing PR v2

* fix

* fix_2

* fix_3

* fix_4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants