Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF2 porting: category feature #667

Merged

Conversation

jimthompson5802
Copy link
Collaborator

Code Pull Requests

Here is the start of converting the category feature to TF2 eager execution. More work is needed to complete.

The following have been implemented:

  • Setup category encoder and decoder
  • Adapt the category encoder to use current Ludwig Embed class
  • Custom softmax cross entropy loss function for training and evaluation
  • Custom softmax cross entropy metric function

At this point training phase completes w/o error. As noted above only the Loss function has been implemented.

The predict phase fails because of an incomplete implementation of the predictions() method and missing metric functions. The work to be done is similar to what I did with the binary feature.

Since this is the first time I've implemented encoder and decoder, I'd appreciate if you would take a look at how I implemented them.

I'm attaching the training data I'm using for testing. Included in the zip file is a log file from a test run.
ludwig_category_feature.zip

Here is the model definition I'm using

python -m ludwig.experiment --data_csv data4/train.csv \
  --skip_save_processed_input \
  --model_definition "{input_features: [{name: x1, type: numerical, preprocessing: {normalization: minmax}},
    {name: x2, type: numerical, preprocessing: {normalization: zscore}},
    {name: x3, type: numerical, preprocessing: {normalization: zscore}},
    {name: c1, type: category, embedding_size: 6}],
    combiner: {type: concat, num_fc_layers: 5, fc_size: 64},
    output_features: [{name: y, type: category}], training: {epochs: 10}}"

@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Mar 29, 2020

This commit cd78001 added probabilities and predictions to the CategoryOutputFeature.predictions() method. With this change the ludwig.experiment run completes w/o any noticeable errors. This zip file contains the run log (ludwig_category_log2.txt) and contents of the directory results/experiment_run/.
ludwig_category_test_results.zip

From my perspective, I plan to do the following to finish the category feature

  • enable/test the sparse representation of a category feature
  • implement the remaining metrics
  • enable use of regularizers
  • remove commented out legacy code

Let me know if I missed anything.

Copy link
Collaborator

@w4nderlust w4nderlust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

There are a bunch of things that should be tested here:

  • all parameters of the embed encoder (pretrained embeddings etc)
  • all parameters of the loss
    Maybe adding some unit tests here would be useful.

Also, after we get it to work entirely, it will be a good idea to do two things:

  • make Embed more TF2esque (I will study how TF2 embedding layers work to see if it's a good idea or not)
  • unpack the weighted cross entropy loss in a bunch of different losses, one for the sampled case, one for the weighted case (as the hone hots are required only in that case and slow down quite a bit when there's a lot of classes).
    I think this is a good occasion for also improving the code structure here, what do you think?

ludwig/features/category_feature.py Outdated Show resolved Hide resolved
ludwig/models/modules/category_decoders.py Outdated Show resolved Hide resolved
ludwig/features/category_feature.py Show resolved Hide resolved
ludwig/features/category_feature.py Outdated Show resolved Hide resolved
def _setup_loss(self):
self.train_loss_function = SoftmaxCrossEntropyLoss(
num_classes=self.num_classes,
feature_loss=self.loss,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe here we can have all parameters unpacked, and pass **self.loss to the call instead of the dictionary

ludwig/models/modules/category_encoders.py Outdated Show resolved Hide resolved
ludwig/models/modules/category_encoders.py Show resolved Hide resolved
ludwig/models/modules/category_encoders.py Outdated Show resolved Hide resolved
def __init__(
self,
num_classes=0,
feature_loss=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unpack all needed parameters from feature loss in separate parameters

def __init__(
self,
num_classes=0,
feature_loss=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here, unpack parameters

@w4nderlust w4nderlust changed the title tf2 port: convert category feature to eager execution TF2 porting: category feature Mar 29, 2020
@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Mar 30, 2020

re: these two comments

unpack the weighted cross entropy loss in a bunch of different losses

maybe here we can have all parameters unpacked, and pass **self.loss to the call instead of the dictionary

I understand this to mean to make the loss dictionary parameter more explicit in the function signature.

In the case of https://github.com/uber/ludwig/blob/126c1ca4df0c2f7a121cf68dadab01309f2fae43/ludwig/models/modules/loss_modules.py#L330
the loss dictionary parameter is replaced by keyword parameters class_weights=1, labels_smoothing=0.

Likewise for https://github.com/uber/ludwig/blob/126c1ca4df0c2f7a121cf68dadab01309f2fae43/ludwig/models/modules/loss_modules.py#L170-L172

The loss dictionary parameter is replaced by keyword parameters such as sampler=null, negative_samples=0, distortion=1, etc.

Is this the correct interpretation of the guidance?

@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Mar 30, 2020

One more question re:

Maybe adding some unit tests here would be useful.

Is the focus of the 'unit test' on the loss function itself, i.e.,

  • returns corrects value or
  • the use of the loss function in a training loop that demonstrates convergence?

I'm thinking you mean the former ('returns correct values') but I wanted to confirm. In this case, do you have your favorite test cases that I can use to bootstrap this work?

@w4nderlust
Copy link
Collaborator

Is this the correct interpretation of the guidance?

Yes.

Is the focus of the 'unit test' on the loss function itself, i.e.,

So, I would imagine a test to check the different loss (sampled, non sampled, weighted, non weighted) work for different parameters, Ideally providing the correct values.
The problem about using a integration test for this is that you would have to run a whole training for all different combination of parameters, which will be time consuming, while all you wanted to test is just the loss function with the different parameters combinations.
One ideally has both, but the unit one will do for now.

I don't have specific tests, but usually one thing I was doing when implementing that functionality was to have a dataset with a big output vocabulary, say 10000, and compare computation times and loss values between the regular softmax and the sampled one. Ideally the value of the loss is not too distant (although it will be different for sure, in particular at the beginning) while the speed of the sampled one would be much higher. Not sure this helps, but it's a good way to at least assess if the sampled softmax is running correctly.

@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Apr 1, 2020

I have a few question on the sampled softmax loss functions.

Re: loss[class_counts]

I see references to a parameter loss['class_counts'] in couple of sections in the function sampled_softmax_cross_entropy()
https://github.com/uber/ludwig/blob/126c1ca4df0c2f7a121cf68dadab01309f2fae43/ludwig/models/modules/loss_modules.py#L174-L183

https://github.com/uber/ludwig/blob/126c1ca4df0c2f7a121cf68dadab01309f2fae43/ludwig/models/modules/loss_modules.py#L200-L209

As best I can tell, loss['class_counts'] is an undocumented parameter. I don't see it in the user guide nor do I see it in CategoryOutputFeature.populate_defaults() method. Is this a new parameter for the loss dictionary? If yes, what is a reasonable default value that I can use?

re: output_placeholder parameter.

I'm having difficulty understanding how to handle this parameter. As I understand it, the parameter eventually makes its way to the parameter output_placeholder ==> output_ex ==> true_classes=ouput_exp parameter for functions such as tfnn.fixed_unigram_candidate_sampler, tf.nn.uniform_candidate_sampler, tf.nn.log_uniform_candidata_sampler and tf.nn.fixed_unigram_candidate_sampler functions.

One question I have, what is the difference between output_placeholder==>true_classes=output_exp and vector_labels both seem to represent the target category.

Assuming that target is an encoded integer, then I believe the shape for output_placeholder is [batch, 1]. However, the TF documentation for tf.nn.*sampler() functions say the shape is [batch, num_true]. What is the implication of num_true > 1?

parameters class_weights and class_biases parameters

Are these the weight and biases used to create logits?

@w4nderlust
Copy link
Collaborator

w4nderlust commented Apr 1, 2020

I have a few question on the sampled softmax loss functions.

Re: loss[class_counts]

To be honest, I implemented that part more than 2 years ago and I don't remember the details. What I believe to be true is that loss['class_counts'] should be a list of frequencies ordered by id, basically the content of the metadata idx2freq sorted by idx2str, but I don't remember why that parameter was undocumented. It's definitely a value that can be set within update_model_definition_with_metadata. because the information needed for it comes from the metadata.

re: output_placeholder parameter.

One question I have, what is the difference between output_placeholder==>true_classes=output_exp and vector_labels both seem to represent the target category.

Yes they serve the same purpose, but some functions (specifically the weighted losses for instance) need the labels as one_hots, others need them as integers, like the sampled softmax I believe, that's why you see them both used .

Assuming that target is an encoded integer, then I believe the shape for output_placeholder is [batch, 1]. However, the TF documentation for tf.nn.*sampler() functions say the shape is [batch, num_true]. What is the implication of num_true > 1?

For categorical, it is always one. For other types of features 9sets for instance) it may be more than one, but for now don't worry and consider it to be 1. If you need I can provide you an explanation in general for the intuition behind sampled softmax as it seems it can be confusing, it may help you understand better what's going on with that loss.

parameters class_weights and class_biases parameters

Are these the weight and biases used to create logits?

Yes they are weights and biases from the Classifier decoder. I'm pretty sure you can obtain them from a Dense object, something like Dense.kernel and Dense.bias or something similar.

@jimthompson5802
Copy link
Collaborator Author

Thank you for the clarifications.

re: loss['class_counts']

I understand that loss['class_counts'] is not user specified through the model_definition specification but can be computed and added to the loss{} dictionary.

re: class_weights and class_biases

Thank you for the pointer. I believe these will retrieve the weights and biases, respectively, self.decoder_obj.get_weights()[0]' and self.decoder_obj.get_weights()[1]`

I'll proceed as described above.

@jimthompson5802
Copy link
Collaborator Author

I just noticed a potential naming conflict with the sampled softmax loss function.

Case 1: Current code invokes tf.nn.sampled_softmax_loss() https://github.com/uber/ludwig/blob/268eb7a319f31d03e9af911772891da4cd31b829/ludwig/models/modules/loss_modules.py#L213-L215

where class_weights and class_biases are

Yes they are weights and biases from the Classifier decoder. I'm pretty sure you can obtain them from a Dense object, something like Dense.kernel and Dense.bias or something similar.

Case 2: There is a second use of the name class_weights that comes for the model_definition for the loss specification, specifically (from the Ludwig user guide):

class_weights (default 1): the value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like {class_a: 0.5, class_b: 0.7, ...}.

This conflict surfaced because #667 (comment)

maybe here we can have all parameters unpacked, and pass **self.loss to the call instead of the dictionary

So the elements in the model definition loss dictionary are now passed as keyword parameters, which makes class_weights in the second case now visible in the function invocation.

I'm planning to resolve this conflict by changing the names for class_weights and class_biases in Case 1 to decoder_weights and decoder_biases. I figure this is a better change because it leaves the model definition specification unchanged. Otherwise, change the name in Case 2 would result in a breaking change for users.

Let me know if I'm missing something.

@w4nderlust
Copy link
Collaborator

Sounds perfect.

@jimthompson5802
Copy link
Collaborator Author

I'm closer to getting sampled softmax working. The current hurdle I have to get over involves the call to tf.nn.sampled_softmax_loss() in this section of code of Ludwig's loss function sampled_softmax_cross_entropy() https://github.com/uber/ludwig/blob/268eb7a319f31d03e9af911772891da4cd31b829/ludwig/models/modules/loss_modules.py#L213-L220

The issue is in line 216. As I understand this part, feature_hidden is the layer that is input into the category output feature's decoder method. As best I can tell, feature_hidden is not addressable at this point when Ludwig's sampled_softmax_cross_entropy() is invoked.

In looking over the TF2 code base, I believe the required feature_hidden data structure is generated at this point in the processing https://github.com/uber/ludwig/blob/268eb7a319f31d03e9af911772891da4cd31b829/ludwig/models/ecd.py#L61-L72
I believe the required data decoder_last_hidden, which is eventually stored in the dictionary output_last_hidden[output_feature_name]. This dictionary is not returned out of this call.

I believe to complete the sampled softmax function we have to return output_last_hidden dictionary from this function.

I wanted to run this by you before making any changes. I also noticed at least one other person has started contributing to the TF2 code base. So if a change has to be made, I wanted to coordinate it with you to minimize generating merge conflicts.

@jimthompson5802
Copy link
Collaborator Author

The last set of commits fixes the numeric output feature issue. The resolution was to wrap the keras mse metric class with a ludwig specific wrapper class to extract out the require values.

Summary of what has been adapted to support the new prediction format:

  • Category output feature with softmax and sampled softmax loss computations
  • Numeric output feature with MSE loss function (only)

If the approach proposed in #667 (comment) is reasonable, I can do the retrofit of the

  • Numerical feature (retrofit for MAE loss)
  • Binary output feature

I'll wait to hear the verdict before continuing the work.

@w4nderlust
Copy link
Collaborator

Ok I checked everything. Approve of this design :) Great job!
There are still some cleanup needed in some function, but you put already the todos there, so that's fine.
Also eventually I would want to clean up the losses in general, but that's on me to do after we get everything to work first.

One minor thing, 'type', 'logits' and 'final_hidden' now can definitely become constants.
Also, have you tested all the samplers? That would be useful to give confidence in the solidity of the implementation (again here a unit test may help).

You can move on and adapt this to binary and numerical.

Once that is done, we can look at the encoding side, to see if there's a way to use TF2 Embedding layer or to transform Ludwig's embedding mechanism in a TF2 Layer (would be useful for serialization and deserialization purposes), and we should test the loading and assignment of pre-trained embeddings carefully.

@jimthompson5802
Copy link
Collaborator Author

Thank you. I understand my next steps are

  • Convert 'type', 'logits' and 'last_hidden' to be Ludwig constants in constants.pyand propagate these through the new work.
  • Retrofit numerical and binary features to support the new prediction dictionary format
  • test out the other samplers in the sampled softmax loss computations.

@jimthompson5802
Copy link
Collaborator Author

oops...just a point of clarification, I may have had a typo in my earlier response.

What I've been using as 'last_hidden' you want to rename to be 'final_hidden'.

@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Apr 4, 2020

Question on coding convention on use of Ludwig/constants.py. Some modules use from ludwig.constants import * other modules use from ludwig.contants import LOSS, COMBINED. Right now I'm adhering to the convention I find in the specific module.

Is there preference which form to use?

@w4nderlust
Copy link
Collaborator

w4nderlust commented Apr 5, 2020

What I've been using as 'last_hidden' you want to rename to be 'final_hidden'.

LAST_HIDDEN is fine.

Is there preference which form to use?

tl;dr do as you please

Since the beginning I've been using * in the import from constants because otherwise everytime i introduce a new one one would also have to change the imports. DeppSource complains about it so in some cases when adding new code I ended up importing the constant itself without *, bu haven't been consistent in removing * everywhere and adding the specific ones. I'll probably do it eventually, but so far I always found a better way to spend my Ludwig time :)

@jimthompson5802
Copy link
Collaborator Author

Status update:

  • constants.py updated to include LOGITS, TYPE and FINAL_HIDDEN. Retrofitted these new constants in the code base.
  • completed updating numerical and binary features to support new predictions format

Re: test other samplers While I was working on this, I noticed there was no existing unit for simple features, e.g. numerical, binary, category features. So I decided to add a new test harness test_simple_features.py based on test_experiment.py.

test_simple_features.py creates a simple model with one input feature to test the encoder, one output feature to test the decoder and provides an optional method to specify loss parameters. This capability was implemented using pytests decorator pytest.mark.parameterize

@pytest.mark.parametrize(
    'input_test_feature, output_test_feature, output_loss_parameter',
  • input_test_feature: specifies Ludwig's data generator function for the input feature, e.g., numerical_feature()
  • outout_test_feature: specifies Ludwig's data generator function for the output feature, e.g., numerical_feature()
  • output_loss_parameter: Is either None or the output feature's loss parameter

Example usage:

@pytest.mark.parametrize(
    'input_test_feature, output_test_feature, output_loss_parameter',
    [
        # numerical features
        (numerical_feature(), numerical_feature(), None),
        (
            numerical_feature(normalization='minmax'),
            numerical_feature(),
            {'loss': {'type':'mean_squared_error'}}
         ),

The above example specifies two test runs

  • First one specifies a model with numerical input feature with no pre-processing, numerical output feature and default loss specification
  • Second test run specifies a model with numerical input feature with minmax normalization, numerical output feature with mean_squared_error loss specification

In the case of the categorical feature, here are the test cases for each sampler.

        # Categorical feature
        (category_feature(), category_feature(), None),
        (
            category_feature(),
            category_feature(),
            {'loss': {'type':'softmax_cross_entropy'}}
        ),
        (
            category_feature(),
            category_feature(),
            {'loss': {
                        'type': 'sampled_softmax_cross_entropy',
                        'sampler': 'fixed_unigram',
                        'negative_samples': 10
                      }
            }
        ),
        (
            category_feature(),
            category_feature(),
            {'loss': {
                        'type': 'sampled_softmax_cross_entropy',
                        'sampler': 'uniform',
                        'negative_samples': 10
                    }
            }
        ),
        (
            category_feature(),
            category_feature(),
            {'loss': {
                        'type': 'sampled_softmax_cross_entropy',
                        'sampler': 'log_uniform',
                        'negative_samples': 10
                    }
            }
        ),
        (
            category_feature(),
            category_feature(),
            {'loss': {
                        'type': 'sampled_softmax_cross_entropy',
                        'sampler': 'learned_unigram',
                        'negative_samples': 10
                    }
            }
        )

Here is the pytest log for the tests/integration_tests/test_simple_features.py

root@a28f4a25cc7b:/opt/project# pytest tests/integration_tests/test_simple_features.py
=========================================================== test session starts ============================================================
platform linux -- Python 3.6.9, pytest-5.4.1, py-1.8.1, pluggy-0.13.1
rootdir: /opt/project
plugins: typeguard-2.7.1
collected 10 items

tests/integration_tests/test_simple_features.py ..........                                                                           [100%]

============================================================= warnings summary =============================================================
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py:15
  /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py:15: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp

tests/integration_tests/test_simple_features.py::test_feature[input_test_feature3-output_test_feature3-None]
tests/integration_tests/test_simple_features.py::test_feature[input_test_feature4-output_test_feature4-None]
tests/integration_tests/test_simple_features.py::test_feature[input_test_feature5-output_test_feature5-output_loss_parameter5]
tests/integration_tests/test_simple_features.py::test_feature[input_test_feature6-output_test_feature6-output_loss_parameter6]
tests/integration_tests/test_simple_features.py::test_feature[input_test_feature7-output_test_feature7-output_loss_parameter7]
tests/integration_tests/test_simple_features.py::test_feature[input_test_feature8-output_test_feature8-output_loss_parameter8]
tests/integration_tests/test_simple_features.py::test_feature[input_test_feature9-output_test_feature9-output_loss_parameter9]
  /usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
    _warn_prf(average, modifier, msg_start, len(result))

-- Docs: https://docs.pytest.org/en/latest/warnings.html
===================================================== 10 passed, 8 warnings in 13.84s ======================================================
root@a28f4a25cc7b:/opt/project#

re: the category features sampler tests. These tests shows the code runs to completion without issue. I still have to look at how to test if the correct values are computed.

What do you think?

@w4nderlust
Copy link
Collaborator

w4nderlust commented Apr 5, 2020

This all sounds great, thank you! Adding @msaisumanth to the discussion here as he was the one who implemented most of the tests to give an opinion ad suggestions, but it all looks good to me.
After he does a pass i think we can merge this PR and then open a new one for the TF2ization of the embed encoder.

@jimthompson5802
Copy link
Collaborator Author

Sounds good. I'll hold off on pushing any other changes to this PR until I hear back.

In the meantime, I can work offline on more robust testing of the category feature and sampled softmax testing. re: your suggestion

I don't have specific tests, but usually one thing I was doing when implementing that functionality was to have a dataset with a big output vocabulary, say 10000, and compare computation times and loss values between the regular softmax and the sampled one. Ideally the value of the loss is not too distant (although it will be different for sure, in particular at the beginning) while the speed of the sampled one would be much higher. Not sure this helps, but it's a good way to at least assess if the sampled softmax is running correctly.

I understand the overall procedure to be:

  • generate a category output feature with a large vocabulary using Ludwig's data generation function, e.g., category_feature(vocab_size=10000).
  • Do two training runs. One run with loss specified as softmax_cross_entropy and the second run with loss of sampled_softmax_cross_entropy.
  • compare the losses between the two runs. The losses should be "close" and the run time of the sampled_softmax_cross_entropy should be faster.

Some questions on details:

  • If the output feature vocabulary size is 10,000, what size training data set should I create? Do I need a size at least as large as the vocabulary size?
  • When I compare losses, what granularity of the loss? Some examples I thought could be the basis of comparison:
    • loss per batch
    • mean loss per epoch overall all the batches in that epoch
    • loss as reported by the evaluation on the training dataset, or validation data set or test data set.

Given the above choices, I'm thinking the comparison should be the first one, loss per batch because that loss will drive calculating the gradients.

    @tf.function
    def train_step(self, model, optimizer, inputs, targets):
        with tf.GradientTape() as tape:
            logits = model(inputs, training=True)
            loss, _ = model.train_loss(targets, logits)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        print('Training loss (for one batch): %s' % float(loss))
  • for the sampled_softmax_cross_entropy any recommended settings for the parameters associated with sampled softmax?
  • For purposes of this testing, I think a small number of epochs should be sufficient. Do I need anything more than 2 or 3 epochs?

@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Apr 5, 2020

I figured I should push this fix before the merge.

While working on the sampled softmax testing, I noticed that the values stored in <feature_name>_proabilities.csv is written as tensors, e.g.,

"tf.Tensor(0.0917323, shape=(), dtype=float32)","tf.Tensor(0.09546962, shape=(), dtype=float32)","tf.Tensor(0.09563452, shape=(), dtype=float32)","tf.Tensor(0.09365028, shape=(), dtype=float32)","tf.Tensor(0.0859547, shape=(), dtype=float32)","tf.Tensor(0.0921639, shape=(), dtype=float32)","tf.Tensor(0.090296395, shape=(), dtype=float32)","tf.Tensor(0.092099555, shape=(), dtype=float32)","tf.Tensor(0.083821714, shape=(), dtype=float32)","tf.Tensor(0.08591533, shape=(), dtype=float32)","tf.Tensor(0.09326165, shape=(), dtype=float32)"
"tf.Tensor(0.09311067, shape=(), dtype=float32)","tf.Tensor(0.10002747, shape=(), dtype=float32)","tf.Tensor(0.09493849, shape=(), dtype=float32)","tf.Tensor(0.094783254, shape=(), dtype=float32)","tf.Tensor(0.08606884, shape=(), dtype=float32)","tf.Tensor(0.08756173, shape=(), dtype=float32)","tf.Tensor(0.0891696, shape=(), dtype=float32)","tf.Tensor(0.091503404, shape=(), dtype=float32)","tf.Tensor(0.08338976, shape=(), dtype=float32)","tf.Tensor(0.08853781, shape=(), dtype=float32)","tf.Tensor(0.09090898, shape=(), dtype=float32)"
"tf.Tensor(0.09113735, shape=(), dtype=float32)","tf.Tensor(0.09825466, shape=(), dtype=float32)","tf.Tensor(0.09695278, shape=(), dtype=float32)","tf.Tensor(0.098814435, shape=(), dtype=float32)","tf.Tensor(0.079404235, shape=(), dtype=float32)","tf.Tensor(0.09384818, shape=(), dtype=float32)","tf.Tensor(0.08696946, shape=(), dtype=float32)","tf.Tensor(0.09264682, shape=(), dtype=float32)","tf.Tensor(0.080888115, shape=(), dtype=float32)","tf.Tensor(0.08343119, shape=(), dtype=float32)","tf.Tensor(0.09765276, shape=(), dtype=float32)"
"tf.Tensor(0.0917323, shape=(), dtype=float32)","tf.Tensor(0.09546962, shape=(), dtype=float32)","tf.Tensor(0.09563452, shape=(), dtype=float32)","tf.Tensor(0.09365028, shape=(), dtype=float32)","tf.Tensor(0.0859547, shape=(), dtype=float32)","tf.Tensor(0.0921639, shape=(), dtype=float32)","tf.Tensor(0.090296395, shape=(), dtype=float32)","tf.Tensor(0.092099555, shape=(), dtype=float32)","tf.Tensor(0.083821714, shape=(), dtype=float32)","tf.Tensor(0.08591533, shape=(), dtype=float32)","tf.Tensor(0.09326165, shape=(), dtype=float32)"

This commit 73ef884 fixes the above issue. Format of data written to <feature_name>_probabilities.csv now appears as this

0.092731796,0.100902446,0.09528344,0.098668285,0.08902399,0.09348325,0.08973809,0.08907923,0.08383052,0.08188916,0.08536975
0.0918586,0.09546873,0.09540367,0.09558848,0.08542835,0.097054325,0.09082846,0.09085872,0.08415315,0.08341262,0.08994486
0.08857817,0.09729856,0.09666546,0.09640322,0.08355103,0.098109104,0.091066405,0.086696416,0.08678802,0.08358456,0.09125907
0.08857817,0.09729856,0.09666546,0.09640322,0.08355103,0.098109104,0.091066405,0.086696416,0.08678802,0.08358456,0.09125907

@w4nderlust
Copy link
Collaborator

Some questions on details:

* If the output feature vocabulary size is 10,000, what size training data set should I create?  Do I need a size at least as large as the vocabulary size?

* When I compare losses, what granularity of the loss?  Some examples I thought could be the basis of comparison:
  
  * loss per batch
  * mean loss per epoch overall all the batches in that epoch
  * loss as reported by the evaluation on the training dataset, or validation data set or test data set.

* for the `sampled_softmax_cross_entropy` any recommended settings for the parameters associated with sampled softmax?

* For purposes of this testing, I think a small number of epochs should be sufficient.  Do I need anything more than 2 or 3 epochs?

A general consideration that probably answers all the questions: the reasons why I didn't add a test like this before in the suite of tests is because this will be really slow compared to most other tests. As you noticed, having 10000 categories means that the datapoints have, at lease, to be 10000 and if you turn on stratified sampling in the synthesizer you'll get 1 sample per class, which is not great for comparing loss values, you are are probably better of with a greater number, say 100000. Now, generating data and training a model on 100000 datapoints may take minutes and it could be too expensive to run a regular CI task. Moreover, as you noticed, the idea of similar loss or potential similar accuracy is very fuzzy, the range of what is acceptable may vary depending on the number of categories and number of negative samples (in theory with 10000 categories and 9999 negative samples the losses should be identical for instance), so it's really difficult to give you a precise epsilon that is always acceptable and also a precise set of parameters. Also the number of epochs, may vary depending on data size and number of classes. This is another reason why I didn't want to include it in my suite of test and just use it as a sanity check.

Given this, I don't know if there's a mechanism in pytest to add a test case and tag it so that it is ignore, but it is still there for people to run it manually, that would be definitely useful.

A possible alternative is to use a dataset with relatively wide set of classes and see what the performance of samples vs non sampled in terms of loss and accuracy trends are.This will not allow to test the speed aspect (as with, say ,20 classes, the speed advantage of the sampled softmax will nto show up), but will give us confidence that overall the mechanism works. A possible example dataset for this could be ATIS intent classification task, but it would require us to have Sequence encoders already ported to TF2.

@jimthompson5802
Copy link
Collaborator Author

Thank you for the response.

I have a related question. When running with sampled_softmax I see many occurrences of this message during training, probably once per batch. These messages show up regardless of sampler specified.

WARNING:tensorflow:Gradients do not exist for variables ['ecd/category_output_feature_1/classifier/dense_2/kernel:0', 'ecd/category_output_feature_1/classifier/dense_2/bias:0'] when minimizing the loss.

Have you seen this message before?

I've done a search on 'gradients do not exist for variables`. I found a few references in the Stack Overflow and Tensorflow's GitHub issue. Based on my understanding, none seemed to point to a solution.

I've tried different settings for vocab_size, number of observations, number and size of fully connected layers. The warning message always appear.

@w4nderlust
Copy link
Collaborator

w4nderlust commented Apr 6, 2020

No I've never seen it, it could be a new TF2 thing. It partially makes sense, as only parts of those variables should receive gradients using sampled softmax (only the rows corresponding to the positive label and the negative sampled ones at each batch), but if they don't receive gradients at all, that's problematic.Does it happen always or intermittedly?

@msaisumanth
Copy link
Contributor

@jimthompson5802 The tests look great! Thanks a lot

@w4nderlust
Copy link
Collaborator

So I'm merging this.
I guess there are still a couple things to figure out:

  • TF2ization of the embedding encoder
  • figure out the weird error with sampled losses
    We could open two separate PRs for those

@w4nderlust w4nderlust merged commit d3f677b into ludwig-ai:tf2_porting Apr 8, 2020
@jimthompson5802 jimthompson5802 deleted the tf2_categorical_feature branch April 8, 2020 11:10
@jimthompson5802
Copy link
Collaborator Author

Thank you for the merge. Which one would you like me to focus on next? And how do you want to keep track of the other one so it does not fall through the crack?

@w4nderlust
Copy link
Collaborator

We can use this column: https://github.com/uber/ludwig/projects/1#column-8037588 to keep track, i will update it with the 2 items we discussed.

I believe the TF2ization will be fundamental when we reintroduce the model saving functionalities, and I believe we would have to find new solutions to pre-trained embeddings load, but at the same time I believe there is a bigger and more fundamental thing to do, which is tackling sequence features. if we tackle them, then text, timeseries and audio fall into place easily, while the remaining features are pretty straightforward. So, if you are ok with it, let's work on the sequence features first..

@jimthompson5802
Copy link
Collaborator Author

re:

So, if you are ok with it, let's work on the sequence features first..

I understand you'd like me to work on the sequence feature next. Similar to the work I did for the Binary and Category features. I'll get started and submit a new PR for this work

@w4nderlust
Copy link
Collaborator

Yes. Consider that the sequence feature is the most complex one to tackle because it has the greater number of different encoders, and because it has two decoders, the tagger should be easy, but the you already had a taster of the generator decoder :) But this time it will be easier because there won't be any mix of TF1 and TF2, it will be pure TF2, which will make things much easier.
I would start from the sequence input feature anyway, as that one is more straightforward than the output feature.

@w4nderlust
Copy link
Collaborator

An additional note: @ydudin3 ported the image input feature to TF2 using TF layers (resnet is still WIP). I did an additional pass on both conv and fc layers to expose almost all their parameters all the way up the stack. The work done on Stacked2DCNN that you can see in the commits is a good blueprint on how to do the same kind of work for both Embed encoder for category features and all the encoders for the sequence features.

@jimthompson5802
Copy link
Collaborator Author

Thank you for the pointer to Stacked2DCNN. I'll look at its implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants