New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[not-for-merge] Replacing the statistic pooling layer with self-attention layer in sr… #2223
base: master
Are you sure you want to change the base?
Conversation
|
||
class XconfigSelfLayer(XconfigLayerBase): | ||
"""This class is for parsing lines like | ||
self-layer name=attention config=mean+stddev(-99:3:9:99) input=tdnn1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use a different name like weighted-stats-layer instead of a generic one like self-layer.
'dim': -1, | ||
'config': '', | ||
'affine-dim': 300, | ||
'num-heads': 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can num-heads be 0? Is it checked anywhere that it should not be zero?
# then transpose the weights back to column vector. | ||
|
||
configs and their defaults: input-dim=-1, input-period=1, left-context=-1, right-context=-1, | ||
num-heads=1, num-log-count-features=0, output-stddevs=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num-heads=0 seems to be the default.
# The second affine node. | ||
affine_options = 'param-stddev=0.04472135955 bias-stddev=1.0 bias-mean=0.0 max-change=0.75 l2-regularize=0.0' | ||
configs.append( | ||
'component name={0}.second_affine type=NaturalGradientAffineComponent' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed if num-heads=0?
@@ -1 +0,0 @@ | |||
tuning/run_xvector_1a.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is dropout used?
''.format(self.name)) | ||
|
||
# The second affine node. | ||
affine_options = 'param-stddev=0.04472135955 bias-stddev=1.0 bias-mean=0.0 max-change=0.75 l2-regularize=0.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain what the stddev value is..
@@ -1 +0,0 @@ | |||
tuning/run_xvector_1a.sh No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sd -> stddev
name=self.name, lc=self._left_context, rc=self._right_context, | ||
dim=input_dim, input_period=self._input_period, | ||
output_period=self._stats_period, | ||
var='true' if self._output_stddev else 'false')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-ryan-snyder, you should probably review this too.
This uses StatisticsExtractionComponent but without the StaticsPoolingComponent, which is odd;
it's, in effect, pooling short groups of frames before the self-attention.
Also, this layer seems to be a bit against the spirit of a layer; it has a couple of Affine+Relu+Batchnorm before the poling and attention, so should probably be split into two layers.
@vimalmanohar, if you have specific ideas to improve this, please comment too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomkocse can correct me if I'm wrong, but I don't think this PR is meant to be merged in it's current state. It's related to some experiments we're doing.
I think there's more work to be done on this, both to establish it's value as a layer and to improve the code (again, @tomkocse, correct me if I'm wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-ryan-snyder Yes, I agree what you said, let's wait for more results from further experiments.
Also, I think this is closely related to the weighted pooling that we experimented with in the past. It appears to perform similarly. We should revisit that before committing to this implementation, as the other weighted pooling was simpler to implement. Anyway, @tomkocse is looking into this stuff more, let's see where this goes first. |
@@ -59,6 +59,8 @@ ComponentPrecomputedIndexes* ComponentPrecomputedIndexes::NewComponentPrecompute | |||
ans = new StatisticsExtractionComponentPrecomputedIndexes(); | |||
} else if (cpi_type == "StatisticsPoolingComponentPrecomputedIndexes") { | |||
ans = new StatisticsPoolingComponentPrecomputedIndexes(); | |||
} else if (cpi_type == "SelfAttentionComponentPrecomputedIndexes") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomkocse, this seems to be a bug fix, can you please make a separate PR for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danpovey I don't see the bug here in the original code, can you point that out ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry, I didn't realize that SelfAttentionComponent was new.
a linear combination of a series of input frames | ||
|
||
# In SelfAttentionComponent, the first n columns of the input matrix are interpreted | ||
# as the weight vectors in multi-head attentions. If only a single-head attention is used, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention that n is number of heads in the attention layer.
# In SelfAttentionComponent, the first n columns of the input matrix are interpreted | ||
# as the weight vectors in multi-head attentions. If only a single-head attention is used, | ||
# then the first column of the input matrix is the weight vector. The n + 1 th column of | ||
# the input matrix is the count from the extraction component, it will not be used in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that you always expect the previous layer to be StatisticExtraction component?
# The first affine node. | ||
configs.append( | ||
'component name={0}.first_affine type=NaturalGradientAffineComponent' | ||
' input-dim={1} output-dim={2} {3}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, why do you need to add separate part for having affine+nonlin+batchnorm within this component, when you can add this part using relu-batchnorm-layer at script level.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it. |
This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open. |
…e16 sv task