Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not-for-merge] Replacing the statistic pooling layer with self-attention layer in sr… #2223

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tomkocse
Copy link
Contributor

…e16 sv task


class XconfigSelfLayer(XconfigLayerBase):
"""This class is for parsing lines like
self-layer name=attention config=mean+stddev(-99:3:9:99) input=tdnn1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use a different name like weighted-stats-layer instead of a generic one like self-layer.

'dim': -1,
'config': '',
'affine-dim': 300,
'num-heads': 0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can num-heads be 0? Is it checked anywhere that it should not be zero?

# then transpose the weights back to column vector.

configs and their defaults: input-dim=-1, input-period=1, left-context=-1, right-context=-1,
num-heads=1, num-log-count-features=0, output-stddevs=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num-heads=0 seems to be the default.

# The second affine node.
affine_options = 'param-stddev=0.04472135955 bias-stddev=1.0 bias-mean=0.0 max-change=0.75 l2-regularize=0.0'
configs.append(
'component name={0}.second_affine type=NaturalGradientAffineComponent'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed if num-heads=0?

@@ -1 +0,0 @@
tuning/run_xvector_1a.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dropout used?

''.format(self.name))

# The second affine node.
affine_options = 'param-stddev=0.04472135955 bias-stddev=1.0 bias-mean=0.0 max-change=0.75 l2-regularize=0.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what the stddev value is..

@@ -1 +0,0 @@
tuning/run_xvector_1a.sh No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sd -> stddev

name=self.name, lc=self._left_context, rc=self._right_context,
dim=input_dim, input_period=self._input_period,
output_period=self._stats_period,
var='true' if self._output_stddev else 'false'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-ryan-snyder, you should probably review this too.
This uses StatisticsExtractionComponent but without the StaticsPoolingComponent, which is odd;
it's, in effect, pooling short groups of frames before the self-attention.
Also, this layer seems to be a bit against the spirit of a layer; it has a couple of Affine+Relu+Batchnorm before the poling and attention, so should probably be split into two layers.
@vimalmanohar, if you have specific ideas to improve this, please comment too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomkocse can correct me if I'm wrong, but I don't think this PR is meant to be merged in it's current state. It's related to some experiments we're doing.

I think there's more work to be done on this, both to establish it's value as a layer and to improve the code (again, @tomkocse, correct me if I'm wrong).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-ryan-snyder Yes, I agree what you said, let's wait for more results from further experiments.

@danpovey danpovey changed the title Replacing the statistic pooling layer with self-attention layer in sr… [not-for-merge] Replacing the statistic pooling layer with self-attention layer in sr… Feb 20, 2018
@david-ryan-snyder
Copy link
Contributor

david-ryan-snyder commented Feb 20, 2018

Also, I think this is closely related to the weighted pooling that we experimented with in the past. It appears to perform similarly. We should revisit that before committing to this implementation, as the other weighted pooling was simpler to implement. Anyway, @tomkocse is looking into this stuff more, let's see where this goes first.

@@ -59,6 +59,8 @@ ComponentPrecomputedIndexes* ComponentPrecomputedIndexes::NewComponentPrecompute
ans = new StatisticsExtractionComponentPrecomputedIndexes();
} else if (cpi_type == "StatisticsPoolingComponentPrecomputedIndexes") {
ans = new StatisticsPoolingComponentPrecomputedIndexes();
} else if (cpi_type == "SelfAttentionComponentPrecomputedIndexes") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomkocse, this seems to be a bug fix, can you please make a separate PR for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey I don't see the bug here in the original code, can you point that out ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, I didn't realize that SelfAttentionComponent was new.

a linear combination of a series of input frames

# In SelfAttentionComponent, the first n columns of the input matrix are interpreted
# as the weight vectors in multi-head attentions. If only a single-head attention is used,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention that n is number of heads in the attention layer.

# In SelfAttentionComponent, the first n columns of the input matrix are interpreted
# as the weight vectors in multi-head attentions. If only a single-head attention is used,
# then the first column of the input matrix is the weight vector. The n + 1 th column of
# the input matrix is the count from the extraction component, it will not be used in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that you always expect the previous layer to be StatisticExtraction component?

# The first affine node.
configs.append(
'component name={0}.first_affine type=NaturalGradientAffineComponent'
' input-dim={1} output-dim={2} {3}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, why do you need to add separate part for having affine+nonlin+batchnorm within this component, when you can add this part using relu-batchnorm-layer at script level.

@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@stale
Copy link

stale bot commented Jul 19, 2020

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

@stale stale bot closed this Jul 19, 2020
@kkm000 kkm000 reopened this Jul 19, 2020
@stale stale bot removed the stale Stale bot on the loose label Jul 19, 2020
@stale
Copy link

stale bot commented Sep 17, 2020

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale bot on the loose
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants