Can we improve accuracy for equations and LaTeX formatting? #190

lmmx · 2022-09-29T13:33:08Z

lmmx
Sep 29, 2022

2 issues raised by Zico Kolter in this Twitter thread:

How to provide particular terminology, like equation terms?
How to get the model to use specific formatting like LaTeX underscores?

Here's the result given the first few minutes of lecture 6 of Zico's course.

yt-dlp --extract-audio --audio-format mp3 https://www.youtube.com/watch?v=CukpVt-1PA4
whisper --model large --language en --initial_prompt 'Terminology: i‘th activation (or layer) in the network = z_i, the bias term = b_i, associated nonlinear function = sigma_i, weight term = W_i.' Lecture\ 6\ -\ Fully\ connected\ networks\,\ optimization\,\ initialization\ \[CukpVt-1PA4\].mp3

Click to show transcribed output

[00:00.000 --> 00:07.600]  Hi everyone and welcome back. We're going to talk today about three different topics which are a
[00:07.600 --> 00:13.840]  bit distinct, but it's also a good order to sort of have one lead into the other. We're going to
[00:13.840 --> 00:17.840]  talk about fully connected networks now that we've gone through the exercise of automatic
[00:17.840 --> 00:24.080]  differentiation. Then we're going to talk about basically how we actually start making these
[00:24.080 --> 00:29.040]  networks in practice, work in practice. So we're going to talk quickly about optimization of these
[00:29.040 --> 00:35.440]  networks and then initialization. All right so let's jump right in and first start talking about
[00:35.440 --> 00:41.840]  fully connected networks. This is actually going to be in some sense just kind of a rehashing of
[00:41.840 --> 00:52.160]  the same material we covered in lecture three, so two, three lectures ago now. But the basic idea
[00:52.160 --> 00:57.840]  here is that now that we've covered automatic differentiation we can actually be much simpler
[00:57.840 --> 01:03.840]  in our definition of fully connected networks. And we can also be a bit more flexible too. We
[01:03.840 --> 01:09.120]  can add a few more elements to them that are kind of common in practice. So in particular we're going
[01:09.120 --> 01:16.480]  to define a L-layer fully connected network, also called a multi-layer perceptron or MLP,
[01:17.680 --> 01:26.480]  in the following way. We're going to say that Z i, the i plus 1th activation or just layer in the
[01:26.480 --> 01:32.640]  network, is equal to some non-linear function sigma of a linear function of the previous layer
[01:33.280 --> 01:38.640]  plus a bias term b i. And now in this case we're actually going to be explicit about including
[01:38.640 --> 01:43.840]  this bias term b i, which we didn't include before, mainly because when we went through
[01:43.840 --> 01:49.040]  the exercise of manually driving backprop it was just one more term to deal with.
[01:50.320 --> 01:55.120]  But now that we've sort of simplified this using automatic differentiation, or at least how we do
[01:55.120 --> 02:00.080]  these things, we're going to include it because it's pretty standard. We're going to define the
[02:00.080 --> 02:06.240]  output of the network as just the L plus 1th layer. So you know we repeat this iteration at first
[02:06.240 --> 02:10.000]  L times and then the output is just the last value of that iteration.
[02:11.040 --> 02:15.520]  And we also define the first layer just being equal to x, the input of the network.
[02:17.120 --> 02:22.960]  Now the parameters of our network here are w 1 through L and b 1 through L. That makes up our
[02:22.960 --> 02:29.200]  set of all the parameters theta. And sigma i is some non-linearity. So that's the non-linear
[02:29.200 --> 02:33.680]  activation function, like a relu or a sigmoid. We're typically going to use relus for most of
[02:33.680 --> 02:40.320]  the examples in this course, at least for the first bit. And then usually it's also the case
[02:40.320 --> 02:45.920]  though this sigma i can vary depending on i here. And usually it's the case that we choose the last
[02:45.920 --> 02:52.080]  layer to just be a linear, the last non-linear to be a linear function so that our log is essentially
[02:52.080 --> 02:56.640]  our linear function of the last layer. All right. So this is going to be our network.
[02:56.640 --> 03:00.880]  And just to define a few more things here, I want to mention that if we look at the sizes
[03:00.880 --> 03:07.360]  of these things, z i if we say we're not going to dwell too much on this because the reality is
[03:07.360 --> 03:12.000]  for the most case you don't really need to worry about these things because you can just define
[03:12.000 --> 03:16.720]  them once and then that sort of works. But let's give names to all these things here.
[03:16.720 --> 03:22.880]  Let's say that z i would be a vector in n i. So n i would be the size of the ith layer.
[03:23.760 --> 03:34.960]  Well in that case, w i then would be a matrix. Well w i transpose has to map n i to n i plus one.
[03:34.960 --> 03:46.560]  So w i would be n i by n i plus one. And of course b i would also then be vectors in n i plus one.
[03:47.040 --> 03:53.040]  It's important to note here that b i is a vector in n i plus one because we're adding it to sort
[03:53.040 --> 03:56.800]  of the output of our linear transformation and so it has to be the size of the next layer.
[03:58.880 --> 04:03.600]  And the nice thing is that now that we've defined automatic differentiation,
[04:03.600 --> 04:09.600]  we are now done. So this is a full complete definition now of a fully connected network.
[04:10.160 --> 04:17.040]  And unlike before, we don't have to worry about deriving the back prop equations for this of
[04:17.040 --> 04:21.440]  finding gradients of all these things when it's connected to larger as part of a larger network.
[04:22.320 --> 04:28.880]  Because now we can just implement these operations in the forward pass of a network
[04:28.880 --> 04:34.080]  and let our automatic differentiation tool give us a way of automatically computing the gradients.
[04:34.080 --> 04:37.600]  And that's that's an amazingly powerful thing, right? And it sort of shows the power
[04:38.160 --> 04:42.480]  of automatic differentiation is that now we are done with our layer definition.
[04:44.560 --> 04:51.200]  Almost. Not quite. Because there is one little issue that I want to mention here which seems
[04:51.200 --> 04:56.640]  like a minor point but which you are actually end up spending a lot of time on when you implement
[04:56.640 --> 05:01.760]  some of your networks. You probably already started to encounter it, you've encountered
[05:01.760 --> 05:07.520]  homework one already on my differentiation. You will encounter things relating to the size
[05:07.520 --> 05:13.520]  of these various operations, especially when, and vectors and matrices here, especially when
[05:13.520 --> 05:18.400]  you use the matrix form of these expressions. So let's talk about the matrix form of these
[05:18.400 --> 05:23.360]  expressions. The most obvious thing here to do is say, well, you know, we have our matrix forms
[05:23.360 --> 05:31.280]  before. Remember, we have our terms zi here. So zi is going to be like a stacking of all the z's,
[05:31.280 --> 05:38.400]  right? So we'd have like in the first row, it would have z1 transpose, I guess it'd be zi1
[05:38.400 --> 05:48.400]  transpose. And the second row would have z2i transpose, etc. Sort of the list of all our
[05:48.400 --> 05:53.920]  activations for that layer in the rows. And we have this sort of now very simple matrix form
[05:53.920 --> 05:58.800]  that we saw before, right? So we're going to just apply these things in matrix form. And this is
[05:58.800 --> 06:02.960]  now the, because these things are already transposed here, you actually don't include
[06:02.960 --> 06:07.600]  a transpose and you include wi after it, but this is all the same as in our original
[06:07.600 --> 06:14.960]  public-native networks. The difference here though comes in the by its term here. So remember, bi
[06:14.960 --> 06:35.520]  is a vector in ri and i plus 1. And what this means is, if we just, if we didn't write this,
[06:35.520 --> 06:40.000]  so if we just instead, if we sort of crossed this out for a moment and just wrote this in terms of
[06:40.000 --> 06:46.000]  bi, right? So sort of copying the notation from the previous slide. This really wouldn't work,
[06:46.000 --> 06:56.480]  right? Because this matrix here would be in, well, it would be m by ni plus 1. And that's a matrix

We get sigma i, z i, w i and b i. This is accurate enough, but to be fussy the capitalisation and underscore aren't perfect.

I also tried appending !!!! to particular terms as is done when using the Stable Diffusion prompt to weight a particular word more heavily; adding more context; and changing how I phrased the prompt. I didn't manage to get any closer to the desired output by doing so.

At a guess, perhaps the tokenisation is preventing the correct output here with the underscore, and perhaps the capitalisation is not amenable to prompting for whatever reason. Is there anything that can be done currently or could be developed to improve this?

When I tried to increase the temperature to 0.2, suspecting this could increase the prompt efficacy, I got an error from line 497 in verify_options.py (up to date git repo source, 62fe7f1):

    raise ValueError("beam_size and best_of can't be given together")
ValueError: beam_size and best_of can't be given together

I just wanted to record it and share it in passing while seeing what Whisper can do :-) Any thoughts or advice appreciated, and might I add thank you for publishing your work open source!

Answered by jongwook

Sep 29, 2022

I think the biggest reason here is the token suppression, where by default most special characters are explicitly forbidden during sampling:

whisper/whisper/tokenizer.py

Lines 239 to 249 in 62fe7f1

      Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech  
    annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.  
     
    - ♪♪♪  
    - ( SPEAKING FOREIGN LANGUAGE )  
    - [DAVID] Hey there,  
     
    keeping basic punctuations like commas, periods, question marks, exclamation points, etc.  
    """  
   symbols = list("\"#()*+/:;<=>@[\\]^_`{|}~「」『』")  
   symbols += "<< >> <<< >>> -- --- -( -[ (' (\" (( )) …

View full answer

jongwook · 2022-09-29T18:13:49Z

jongwook
Sep 29, 2022
Maintainer

I think the biggest reason here is the token suppression, where by default most special characters are explicitly forbidden during sampling:

whisper/whisper/tokenizer.py

Lines 239 to 249 in 62fe7f1

    
                   Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech 
        
                   annotations, to prevent sampling texts that are not actually spoken in the audio, e.g. 
        
                   - ♪♪♪ 
        
                   - ( SPEAKING FOREIGN LANGUAGE ) 
        
                   - [DAVID] Hey there, 
        
                   keeping basic punctuations like commas, periods, question marks, exclamation points, etc. 
        
                   """ 
        
                   symbols = list("\"#()*+/:;<=>@[\\]^_`{|}~「」『』") 
        
                   symbols += "<< >> <<< >>> -- --- -( -[ (' (\" (( )) ((( ))) [[ ]] {{ }} ♪♪ ♪♪♪".split()

You might see a better result if you relax this by supplying a different --suppress_tokens option or modifying the code above directly.

The --initial_prompt option will make only a limited influence to the output, and the effect wears off since only 224 tokens fits in the context. Logit biasing might be helpful, but a surefire way is to fine-tune the model with some LaTeX-style transcripts.

The ValueError is a bug in my implementation; will fix shortly. But in general, we observed that temperature 0 results in more accurate transcripts, unless it gets stuck in a repetition loop.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we improve accuracy for equations and LaTeX formatting? #190

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

	Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech
	annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.

	- ♪♪♪
	- ( SPEAKING FOREIGN LANGUAGE )
	- [DAVID] Hey there,

	keeping basic punctuations like commas, periods, question marks, exclamation points, etc.
	"""
	symbols = list("\"#()*+/:;<=>@[\\]^_`{\|}~「」『』")
	symbols += "<< >> <<< >>> -- --- -( -[ (' (\" (( )) …

Can we improve accuracy for equations and LaTeX formatting? #190

lmmx Sep 29, 2022

Replies: 1 comment

jongwook Sep 29, 2022 Maintainer

lmmx
Sep 29, 2022

jongwook
Sep 29, 2022
Maintainer