This is just to check if my understanding is right. In the blogpost https://jaykmody.com/blog/speculative-sampling complexity analysis (as well as in the code), I think the time consumption should probably be modified to t_{draft} *K + t_{target}.
In your explanation, I think the +1 * t_{draft} consumption from step 2 can be eliminated, since the forward pass p(x | x_1,...., x_n, x^\tilde_1, ..., x^\tilde_K) is not really needed, right? Only the passes up to p(x | x_1,...., x_n, x^\tilde_1, ..., x^\tilde_{K-1}) are needed, as computed in step 1.