/
__init__.ja.json
35 lines (35 loc) ยท 15.9 KB
/
__init__.ja.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
"<h1>Attention with Linear Biases (ALiBi)</h1>\n<p>This is an implementation of Attention with Linear Biases (ALiBi) from the paper <a href=\"https://papers.labml.ai/paper/2108.12409\">Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation</a>.</p>\n<p>This replaces positional encodings with biases added to attention scores (attention logits, before the softmax). This is a relative scheme tested on autoregressive tasks, and the bias is higher for closeby tokens and lower for far-away tokens. The biases decrease linearly in the log scale (because it's before the softmax) and each head has a different slope.</p>\n<p>Here's the attention formula for <span translate=no>_^_0_^_</span>-th token,</p>\n<span translate=no>_^_1_^_</span><p>where <span translate=no>_^_2_^_</span> is the query of the <span translate=no>_^_3_^_</span>-th token, <span translate=no>_^_4_^_</span> are the keys up to <span translate=no>_^_5_^_</span>, and <span translate=no>_^_6_^_</span> the number of features per head. Note that the above equality halts because <span translate=no>_^_7_^_</span> is invariant to translations (you can add any constant to all elements without changing the result).</p>\n<p>Here is <a href=\"experiment.html\">the training code</a> for a ALiBi model.</p>\n": "<h1>\u7dda\u5f62\u30d0\u30a4\u30a2\u30b9\u306b\u3088\u308b\u6ce8\u610f (AliBi)</h1>\n<p>\u3053\u308c\u306f\u3001\u300c<a href=\"https://papers.labml.ai/paper/2108.12409\">\u30c8\u30ec\u30a4\u30f3\u30b7\u30e7\u30fc\u30c8\u3001\u30c6\u30b9\u30c8\u30ed\u30f3\u30b0\u300d\u3068\u3044\u3046\u8ad6\u6587\u306e\u300c\u7dda\u5f62\u30d0\u30a4\u30a2\u30b9\u306b\u3088\u308b\u6ce8\u610f\uff08AliBi\uff09\u300d\u306e\u5b9f\u88c5\u3067\u3059\u3002\u7dda\u5f62\u30d0\u30a4\u30a2\u30b9\u306b\u3088\u308b\u6ce8\u610f\u306b\u3088\u308a\u3001\u5165\u529b\u306e\u9577\u3055\u306e\u63a8\u5b9a\u304c\u53ef\u80fd\u306b\u306a\u308a\u307e\u3059</a>\u3002</p>\n<p>\u3053\u308c\u306b\u3088\u308a\u3001\u4f4d\u7f6e\u30a8\u30f3\u30b3\u30fc\u30c7\u30a3\u30f3\u30b0\u304c\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30b9\u30b3\u30a2\uff08\u30bd\u30d5\u30c8\u30de\u30c3\u30af\u30b9\u306e\u524d\u306e\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30ed\u30b8\u30c3\u30c8\uff09\u306b\u30d0\u30a4\u30a2\u30b9\u304c\u52a0\u308f\u3063\u305f\u3082\u306e\u306b\u7f6e\u304d\u63db\u308f\u308a\u307e\u3059\u3002\u3053\u308c\u306f\u81ea\u5df1\u56de\u5e30\u30bf\u30b9\u30af\u3067\u30c6\u30b9\u30c8\u3055\u308c\u305f\u76f8\u5bfe\u7684\u306a\u30b9\u30ad\u30fc\u30e0\u3067\u3001\u8fd1\u304f\u306b\u3042\u308b\u30c8\u30fc\u30af\u30f3\u306e\u65b9\u304c\u30d0\u30a4\u30a2\u30b9\u304c\u5927\u304d\u304f\u3001\u9060\u3044\u30c8\u30fc\u30af\u30f3\u306e\u65b9\u304c\u30d0\u30a4\u30a2\u30b9\u304c\u4f4e\u304f\u306a\u308a\u307e\u3059\u3002\u5bfe\u6570\u30b9\u30b1\u30fc\u30eb\u3067\u306f\uff08\u30bd\u30d5\u30c8\u30de\u30c3\u30af\u30b9\u306e\u524d\u306a\u306e\u3067\uff09\u30d0\u30a4\u30a2\u30b9\u306f\u76f4\u7dda\u7684\u306b\u6e1b\u5c11\u3057\u3001\u5404\u30d8\u30c3\u30c9\u306e\u50be\u304d\u306f\u7570\u306a\u308a\u307e\u3059</p>\u3002\n<p><span translate=no>_^_0_^_</span>-th \u30c8\u30fc\u30af\u30f3\u306e\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30d5\u30a9\u30fc\u30df\u30e5\u30e9\u306f\u6b21\u306e\u3068\u304a\u308a\u3067\u3059\u3002</p>\n<span translate=no>_^_1_^_</span><p>\u3053\u3053\u3067\u3001<span translate=no>_^_2_^_</span>\u306f <span translate=no>_^_3_^_</span>-th \u30c8\u30fc\u30af\u30f3\u306e\u30af\u30a8\u30ea\u3001<span translate=no>_^_4_^_</span>\u307e\u3067\u306e\u30ad\u30fc<span translate=no>_^_5_^_</span>\u3001<span translate=no>_^_6_^_</span>\u304a\u3088\u3073\u30d8\u30c3\u30c9\u3042\u305f\u308a\u306e\u30d5\u30a3\u30fc\u30c1\u30e3\u6570\u3067\u3059\u3002<span translate=no>_^_7_^_</span>\u4e0a\u8a18\u306e\u7b49\u5f0f\u306f\u5909\u63db\u306b\u4e0d\u5909\u3067\u3042\u308b\u305f\u3081\u4e2d\u6b62\u3055\u308c\u308b\u3053\u3068\u306b\u6ce8\u610f\u3057\u3066\u304f\u3060\u3055\u3044 (\u7d50\u679c\u3092\u5909\u66f4\u305b\u305a\u306b\u3059\u3079\u3066\u306e\u8981\u7d20\u306b\u4efb\u610f\u306e\u5b9a\u6570\u3092\u8ffd\u52a0\u3067\u304d\u307e\u3059</p>)\u3002\n<p>AliBi <a href=\"experiment.html\">\u30e2\u30c7\u30eb\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u30b3\u30fc\u30c9\u306f\u6b21\u306e\u3068\u304a\u308a\u3067\u3059</a>\u3002</p>\n",
"<h2>Attention with Linear Biases (ALiBi)</h2>\n<p>We override <a href=\"../mha.html\">Multi-Head Attention</a>.</p>\n": "<h2>\u7dda\u5f62\u30d0\u30a4\u30a2\u30b9\u306b\u3088\u308b\u6ce8\u610f (AliBi)</h2>\n<p><a href=\"../mha.html\">\u30de\u30eb\u30c1\u30d8\u30c3\u30c9\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u3092\u7121\u52b9\u306b\u3057\u307e\u3059</a>\u3002</p>\n",
"<h2>Calculate the attention biases matrix</h2>\n<ul><li><span translate=no>_^_0_^_</span> is the number of heads in the attention layer </li>\n<li><span translate=no>_^_1_^_</span> is the attention mask of shape <span translate=no>_^_2_^_</span></li></ul>\n<p>This returns a matrix of shape <span translate=no>_^_3_^_</span> with ALiBi attention biases.</p>\n": "<h2>\u6ce8\u610f\u30d0\u30a4\u30a2\u30b9\u30de\u30c8\u30ea\u30c3\u30af\u30b9\u306e\u8a08\u7b97</h2>\n<ul><li><span translate=no>_^_0_^_</span>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30ec\u30a4\u30e4\u30fc\u306e\u30d8\u30c3\u30c9\u6570\u3067\u3059</li>\n<li><span translate=no>_^_1_^_</span>\u30b7\u30a7\u30a4\u30d7\u306e\u6ce8\u610f\u30de\u30b9\u30af\u3067\u3059 <span translate=no>_^_2_^_</span></li></ul>\n<p>\u3053\u308c\u306b\u3088\u308a\u3001AliBi <span translate=no>_^_3_^_</span> \u306e\u6ce8\u610f\u30d0\u30a4\u30a2\u30b9\u304c\u5165\u3063\u305f\u5f62\u72b6\u306e\u30de\u30c8\u30ea\u30c3\u30af\u30b9\u304c\u8fd4\u3055\u308c\u307e\u3059\u3002</p>\n",
"<h2>Get head-specific slope <span translate=no>_^_0_^_</span> for each head</h2>\n<ul><li><span translate=no>_^_1_^_</span> is the number of heads in the attention layer <span translate=no>_^_2_^_</span></li></ul>\n<p>The slope for first head is</p>\n<p><span translate=no>_^_3_^_</span></p>\n<p>The slopes for the rest of the heads are in a geometric series with a ratio same as above.</p>\n<p>For instance when the number of heads is <span translate=no>_^_4_^_</span> the slopes are <span translate=no>_^_5_^_</span></p>\n": "<h2><span translate=no>_^_0_^_</span>\u5404\u982d\u90e8\u306e\u982d\u90e8\u56fa\u6709\u306e\u52fe\u914d\u3092\u53d6\u5f97</h2>\n<ul><li><span translate=no>_^_1_^_</span>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30ec\u30a4\u30e4\u30fc\u306e\u30d8\u30c3\u30c9\u6570\u3067\u3059 <span translate=no>_^_2_^_</span></li></ul>\n<p>1 \u756a\u76ee\u306e\u30d8\u30c3\u30c9\u306e\u52fe\u914d\u306f</p>\n<p><span translate=no>_^_3_^_</span></p>\n<p>\u6b8b\u308a\u306e\u30d8\u30c3\u30c9\u306e\u52fe\u914d\u306f\u5e7e\u4f55\u5b66\u7684\u306b\u9023\u7d9a\u3057\u3066\u304a\u308a\u3001\u305d\u306e\u6bd4\u7387\u306f\u4e0a\u8a18\u3068\u540c\u3058\u3067\u3059\u3002</p>\n<p>\u305f\u3068\u3048\u3070\u3001\u30d8\u30c3\u30c9\u306e\u6570\u304c\u306e\u5834\u5408\u3001<span translate=no>_^_4_^_</span>\u30b9\u30ed\u30fc\u30d7\u306f <span translate=no>_^_5_^_</span></p>\n",
"<p> </p>\n": "<p></p>\n",
"<p> <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> are the tensors that store collection of <em>query</em>, <em>key</em> and <em>value</em> vectors. They have shape <span translate=no>_^_3_^_</span>.</p>\n<p><span translate=no>_^_4_^_</span> has shape <span translate=no>_^_5_^_</span> and <span translate=no>_^_6_^_</span> indicates whether for batch <span translate=no>_^_7_^_</span>, query at position <span translate=no>_^_8_^_</span> has access to key-value at position <span translate=no>_^_9_^_</span>.</p>\n": "<p><span translate=no>_^_0_^_</span>\u3001<span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span>\u304a\u3088\u3073\u306f\u3001<em>\u30af\u30a8\u30ea</em>\u3001<em>\u30ad\u30fc</em>\u3001<em>\u304a\u3088\u3073\u5024\u306e\u30d9\u30af\u30c8\u30eb\u306e\u30b3\u30ec\u30af\u30b7\u30e7\u30f3\u3092\u683c\u7d0d\u3059\u308b\u30c6\u30f3\u30bd\u30eb\u3067\u3059</em>\u3002\u5f62\u304c\u3042\u308a\u307e\u3059<span translate=no>_^_3_^_</span>\u3002</p>\n<p><span translate=no>_^_4_^_</span><span translate=no>_^_5_^_</span>\u5f62\u72b6\u304c\u3042\u308a\u3001\u30d0\u30c3\u30c1\u306e\u5834\u5408<span translate=no>_^_7_^_</span>\u3001<span translate=no>_^_6_^_</span><span translate=no>_^_8_^_</span>\u305d\u306e\u4f4d\u7f6e\u306e\u30af\u30a8\u30ea\u304c\u305d\u306e\u4f4d\u7f6e\u306e\u30ad\u30fc\u5024\u306b\u30a2\u30af\u30bb\u30b9\u3067\u304d\u308b\u304b\u3069\u3046\u304b\u3092\u793a\u3057\u307e\u3059\u3002<span translate=no>_^_9_^_</span></p>\n",
"<p> Simple test function to see the slopes.</p>\n": "<p>\u30b9\u30ed\u30fc\u30d7\u3092\u78ba\u8a8d\u3067\u304d\u308b\u7c21\u5358\u306a\u30c6\u30b9\u30c8\u6a5f\u80fd\u3002</p>\n",
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
"<p><span translate=no>_^_0_^_</span> Note that we take steps by <span translate=no>_^_1_^_</span> to avoid slopes added previously. </p>\n": "<p><span translate=no>_^_0_^_</span>\u306a\u304a\u3001<span translate=no>_^_1_^_</span>\u4ee5\u524d\u306b\u30b9\u30ed\u30fc\u30d7\u304c\u8ffd\u52a0\u3055\u308c\u306a\u3044\u3088\u3046\u306b\u5bfe\u7b56\u3092\u8b1b\u3058\u3066\u3044\u307e\u3059\u3002</p>\n",
"<p><span translate=no>_^_0_^_</span> attention along the key sequence dimension <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u30ad\u30fc\u30b7\u30fc\u30b1\u30f3\u30b9\u6b21\u5143\u306b\u6cbf\u3063\u3066\u6ce8\u76ee <span translate=no>_^_1_^_</span></p>\n",
"<p><span translate=no>_^_0_^_</span> has shape <a href=\"seq_len, seq_len, 1, 1\">seq_len, seq_len, 1, 1</a> </p>\n": "<p><span translate=no>_^_0_^_</span><a href=\"seq_len, seq_len, 1, 1\">\u56f3\u5f62\u306f\u9023\u756a\u3001\u9023\u756a\u30011\u3001</a> 1\u3067\u3059</p>\n",
"<p><span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> have shape <span translate=no>_^_3_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u3001<span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span>\u305d\u3057\u3066\u5f62\u304c\u3042\u308b <span translate=no>_^_3_^_</span></p>\n",
"<p>ALiBi only works with causal masks. </p>\n": "<p>AliBi \u306f\u56e0\u679c\u30de\u30b9\u30af\u3067\u306e\u307f\u6a5f\u80fd\u3057\u307e\u3059\u3002</p>\n",
"<p>Add AliBi biases to attention scores. ALiBi biases has shape <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> has shape <span translate=no>_^_2_^_</span> </p>\n": "<p>AliBi \u30d0\u30a4\u30a2\u30b9\u3092\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30b9\u30b3\u30a2\u306b\u8ffd\u52a0\u3057\u307e\u3059\u3002AliBi <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span> \u30d0\u30a4\u30a2\u30b9\u306b\u306f\u5f62\u3068\u5f62\u304c\u3042\u308b <span translate=no>_^_2_^_</span></p>\n",
"<p>Add head dimension to mask and check its shape. </p>\n": "<p>\u30de\u30b9\u30af\u306b\u982d\u90e8\u306e\u5bf8\u6cd5\u3092\u8ffd\u52a0\u3057\u3001\u5f62\u72b6\u3092\u78ba\u8a8d\u3057\u307e\u3059\u3002</p>\n",
"<p>Apply dropout </p>\n": "<p>\u30c9\u30ed\u30c3\u30d7\u30a2\u30a6\u30c8\u3092\u9069\u7528</p>\n",
"<p>Apply mask </p>\n": "<p>\u30de\u30b9\u30af\u3092\u9069\u7528</p>\n",
"<p>Calculate distances <span translate=no>_^_0_^_</span> Here we calculate the distances using the mask.</p>\n<p>Since it's causal mask we can just use <span translate=no>_^_1_^_</span> too. <span translate=no>_^_2_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u8ddd\u96e2\u306e\u8a08\u7b97\u3053\u3053\u3067\u306f\u30de\u30b9\u30af\u3092\u4f7f\u3063\u3066\u8ddd\u96e2\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002</p>\n<p><span translate=no>_^_1_^_</span>\u30ab\u30b8\u30e5\u30a2\u30eb\u30de\u30b9\u30af\u306a\u306e\u3067\u305d\u306e\u307e\u307e\u4f7f\u3048\u307e\u3059\u3002<span translate=no>_^_2_^_</span></p>\n",
"<p>Compute attention scores <span translate=no>_^_0_^_</span>. This gives a tensor of shape <span translate=no>_^_1_^_</span>. </p>\n": "<p><span translate=no>_^_0_^_</span>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30b9\u30b3\u30a2\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002<span translate=no>_^_1_^_</span>\u3053\u308c\u306b\u3088\u308a\u5f62\u72b6\u306e\u30c6\u30f3\u30bd\u30eb\u304c\u5f97\u3089\u308c\u307e\u3059</p>\u3002\n",
"<p>Concatenate multiple heads </p>\n": "<p>\u8907\u6570\u306e\u30d8\u30c3\u30c9\u3092\u9023\u7d50</p>\n",
"<p>Concatenate the slopes with the remaining slopes. </p>\n": "<p>\u30b9\u30ed\u30fc\u30d7\u3092\u6b8b\u308a\u306e\u30b9\u30ed\u30fc\u30d7\u3068\u9023\u7d50\u3057\u307e\u3059\u3002</p>\n",
"<p>Create AliBi biases if it's not cached </p>\n": "<p>\u30ad\u30e3\u30c3\u30b7\u30e5\u3055\u308c\u3066\u3044\u306a\u3044\u5834\u5408\u306fAliBi\u30d0\u30a4\u30a2\u30b9\u3092\u4f5c\u6210\u3059\u308b</p>\n",
"<p>Get slopes <span translate=no>_^_0_^_</span> for each head </p>\n": "<p><span translate=no>_^_0_^_</span>\u5404\u30d8\u30c3\u30c9\u306e\u30b9\u30ed\u30fc\u30d7\u3092\u53d6\u5f97</p>\n",
"<p>Get the closest power of 2 to <span translate=no>_^_0_^_</span>. If <span translate=no>_^_1_^_</span> is not a power of 2, then we first calculate slopes to the closest (smaller) power of 2, and then add the remaining slopes. </p>\n": "<p>2 <span translate=no>_^_0_^_</span> \u306e\u7d2f\u4e57\u306b\u6700\u3082\u8fd1\u3044\u3082\u306e\u3092\u6c42\u3081\u307e\u3059\u3002\u304c 2 <span translate=no>_^_1_^_</span> \u306e\u7d2f\u4e57\u3067\u306a\u3044\u5834\u5408\u306f\u3001\u307e\u305a 2 \u306b\u6700\u3082\u8fd1\u3044 (\u5c0f\u3055\u306a) \u7d2f\u4e57\u307e\u3067\u306e\u52fe\u914d\u3092\u8a08\u7b97\u3057\u3001\u6b21\u306b\u6b8b\u308a\u306e\u52fe\u914d\u3092\u52a0\u7b97\u3057\u307e\u3059</p>\u3002\n",
"<p>If <span translate=no>_^_0_^_</span> is not a power of 2, then we add the remaining slopes. We calculate the remaining slopes for <span translate=no>_^_1_^_</span> (avoiding slopes added previously). And pick the slopes upto <span translate=no>_^_2_^_</span>. </p>\n": "<p><span translate=no>_^_0_^_</span>\u304c 2 \u306e\u7d2f\u4e57\u3067\u306a\u3044\u5834\u5408\u306f\u3001\u6b8b\u308a\u306e\u52fe\u914d\u3092\u52a0\u7b97\u3057\u307e\u3059\u3002\u6b8b\u308a\u306e\u52fe\u914d\u3092\u8a08\u7b97\u3057\u307e\u3059 <span translate=no>_^_1_^_</span> (\u4ee5\u524d\u306b\u8ffd\u52a0\u3055\u308c\u305f\u52fe\u914d\u306f\u9664\u304d\u307e\u3059)\u3002\u305d\u3057\u3066\u3001<span translate=no>_^_2_^_</span>\u4e0a\u306e\u659c\u9762\u3092\u9078\u3093\u3067\u304f\u3060\u3055\u3044</p>.\n",
"<p>Multiply by values <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5024\u306b\u3088\u308b\u4e57\u7b97 <span translate=no>_^_0_^_</span></p>\n",
"<p>Multiply them pair-wise to get the AliBi bias matrix </p>\n": "<p>\u305d\u308c\u3089\u3092\u30da\u30a2\u3054\u3068\u306b\u4e57\u7b97\u3057\u3066\u3001AliBi \u30d0\u30a4\u30a2\u30b9\u30de\u30c8\u30ea\u30c3\u30af\u30b9\u3092\u6c42\u3081\u307e\u3059\u3002</p>\n",
"<p>Output layer </p>\n": "<p>\u51fa\u529b\u30ec\u30a4\u30e4\u30fc</p>\n",
"<p>Prepare <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> for attention computation. These will then have shape <span translate=no>_^_3_^_</span>. </p>\n": "<p><span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span>\u6ce8\u610f\u529b\u8a08\u7b97\u306e\u6e96\u5099\u3092\u3057\u3066<span translate=no>_^_3_^_</span>\u3053\u308c\u3067\u5f62\u304c\u3067\u304d\u3042\u304c\u308a\u307e\u3059\u3002</p>\n",
"<p>Scale scores <span translate=no>_^_0_^_</span> </p>\n": "<p>\u30b9\u30b1\u30fc\u30eb\u30b9\u30b3\u30a2 <span translate=no>_^_0_^_</span></p>\n",
"<p>To cache AliBi the biases </p>\n": "<p>AliBi \u306b\u30d0\u30a4\u30a2\u30b9\u3092\u30ad\u30e3\u30c3\u30b7\u30e5\u3059\u308b\u306b\u306f</p>\n",
"Attention with Linear Biases (ALiBi)": "\u7dda\u5f62\u30d0\u30a4\u30a2\u30b9\u306b\u3088\u308b\u6ce8\u610f (AliBi)",
"Documented implementation with explanations of Attention with Linear Biases (ALiBi)": "\u7dda\u5f62\u30d0\u30a4\u30a2\u30b9\u306b\u3088\u308b\u6ce8\u610f\uff08AliBi\uff09\u306e\u8aac\u660e\u3092\u542b\u3080\u6587\u66f8\u5316\u3055\u308c\u305f\u5b9f\u88c5"
}