From 570d4a0af82d547264a2bc46f6f0abeba59f3d66 Mon Sep 17 00:00:00 2001 From: Scott Main Date: Wed, 19 Apr 2023 16:47:35 -0700 Subject: [PATCH] [Docs][Mojo] Move some of the Mojo notebooks into new "public" dir. (#12881) This is to hopefully avoid committing notebooks that are meant for internal eyes only. The Mojo Playground can simply fetch all notebooks from this dir, and so can the docs website to render them as static web pages (for people without Playground access to read). modular-orig-commit: e00f2cb1c44c408eeb54918c1a722df0f9b7fa7d --- examples/BoolMLIR.ipynb | 485 +++++++++++++++ examples/HelloMojo.ipynb | 45 ++ examples/Matmul.ipynb | 1240 ++++++++++++++++++++++++++++++++++++++ examples/Memset.ipynb | 674 +++++++++++++++++++++ examples/index.md | 3 + 5 files changed, 2447 insertions(+) create mode 100644 examples/BoolMLIR.ipynb create mode 100644 examples/HelloMojo.ipynb create mode 100644 examples/Matmul.ipynb create mode 100644 examples/Memset.ipynb create mode 100644 examples/index.md diff --git a/examples/BoolMLIR.ipynb b/examples/BoolMLIR.ipynb new file mode 100644 index 000000000..e73ef4544 --- /dev/null +++ b/examples/BoolMLIR.ipynb @@ -0,0 +1,485 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "title: \"Low-Level IR in Mojo\"\n", + "subtitle: \"A Boolean Case Study\"\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "Mojo is a high-level programming language with an extensive set of modern features. But Mojo also provides you, the programmer, access to all of the low-level primitives that you need to write powerful -- yet zero-cost -- abstractions.\n", + "\n", + "These primitives are implemented in [MLIR](https://mlir.llvm.org), an extensible intermediate representation (IR) format for compiler design. Many different programming languages and compilers translate their source programs into MLIR, and because Mojo provides direct access to MLIR features, this means Mojo programs can enjoy the benefits of each of these tools.\n", + "\n", + "Going one step further, Mojo's unique combination of zero-cost abstractions with MLIR interoperability means that Mojo programs can take full advantage of *anything* that interfaces with MLIR. While this isn't something normal Mojo programmers may ever need to do, it's an extremely powerful capability when extending a system to interface with a new datatype, or an esoteric new accelerator feature.\n", + "\n", + "To illustrate these ideas, we'll implement a boolean type in Mojo below, which we'll call `OurBool`. We'll make extensive use of MLIR, so let's begin with a short primer.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## What is MLIR?\n", + "\n", + "MLIR is an intermediate representation of a program, not unlike an assembly language, in which a sequential set of instructions operate on in-memory values.\n", + "\n", + "More importantly, MLIR is modular and extensible. MLIR is composed of an ever-growing number of \"dialects.\" Each dialect defines operations and optimizations: for example, the ['math' dialect](https://mlir.llvm.org/docs/Dialects/MathOps/) provides mathematical operations such as sine and cosine, the ['amdgpu' dialect](https://mlir.llvm.org/docs/Dialects/AMDGPU/) provides operations specific to AMD processors, and so on.\n", + "\n", + "Each of MLIR's dialects can interoperate with the others. This is why MLIR is said to unlock heterogeneous compute: as newer, faster processors and architectures are developed, new MLIR dialects are implemented to generate optimal code for those environments. Any new MLIR dialect can be translated seamlessly into other dialects, so as more get added, all existing MLIR becomes more powerful.\n", + "\n", + "This means that our own custom types, such as the `OurBool` type we'll create below, can be used to provide programmers with a high-level, Python-like interface. But \"under the covers,\" Mojo and MLIR will optimize our convenient, high-level types for each new processor that appears in the future.\n", + "\n", + "There's much more to write about why MLIR is such a revolutionary technology, but let's get back to Mojo and defining the `OurBool` type. There will be opportunities to learn more about MLIR along the way." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "#|echo: false\n", + "\n", + "from IO import print\n", + "\n", + "\n", + "alias OurTrue = OurBool(__mlir_attr.`true`)\n", + "alias OurFalse: OurBool = __mlir_attr.`false`\n", + "\n", + "\n", + "@register_passable(\"trivial\")\n", + "struct OurBool:\n", + " var value: __mlir_type.i1\n", + " \n", + " fn __init__() -> Self:\n", + " return OurFalse\n", + "\n", + " fn __init__(value: __mlir_type.i1) -> Self:\n", + " return Self {value: value}\n", + " \n", + " fn __bool__(self) -> Bool:\n", + " return Bool(self.value)\n", + "\n", + " fn __mlir_i1__(self) -> __mlir_type.i1:\n", + " return self.value\n", + "\n", + " fn __eq__(self, rhs: OurBool) -> Self:\n", + " let lhsIndex = __mlir_op.`index.casts`[_type : __mlir_type.index](\n", + " self.value\n", + " )\n", + " let rhsIndex = __mlir_op.`index.casts`[_type : __mlir_type.index](\n", + " rhs.value\n", + " )\n", + " return Self(\n", + " __mlir_op.`index.cmp`[\n", + " pred : __mlir_attr.`#index`\n", + " ](lhsIndex, rhsIndex)\n", + " )\n", + "\n", + " fn __invert__(self) -> Self:\n", + " return OurFalse if self == OurTrue else OurTrue" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Defining the `OurBool` type\n", + "\n", + "We can use Mojo's `struct` keyword to define a new type `OurBool`:\n", + "\n", + "```\n", + "struct OurBool:\n", + " var value: __mlir_type.i1\n", + "```\n", + "\n", + "A boolean can represent 0 or 1, \"true\" or \"false.\" To store this information, `OurBool` has a single member, called `value`. Its type is represented *directly in MLIR*, using the MLIR builtin type [`i1`](https://mlir.llvm.org/docs/Dialects/Builtin/#integertype). In fact, you can use any MLIR type in Mojo, by prefixing the type name with `__mlir_type`.\n", + "\n", + "As we'll see below, representing our boolean value with `i1` will allow us to utilize all of the MLIR operations and optimizations that interface with the `i1` type -- and there are many of them!\n", + "\n", + "Having defined `OurBool`, we can now declare a variable of this type:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "var a: OurBool" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Leveraging MLIR\n", + "\n", + "Naturally, we might next try to create an instance of `OurBool`. Attempting to do so at this point, however, results in an error:\n", + "\n", + "```\n", + "let a = OurBool() # error: 'OurBool' does not implement an '__init__' method\n", + "```\n", + "\n", + "As in Python, `__init__` is a [special method](https://docs.python.org/3/reference/datamodel.html#specialnames) that can be defined to customize the behavior of a type. We can implement an `__init__` method that takes no arguments, and returns an `OurBool` with a \"false\" value.\n", + "\n", + "```\n", + "struct OurBool:\n", + " var value: __mlir_type.i1\n", + "\n", + " fn __init__(self&):\n", + " self.value = __mlir_op.`index.bool.constant`[\n", + " value : __mlir_attr.`false`,\n", + " ]()\n", + "```\n", + "\n", + "To initialize the underlying `i1` value, we use an MLIR operation from its ['index' dialect](https://mlir.llvm.org/docs/Dialects/IndexOps/), called [`index.bool.constant`](https://mlir.llvm.org/docs/Dialects/IndexOps/#indexboolconstant-mlirindexboolconstantop).\n", + "\n", + "MLIR's 'index' dialect provides us with operations for manipulating builtin MLIR types, such as the `i1` we use to store the value of `OurBool`. The `index.bool.constant` operation takes a `true` or `false` compile-time constant as input, and produces a runtime output of type `i1` with the given value.\n", + "\n", + "So, as shown above, in addition to any MLIR type, Mojo also provides direct access to any MLIR operation via the `__mlir_op` prefix, and to any attribute via the `__mlir_attr` prefix. MLIR attributes are used to represent compile-time constants.\n", + "\n", + "As you can see above, the syntax for interacting with MLIR isn't always pretty: MLIR attributes are passed in between square brackets `[...]`, and the operation is executed via a parentheses suffix `(...)`, which can take runtime argument values. However, most Mojo programmers will not need to access MLIR directly, and for the few that do, this \"ugly\" syntax gives them superpowers: they can define high-level types that are easy to use, but that internally plug into MLIR and its powerful system of dialects.\n", + "\n", + "We think this is very exciting, but let's bring things back down to earth: having defined an `__init__` method, we can now create an instance of our `OurBool` type:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "let b = OurBool()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Value semantics in Mojo\n", + "\n", + "We can now instantiate `OurBool`, but using it is another story:\n", + "\n", + "```\n", + "let a = OurBool()\n", + "let b = a # error: 'OurBool' does not implement the '__copyinit__' method\n", + "```\n", + "\n", + "Mojo uses \"value semantics\" by default, meaning that it expects to create a copy of `a` when assigning to `b`. However, Mojo doesn't make any assumptions about *how* to copy `OurBool`, or its underlying `i1` value. The error indicates that we should implement a `__copyinit__` method, which would implement the copying logic.\n", + "\n", + "In our case, however, `OurBool` is a very simple type, with only one \"trivially copyable\" member. We can use a decorator to tell the Mojo compiler that, saving us the trouble of defining our own `__copyinit__` boilerplate. Trivially copyable types must implement an `__init__` method that returns an instance of themselves, so we must also rewrite our initializer slightly.\n", + "\n", + "```\n", + "@register_passable(\"trivial\")\n", + "struct OurBool:\n", + " var value: __mlir_type.i1\n", + "\n", + " fn __init__() -> Self:\n", + " return Self {\n", + " value: __mlir_op.`index.bool.constant`[\n", + " value : __mlir_attr.`false`,\n", + " ]()\n", + " }\n", + "```\n", + "\n", + "We can now copy `OurBool` as we please:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "let c = OurBool()\n", + "let d = c" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Compile-time constants\n", + "\n", + "It's not very useful to have a boolean type that can only represent \"false.\" Let's define compile-time constants that represent true and false `OurBool` values.\n", + "\n", + "First, let's define another `__init__` constructor for `OurBool` that takes its `i1` value as an argument:\n", + "\n", + "```\n", + "@register_passable(\"trivial\")\n", + "struct OurBool:\n", + " var value: __mlir_type.i1\n", + " # ...\n", + "\n", + " fn __init__(value: __mlir_type.i1) -> Self:\n", + " return Self {value: value}\n", + "```\n", + "\n", + "This allows us to define compile-time constant `OurBool` values, using the `alias` keyword. First, let's define `OurTrue`:\n", + "\n", + "```\n", + "alias OurTrue = OurBool(__mlir_attr.`true`)\n", + "```\n", + "\n", + "Here we're passing in an MLIR compile-time constant value of `true`, which has the `i1` type that our new `__init__` constructor expects. We can use a slightly different syntax for `OurFalse`:\n", + "\n", + "```\n", + "alias OurFalse: OurBool = __mlir_attr.`false`\n", + "```\n", + "\n", + "`OurFalse` is declared to be of type `OurBool`, and then assigned an `i1` type -- in this case, the `OurBool` constructor we added is called implicitly.\n", + "\n", + "With true and false constants, we can also simplify our original `__init__` constructor for `OurBool`. Instead of constructing an MLIR value, we can simply return our `OurFalse` constant:\n", + "\n", + "```\n", + "alias OurTrue = OurBool(__mlir_attr.`true`)\n", + "alias OurFalse: OurBool = __mlir_attr.`false`\n", + "\n", + "@register_passable(\"trivial\")\n", + "struct OurBool:\n", + " # ...\n", + " fn __init__() -> Self:\n", + " return OurFalse\n", + "```\n", + "\n", + "Note also that we can define `OurTrue` before we define `OurBool`. The Mojo compiler is smart enough to figure this out.\n", + "\n", + "With these constants, we can now define variables with both true and false values of `OurBool`:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "let e = OurTrue\n", + "let f = OurFalse" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Implementing `__bool__`\n", + "\n", + "Of course, the reason booleans are ubiquitous in programming is because they can be used for program control flow. However, if we attempt to use `OurBool` in this way, we get an error:\n", + "\n", + "```\n", + "let a = OurTrue\n", + "if a: print(\"It's true!\") # error: 'OurBool' does not implement the '__bool__' method\n", + "```\n", + "\n", + "When Mojo attempts to execute our program, it needs to be able to determine whether to print \"It's true!\" or not. It doesn't yet know that `OurBool` represents a boolean value -- Mojo just sees a struct that is 1 bit in size. However, Mojo also provides interfaces that convey boolean qualities, which are the same as those used by Mojo's standard library types, like `Bool`. In practice, this means Mojo gives you full control: any type that's packaged with the language's standard library is one for which you could define your own version.\n", + "\n", + "In the case of our error message, Mojo is telling us that implementing a `__bool__` method on `OurBool` would signify that it has boolean qualities.\n", + "\n", + "Thankfully, `__bool__` is simple to implement: Mojo's standard library and builtin types are all implemented on top of MLIR, and so the builtin `Bool` type also defines a constructor that takes an `i1`, just like `OurBool`:\n", + "\n", + "```\n", + "@register_passable(\"trivial\")\n", + "struct OurBool:\n", + " var value: __mlir_type.i1\n", + " # ...\n", + "\n", + " fn __bool__(self) -> Bool:\n", + " return Bool(self.value)\n", + "```\n", + "\n", + "Now we can use `OurBool` anywhere we can use the builtin `Bool` type:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "It's true!\n" + ] + } + ], + "source": [ + "#| CHECK: It's true!\n", + "let g = OurTrue\n", + "if g: print(\"It's true!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Avoiding type conversion with `__mlir_i1__`\n", + "\n", + "Our `OurBool` type is looking great, and by providing a conversion to `Bool`, it can be used anywhere the builtin `Bool` type can. But in the last section we promised you \"full control,\" the ability to define your own version of any type built into Mojo or its standard library. Surely `Bool` doesn't implement `__bool__` to convert itself into `Bool`?\n", + "\n", + "Indeed it doesn't: when Mojo evaluates a conditional expression, it actually attempts to convert it to an MLIR `i1` value, by searching for the special interface method `__mlir_i1__`. (The automatic conversion to `Bool` occurs becasue `Bool` is known to implement the `__mlir_i1__` method.)\n", + "\n", + "Again, Mojo is designed to be extensible and modular. By implementing all the special methods `Bool` does, we can create a type that can replace it entirely. Let's do so by implementing `__mlir_i1__` on `OurBool`:\n", + "\n", + "```\n", + "@register_passable(\"trivial\")\n", + "struct OurBool:\n", + " var value: __mlir_type.i1\n", + " # ...\n", + "\n", + " fn __mlir_i1__(self) -> __mlir_type.i1:\n", + " return self.value\n", + "```\n", + "\n", + "We can still use `OurBool` in conditionals just as we did before:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "No more Bool conversion!\n" + ] + } + ], + "source": [ + "#| CHECK: No more Bool conversion!\n", + "let h = OurTrue\n", + "if h: print(\"No more Bool conversion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But this time, no conversion to `Bool` occurs. You can try adding `print` statements to the `__bool__` and `__mlir_i1__` methods, or even removing the `__bool__` method entirely, to see for yourself." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Adding functionality with MLIR\n", + "\n", + "There are many more ways we can improve `OurBool`. Many of those involve implementing special methods, some of which you may recognize from Python, and some which are specific to Mojo. For example, we can implement inversion of a `OurBool` value by adding a `__invert__` method. We can also add an `__eq__` method, which allows two `OurBool` to be compared with the `==` operator.\n", + "\n", + "What sets Mojo apart is the fact that we can implement each of these using MLIR. To implement `__eq__`, for example, we use the [`index.casts`](https://mlir.llvm.org/docs/Dialects/IndexOps/#indexcasts-mlirindexcastsop) operation to cast our `i1` values to the MLIR index dialect's `index` type, and then the [`index.cmp`](https://mlir.llvm.org/docs/Dialects/IndexOps/#indexcmp-mlirindexcmpop) operation to compare them for equality:\n", + "\n", + "```\n", + "@register_passable(\"trivial\")\n", + "struct OurBool:\n", + " var value: __mlir_type.i1\n", + " # ...\n", + "\n", + " fn __eq__(self, rhs: OurBool) -> Self:\n", + " let lhsIndex = __mlir_op.`index.casts`[_type : __mlir_type.index](\n", + " self.value\n", + " )\n", + " let rhsIndex = __mlir_op.`index.casts`[_type : __mlir_type.index](\n", + " rhs.value\n", + " )\n", + " return Self(\n", + " __mlir_op.`index.cmp`[\n", + " pred : __mlir_attr.`#index`\n", + " ](lhsIndex, rhsIndex)\n", + " )\n", + "```\n", + "\n", + "We can then implement `__invert__` in terms of `__eq__`:\n", + "\n", + "```\n", + "@register_passable(\"trivial\")\n", + "struct OurBool:\n", + " # ...\n", + " fn __invert__(self) -> Self:\n", + " return OurFalse if self == OurTrue else OurTrue\n", + "```\n", + "\n", + "This allows us to use the `~` operator with `OurBool`:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "It's false!\n" + ] + } + ], + "source": [ + "#| CHECK: It's false!\n", + "let i = OurFalse\n", + "if ~i: print(\"It's false!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This extensible design is what allows even \"built in\" Mojo types like `Bool`, `Int`, and even `Tuple` (!!) to be implemented in the Mojo standard library in terms of MLIR, rather than hard-coded into the Mojo language. This also means that there's almost nothing that those types can achieve that user-defined types cannot.\n", + "\n", + "By extension, this means that the incredible performance that Mojo unlocks for machine learning workflows isn't due to some magic being performed behind a curtain -- you can define your own high-level types that, in their implementation, use low-level MLIR to achieve unprecedented speed and control." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The promise of modularity\n", + "\n", + "As we've seen, Mojo's integration with MLIR allows Mojo programmers to implement zero-cost abstractons on par with Mojo's own builtin and standard library types.\n", + "\n", + "MLIR is open-source and extensible: new dialects are being added all the time, and those dialects then become available to use in Mojo. All the while, Mojo code gets more powerful and more optimized for new hardware -- with no additional work necessary by Mojo programmers.\n", + "\n", + "What this means is that your own custom types, whether those be `OurBool` or `OurTensor`, can be used to provide programmers with an easy-to-use and unchanging interface. But behind the scenes, MLIR will optimize those convenient, high-level types for the computing environments of tomorrow.\n", + "\n", + "In other words: Mojo isn't magic, it's modular." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Mojo", + "language": "mojo", + "name": "mojo-jupyter-kernel" + }, + "language_info": { + "file_extension": ".mojo", + "mimetype": "text/x-mojo", + "name": "mojo" + }, + "vscode": { + "interpreter": { + "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/HelloMojo.ipynb b/examples/HelloMojo.ipynb new file mode 100644 index 000000000..7687f7f02 --- /dev/null +++ b/examples/HelloMojo.ipynb @@ -0,0 +1,45 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Hello Mojo" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Mojo is designed to be a super set of Python, and so many of the functions that you are already familiar with are present in Mojo. This document describes how to write a hello world program in Mojo." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IO import print \n", + "\n", + "#| CHECK: Hello Mojo!\n", + "print(\"Hello Mojo!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "language_info": { + "name": "python" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/Matmul.ipynb b/examples/Matmul.ipynb new file mode 100644 index 000000000..fdd948cac --- /dev/null +++ b/examples/Matmul.ipynb @@ -0,0 +1,1240 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Matrix Multiplication" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook describes how to write a matrix multiplication (matmul) algorithm in Mojo. We will start with a pure Python implementation, transition to a naive implementation that is essentially a copy of the Python one, then add types, then continue the optimizations by vectorizing, tiling, and parallelizing the implementation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, let's define matrix multiplication. Given two dense matrices $A$ and $B$ of dimensions $M\\times K$ and $K\\times N$ respectively, we want to compute their dot product $C = A . B$ (also known as matmul). The dot product $C += A . B$ is defined by" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "$$C_{i,j} += \\sum_{k \\in [0 \\cdots K)} A_{i,k} B_{k,j}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The format of this demo is to start with an implementation which is identical to that of Python (effectively renaming the file extension), then look at how adding types to the implementation helps performance before extending the implementation by leveraging the vectorization and parallelization capabilities available on modern hardware. Throughout the execution, we report the GFlops achieved." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Python Implementation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's first implement matmul in Python directly from the definition." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%python\n", + "def matmul_python(C, A, B):\n", + " for m in range(C.rows):\n", + " for n in range(C.cols):\n", + " for k in range(A.cols):\n", + " C[m, n] += A[m, k] * B[k, n]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's benchmark our implementation using 128 by 128 square matrices and report the achieved GFLops." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%python\n", + "import numpy as np\n", + "from timeit import timeit\n", + "\n", + "class Matrix:\n", + " def __init__(self, value, rows, cols):\n", + " self.value = value\n", + " self.rows = rows\n", + " self.cols = cols\n", + " \n", + " def __getitem__(self, idxs):\n", + " return self.value[idxs[0]][idxs[1]]\n", + " \n", + " def __setitem__(self, idxs, value):\n", + " self.value[idxs[0]][idxs[1]] = value\n", + "\n", + "def benchmark_matmul_python(M, N, K):\n", + " A = Matrix(list(np.random.rand(M, K)), M, K)\n", + " B = Matrix(list(np.random.rand(K, N)), K, N)\n", + " C = Matrix(list(np.zeros((M, N))), M, N)\n", + " secs = timeit(lambda: matmul_python(C, A, B), number=2)/2\n", + " print(((2*M*N*K)/secs) / 1e9, \"GFLOP/s\")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.005480328057626661 GFLOP/s\n" + ] + } + ], + "source": [ + "# %python # TODO: delete this once we switch to REPL in notebook testing (#12719).\n", + "benchmark_matmul_python(128, 128, 128)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Importing the Python implementation to Mojo" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using Mojo is as simple as Python. First, let's include that modules from the Mojo stdlib that we are going to use:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from Benchmark import Benchmark\n", + "from DType import DType\n", + "from IO import print, _printf, _put_kgen_scalar\n", + "from Intrinsics import strided_load\n", + "from List import VariadicList, VariadicListMem\n", + "from Math import div_ceil\n", + "from Memory import memset_zero\n", + "from Object import object, Attr\n", + "from Pointer import DTypePointer\n", + "from Random import rand, random_f64\n", + "from Range import range\n", + "from SIMD import SIMD, F64, F32\n", + "from TargetInfo import dtype_sizeof\n", + "from Complex import ComplexSIMD as ComplexGenericSIMD\n", + "\n", + "alias python_gflops = F64(0.005480328057626661)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then, we can copy and paste our Python code. Mojo is a superset of Python, so the same Python code will run as Mojo code" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# This exactly the same Python implementation, \n", + "# but is infact Mojo code!\n", + "def matmul_untyped(C, A, B):\n", + " for m in range(C.rows):\n", + " for n in range(C.cols):\n", + " for k in range(A.cols):\n", + " C[m, n] += A[m, k] * B[k, n]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can then benchmark the implementation. As before we use a 128 by 128 matrix" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def matrix_getitem(self, i) -> object:\n", + " return self.value[i]\n", + "\n", + "\n", + "def matrix_setitem(self, i, value) -> object:\n", + " self.value[i] = value\n", + " return None\n", + "\n", + "\n", + "def matrix_append(self, value) -> object:\n", + " self.value.append(value)\n", + " return None\n", + "\n", + "\n", + "def matrix_init(rows: Int, cols: Int) -> object:\n", + " value = object([])\n", + " let getitem: object = __mlir_op.`kgen.addressof`[ _type : object.binary_function, callee:matrix_getitem, paramDecls : __mlir_attr.`#kgen`, ]()\n", + " let setitem: object = __mlir_op.`kgen.addressof`[ _type : object.ternary_function, callee:matrix_setitem, paramDecls : __mlir_attr.`#kgen`, ]()\n", + " let append: object = __mlir_op.`kgen.addressof`[ _type : object.binary_function, callee:matrix_append, paramDecls : __mlir_attr.`#kgen`, ]()\n", + " return object(\n", + " Attr(\"value\", value), Attr(\"__getitem__\", getitem), Attr(\"__setitem__\", setitem), \n", + " Attr(\"rows\", rows), Attr(\"cols\", cols), Attr(\"append\", append),\n", + " )\n", + "\n", + "def benchmark_matmul_untyped(M: Int, N: Int, K: Int):\n", + " C = matrix_init(M, N)\n", + " A = matrix_init(M, K)\n", + " B = matrix_init(K, N)\n", + " for i in range(M):\n", + " c = object([])\n", + " b = object([])\n", + " a = object([])\n", + " for j in range(N):\n", + " c.append(0.0)\n", + " b.append(random_f64(-5, 5))\n", + " a.append(random_f64(-5, 5))\n", + " C.append(c)\n", + " B.append(b)\n", + " A.append(a)\n", + "\n", + " fn test_fn():\n", + " try:\n", + " matmul_untyped(C, A, B)\n", + " except:\n", + " pass\n", + "\n", + " let secs = F64(Benchmark().run[test_fn]()) / 1_000_000_000\n", + " let gflops = ((2*M*N*K)/secs) / 1e9\n", + " _put_kgen_scalar[DType.f64](gflops.value)\n", + " _printf(\" GFLOP/s, a \")\n", + " let speedup : F64 = gflops / python_gflops\n", + " _printf(\"%0.2f\", speedup.value)\n", + " print(\"x speedup over Python\")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.047082 GFLOP/s, a 8.59x speedup over Python\n" + ] + } + ], + "source": [ + "benchmark_matmul_untyped(128, 128, 128)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note the huge speedup with no effort that we have gotten." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Adding types to the Python implementation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above program, while achieving better performance than Python is still not the best we can get from Mojo. If we tell Mojo the types of the inputs, it can optimize much of the code away and reduce cost due to dispatching (this is unlike Python which only uses types only for type checking), here Mojo is using the types for performance optimizations also." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To do that, let's first define a Matrix struct. The Matrix struct contains a data pointer along with a size field. While the Matrix struct can be parametrized on any data type, here we set the data type to be f32 for conciseness. The implementation is:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "struct Matrix:\n", + " var data: DTypePointer[DType.f32]\n", + " var rows: Int\n", + " var cols: Int\n", + "\n", + " fn __init__(self&, rows: Int, cols: Int):\n", + " self.data = DTypePointer[DType.f32].alloc(rows * cols)\n", + " rand(self.data, rows*cols)\n", + " self.rows = rows\n", + " self.cols = cols\n", + "\n", + " fn __del___(owned self):\n", + " self.data.free()\n", + "\n", + " fn zero(self&):\n", + " memset_zero(self.data, self.rows * self.cols * dtype_sizeof[DType.f32]())\n", + "\n", + " @always_inline\n", + " fn __getitem__(self, y: Int, x: Int) -> F32:\n", + " return self.load[1](y, x)\n", + "\n", + " @always_inline\n", + " fn load[nelts:Int](self, y: Int, x: Int) -> SIMD[DType.f32, nelts]:\n", + " return self.data.simd_load[nelts](y * self.cols + x)\n", + "\n", + " @always_inline\n", + " fn load_tr[nelts:Int](self, y: Int, x: Int) -> SIMD[DType.f32, nelts]:\n", + " # Perform a transposed simd load. \n", + " # return strided_load[nelts,DType.f32](self.data + x* dtype_sizeof[DType.f32](), self.cols)\n", + " var res = SIMD[DType.f32, nelts]()\n", + " res[0] = self[y + 0, x]\n", + " res[1] = self[y + 1, x]\n", + " res[2] = self[y + 2, x]\n", + " res[3] = self[y + 3, x]\n", + " res[4] = self[y + 4, x]\n", + " res[5] = self[y + 5, x]\n", + " res[6] = self[y + 6, x]\n", + " res[7] = self[y + 7, x]\n", + " return res\n", + "\n", + " @always_inline\n", + " fn __setitem__(self, y: Int, x: Int, val: F32):\n", + " return self.store[1](y, x, val)\n", + "\n", + " @always_inline\n", + " fn store[nelts:Int](self, y: Int, x: Int, val: SIMD[DType.f32, nelts]):\n", + " var data = self.data\n", + " data.simd_store[nelts](y * self.cols + x, val)\n", + "\n", + " def to_numpy(self) -> PythonObject:\n", + " let np = Python.import_module(\"numpy\")\n", + " var numpy_array = np.zeros((self.cols, self.rows), np.uint32)\n", + " for x in range(self.cols):\n", + " for y in range(self.rows):\n", + " numpy_array.itemset((y, x), self[x, y])\n", + " return numpy_array" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Note that we implement `getitem` and `setitem` in terms of `load` and `store` while not important for the naive matmul, we will utilize it in the vectorized version of matmul. We are also defining a `load_tr` which loads a vector from the columns specified at the offset." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With the above Matrix type, we can effectively copy and paste the Python implementation and just add type annotations" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Note that C, A, and B have types.\n", + "def matmul_naive(C: Matrix, A: Matrix, B: Matrix):\n", + " for m in range(C.rows):\n", + " for n in range(C.cols):\n", + " for k in range(A.cols):\n", + " C[m, n] += A[m, k] * B[k, n]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are going to benchmark the implementations as we improve, so let's write a helper function that will do that for us: " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "@always_inline\n", + "def benchmark[func : __mlir_type[\n", + " `!kgen.signature<<>(`,\n", + " `!pop.pointer<`, Matrix,`>`, # C\n", + " ` borrow_in_mem,`,\n", + " `!pop.pointer<`, Matrix,`>`, # A\n", + " ` borrow_in_mem,`,\n", + " `!pop.pointer<`, Matrix,`>`, # B\n", + " ` borrow_in_mem) throws -> `,\n", + " `!pop.variant<`, Error, `,`, NoneType, `>`,\n", + " `>`,\n", + " ]](M : Int, N : Int, K : Int):\n", + " var C = Matrix(M, N)\n", + " C.zero()\n", + " var A = Matrix(M, K)\n", + " var B = Matrix(K, N)\n", + "\n", + " # func(C, A, B)\n", + " # print(C[10,4])\n", + "\n", + " @always_inline\n", + " fn test_fn():\n", + " try:\n", + " func(C, A, B)\n", + " except:\n", + " pass\n", + "\n", + " let secs = F64(Benchmark().run[test_fn]()) / 1_000_000_000\n", + " let gflops = ((2*M*N*K)/secs) / 1e9\n", + " _put_kgen_scalar[DType.f64](gflops.value)\n", + " _printf(\" GFLOP/s, a \")\n", + " let speedup : F64 = gflops / python_gflops\n", + " _printf(\"%0.2f\", speedup.value)\n", + " print(\"x speedup over Python\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Benchmarking the results shows significant speedups. We increase the size of the matrix to 512 by 512, since Mojo is much faster than Python." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.702316 GFLOP/s, a 310.62x speedup over Python\n" + ] + } + ], + "source": [ + "benchmark[matmul_naive](512, 512, 512)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Adding types gives around a huge improvement over the untyped version" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Vectorizing the inner most loop" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can do better than the above implementation by utilizing the vector instructions. Assuming a vector width of 8, we can modify the code to the following to leverage the simd instructions:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Mojo has SIMD vector types, we can vectorize the \n", + "# Matmul code as follows.\n", + "alias nelts = 8 # The SIMD vector width.\n", + "def matmul_vectorized_0(C: Matrix, A: Matrix, B: Matrix):\n", + " for m in range(C.rows):\n", + " for n in range(C.cols):\n", + " var tmp = SIMD[DType.f32, nelts]()\n", + " for kv in range(0, A.cols, nelts):\n", + " tmp += A.load[nelts](m,kv) * B.load_tr[nelts](kv,n)\n", + " C[m,n] += tmp.reduce_add()\n", + " \n", + " # Handle remaining elements with scalars.\n", + " for k in range(nelts*(A.cols//nelts), A.cols):\n", + " C[m,n] += A[m,k] * B[k,n]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can benchmark the above implementation. Note that many compilers can detect the naive loop and perform optimizations. Mojo allows you to be explicit however. In this case the compiler heuristics were better than what we wrote." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3.136114 GFLOP/s, a 572.25x speedup over Python\n" + ] + } + ], + "source": [ + "benchmark[matmul_vectorized_0](512, 512, 512)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since vectorization is a common optimization, Mojo provides a higher-order function that performs vectorization for you. The vectorization function takes a vector width and a function which is parameteric on the vector width and is going to be evaluated in a vectorized manner." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Simplify the code by using the builtin vectorize function\n", + "from Functional import vectorize\n", + "def matmul_vectorized_1(C: Matrix, A: Matrix, B: Matrix):\n", + " for m in range(C.rows):\n", + " for n in range(C.cols):\n", + " fn dot[nelts : Int](k : Int):\n", + " C[m,n] += (A.load[nelts](m,k) * B.load_tr[nelts](k,n)).reduce_add()\n", + " vectorize[nelts, dot](A.cols)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is only a slight difference in terms of performance between the two implementations:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3.125112 GFLOP/s, a 570.27x speedup over Python\n" + ] + } + ], + "source": [ + "benchmark[matmul_vectorized_1](512, 512, 512)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parallelizing Matmul" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To get the best performance from modern processors, one has to utilizing the multiple cores they provide you. We can parallelize the matmul algorithm above using the `parallelize` function. We can adjust the program above to be multi-threaded and run across cores using the `parallelize` function. For simplicity, we only `parallelize` on the M dimension:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Parallelize the code by using the builtin parallelize function\n", + "from Functional import parallelize\n", + "def matmul_parallelized(C: Matrix, A: Matrix, B: Matrix):\n", + " fn calc_row(m: Int):\n", + " for n in range(C.cols):\n", + " fn dot[nelts : Int](k : Int):\n", + " C[m,n] += (A.load[nelts](m,k) * B.load_tr[nelts](k,n)).reduce_add()\n", + " vectorize[nelts, dot](A.cols)\n", + " \n", + " parallelize[calc_row](C.rows)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can benchmark the parallel matmul implementation. Again, we increase the size of the matrix to 1024 by 1024, since this implementation is much faster than the unvectorized version." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "11.388119 GFLOP/s, a 2078.00x speedup over Python\n" + ] + } + ], + "source": [ + "benchmark[matmul_parallelized](1024, 1024, 1024)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tiling Matmul" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Tiling is an optimization performed for matmul to increase cache locality. The idea is to keep sub-matrices resident in the cache and increase the reuse. The tile function itself can be written in Mojo as:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from Functional import Static2DTileUnitFunc as Tile2DFunc" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Perform 2D tiling on the iteration space defined by end_x and end_y.\n", + "fn tile[tiled_fn: Tile2DFunc, tile_x: Int, tile_y: Int](end_x: Int, end_y: Int):\n", + " # Note: this assumes that ends are multiples of the tiles.\n", + " for y in range(0, end_y, tile_y):\n", + " for x in range(0, end_x, tile_x):\n", + " tiled_fn[tile_x, tile_y](x, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above will perform 2 dimensional tiling over a 2D iteration space defined to be between $([0, end_x], [0, end_y])$. Once we define it above, we can use it within our matmul kernel. For simplicity we We choose 2 as the tile height and since we also want to vectorize we use `2 * nelts` as the tile width (since we vectorize on the columns)." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Use the above tile function to perform tiled matmul.\n", + "def matmul_tiled_parallelized(C: Matrix, A: Matrix, B: Matrix):\n", + " fn calc_row(m: Int):\n", + " fn calc_tile[tile_x: Int, tile_y: Int](x: Int, y: Int):\n", + " for n in range(y, y + tile_y):\n", + " fn dot[nelts: Int](k: Int):\n", + " C[m,n] += (A.load[nelts](m,k+x) * B.load_tr[nelts](k+x,n)).reduce_add()\n", + " vectorize[nelts, dot](tile_x)\n", + " \n", + " # We hardcode the tile factor to be 16.\n", + " alias tile_size = 16\n", + " tile[calc_tile, nelts * tile_size, tile_size](C.cols, A.cols)\n", + "\n", + " parallelize[calc_row](C.rows)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, we can benchmark the parallel matmul implementation:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "11.896004 GFLOP/s, a 2170.67x speedup over Python\n" + ] + } + ], + "source": [ + "benchmark[matmul_tiled_parallelized](1024, 1024, 1024)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One source of overhead in the above implementation is the fact that the we are not unrolling the loops introduced by vectorize of the dot function. We can do that via the `vectorize_unroll` higher-order function in Mojo:" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Unroll the vectorized loop by a constant factor.\n", + "from Functional import vectorize_unroll\n", + "def matmul_tiled_unrolled_parallelized(C: Matrix, A: Matrix, B: Matrix):\n", + " fn calc_row(m: Int):\n", + " fn calc_tile[tile_x: Int, tile_y: Int](x: Int, y: Int):\n", + " for n in range(y, y + tile_y):\n", + " fn dot[nelts : Int](k : Int):\n", + " C[m,n] += (A.load[nelts](m,k+x) * B.load_tr[nelts](k+x,y)).reduce_add()\n", + " \n", + " # Vectorize by nelts and unroll by tile_x/nelts\n", + " # Here unroll factor is 16/8 = 2\n", + " vectorize_unroll[nelts, tile_x//nelts, dot](tile_x)\n", + "\n", + " alias tile_size = 16\n", + " tile[calc_tile, nelts*tile_size, tile_size](A.cols, C.cols)\n", + " \n", + " parallelize[calc_row](C.rows)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, we can benchmark the parallel matmul implementation:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "12.243320 GFLOP/s, a 2234.05x speedup over Python\n" + ] + } + ], + "source": [ + "benchmark[matmul_tiled_unrolled_parallelized](1024, 1024, 1024)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Searching for the `tile_factor`" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from Autotune import autotune, search\n", + "from Time import now\n", + "from Pointer import Pointer\n", + "\n", + "alias matmul_fn_type = __mlir_type[\n", + " `( `,\n", + " Pointer[Matrix].pointer_type,\n", + " `, `,\n", + " Pointer[Matrix].pointer_type,\n", + " `, `,\n", + " Pointer[Matrix].pointer_type,\n", + " `) -> `, NoneType\n", + "]\n", + "alias matmul_fn_sig_type = __mlir_type[\n", + " `!kgen.signature<<>(`,\n", + " `!pop.pointer<`, Matrix,`>`, # C\n", + " ` borrow_in_mem,`,\n", + " `!pop.pointer<`, Matrix,`>`, # A\n", + " ` borrow_in_mem,`,\n", + " `!pop.pointer<`, Matrix,`>`, # B\n", + " ` borrow_in_mem) -> `, NoneType,\n", + " `>`,\n", + " ]\n", + "\n", + "alias matmul_fn_ptr_type = __mlir_type[`!pop.pointer<`, matmul_fn_sig_type, `>`]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The choice of the tile factor can greatly impact the performace of the full matmul,\n", + "but the optimal tile factor is highly hardware-dependant, and is influenced by the\n", + "cache configuration and other hard-to-model effects. We want write to write portable code\n", + "without knowing everything about the hardware, so we can ask Mojo to automatically\n", + "select the best tile factor using autotuning." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Autotune the tile size used in the matmul.\n", + "@adaptive\n", + "fn matmul_autotune_impl(C: Matrix, A: Matrix, B: Matrix):\n", + " fn calc_row(m: Int):\n", + " fn calc_tile[tile_x: Int, tile_y: Int](x: Int, y: Int):\n", + " for n in range(y, y + tile_y):\n", + " fn dot[nelts : Int](k : Int):\n", + " C[m,n] += (A.load[nelts](m,k+x) * B.load_tr[nelts](k+x,y)).reduce_add()\n", + " vectorize_unroll[nelts, tile_x // nelts, dot](tile_x)\n", + "\n", + " # Instead of hardcoding to tile_size = 16, search for the fastest \n", + " # tile size by evaluting this function as tile size varies.\n", + " alias tile_size = autotune(1, 2, 4, 8, 16, 32, 64)\n", + " tile[calc_tile, nelts * tile_size, tile_size](A.cols, C.cols)\n", + " \n", + " parallelize[calc_row](C.rows)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This will generate multiple candidates for the matmul function. To teach Mojo how\n", + "to find the best tile factor, we provide an evaluator function Mojo can use to\n", + "measure each candidate." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "fn matmul_evaluator(funcs: matmul_fn_ptr_type, size: Int) -> Int:\n", + " print(\"matmul_evaluator, number of candidates: \")\n", + " print(size)\n", + "\n", + " let eval_begin: Int = now()\n", + "\n", + " # This size is picked at random, in real code we could use a real size\n", + " # distribution here.\n", + " print(\"Optimizing for size: \")\n", + " let M = 1024\n", + " let N = 1024\n", + " let K = 1024\n", + " print(M)\n", + " print(N)\n", + " print(K)\n", + "\n", + " var best_idx: Int = -1\n", + " var best_time: Int = -1\n", + " var funcs_ptr = Pointer[matmul_fn_sig_type](funcs).bitcast[matmul_fn_type]()\n", + "\n", + " alias eval_iterations = 10\n", + " alias eval_samples = 10\n", + "\n", + " var C = Matrix(M, N)\n", + " var A = Matrix(M, K)\n", + " var B = Matrix(K, N)\n", + " let Cptr = Pointer[Matrix].address_of(C).address\n", + " let Aptr = Pointer[Matrix].address_of(A).address\n", + " let Bptr = Pointer[Matrix].address_of(B).address\n", + "\n", + " # Find the function that's the fastest on the size we're optimizing for\n", + " for f_idx in range(size):\n", + " let func = funcs_ptr.load(f_idx)\n", + "\n", + " @always_inline\n", + " fn wrapper():\n", + " __mlir_op.`pop.call_indirect`[_type:NoneType](\n", + " func, Cptr, Aptr, Bptr\n", + " )\n", + " let cur_time = Benchmark(1, 100_000, 500_000_000, 1000_000_000).run[wrapper]()\n", + "\n", + " if best_idx < 0:\n", + " best_idx = f_idx\n", + " best_time = cur_time\n", + " if best_time > cur_time:\n", + " best_idx = f_idx\n", + " best_time = cur_time\n", + "\n", + " let eval_end: Int = now()\n", + " print(\"Time spent in matmul_evaluator, ms: \")\n", + " print((eval_end - eval_begin) // 1000000)\n", + "\n", + " print(\"Best candidate idx:\")\n", + " print(best_idx)\n", + " return best_idx" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we need to define an entry function that would simply call the best candidate." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def matmul_autotune(C: Matrix, A: Matrix, B: Matrix):\n", + " alias best_impl: matmul_fn_sig_type\n", + " search[\n", + " matmul_fn_sig_type,\n", + " VariadicList(matmul_autotune_impl.__adaptive_set),\n", + " matmul_evaluator -> best_impl\n", + " ]()\n", + " # Run the best candidate\n", + " return best_impl(C, A, B)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's benchmark our new implementation:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "22.824017 GFLOP/s, a 4164.72x speedup over Python\n" + ] + } + ], + "source": [ + "benchmark[matmul_autotune](1024, 1024, 1024)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Mandelbrot in Mojo, Plotting in Python\n", + "\n", + "Mojo is great at compute and writing high-performance code, but Python has a huge ecosystem of libraries and tools. With seamless Python interoperability, Mojo can leverage Python for what it's good at, especially GUIs, with out sacrificing performance in critical code. Let's take the classic mandelbrot set algorithm and implement it in Mojo.\n", + "\n", + "We'll introduce a `Complex` type and use it in our implementation." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "@register_passable(\"trivial\")\n", + "struct Complex:\n", + " var real: F32\n", + " var imag: F32\n", + "\n", + " fn __init__(real: F32, imag: F32) -> Self:\n", + " return Self {real: real, imag: imag}\n", + "\n", + " fn __add__(lhs, rhs: Self) -> Self:\n", + " return Self(lhs.real + rhs.real, lhs.imag + rhs.imag)\n", + "\n", + " fn __mul__(lhs, rhs: Self) -> Self:\n", + " return Self(\n", + " lhs.real * rhs.real - lhs.imag * rhs.imag,\n", + " lhs.real * rhs.imag + lhs.imag * rhs.real,\n", + " )\n", + "\n", + " fn norm(self) -> F32:\n", + " return self.real * self.real + self.imag * self.imag" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then we can write the core mandelbrot algorithm, which involves computing an iterative complex function for each pixel until it \"escapes\" the complex circle of radius 2, counting the number of iterations to escape.\n", + "\n", + "$$z_{i+1} = (z_i)^2 + c$$" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "alias xmin: F32 = -2.25\n", + "alias xmax: F32 = 0.75\n", + "alias xn = 1500\n", + "alias ymin: F32 = -1.25\n", + "alias ymax: F32 = 1.25\n", + "alias yn = 1250\n", + "\n", + "# Compute the number of steps to escape.\n", + "def mandlebrot_kernel(c: Complex) -> Int:\n", + " max_iter = 200\n", + " z = c\n", + " for i in range(max_iter):\n", + " z = z * z + c\n", + " if z.norm() > 4:\n", + " return i\n", + " return max_iter\n", + "\n", + "\n", + "def compute_mandlebrot() -> Matrix:\n", + " # create a matrix. Each element of the matrix corresponds to a pixel\n", + " result = Matrix(xn, yn)\n", + "\n", + " cnt = 0\n", + " x = xmin\n", + " dx = (xmax - xmin) / xn\n", + " dy = (ymax - ymin) / yn\n", + " for i in range(xn):\n", + " y = ymin\n", + " for j in range(yn):\n", + " result[i, j] = mandlebrot_kernel(Complex(x, y))\n", + " y += dy\n", + " x += dx\n", + " return result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Plotting the number of iterations to escape with some color gives us the canonical mandelbrot set plot. We can directly leverage Python's `matplotlib` from Mojo to do our plotting!" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def to_numpy(m: Matrix) -> PythonObject:\n", + " let np = Python.import_module(\"numpy\")\n", + " var numpy_array = np.zeros((yn, xn), np.uint32)\n", + " for x in range(xn):\n", + " for y in range(yn):\n", + " numpy_array.itemset((y, x), m[x, y])\n", + " return numpy_array" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "image/png": "" + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "np = Python.import_module(\"numpy\")\n", + "plt = Python.import_module(\"matplotlib.pyplot\")\n", + "colors = Python.import_module(\"matplotlib.colors\")\n", + "\n", + "result = compute_mandlebrot()\n", + "dpi = 72\n", + "width = 10\n", + "height = 10 * yn // xn\n", + "\n", + "fig = plt.figure(1, [width, height], dpi)\n", + "ax = fig.add_axes([0.0, 0.0, 1.0, 1.0], False, 1)\n", + "\n", + "light = colors.LightSource(315, 10, 0, 1, 1, 0)\n", + "image = light.shade(\n", + " result.to_numpy(), plt.cm.hot, colors.PowerNorm(0.3), \"hsv\", 0, 0, 1.5\n", + " )\n", + "plt.imshow(image)\n", + "plt.axis(\"off\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "source": [ + "We showed a naive implementation of the mandelbrot algorithm, but there are two things we can do to speed it up. We can early-stop the loop iteration when a pixel is known to have escaped, and we can leverage Mojo's access to hardware by vectorizing the loop, computing multiple pixels simultaneously." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Some aliases to make the code easier to read\n", + "alias ComplexSIMD = ComplexGenericSIMD[DType.f32, 8]\n", + "alias NumStepsSIMD = SIMD[DType.si64, 8]\n", + "\n", + "# Multi-element mandlebrot with early-stop optimization.\n", + "def mandlebrot_kernel_simd(c: ComplexSIMD, iter: Int) -> NumStepsSIMD:\n", + " z = c\n", + " nv = NumStepsSIMD(0)\n", + " done_mask = SIMD[DType.bool, 8](0)\n", + "\n", + " i = 100\n", + " while i != 0 and done_mask:\n", + " done_mask = z.norm() > 4 \n", + " z = z*z + c\n", + " nv = done_mask.select(nv, nv + 1)\n", + " i -= 1\n", + " return nv" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Mojo", + "language": "mojo", + "name": "mojo-jupyter-kernel" + }, + "language_info": { + "codemirror_mode": { + "name": "mojo" + }, + "file_extension": ".mojo", + "mimetype": "text/x-mojo", + "name": "mojo" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/Memset.ipynb b/examples/Memset.ipynb new file mode 100644 index 000000000..aa148b82b --- /dev/null +++ b/examples/Memset.ipynb @@ -0,0 +1,674 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Fast Memset in Mojo" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this tutorial we will implement a memset version optimized for small sizes\n", + "using Mojo's autotuning feature.\n", + "\n", + "The idea behind the implementation is based on Nadav Rotem's work [[1](https://github.com/nadavrot/memset_benchmark)], and is also well-described in [[2](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/4f7c3da72d557ed418828823a8e59942859d677f.pdf)].\n", + "\n", + "Below we try to briefly summarize the approach." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## High-level overview\n", + "\n", + "For the best memset performance we want to use the widest possible register\n", + "width for the memory access. For instance, if we want to store 19 bytes, we\n", + "want to use vector width 16 and use two overlapping stores. To store 9 bytes,\n", + "we would want to use two 8-byte stores.\n", + "\n", + "However, before we get to actually doing stores, we need to perform size\n", + "checks to make sure that we're in the right range. I.e. we want to use 8\n", + "bytes stores for sizes 8-16, 16 bytes stores for sizes 16-32, etc.\n", + "\n", + "The order in which we do the size checks significantly affects performance\n", + "and ideally we would like to run as few checks as possible for the sizes\n", + "that occur most often. I.e. if most of the sizes we see are 16-32, then we\n", + "want to first check if it's within that range before we check if it's in\n", + "8-16 or some other range.\n", + "\n", + "This results in a number of different comparison \"trees\" that can be used to\n", + "perform the size checks, and in this tutorial we use Mojo's autotuning to pick\n", + "the most optimal one given the distribution of input data." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Implementation\n", + "\n", + "We will start as we always start - with imports and type aliases." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from Assert import assert_param\n", + "from Autotune import autotune_fork, search\n", + "from DType import DType\n", + "from IO import print, _printf, put\n", + "from List import VariadicList\n", + "from Math import min, max\n", + "from Memory import _malloc, _free\n", + "from OS import getenv\n", + "from Pointer import DTypePointer, Pointer\n", + "from Range import range\n", + "from SIMD import SIMD\n", + "from Sort import sort\n", + "from String import StringRef\n", + "from TargetInfo import sizeof\n", + "from Time import now\n", + "from Vector import DynamicVector\n", + "\n", + "alias UI8 = DType.ui8\n", + "alias BufferPtrType = DTypePointer[UI8]\n", + "alias ValueType = SIMD[UI8, 1]\n", + "alias NoneType = __mlir_type.`!lit.none`\n", + "\n", + "alias memset_fn_type = __mlir_type[\n", + " `(`, BufferPtrType, `, `, ValueType, `, `, Int, `) -> `, NoneType\n", + "]\n", + "alias memset_fn_sig_type = __mlir_type[\n", + " `!kgen.signature<(`,\n", + " BufferPtrType,\n", + " ` borrow, `,\n", + " ValueType,\n", + " ` borrow, `,\n", + " Int,\n", + " ` borrow) -> `,\n", + " NoneType, `>`\n", + "]\n", + "alias memset_fn_ptr_type = __mlir_type[\n", + " `!pop.pointer<`,\n", + " memset_fn_sig_type,\n", + " `>`\n", + "]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's add some auxiliary function. We will use them to benchmark various\n", + "memset implementations and visualize results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fn optimization_barrier(ptr: BufferPtrType):\n", + " __mlir_op.`pop.inline_asm`[\n", + " _type:None,\n", + " assembly:(\"\").value,\n", + " constraints:(\"r,~{memory}\").value,\n", + " hasSideEffects : __mlir_attr.unit,\n", + " ](ptr.address)\n", + "\n", + "\n", + "fn alloc_buffer(size: Int) -> BufferPtrType:\n", + " let data_mem = _malloc[ValueType](sizeof[ValueType]() * size)\n", + " return DTypePointer[UI8.value](data_mem.address)\n", + "\n", + "\n", + "fn free_buffer(ptr: BufferPtrType):\n", + " _free[ValueType](ptr.as_scalar_pointer())\n", + "\n", + "\n", + "fn measure_time(\n", + " func: memset_fn_type, size: Int, ITERS: Int, SAMPLES: Int\n", + ") -> Int:\n", + " alias alloc_size = 1024 * 1024\n", + " var ptr = alloc_buffer(alloc_size)\n", + "\n", + " var samples = DynamicVector[Int](SAMPLES)\n", + "\n", + " for sample in range(SAMPLES):\n", + " let tic = now()\n", + " for iter in range(ITERS):\n", + " # Offset pointer to shake up cache a bit\n", + " var offset_ptr = ptr.offset((iter * 128) & 1024)\n", + "\n", + " # Just in case compiler will try to outsmart us and avoid repeating\n", + " # memset, change the value we're filling with\n", + " var v = ValueType(iter&255)\n", + "\n", + " # Actually call the memset function\n", + " __mlir_op.`pop.call_indirect`[_type:NoneType](\n", + " func, offset_ptr, v.value, size\n", + " )\n", + "\n", + " # Insert optimization barriers to prevent compiler from optimizing\n", + " # this loop away\n", + " optimization_barrier(ptr)\n", + " optimization_barrier(offset_ptr)\n", + "\n", + " let toc = now()\n", + " samples.push_back(toc - tic)\n", + "\n", + " # Find median across the samples\n", + " sort(samples)\n", + " let result = samples[SAMPLES // 2]\n", + "\n", + " samples.__del__()\n", + " free_buffer(ptr)\n", + " return result\n", + "\n", + "\n", + "fn visualize_result(size: Int, result: Int):\n", + " _printf(\"Size: \")\n", + " if size < 10:\n", + " _printf(\" \")\n", + " put(size)\n", + " _printf(\" |\")\n", + " for _ in range(result // 10000):\n", + " _printf(\"*\")\n", + " print(\"\")\n", + "\n", + "\n", + "fn benchmark(func: memset_fn_type, title: StringRef):\n", + " print(\"\\n================\")\n", + " print(title)\n", + " print(\"----------------\\n\")\n", + "\n", + " alias warmup_iterations = 100\n", + " alias benchmark_iterations = 100000\n", + " alias benchmark_samples = 5\n", + "\n", + " for size in range(35):\n", + " # Warmup\n", + " let _ = measure_time(func, size, warmup_iterations, 1)\n", + "\n", + " # Actual run\n", + " let result = measure_time(\n", + " func, size, benchmark_iterations, benchmark_samples\n", + " )\n", + "\n", + " visualize_result(size, result)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Reproducing results from the paper\n", + "\n", + "Let's implement a memset version from the paper in Mojo and compare it against\n", + "the system memset.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@always_inline\n", + "fn overlapped_store[\n", + " width: Int\n", + "](ptr: BufferPtrType, value: ValueType, count: Int):\n", + " let v = SIMD.splat[UI8, width](value)\n", + " ptr.simd_store[width](v)\n", + " ptr.simd_store[width](count - width, v)\n", + "\n", + "\n", + "fn memset_manual(ptr: BufferPtrType, value: ValueType, count: Int):\n", + " if count < 32:\n", + " if count < 5:\n", + " if count == 0:\n", + " return\n", + " # 0 < count <= 4\n", + " ptr.store(0, value)\n", + " ptr.store(count - 1, value)\n", + " if count <= 2:\n", + " return\n", + " ptr.store(1, value)\n", + " ptr.store(count - 2, value)\n", + " return\n", + "\n", + " if count <= 16:\n", + " if count >= 8:\n", + " # 8 <= count < 16\n", + " overlapped_store[8](ptr, value, count)\n", + " return\n", + " # 4 < count < 8\n", + " overlapped_store[4](ptr, value, count)\n", + " return\n", + "\n", + " # 16 <= count < 32\n", + " overlapped_store[16](ptr, value, count)\n", + " else:\n", + " # 32 < count\n", + " memset_system(ptr, value, count)\n", + "\n", + "\n", + "fn memset_system(ptr: BufferPtrType, value: ValueType, count: Int):\n", + " __mlir_op.`pop.memset`(ptr.address, value.value, count.__as_mlir_index())\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| CHECK: Manual memset\n", + "#| CHECK: System memset\n", + "let fptr_manual = __mlir_op.`kgen.addressof`[\n", + " _type:memset_fn_type,\n", + " callee:memset_manual,\n", + " paramDecls : __mlir_attr.`#kgen`,\n", + "]()\n", + "let fptr_system = __mlir_op.`kgen.addressof`[\n", + " _type:memset_fn_type,\n", + " callee:memset_system,\n", + " paramDecls : __mlir_attr.`#kgen`,\n", + "]()\n", + "benchmark(fptr_manual, \"Manual memset\")\n", + "benchmark(fptr_system, \"System memset\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tweaking the implementation for different sizes\n", + "\n", + "We can see that it's already much faster for small sizes.\n", + "That version was specifically optimized for a certain input size distribution,\n", + "e.g. we can see that sizes 8-16 and 0-4 work fastest.\n", + "\n", + "But what if in **our use case** the distribution is different? Let's imagine that\n", + "in our case the most common sizes are 16-32 - is this version the most optimal\n", + "version we can use then? The answer is obviously \"no\", and we can easily tweak\n", + "the implementation to work better for these sizes - we just need to move the\n", + "corresponding check closer to the beginning of the function. E.g. like so:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fn memset_manual_2(ptr: BufferPtrType, value: ValueType, count: Int):\n", + " if count < 32:\n", + " if count >= 16:\n", + " # 16 <= count < 32\n", + " overlapped_store[16](ptr, value, count)\n", + " return\n", + "\n", + " if count < 5:\n", + " if count == 0:\n", + " return\n", + " # 0 < count <= 4\n", + " ptr.store(0, value)\n", + " ptr.store(count - 1, value)\n", + " if count <= 2:\n", + " return\n", + " ptr.store(1, value)\n", + " ptr.store(count - 2, value)\n", + " return\n", + "\n", + " if count >= 8:\n", + " # 8 <= count < 16\n", + " overlapped_store[8](ptr, value, count)\n", + " return\n", + " # 4 < count < 8\n", + " overlapped_store[4](ptr, value, count)\n", + "\n", + " else:\n", + " # 32 < count\n", + " memset_system(ptr, value, count)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's check the performance of this version." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| CHECK: Manual memset v2\n", + "let fptr_manual_2 = __mlir_op.`kgen.addressof`[\n", + " _type:memset_fn_type,\n", + " callee:memset_manual_2,\n", + " paramDecls : __mlir_attr.`#kgen`,\n", + "]()\n", + "benchmark(fptr_manual_2, \"Manual memset v2\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The performance is now much better on the 16-32 sizes!\n", + "\n", + "The problem is that we had to manually re-write the code. Wouldn't it be nice\n", + "if it was done automatically?\n", + "\n", + "In Mojo this is possible (and quite easy) - we can generate multiple\n", + "implementations and let the compiler pick the fastest one for us evaluating\n", + "them on sizes we want!" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Mojo implementation\n", + "\n", + "Let's dive into that.\n", + "\n", + "The first thing we need to do is to generate all possible candidates. To do\n", + "that we will need to iteratively generate size checks to understand what size\n", + "for the overlapping store we can use. Once we localize the size interval, we\n", + "just call the overlapping store of the corresponding size.\n", + "\n", + "To express this we will implement an adaptive function `memset_impl_layer` two\n", + "parameters designating the current interval of possible size values. When we\n", + "generate a new size check, we split that interval into two parts and\n", + "recursively call the same functions on those two parts. Once we reach the\n", + "minimal intervals, we will call the corresponding overlapped_store function.\n", + "\n", + "This first implementation covers minimal interval cases:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@adaptive\n", + "@always_inline\n", + "fn memset_impl_layer[\n", + " lower: Int, upper: Int\n", + "](ptr: BufferPtrType, value: ValueType, count: Int):\n", + " @parameter\n", + " if (lower == -100) & (upper == 0):\n", + " pass\n", + " elif (lower == 0) & (upper == 4):\n", + " ptr.store(0, value)\n", + " ptr.store(count - 1, value)\n", + " if count <= 2:\n", + " return\n", + " ptr.store(1, value)\n", + " ptr.store(count - 2, value)\n", + " elif (lower == 4) & (upper == 8):\n", + " overlapped_store[4](ptr, value, count)\n", + " elif (lower == 8) & (upper == 16):\n", + " overlapped_store[8](ptr, value, count)\n", + " elif (lower == 16) & (upper == 32):\n", + " overlapped_store[16](ptr, value, count)\n", + " elif (lower == 32) & (upper == 100):\n", + " memset_system(ptr, value, count)\n", + " else:\n", + " assert_param[False]()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's now add an implementation for the other case, where we need to generate a\n", + "size check." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@adaptive\n", + "@always_inline\n", + "fn memset_impl_layer[\n", + " lower: Int, upper: Int\n", + "](ptr: BufferPtrType, value: ValueType, count: Int):\n", + " alias cur: Int\n", + " autotune_fork[Int, 0, 4, 8, 16, 32 -> cur]()\n", + "\n", + " assert_param[cur > lower]()\n", + " assert_param[cur < upper]()\n", + "\n", + " if count > cur:\n", + " memset_impl_layer[max(cur, lower), upper](ptr, value, count)\n", + " else:\n", + " memset_impl_layer[lower, min(cur, upper)](ptr, value, count)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we use 'autotune_fork' to generate all possible at that point checks.\n", + "\n", + "We will discard values beyond the current interval, and for the values within\n", + "we will recursively call this function on the interval splits.\n", + "\n", + "This is sufficient to generate multiple correct versions of memset, but to\n", + "achieve the best performance we need to take into account one more factor: when\n", + "we're dealing with such small sizes, even the code location matters a lot. E.g.\n", + "if we swap Then and Else branches and invert the condition, we might get a\n", + "different performance of the final function.\n", + "\n", + "To account for that, let's add one more implementation of our function, but now\n", + "with branches swapped:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@adaptive\n", + "@always_inline\n", + "fn memset_impl_layer[\n", + " lower: Int, upper: Int\n", + "](ptr: BufferPtrType, value: ValueType, count: Int):\n", + " alias cur: Int\n", + " autotune_fork[Int, 0, 4, 8, 16, 32 -> cur]()\n", + "\n", + " assert_param[cur > lower]()\n", + " assert_param[cur < upper]()\n", + "\n", + " if count <= cur:\n", + " memset_impl_layer[lower, min(cur, upper)](ptr, value, count)\n", + " else:\n", + " memset_impl_layer[max(cur, lower), upper](ptr, value, count)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We defined building blocks for our implementation, now we need to add a top\n", + "level entry-point that will kick off the recursion we've just defined.\n", + "\n", + "We will simply call our function with [-100,100] interval - -100 and 100 simply\n", + "designate that no checks have been performed yet. This interval will be refined\n", + "as we generate more and more check until we have enough to emit actual stores." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@adaptive\n", + "fn memset_autotune_impl(ptr: BufferPtrType, value: ValueType, count: Int):\n", + " memset_impl_layer[-100, 100](ptr, value, count)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ok, we're done with our memset implementation, now we just need to plug it to\n", + "autotuning infrastructure to let the Mojo compiler do the search and pick the\n", + "best implementation.\n", + "\n", + "To do that, we need to define an evaluator - this is a function that will take\n", + "an array of function pointers to all implementations of our function and will\n", + "need to return an index of the best candidate.\n", + "\n", + "There are no limitations in how this function can be implemented - it can\n", + "return the first or a random candidate, or it can actually benchmark all of\n", + "them and pick the fastest - this is what we're going to do for this example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fn memset_evaluator(funcs: memset_fn_ptr_type, size: Int) -> Int:\n", + "\n", + " print(\"memset_evaluator, number of candidates: \")\n", + " print(size)\n", + " let eval_begin: Int = now()\n", + "\n", + " # This size is picked at random, in real code we could use a real size\n", + " # distribution here.\n", + " print(\"Optimizing for size: \")\n", + " let size_to_optimize_for = 17\n", + " print(size_to_optimize_for)\n", + "\n", + " var best_idx: Int = -1\n", + " var best_time: Int = -1\n", + " var funcs_ptr = Pointer[memset_fn_sig_type](funcs).bitcast[memset_fn_type]()\n", + "\n", + " alias eval_iterations = 10000\n", + " alias eval_samples = 10\n", + "\n", + " # Find the function that's the fastest on the size we're optimizing for\n", + " for f_idx in range(size):\n", + " let func = funcs_ptr.load(f_idx)\n", + " let cur_time = measure_time(\n", + " func, size_to_optimize_for, eval_iterations, eval_samples\n", + " )\n", + " if best_idx < 0:\n", + " best_idx = f_idx\n", + " best_time = cur_time\n", + " if best_time > cur_time:\n", + " best_idx = f_idx\n", + " best_time = cur_time\n", + "\n", + " let eval_end: Int = now()\n", + " print(\"Time spent in memset_evaluator, ms: \")\n", + " print((eval_end - eval_begin) // 1000000)\n", + "\n", + " return best_idx" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The evaluator is ready, the last brush stroke is to add a function that will\n", + "call the best candidate.\n", + "\n", + "The search will be performed at compile time, and at runtime we will go\n", + "directly to the best implementation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fn memset_autotune(ptr: BufferPtrType, value: ValueType, count: Int):\n", + " # Get the set of all candidates\n", + " alias candidates = memset_autotune_impl.__adaptive_set\n", + "\n", + " # Use the evaluator to select the best candidate.\n", + " alias best_impl: memset_fn_sig_type\n", + " search[memset_fn_sig_type, VariadicList(candidates), memset_evaluator -> best_impl]()\n", + "\n", + " # Run the best candidate\n", + " return best_impl(ptr, value, count)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are now ready to benchmark our function, let's see how its performance looks!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| CHECK: Mojo autotune memset\n", + "let fptr_autotune = __mlir_op.`kgen.addressof`[\n", + " _type:memset_fn_type,\n", + " callee:memset_autotune,\n", + " paramDecls : __mlir_attr.`#kgen`,\n", + "]()\n", + "benchmark(fptr_autotune, \"Mojo autotune memset\")" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/index.md b/examples/index.md new file mode 100644 index 000000000..50d115f48 --- /dev/null +++ b/examples/index.md @@ -0,0 +1,3 @@ +# Public Mojo notebooks + +All files in here will be publicly visible.