18.335J/6.337J: Introduction to Numerical Methods
This is the repository of course materials for the 18.335J/6.337J course at MIT, taught by Prof. Steven G. Johnson, in Spring 2019.
Lectures: Monday/Wednesday/Friday 3–4pm (2-190). Office Hours: Thursday 4–5pm (2-345).
Topics: Advanced introduction to numerical linear algebra and related numerical methods. Topics include direct and iterative methods for linear systems, eigenvalue decompositions and QR/SVD factorizations, stability and accuracy of numerical algorithms, the IEEE floating-point standard, sparse and structured matrices, and linear algebra software. Other topics may include memory hierarchies and the impact of caches on algorithms, nonlinear optimization, numerical integration, FFTs, and sensitivity analysis. Problem sets will involve use of Julia, a Matlab-like environment (little or no prior experience required; you will learn as you go).
Prerequisites: Understanding of linear algebra (18.06, 18.700, or equivalents). 18.335 is a graduate-level subject, however, so much more mathematical maturity, ability to deal with abstractions and proofs, and general exposure to mathematics is assumed than for 18.06!
Other Reading: Previous terms can be found in branches of the 18335 git repository. The course notes from 18.335 in much earlier terms can be found on OpenCourseWare. For a review of iterative methods, the online books Templates for the Solution of Linear Systems (Barrett et al.) and Templates for the Solution of Algebraic Eigenvalue Problems are useful surveys.
Grading: 33% problem sets (about six, ~biweekly). 33% take-home mid-term exam (posted Monday Apr. 8 and due Tuesday Apr. 9), 34% final project (one-page proposal due Friday March 22, project due Wednesday May 15).
- Psets will be submitted electronically via Stellar. Submit a good-quality PDF scan of any handwritten solutions and also a PDF printout of a Julia notebook of your computational solutions.
TA: Sung Woo Jeong.
Collaboration policy: Talk to anyone you want to and read anything you want to, with three exceptions: First, you may not refer to homework solutions from the previous terms in which I taught 18.335. Second, make a solid effort to solve a problem on your own before discussing it with classmates or googling. Third, no matter whom you talk to or what you read, write up the solution on your own, without having their answer in front of you.
Final Projects: The final project will be a 8–15 page paper (single-column, single-spaced, ideally using the style template from the SIAM Journal on Numerical Analysis), reviewing some interesting numerical algorithm not covered in the course. [Since this is not a numerical PDE course, the algorithm should not be an algorithm for turning PDEs into finite/discretized systems; however, your project may take a PDE discretization as a given "black box" and look at some other aspect of the problem, e.g. iterative solvers.] Your paper should be written for an audience of your peers in the class, and should include example numerical results (by you) from application to a realistic problem (small-scale is fine), discussion of accuracy and performance characteristics (both theoretical and experimental), and a fair comparison to at least one competing algorithm for the same problem. Like any review paper, you should thoroughly reference the published literature (citing both original articles and authoritative reviews/books where appropriate [rarely web pages]), tracing the historical development of the ideas and giving the reader pointers on where to go for more information and related work and later refinements, with references cited throughout the text (enough to make it clear what references go with what results). (Note: you may re-use diagrams from other sources, but all such usage must be explicitly credited; not doing so is plagiarism.) Model your paper on academic review articles (e.g. read SIAM Review and similar journals for examples).
Frequently asked questions about the final project:
- Does it have to be about numerical linear algebra? No. It can be any numerical topic (basically, anything where you are computing a conceptually real result, not integer computations), excluding algorithms for discretizing PDEs.
- Can I use a matrix from a discretized PDE? Yes. You can take a matrix from the PDE as input and then talk about iterative methods to solve it, etcetera. I just don't want the paper to be about the PDE discretization technique itself.
- How formal is the proposal? Very informal—one page describing what you plan to do, with a couple of references that you are using as starting points. Basically, the proposal is just so that I can verify that what you are planning is reasonable and to give you some early feedback.
- How much code do I need to write? A typical project (there may be exceptions) will include a working proof-of-concept implementation, e.g. in Julia or Python or Matlab, that you wrote to demonstrate that you understand how the algorithm works. Your code does not have to be competitive with "serious" implementations, and I encourage you to download and try out existing "serious" implementations (where available) for any large-scale testing and comparisons.
- How should I do performance comparisons? Be very cautious about timing measurements: unless you are measuring highly optimized code or only care about orders of magnitude, timing measurements are more about implementation quality than algorithms. Better to measure something implementation independent (like flop counts, or matrix-vector multiplies for iterative algorithms, or function evaluations for integrators/optimizers), although such measures have their own weaknesses.
Lecture Summaries and Handouts
Lecture 1 (Feb 6)
- pset 1 and accompanying notebook, due Friday Feb. 15.
- Newton's method for square roots and accompanying notebook.
Brief overview of the huge field of numerical methods, and outline of the small portion that this course will cover. Key new concerns in numerical analysis, which don't appear in more abstract mathematics, are (i) performance (traditionally, arithmetic counts, but now memory access often dominates) and (ii) accuracy (both floating-point roundoff errors and also convergence of intrinsic approximations in the algorithms).
As a starting example, considered the convergence of Newton's method (as applied to square roots); see the handout and Julia notebook above.
Further reading: Googling "Newton's method" will find lots of references; as usual, the Wikipedia article on Newton's method is a reasonable starting point. Beware that the terminology for the convergence order (linear, quadratic, etc.) is somewhat different in this context from the terminology for discretization schemes (first-order, second-order, etc.); see e.g. the linked Wikipedia article. Homer Reid's notes on machine arithmetic for 18.330 are an excellent introduction that covers several applications and algorithms for root-finding. For numerical computation in 18.335, we will be using the Julia language: see this information on Julia at MIT.
Julia tutorial (Feb 8: 5pm in 32-141) — optional
On Friday, 8 February, at 5pm in 32-141, I will give an (attendance-optional!) Julia tutorial, introducing the Julia programming language and environment that we will use this term. Please see the tutorial notes online.
Please bring your laptops, and try to install Julia and the IJulia interface first via the abovementioned tutorial notes. Several people will be at the tutorial session to help answer installation questions. Alternatively, you can use Julia online at JuliaBox without installing anything (although running things on your own machine is usually faster).
Lecture 2 (Feb 8)
- notes on floating-point (18.335 Fall 2007; also slides)
- Julia floating-point notebook
- some floating-point myths
New topic: Floating-point arithmetic
The basic issue is that, for computer arithmetic to be fast, it has to be done in hardware, operating on numbers stored in a fixed, finite number of digits (bits). As a consequence, only a finite subset of the real numbers can be represented, and the question becomes which subset to store, how arithmetic on this subset is defined, and how to analyze the errors compared to theoretical exact arithmetic on real numbers.
In floating-point arithmetic, we store both an integer coefficient and an exponent in some base: essentially, scientific notation. This allows large dynamic range and fixed relative accuracy: if fl(x) is the closest floating-point number to any real x, then |fl(x)-x| < ε|x| where ε is the machine precision. This makes error analysis much easier and makes algorithms mostly insensitive to overall scaling or units, but has the disadvantage that it requires specialized floating-point hardware to be fast. Nowadays, all general-purpose computers, and even many little computers like your cell phones, have floating-point units.
Went through some simple examples in Julia (see notebook above), illustrating basic syntax and a few interesting tidbits, in particular on the accuracy of summation algorithms, that we will investigate in more detail later.
Overview of floating-point representations, focusing on the IEEE 754 standard (see also handout from previous lecture). The key point is that the nearest floating-point number to x, denoted fl(x), has the property of uniform relative precision (for |x| and 1/|x| < than some range, ≈10308 for double precision) that |fl(x)−_x_| ≤ εmachine|x|, where εmachine is the relative "machine precision" (about 10−16 for double precision). There are also a few special values: ±Inf (e.g. for overflow), NaN, and ±0 (e.g. for underflow).
Further reading: What Every Computer Scientist Should Know About Floating Point Arithmetic (David Goldberg, ACM 1991). William Kahan, How Java's floating-point hurts everyone everywhere (2004): contains a nice discussion of floating-point myths and misconceptions. Trefethen, lecture 13. Homer Reid's notes on machine arithmetic for 18.330 are another nice introduction to this material.
Lecture 3 (Feb 11)
- notes on the accuracy of floating-point summation
- notes on backwards stability of summation — in class, I instead showed the same thing using the previous accuracy notes as a starting point, but similarly using the L₁ norm.
Analyzed the accumulated floating-point roundoff errors (see notes), explaining the results that we observed experimentally in the Julia notebook of the previous lecture. In those experiments, all of the inputs were non-negative, so that the "condition number" factor in the derivation equalled 1. In general, though, if you have cancellations between summands of opposite signs, the same analysis shows that the relative error of the output can be arbitrarily large.
Stability: Gave the obvious definition of accuracy, what we might call "forwards stability" = almost the right answer for the right input. Showed that this is often too strong; e.g. adding a sequence of numbers is not forwards stable. (Basically because the denominator in the relative forwards error, which is the exact sum, can be made arbitrarily small via cancellations.)
Define asymptotic notation O(ε): f(ε) is O(g(ε)) if there exist some constants C, ε₀ > 0 such that |f(ε)| < C|g(ε)| for all |ε|<ε₀. That is, g(ε) is an asymptotic upper bound for f(ε) as ε goes to zero, ignoring constant factors C. (A similar notation is used in computational complexity theory, but in the limit of large arguments n.) In the definitions of stability, we technically require uniform convergence: we must have O(ε) errors with the same constants C and ε₀ independent of the inputs x. (The constants can depend on the dimension of x, however.)
More generally, we apply a weaker condition: "stability" = almost the right answer for almost the right input. (Gave the technical version of this, from the book.)
Often, it is sufficient to prove "backwards stability" = right answer for almost the right input. Showed that, in our example of adding a sequence of numbers, backwards stability seems to work where forwards stability failed. Then more rigorously proved that floating-point summation of n numbers is backwards stable.
Further reading: See the further reading from the previous lecture. Trefethen, lectures 14, 15, and 3. See also the Wikipedia article on asymptotic ("big O") notation; note that for expressions like O(ε) we are looking in the limit of small arguments rather than of large arguments (as in complexity theory), but otherwise the ideas are the same.
Lecture 4 (Feb 13)
When quantifying errors, a central concept is a norm, and we saw in our proof of backwards stability of summation that the choice of norm seems important. Defined norms (as in lecture 3 of Trefethen): for a vector space V, a norm takes any v∈V and gives you a real number ‖v‖ satisfying three properties:
- Positive definite: ‖v‖≥0, and =0 if and only if v=0
- Scaling: ‖αv‖ = |α|⋅‖v‖ for any scalar α.
- Triangle inequality: ‖v+w‖≤‖v‖+‖w‖
There are many norms, for many different vector spaces. Gave examples of norms of column vectors: Lₚ norms (usually p = 1, 2, or ∞) and weighted L₂ norms. A complete (= continuous, essentially) normed vector space is called a Banach space, and these error concepts generalize to f(x) when f and x are in any Banach spaces (including scalars, column vectors, matrices, …though infinite-dimensional Banach spaces are trickier).
Especially important in numerical analysis are functions where the inputs and/or outputs are matrices, and for these cases we need matrix norms. The most important matrix norms are those that are related to matrix operations. Started with the Frobenius norm. Related the Frobenius norm ‖A‖F to the square root of the sum of eigenvalues of A*A, which are called the singular values σ²; we will do much more on singular values later, but for now noted that they equal the squared eigenvalues of A if A*=A (Hermitian). Also defined the induced matrix norm, and bounded it below by the maximum eigenvalue magnitude of A (if A is square). For the L₂ induced norm, related it (without proof for now) to the maximum singular value. A useful property of the induced norm is ‖AB‖≤‖A‖⋅‖B‖. Multiplication by a unitary matrix Q (Q* = Q⁻¹) preserves both the Frobenius and L₂ induced norms.
Equivalence of norms. Described fact that any two norms are equivalent up to a constant bound, and showed that this means that stability in one norm implies stability in all norms. Sketched proof (only skimmed this): (i) show we can reduce the problem to proving any norm is equivalent to e.g. L₁ on (ii) the unit circle; (iii) show any norm is continuous; and (ii) use a result from real analysis: a continuous function on a closed/bounded (compact) set achieves its maximum and minimum, the extreme value theorem. See notes handout.
Further reading: Trefethen, lecture 3. If you don't immediately recognize that A*A has nonnegative real eigenvalues (it is positive semidefinite), now is a good time to review your linear algebra; see also Trefethen lecture 24.
Lecture 5 (Feb 15)
Relate backwards error to forwards error, and backwards stability to forwards error (or "accuracy" as the book calls it). Show that, in the limit of high precision, the forwards error can be bounded by the backwards error multiplied by a quantity κ, the relative condition number. The nice thing about κ is that it involves only exact linear algebra and calculus, and is completely separate from considerations of floating-point roundoff. Showed that, for differentiable functions, κ can be written in terms of the induced norm of the Jacobian matrix.
Calculated condition number for square root, summation, and matrix-vector multiplication, as well as solving Ax=b, similar to the book. Defined the condition number of a matrix: for f(x)=Ax, the condition number is ‖A‖⋅‖x‖/‖Ax‖, which is bounded above by κ(A)=‖A‖⋅‖A⁻¹‖.
Related matrix _L_2 norm to eigenvalues of B=A*A. B is obviously Hermitian (B*=B), and with a little more work showed that it is positive semidefinite: x*_B__x_≥0 for any x. Reviewed and re-derived properties of eigenvalues and eigenvectors of Hermitian and positive-semidefinite matrices. Hermitian means that the eigenvalues are real, the eigenvectors are orthogonal (or can be chosen orthogonal). Also, a Hermitian matrix must be diagonalizable (I skipped the proof for this; we will prove it later in a more general setting). Positive semidefinite means that the eigenvalues are nonnegative.
Proved that, for a Hermitian matrix B, the Rayleigh quotient R(x)=x*Bx/x*x is bounded above and below by the largest and smallest eigenvalues of B (the "min–max theorem"). Hence showed that the L2 induced norm of A is the square root of the largest eigenvalue of B=A*A. Similarly, showed that the L₂ induced norm of A⁻¹, or more generally the supremum of ‖x‖/‖Ax‖, is equal to the square root of the inverse of the smallest eigenvalue of A*.A
Understanding norms and condition numbers of matrices therefore reduces to understanding the eigenvalues of A*A (or A__A*). However, looking at it this way is unsatisfactory for several reasons. First, we would like to solve one eigenproblem, not two. Second, working with things like A*A explicitly is often bad numerically, because it squares the condition number [showed that κ(A*A)=κ(A)2] and hence exacerbates roundoff errors. Third, we would really like to get some better understanding of A itself. All of these concerns are addressed by the singular value decomposition or SVD, which we will derive next time.
Further reading: Trefethen, lectures 12, 14, 15, 24. See any linear-algebra textbook for a review of eigenvalue problems, especially Hermitian/real-symmetric ones. See also these notes from 18.06.
Lecture 6 (Feb 19 — Tuesday is an "MIT Monday")
Guest lecture by MIT Instructor Dr. Chris Rackauckas: Modern Differential Equations Solver Software: Where We Are and Where We're Headed:
In this talk we will discuss the current state of software in differential equations and see how the continued advances in computer science and numerical methods are likely to impact our software in the near future. Issues such as efficiency improvements for stiff and non-stiff differential equations will be addressed from a numerical analysis standpoint but backed with recent benchmarking results. Newer mathematical topics like random ordinary differential equations, jump diffusion equations, and adaptivity for stochastic differential equations will be introduced and the successes and limitations in current automatic software solutions will be discussed. We will close with a discussion on how recent computational advancements have been influencing the software implementations, specifically showing the effects of generic typing over abstract algorithms and implicit parallelism.
Lecture 7 (Feb 20)
Guest lecture by Prof. Alan Edelman: the SVD, relationship to L₂ norms and condition numbers, applications (e.g. principal components analysis).
Lecture 8 (Feb 22)
Guest lecture by Prof. Alan Edelman: Least-square problems, why the "normal" equations square the condition number, and the alternatives of SVD or QR.