## All problems

Many of the problems ask for *explanations* of answers and calculations. For these you should write one to several complete sentences that explain the reasoning behind your work in a manner that would be helpful to a fellow student. 

## Problem

What is the next `Float64` number after 6.0? Determine the precise value based on the structure of the 64-bit floating point number system as presented in lecture. Describe your reasoning fully (e.g. write a few complete sentences in English which would explain your reasoning clearly to a fellow student). Then confirm your answer with a calculation in Julia. 

## Problem 

What range of integers can be represented exactly in the `Float64` number system? I.e. what is the maximum value of $N$ for which all integers between $-N$ and $N$ are represented exactly as `Float64`s. As before, determine the answer based on the structure of the 64-bit floating-point number system as presented in lecture, and explain your reasoning. Then confirm your answer with a few calculations in Julia, and explain how the calculations confirm your expectations. 

## Problem 

There is a debate underway in the Julia development community whether the time library should represent time with floating-point numbers or with integers. (https://discourse.julialang.org/t/why-do-time-quantities-have-to-be-integers/5864/51) Suppose you want to measure time with millisecond accuracy near the present, but also dates as far back as 0 BC, using just a single number. What kind of number should you use, integer or floating-point? How many bits would you need? Please explain your reasoning fully. 

## Problems X-Y

For each of problems X through Y, use the Conditioning and Accuracy Theorem to estimate a value for the expected relative error $|\tilde{f} - f|\,/\,|f|$ for the floating-point calculation $\tilde{f}(x)$ of the mathematical problem $f(x)$ in 64-bit arithmetic. Then devise a calculation in Julia that confirms your expectations. Explain each of your answers.

## Problem 1. 

Scalar multiplication $f(x) = cx$. Condition number is $\kappa = 1$. 


In [4]:
c = BigFloat(3//19)
x = BigFloat(7//11)

f = c*x
f̃ = Float64(c)*Float64(x)

abs(f̃ - f)/abs(f)

3.039896376949833384493277186439150855654761904761904761904762444522439455307514e-17

In [5]:
c = BigFloat(3//19)*1e99
x = BigFloat(7//11)*1e49

f = c*x
f̃ = Float64(c)*Float64(x)

abs(f̃ - f)/abs(f)

8.39966623510554109559019391867281175106050489360511630169919801987848897855892e-18

## Problem 2

Scalar addition $f(x_1, x_2) = x_1 + x_2$.

In [8]:
x₁ = BigFloat(3//19)*1e19
x₂ = BigFloat(4//17)

f = x₁ + x₂
f̃ = Float64(x₁) + Float64(x₂)

abs(f̃ - f)/abs(f)

2.545098039215686274130534409842368319933488989905843152323636720082523588367262e-17

## Problem 3

Scalar subtraction, $f(x_1, x_2) = x_2 - x_1$. Let real numbers $x_1 = 129/13$ and $x_2 = x_1 + \delta$ where $\delta = 10^{-13}$. In the real numbers, $x_2 - x_1 = \delta = 10^{-13}$. What is the expected relative error $|\tilde{f} - f|/|f|$ for the floating-point calculation of $x_2 - x_1$ in 64-bit arithmetic, based on the Conditioning and Accuracy Theorem? Show your reasoning. Then devise a calculation in Julia that confirms your expectations. 


In [28]:
δ = 1e-13
x₁ = BigFloat(129//13)
x₂ = x₁ + δ

f = x₁ - x₂
f̃ = Float64(x₁) - Float64(x₂)

abs(f̃ - f)/abs(f), f̃ - f, abs(f)

(5.240169935859769995008000450211218502739276674031611248936183852520240327738815e-03, 5.240169935859770154171588790995913603471684227841365100175607949495315551757812e-16, 1.000000000000000030373745563400370913603471684227841365100175607949495315551758e-13)

## Problem 4

power $f(x) = x^n$

In [21]:
x = big(e)
n = big(179.0)

f = x^n
f̃ = Float64(x)^Float64(n)

abs(f̃ - f)/abs(f)

9.502680312005280031501588750876338363372820039456421762928766431671837048594457e-15

## Problem 5

Solution of linear system.

In [29]:
function randA(m, κ)
    σ = logspace(0, -log10(κ), m)
    Σ = diagm(σ)
    U,tmp = qr(randn(m,m))
    V,tmp = qr(randn(m,m))
    A = U*Σ*V'
end


randA (generic function with 1 method)

In [47]:
A = randA(5, 1e15)

5×5 Array{Float64,2}:
  0.22244     0.0205782    0.0612183  -0.0435877    0.034031  
 -0.0392961  -0.00363137  -0.0108725   0.00772521  -0.00605824
  0.157224    0.0145378    0.0433742  -0.0308536    0.0241373 
  0.251357    0.0232516    0.0692013  -0.0492648    0.0384748 
 -0.856329   -0.0791933   -0.236062    0.167969    -0.131323  

In [48]:
x = randn(5)

5-element Array{Float64,1}:
 -0.225303
 -0.671943
 -0.166272
  0.40871 
  0.311427

In [49]:
b = A*x

5-element Array{Float64,1}:
 -0.0813392
  0.014372 
 -0.0574966
 -0.0919145
  0.313151 

In [50]:
x̂ = A\b

5-element Array{Float64,1}:
 -0.225463
 -0.670227
 -0.168956
  0.406885
  0.313927

In [51]:
norm(x̂ - x)/norm(x)

0.0049876934733064805

In [67]:
Ã = convert(Array{Float64,2}, A)

5×5 Array{Float64,2}:
  0.0676934  -0.408041   0.0710563   0.149639    0.180646 
  0.075827   -0.456243   0.0794483   0.167331    0.202163 
  0.0626421  -0.375666   0.0654147   0.137801    0.166727 
  0.0209572  -0.12649    0.0220272   0.0463842   0.0559637
 -0.0718268   0.434047  -0.0755867  -0.159157   -0.191925 

In [68]:
cond(Ã)

9.99790065924683e12

In [69]:
b̃ = convert(Array{Float64,1}, b)

5-element Array{Float64,1}:
 -0.641846
 -1.76862 
  0.877396
  1.11914 
 -1.49699 

In [70]:
x̃ = Ã\b̃

5-element Array{Float64,1}:
 -1.14531e12
  7.05403e12
  3.96642e12
  1.86205e13
 -6.21812e11

In [71]:
norm(x̃-x)/norm(x)

6.172061329012690037938807242421238353323318282237103288748085107287282881047971e-05