<div class="alert alert-info">
    <h1> About these notebooks </h1>
    <p> If you opened this notebook in Binder, it is running on a server that was launched <b>just</b> for you. Your changes will be reset once server restarts due to inactivity, so don't rely on it for anything you want to last. Likewise, feel free to try and tweak everything you want, since you won't affect the original repository. And <b>if you have never used a Jupyter noteboook: </b> for running a cell just press <tt>Shift + Enter.</tt> </p>
    <p>Enjoy!</p>
</div>

# 1.1 Introduction

The objective of this notebook is to get you started in Julia with a real example: the Gradient Descent algorithm. We have chosen this algorithm because:

* **The mathematical problem is simple and ubiquitous**: given a function, find a minimum. 
* **The algorithm formulation is simple**: just take small leaps following the steepest path down.
* **The algorithm admits plenty of variations**: we can play with several variations to explore the Julia lenguage in greater depth 


# 1.2 Mathematical formulation

*Gradient descent* is one of the simplest methods for finding optima. As [Wikipedia](https://en.wikipedia.org/wiki/Gradient_descent) puts it, its core idea is that a differentiable function $f(x)$ decreases fastest if one follows the direction of  $− \nabla f$ (the direction of greatest slope). Since this is in general only valid close to our starting point, the gradient descent corresponds to the following sequence:

$$ x_{n+1} = x_n - \alpha \nabla f(x_n) $$

where $\alpha$ is some parameter chosen in such a way that we keep decreasing $f$ in each step (or at least we try so). If $\alpha$ is too high we could land even farther from the minimum in the next iterate! The resulting sequence $x_n$ will look something like this:

![gradient descent gif](figures/gradient_descent.gif)

(By the way, we will also learn to generate animations like this. Yay!)



# 1.3 The pseudo-code

We then want some code that given a function `f`, its derivative or gradient `Df` and a starting point `x`, returns a minimum and the value of `f` at this minimum. The code should look something like this:

```julia
function gradient_descent(f, Df, x; alpha = 0.1)
    while (x is not close enough to a minimum)
        x = x - alpha*Df(x)
    end
    return x, f(x)
end
```

If you are not familiar with the algorithm you may visit [Wikipedia's page](https://en.wikipedia.org/wiki/Gradient_descent#Computational_examples) and check the code in your favorite programming language.

Let's learn little by little all we need to code such function!

# 1.4 Variables, operations and functions

Variable declaration and arithmetic operations won't look challenging.

In [None]:
x = 1
s = "Hello world!"
tup = (x, s) # This is a tuple, a immutable collection of (possibly distinct) elements

println(x)   # The function print doesn't add a new line after
println(s)
println(tup)

In [None]:
y = x + 2
z = y * 5
z += 1 # inplace adition

print("The final value of z is $z") # The dollar is used for formatting variables into a string

The standard way of defining a function begins with the `function` keyword and must end with an `end`. They return the last computed expression (or whatever follows a `return`):

In [1]:
function f(x)
    x^2
end

f(-2)

4

In [2]:
# But sometimes it's simpler to just do:
Df(x) = 2*x

Df(1)

2

# 1.5 Loops

`for` loops will be familiar for **Matlab** users:

In [None]:
for i=1:5
    print(i," ")
end

Anyway, **Python** users will also feel *almost* at home:

In [None]:
for i in 1:5
    print(i," ")
end

A nice feature of Julia is that nested loops can be written very compactly:

In [None]:
for i in 1:2, j in 3:4
    println((i,j))
end

Of course there are also while-loops:

In [None]:
i = 1
while i <= 5
    print(i, " ")
    i+=1
end

And `if` statements won't bring many surprises either:

In [None]:
a = 0
if a < 0
    print("a is negative")
elseif a > 0
    print("a is positive")
else
    print("a is zero")
end

<div class="alert alert-info">
    <h3>A quick but important warning about scopes</h3>
    <p> A frequent cause of frustration in Julia comes from defining some variable inside a loop and trying to accessing it later from outside. Outrageously, this fails to work. Why is that?</p>
    <p> Well, it turns out that regarding variables, Julia has a <i>global scope</i> (the basic level of a Jupyter Notebook cell, the REPL,...) and <i>local scopes</i> (a function, a `for` loop...). A variable that is declared inside a local scope cannot be accesed from outside; this behavior is pretty common in other languages when it comes to functions, <b>but</b> in Julia this also extends to loops, which at first can be kind of unexpected. </p>
    <p> Check the following code and try to fix it:  </p>
</div>


In [None]:
for i in 1:5
    b = i
    print(b," ")
end
b

However, `if` statements don't introduce any new scope:

In [None]:
if true
    c = 1
end
c

# 1.6 A first version of the gradient descent

With all these tools we are ready to code a preliminary version of the gradient descend algorithm:

In [None]:
function gradient_descent(f, Df, x)
    α = 0.1
    
    for i in 1:100
        x = x - α*Df(x)
    end
    
    return (x, f(x))
end

(By the way, Julia admits general Unicode as names of variables. For writing the `α` while you are coding you just have to write `\alpha` and press tab; it will quickly change to `α`)

Notice that right now we have a fixed number of iterations, and the learning rate is fixed to `α = 0.1`. However, it kind of does the job; for `f` and `Df` as defined above it yields:

In [None]:
x_opt, f_opt = gradient_descent(f, Df, 1) 

Since the real values for `x_opt` and `f_opt` are 0, the result are really good! 

We would like now to have a bit more flexibility and robustness. Let us include `α` as an *optional argument*, and also include a *docstring* before the function; this is, some help string that tells us how the algorithm works. This docstring is also what you will see when you press `Shift + Tab` asking for help.

In [None]:
""" Second version of the gradient descent, now with alpha as an optional argument"""
function gradient_descent(f, Df, x; α = 0.1)
    
    for N_iter in 1:100
        x = x - α*Df(x)
    end
    return (x, f(x))
end

println("Testing our implementation:")
println(gradient_descent(f, Df, 1))
# Change learning rate

println("\nTesting the learning rate:")
println(gradient_descent(f, Df, 1, α=0.01))

As it was expected, diminishing the learning rate reduces the rate of convergence; in this case we are much farther from the minimum. Let's keep improving our `gradient_descent` in the following exercise:

#### Exercise 1:

Improve the `gradient_descent` function by adding:

* *a tolerance and a maximum number of iterations*: stop the iterations only when `Df(x)` is smaller than the tolerance or the maximum number of iterations has been reached (**Hint**: in Julia the "and" operator is written `&`).
* *progress info*: print the current number of iterations, the value of `x` and that of `f(x)` every 100 iterations (*Hint*: in the second code cell of this notebook you can see how to format variables into text) 
* *a `verbose` parameter*, to activate or deactivate the showing of progress info.



(A solution is given in the `Exercise solutions` notebook)

In [5]:
""" Third version of the gradient descent. Stops when the gradient is smaller than `TOL`, 
    or when the maximum number of iterations `maxiter` has been reached"""
function gradient_descent(f, Df, x; α = 0.1, TOL = 1e-10, maxiter = 1000, verbose = false)
    
    # Here goes your code
    
    return (x, f(x))
end

gradient_descent

In [4]:
# Let's test it:
gradient_descent(f, Df, 1, α = 0.01, maxiter = 10000, verbose = true)

Iter. 100,	x = 0.13261955589475316,	f(x) = 0.017587946605721556
Iter. 200,	x = 0.017587946605721556,	f(x) = 0.00030933586580571244
Iter. 300,	x = 0.002332505667951425,	f(x) = 5.440582691025523e-6
Iter. 400,	x = 0.00030933586580571244,	f(x) = 9.568867787376974e-8
Iter. 500,	x = 4.102398514547257e-5,	f(x) = 1.6829673572159537e-9
Iter. 600,	x = 5.4405826910255245e-6,	f(x) = 2.959994001788654e-11
Iter. 700,	x = 7.215276602924866e-7,	f(x) = 5.2060216456715e-13
Iter. 800,	x = 9.56886778737698e-8,	f(x) = 9.156323073230082e-15
Iter. 900,	x = 1.2690189963775444e-8,	f(x) = 1.61040921316707e-16
Iter. 1000,	x = 1.6829673572159529e-9,	f(x) = 2.8323791254544486e-18
Iter. 1100,	x = 2.2319438349934617e-10,	f(x) = 4.981573282565321e-20


(4.9049991751351605e-11, 2.4059016908076606e-21)

# Bonus: Multiple dispatch: things start getting different

Imagine for a moment that we don't want to care much about the initial point `x`. This may happen for example if we just want to get a local minimum of `f` and not the smallest one. Then it may make sense to define another `gradient descent` that just chooses `x` randomly. We could then make `x` also an optional argument; nevertheless we are going to explore another way that explotes one of Julia's most distinguishing features: *multiple dispatch*. To see how it works, let's consider this alternative version:

In [None]:
""" A fourth version of gradient descent, where the initial point gets initialized randomly.
    Stops when the gradient is smaller than `TOL`, or when the maximum number of iterations 
    `maxiter` has been reached"""
function gradient_descent(f, Df; α = 0.1, TOL = 1e-10, maxiter = 1000, verbose = false)
    x = rand()
    gradient_descent(f, Df, x; α = α, TOL = TOL, maxiter = maxiter, verbose = verbose)
end

# Let us run first the original version
x1,_ = gradient_descent(f, Df, 1, α = 0.1, maxiter = 10000);

# And then the new version
x2, _ = gradient_descent(f, Df, α = 0.1, maxiter = 10000);

println("The difference between the two solutions is $(abs(x1-x2)).")

As we have seen above, both versions of `gradient_descent` coexist despite having used the same name. In fact, in Julia a function can have multiple *methods*, depending on which the inputs are. Thus, our `gradient_descent` has two methods:

In [None]:
methods(gradient_descent)

Other functions, like the `sort` function, have a lot more methods defined:

In [None]:
methods(sort)

As you may guess from this list, the function `sort` has methods defined for a lot of relevant `types`: arrays, sparse arrays, ranges... And this is something we can also do by just annotating the type of the function's input. This adds even more flexibility: not only can we have different methods for different number of inputs, we can also have different methods for different *types* of the input, **and all behind the same function name**. Having different methods for different types helps Julia to compile the program in the most efficient way, which translates its impressive performance.

On the other hand, this *multiple dispatch* also opens the door to reuse names of functions from standard libraries for other types (even user defined!) if they perform a similar task. For example, Julia doesn't provide a method for sorting a tuple (and this may make sense, since tuples can contain elements of any type):

In [None]:
A = (1, 3, 2)
sort(A) # this doesn't work

However, we may as well write a method for sorting tuples that will get incorporated to the `sort` function methods.

In [None]:
import Base: sort                      # we must explicitly import the function to extend it
sort(A::Tuple) = Base.sort(collect(A)) # here ::Tuple signals that the method is supposed to work on tuples
methods(sort)

In [None]:
sort(A)