# **PhD430 - Machine Learning Exam**
## Morteza Aghajanzadeh 
### Dec 2023

## **Task 1**

### (a) *Symbolic differentiation*

\begin{equation*}
\begin{split}
L & = (y - \omega x -b)^2\\
\frac{\partial L}{\partial \omega} &  = 2 (-x)(y - \omega x -b)\\
& = 2 (-2)(3 - 2 \omega ) \\
& = 8 \omega - 12
\end{split}
\end{equation*}


### (b) *The forward difference method*

\begin{equation*}
\begin{split}
L & = (y - \omega x -b)^2\\
\frac{\partial L}{\partial \omega} &  \approx  \dfrac{(y - (\omega + h) x -b)^2 - (y - \omega x -b)^2}{h}\\
& \approx \dfrac{((y - (\omega + h) x -b) + (y - \omega x -b)) ((y - (\omega + h) x -b) - (y - \omega x -b))}{h} \\
& \approx \dfrac{(2(y - \omega x -b) - hx) (-hx)}{h} \\
& \approx {(2(y - \omega x -b) - hx) (-x)}\\
& \approx {2(-x) (y - \omega x -b)}\\
& \approx 2 (-2)(3 - 2 \omega ) \\
& = 8 \omega - 12
\end{split}
\end{equation*}


### (c) *Autodifferentiation*

\begin{equation*}

\left.\begin{array}{c}
g(z) = z^2 \Rightarrow \frac{\partial g(z)}{\partial z} = 2z\\
f(w) = y - wx - b \Rightarrow \frac{\partial f(w)}{\partial w} = -x
\end{array}\right\} \Rightarrow \frac{\partial L}{\partial \omega} = \frac{\partial g(f(w))}{\partial f(w)} \frac{\partial f(w)}{\partial \omega} = (2f(w)) (-x) = 2 (y - \omega x -b)(-x)  = 2 (-2)(3 - 2 \omega ) = 8 \omega - 12
\end{equation*}


## **Task 2**

Assume you want to estimate an AR(1) model of the log USD-GBP exchange rate:

\begin{equation}
y_{t} = \alpha + \rho y_{t-1} + \epsilon_t
\end{equation}

The code in this notebook trains the model by minimizing the following loss function:

\begin{equation}
L = \frac{1}{T}\sum_{t=1}^{T} \left(y_{t} - \alpha - \rho y_{t-1}
\right)^{2}
\end{equation}

In [7]:
# Import libraries.
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Define data path.
file_path = 'https://www.dropbox.com/scl/fi/utj4vox9yudaj5z0ngd8d/exchange_rate.csv?rlkey=1szy4yh3x1w3pac4qds3y6hpw&dl=1'

# Load data.
data = pd.read_csv(file_path)


alpha: 0.36203432083129883, rho: 0.2650577127933502
loss: 0.017625970765948296


### Modified code
I wrote than a function that do all the trainings based on the code that you provided for us.
I set seed number in the function to get the same result every time.
The function get the $\alpha_{0}$, $\rho_0$, loss_function, and opt as an input.
I set the baseline values as given to the function

In [27]:
def estimation_model(data,α_0 = 0.05,ρ_0 = 0.05,loss_function = tf.keras.losses.mse,opt = tf.keras.optimizers.SGD()):
	import random
	random.seed(13990508)
	# Convert log exchange rate to numpy array.
	e = np.array(np.log(data['USD_GBP']))

	# Define the lagged exchange rate as a tensorflow constant.
	le = tf.constant(e[1:-1], tf.float32)

	# Define the exchange rate as a tensorflow constant.
	e = tf.constant(e[2:], tf.float32)
	# Initialize parameters.
	alpha = tf.Variable(α_0, tf.float32)
	rho = tf.Variable(ρ_0, tf.float32)
	# Define AR(1) model to make predictions.
	def ar(alpha, rho, le):
		yhat = alpha + rho*le
		return yhat
	# Define loss function.
	def loss(alpha, rho, e, le):
		yhat = ar(alpha, rho, le)
		return loss_function(e, yhat)
	# Insantiate optimizer.
	opt = tf.keras.optimizers.SGD()
	# Perform minimization.
	for i in range(100):
		opt.minimize(lambda:
		loss(alpha, rho, e, le),
		var_list = [alpha, rho]
		)
	# Print parameters.
	print('alpha: {}, rho: {}'.format(alpha.numpy(), rho.numpy()))

	# Generate predictions.
	ypred = ar(alpha, rho, le)

	# Print loss.
	print('loss: {}'.format(loss(alpha, rho, e, le).numpy()))

In [28]:
print("Results for the baseline model:")
estimation_model(data)

Results for the baseline model:
alpha: 0.36203432083129883, rho: 0.2650577127933502
loss: 0.017625970765948296


### (a) 
Now I modify the loss function input in the defined function

In [42]:
estimation_model(data,loss_function = tf.keras.losses.mae)

alpha: 0.34546560049057007, rho: 0.306787371635437
loss: 0.09436121582984924


### (b) 
Now I modify the optimizer in the defined function

In [43]:
estimation_model(data,opt=tf.keras.optimizers.Adam())

alpha: 0.36203432083129883, rho: 0.2650577127933502
loss: 0.017625970765948296


### (c) 
Now I would use different initial guess

In [44]:
estimation_model(data,α_0 = 0.5,ρ_0 = 0.5)

alpha: 0.31938737630844116, rho: 0.43941545486450195
loss: 0.010513707995414734


As you can see, the change that we made in the last section has the major effect on the loss value that we get from the estimation. The initial guesses are important to provide a better estimation results for the model.

## **Task 3**

### (a) 
After calculating the initial loss value, we need to find the next guess. Gradient is helpful to provide a direction for us to find the next guess.
The gradient is used to update $\theta$ because it points in the direction of the steepest increase of the loss function. By moving in the opposite direction of the gradient, we can iteratively update the parameters to minimize the loss. 

### (b)
When selecting a learning rate for an optimization algorithm, it is essential to consider the trade-off between taking larger steps with each iteration and the potential for overshooting the minimum. A high learning rate can help us approach the minimum faster, but it may also lead us to miss it entirely. So, it's crucial to choose a learning rate that strikes the right balance between convergence speed and accuracy.

### (c)
In the SDG we select a sample j uniformly form all the observations in the data and update $\theta$ by using the 
$$
\theta \coloneqq \theta - \alpha \Delta_{\theta} J^{(j)}(\theta)
$$
This is the main difference between SDG and DG

### (d)
Computing the gradient of B examples simultaneously for the parameter $\theta$ can be faster than computing B gradients separately due to hardware parallelization. So we sample B examples $j_1,\dots, j_B$ (without replacement) form the observations and update $\theta$ by 
$$
\theta \coloneqq \theta - \frac{\alpha}{B} \sum_{k=1}^{B}\Delta_{\theta} J^{(j_k)}(\theta)
$$

### (e)
Due to their efficiency and effectiveness, Stochastic Gradient Descent (SGD) and Mini-Batch SGD have become popular optimization techniques for training deep learning models. Unlike regular Gradient Descent, which uses the entire dataset to calculate gradients, SGD and Mini-Batch SGD operate on random subsets of the data. This randomness not only makes computations more scalable for large datasets but also leads to faster convergence. Training deep learning models is computationally intensive, and the frequent parameter updates in SGD and Mini-Batch SGD contribute to quicker convergence during optimization. Additionally, the inherent randomness introduced by these methods can enhance the model's ability to generalize well to new, unseen data, acting as a form of regularization. Furthermore, their memory-efficient nature allows for the processing of large datasets that may not fit into memory at once. Overall, the combination of efficiency, faster convergence, potential for better generalization, and memory efficiency makes SGD and Mini-Batch SGD advantageous choices for training deep learning models

## **Task 4**

In [45]:
# Import libraries.
import pandas as pd
import tensorflow as tf
# Load data.
data = pd.read_csv('https://www.dropbox.com/scl/fi/v7iqtlyf3voedweq7xct5/macrodata.csv?rlkey=ccr7auc4i910z2h3xrs7caprn&dl=1',
                        index_col = 'Date')

# Define target.
y = data['Inflation'].iloc[1:]

# Define features.
X = data[['Inflation', 'Unemployment']].iloc[:-1]

# Create train and test sets.
y_train, y_test = y.iloc[:400], y.iloc[400:]
X_train, X_test = X.iloc[:400], X.iloc[400:]

### (a)

In [48]:
# Define sequential model.
model = tf.keras.models.Sequential()

# Add input layer.
model.add(tf.keras.Input(shape=(2,)))

# Define dense layer.
model.add(tf.keras.layers.Dense(2, activation="relu", ))

# Define output layer.
model.add(tf.keras.layers.Dense(1, activation="linear"))

# Compile the model.
model.compile(loss="mse", optimizer="adam")

# Train the model.
model.fit(X_train, y_train, epochs=100)
# Print model architecture.
print(model.summary())
# Evaluate training set using MSE. # Evaluate test set using MSE.
model.evaluate(X_train, y_train),model.evaluate(X_test, y_test)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

(0.06745272129774094, 0.13329584896564484)

## **Task 1**