**Part1 Three layer network (60 points)**
* 수업에서 작성한 두 개의 hidden layer를 갖도록 한 개 이상의 layer로 regression problem을 해결하는 fully-connected two layer perceptron을 확장하여 사용.
* hidden layer 수와 관계없이 internal derivative의 derivation은 항상 동일함을 유의.
* 아래 조건에 따라 기존의 two-layer version과 새로운 three-layer로 x^2 + y^2 + 1에 대한 여러 테스트를 진행할 것
-----------------------------
    조건
      - lr=0.0001
      - tolerance_threshold=0.0001
      - max(iter)=100,000
      - No sgd, No regularization
      - act_func=ReLU
-----------------------------
    1st Test
      - two-layer version: 8 neurons
      - three-layer version: 4+4 neurons
      - 각 네트워크마다 20번 실행 후 loss와 iter 수 기록
-----------------------------
    2nd Test
      - two-layer version: 16 neurons
      - three-layer version: 8+8 neurons
      - 각 네트워크마다 20번 실행 후 loss와 iter 수 기록
-----------------------------

1과 2 테스트 중 어떤 구조가 더 좋은가?
- 하나의 그래프에 iter 수와 error에 대한 결과 plot하고 결과 기술

In [47]:
# library imports
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import colorsys
import sys
import time
from IPython import display
from mpl_toolkits import mplot3d
from sklearn import datasets

In [48]:
# activation functions
def relu(X):
    return np.maximum(X, 0)

def relu_derivative(X):
    return 1. * (X > 0)

def tanh(X):
    return np.tanh(X)

def tanh_derivative(X):
    return 1.-tanh(X)**2

def logistic(X):
    return 1./(1. + np.exp(-X))

def logistic_derivative(X):
    return logistic(X)*(1. - logistic(X))

# Activation functions mapping
activation_functions = {
    'relu': relu,
    'tanh': tanh,
    'logistic': logistic
}

# Activation functions derivatives mapping
activation_derivatives = {
    'relu': relu_derivative,
    'tanh': tanh_derivative,
    'logistic': logistic_derivative
}

In [49]:
# create a two-layer neural network
def old_create_model(X, hidden_nodes, output_dim = 2):
    # this will hold a dictionary of layers
    model = {}
    # input dimensionality
    input_dim = X.shape[1]
    # first set of weights from input to hidden layer 1
    model['W1'] = np.random.randn(input_dim, hidden_nodes) / np.sqrt(input_dim)
    # set of biases
    model['b1'] = np.zeros((1, hidden_nodes))

    # second set of weights from hidden layer 1 to output
    model['W2'] = np.random.randn(hidden_nodes, output_dim) / np.sqrt(hidden_nodes)
    # set of biases
    model['b2'] = np.zeros((1, output_dim))
    return model

# defines the forward pass given a model and data
def old_feed_forward(model, x):
    # get weights and biases
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    # first layer
    z1 = x.dot(W1) + b1

    # activation function
    #a1 = logistic(z1)
    #a1 = tanh(z1)
    a1 = relu(z1)

    # second layer
    z2 = a1.dot(W2) + b2

    # no activation function as this is simply a linear layer!!
    out = z2
    return z1, a1, z2, out

# define the regression loss
def old_calculate_loss(model,X,y,reg_lambda):
    num_examples = X.shape[0]
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']

    # what are the current predictions
    z1, a1, z2, out = old_feed_forward(model, X)

    # calculate L2 loss
    loss = 0.5 * np.sum((out - y) ** 2)

    # add regulatization term to loss
    loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))

    # return per-item loss
    return 1./num_examples * loss

# back-propagation for the two-layer network
def old_backprop(X,y,model,z1,a1,z2,output,reg_lambda):
    num_examples = X.shape[0]
    # derivative of loss function
    delta3 = (output-y)/num_examples
    # multiply this by activation outputs of hidden layer
    dW2 = (a1.T).dot(delta3)
    # and over all neurons
    db2 = np.sum(delta3, axis=0, keepdims=True)

    # derivative of activation function
    #delta2 = delta3.dot(model['W2'].T) * logistic_derivative(a1) #if logistic
    #delta2 = delta3.dot(model['W2'].T) * tanh_derivative(a1) #if tanh
    delta2 = delta3.dot(model['W2'].T) * relu_derivative(a1) #if ReLU

    # multiply by input data
    dW1 = np.dot(X.T, delta2)
    # and sum over all neurons
    db1 = np.sum(delta2, axis=0)

    # add regularization terms on the two weights
    dW2 += reg_lambda * model['W2']
    dW1 += reg_lambda * model['W1']

    return dW1, dW2, db1, db2

# simple training loop
def old_train(model, X, y, num_passes=100000, reg_lambda = 0.1, learning_rate = 0.001):
    # variable that checks whether we break iteration
    done = False

    # keeping track of losses
    previous_loss = float('inf')
    losses = []

    # iteration counter
    i = 0
    while done == False:
        # get predictions
        z1,a1,z2,output = old_feed_forward(model, X)
        # feed this into backprop
        dW1, dW2, db1, db2 = old_backprop(X,y,model,z1,a1,z2,output,reg_lambda)

        # given the results of backprop, update both weights and biases
        model['W1'] -= learning_rate * dW1
        model['b1'] -= learning_rate * db1
        model['W2'] -= learning_rate * dW2
        model['b2'] -= learning_rate * db2

        # do some book-keeping every once in a while
        if i % 1000 == 0:
            loss = old_calculate_loss(model, X, y, reg_lambda)
            losses.append(loss)
            print("Two Layer Loss after iteration {}: {}".format(i, loss))
            # very crude method to break optimization
            if previous_loss != 0 and np.abs((previous_loss-loss)/previous_loss) < 0.001:
                done = True
            previous_loss = loss
        i += 1
        if i>=num_passes:
            done = True
    return model, losses

In [50]:
# create a three-layer neural network
def create_model(X, hidden_nodes, output_dim = 2, activation_function='relu'):
    # this will hold a dictionary of layers
    model = {}
    
    #save activation function to model
    model['activation_function'] = activation_function
    
    # input dimensionality
    input_dim = X.shape[1]
    
    # [i -> 1]weights and biases from input to hidden layer 1
    model['W1'] = np.random.randn(input_dim, hidden_nodes[0]) / np.sqrt(input_dim)
    model['b1'] = np.zeros((1, hidden_nodes[0]))
    
    # [1 -> 2]weights and biases  from  hidden layer 1 to hidden layer 2
    model['W2'] = np.random.randn(hidden_nodes[0], hidden_nodes[1]) / np.sqrt(hidden_nodes[0])
    model['b2'] = np.zeros((1, hidden_nodes[1]))

    # [2 -> o]weights and biases from hidden layer 2 to output
    model['W3'] = np.random.randn(hidden_nodes[1], output_dim) / np.sqrt(hidden_nodes[1])
    model['b3'] = np.zeros((1, output_dim))
    
    return model

# defines the forward pass given a model and data
def feed_forward(model, x):
    # get weights and biases
    W1, b1, W2, b2, W3, b3 = model['W1'], model['b1'], model['W2'], model['b2'], model['W3'], model['b3']

    # get activation function
    act_func = activation_functions.get(model['activation_function'])
    
    # first layer
    z1 = x.dot(W1) + b1
    a1 = act_func(z1)

    # second layer
    z2 = a1.dot(W2) + b2
    a2 = act_func(z2)
    
    # third layer
    z3 = a2.dot(W3) + b3
    out = z3
    
    return z1, a1, z2, a2, z3, out
    
# define the regression loss
def calculate_loss(model,X,y,reg_lambda):
    num_examples = X.shape[0]
    W1, b1, W2, b2, W3, b3 = model['W1'], model['b1'], model['W2'], model['b2'], model['W3'], model['b3']

    # what are the current predictions
    z1, a1, z2, a2, z3, out = feed_forward(model, X)

    # calculate L2 loss
    loss = 0.5 * np.sum((out - y) ** 2)

    # add regulatization term to loss
    loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3)))

    # return per-item loss
    return 1./num_examples * loss

# back-propagation for the two-layer network
def backprop(X,y,model,z1,a1,z2,a2,z3,output,reg_lambda):
    num_examples = X.shape[0]
    act_func = activation_functions.get(model['activation_function'])
    act_func_derivative = activation_derivatives.get(model['activation_function'])
    
    delta4 = (output-y)/num_examples
    dW3 = a2.T.dot(delta4)
    db3 = np.sum(delta4, axis=0, keepdims=True)

    delta3 = delta4.dot(model['W3'].T) * act_func_derivative(a2)
    dW2 = np.dot(a1.T, delta3)
    db2 = np.sum(delta3, axis=0, keepdims=True)
    
    delta2 = delta3.dot(model['W2'].T) * relu_derivative(a1)
    dW1 = np.dot(X.T, delta2)
    db1 = np.sum(delta2, axis=0, keepdims=True)

    # add regularization terms on the two weights
    dW3 += reg_lambda * model['W3']
    dW2 += reg_lambda * model['W2']
    dW1 += reg_lambda * model['W1']

    return dW1, dW2, dW3, db1, db2, db3

# simple training loop
def train(model, X, y, num_passes=100000, reg_lambda = 0, learning_rate = 0.001, tolerance = 0.0001):
    
    # variable that checks whether we break iteration
    done = False

    # keeping track of losses
    previous_loss = float('inf')
    losses = []

    # iteration counter
    iterations = 0
    while done == False:
            # choose a random set of points
            randinds = np.random.choice(np.arange(len(y)),30,False)
            # get predictions
            z1,a1,z2,a2,z3,output = feed_forward(model, X[randinds,:])
            # feed this into backprop
            dW1, dW2, dW3, db1, db2, db3 = backprop(X[randinds,:],y[randinds],model,z1,a1,z2,a2,z3,output,reg_lambda)
        else:
            # get predictions
            z1,a1,z2,a2,z3,output = feed_forward(model, X)
            # feed this into backprop
            dW1, dW2, dW3, db1, db2, db3 = backprop(X,y,model,z1,a1,z2,a2,z3,output,reg_lambda)

        # given the results of backprop, update both weights and biases
        model['W1'] -= learning_rate * dW1
        model['b1'] -= learning_rate * db1
        model['W2'] -= learning_rate * dW2
        model['b2'] -= learning_rate * db2
        model['W3'] -= learning_rate * dW3
        model['b3'] -= learning_rate * db3

        # do some book-keeping every once in a while
        if i % 1000 == 0:
            loss = calculate_loss(model, X, y, reg_lambda)
            losses.append(loss)
            print("Three Layer Loss after iteration {}: {}".format(i, loss))
            # very crude method to break optimization
            if previous_loss != 0 and np.abs((previous_loss-loss)/previous_loss) < 0.001:
                done = True
            previous_loss = loss
        i += 1
        if i>=num_passes:
            done = True
    return model, losses

In [52]:
# data creation 
np.random.seed(0)  # For reproducibility
num_samples = 1000
X_train = np.random.uniform(-5, 5, (num_samples, 2))
y_train = (X_train[:, 0]**2 + X_train[:, 1]**2 + 1).reshape(-1, 1)

# create the model
old_model = old_create_model(X,8,1)
model = create_model(X,[4,4],1, activation_function='relu')


# Training configurations
configs = [
    {"neurons": [8], "label": "Two-layer (8)"},
    {"neurons": [4, 4], "label": "Three-layer (4+4)"},
    {"neurons": [16], "label": "Two-layer (16)"},
    {"neurons": [8, 8], "label": "Three-layer (8+8)"}
]


results = []

# Training loop for each configuration
for config in configs:
    config_results = {"label": config["label"], "losses": [], "iterations": []}
    for _ in range(20):  # 20 runs for each configuration
        model = create_model(X_train, *config["neurons"])
        trained_model, losses, iters = modified_train(model, X_train, y_train)
        config_results["losses"].append(losses[-1])
        config_results["iterations"].append(iters)
    results.append(config_results)

results

# training
old_model, old_losses = old_train(old_model,X, y, reg_lambda=reg_lambda)

new_model, new_losses = train(model,X, y, reg_lambda=reg_lambda, learning_rate=learning_rate)

# determine predictions of the trained model
old_output = old_feed_forward(model, X)
output = feed_forward(model, X)

Two Layer Loss after iteration 0: 1521.6570461719425
Two Layer Loss after iteration 1000: 26.98206088806651


  if previous_loss != 0 and np.abs((previous_loss-loss)/previous_loss) < 0.001:


Two Layer Loss after iteration 2000: 10.345953168112125
Two Layer Loss after iteration 3000: 7.2135242945218065
Two Layer Loss after iteration 4000: 6.424877598812335
Two Layer Loss after iteration 5000: 5.890511861048182
Two Layer Loss after iteration 6000: 5.667468946774807
Two Layer Loss after iteration 7000: 5.1882792944397575
Two Layer Loss after iteration 8000: 4.818835784523701
Two Layer Loss after iteration 9000: 4.429588956156372
Two Layer Loss after iteration 10000: 4.185377656494947
Two Layer Loss after iteration 11000: 4.2752520753480425
Two Layer Loss after iteration 12000: 3.6288564134878216
Two Layer Loss after iteration 13000: 3.6151854153028444
Two Layer Loss after iteration 14000: 3.2394202797021907
Two Layer Loss after iteration 15000: 3.3376810824305143
Two Layer Loss after iteration 16000: 3.050671681755592
Two Layer Loss after iteration 17000: 3.134164294691153
Two Layer Loss after iteration 18000: 3.126843193146604
Two Layer Loss after iteration 19000: 2.92406196

  if previous_loss != 0 and np.abs((previous_loss-loss)/previous_loss) < 0.001:


Three Layer Loss after iteration 1000: 12.515808653998098
Three Layer Loss after iteration 2000: 10.671005198877367
Three Layer Loss after iteration 3000: 7.656088862221957
Three Layer Loss after iteration 4000: 12.389931348327366
Three Layer Loss after iteration 5000: 5.494603818369914
Three Layer Loss after iteration 6000: 6.261388295479025
Three Layer Loss after iteration 7000: 4.5642718091018315
Three Layer Loss after iteration 8000: 3.837694082379573
Three Layer Loss after iteration 9000: 4.619846334538207
Three Layer Loss after iteration 10000: 3.4797042823925883
Three Layer Loss after iteration 11000: 3.4928060685296347
Three Layer Loss after iteration 12000: 3.749459332923351
Three Layer Loss after iteration 13000: 3.132621707762737
Three Layer Loss after iteration 14000: 3.5209541011191194
Three Layer Loss after iteration 15000: 3.563093603242017
Three Layer Loss after iteration 16000: 2.5654773594224607
Three Layer Loss after iteration 17000: 3.711897649277704
Three Layer Los

**Bonus: arbitrary number of layers (20 points)**
위의 코드에 대해 전체 call-logic과 training loop를 유지하되, 함수가 임의의 수의 레이어를 사용할 수 있도록 코드 수정.
  - create_model, forward, backprop의 딕셔너리 수정해야함

**Part2 Pytorch version (20 points)**
- pytorch의 nn layer를 사용하여 같은 수의 파라미터와 세 개의 layer로 동일한 regression 문제를 해결할 수 있도록 코드를 수정하고 시각화
- 충분히 큰 iter 수를 사용하여 ADAM optimizer와 SGD optimizer로 각각 20번씩 테스트
- 각 optimizer에 대해 loss 평가와 plot 비교 후, 각 optimizer에 대해 설명