**Part1 Three layer network (60 points)**
* 수업에서 작성한 두 개의 hidden layer를 갖도록 한 개 이상의 layer로 regression problem을 해결하는 fully-connected two layer perceptron을 확장하여 사용.
* hidden layer 수와 관계없이 internal derivative의 derivation은 항상 동일함을 유의.
* 아래 조건에 따라 기존의 two-layer version과 새로운 three-layer로 x^2 + y^2 + 1에 대한 여러 테스트를 진행할 것
-----------------------------
    조건
      - lr=0.0001
      - tolerance_threshold=0.0001
      - max(iter)=100,000
      - No sgd, No regularization
      - act_func=ReLU
-----------------------------
    1st Test
      - two-layer version: 8 neurons
      - three-layer version: 4+4 neurons
      - 각 네트워크마다 20번 실행 후 loss와 iter 수 기록
-----------------------------
    2nd Test
      - two-layer version: 16 neurons
      - three-layer version: 8+8 neurons
      - 각 네트워크마다 20번 실행 후 loss와 iter 수 기록
-----------------------------

1과 2 테스트 중 어떤 구조가 더 좋은가?
- 하나의 그래프에 iter 수와 error에 대한 결과 plot하고 결과 기술

In [None]:
# create a two-layer neural network
def create_model(X, hidden_nodes, output_dim = 2):
    # this will hold a dictionary of layers
    model = {}
    # input dimensionality
    input_dim = X.shape[1]
    # first set of weights from input to hidden layer 1
    model['W1'] = np.random.randn(input_dim, hidden_nodes) / np.sqrt(input_dim)
    # set of biases
    model['b1'] = np.zeros((1, hidden_nodes))

    # second set of weights from hidden layer 1 to output
    model['W2'] = np.random.randn(hidden_nodes, output_dim) / np.sqrt(hidden_nodes)
    # set of biases
    model['b2'] = np.zeros((1, output_dim))
    return model

# defines the forward pass given a model and data
def feed_forward(model, x):
    # get weights and biases
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    # first layer
    z1 = x.dot(W1) + b1

    # activation function
    #a1 = logistic(z1)
    #a1 = tanh(z1)
    a1 = relu(z1)

    # second layer
    z2 = a1.dot(W2) + b2

    # no activation function as this is simply a linear layer!!
    out = z2
    return z1, a1, z2, out

# define the regression loss
def calculate_loss(model,X,y,reg_lambda):
    num_examples = X.shape[0]
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']

    # what are the current predictions
    z1, a1, z2, out = feed_forward(model, X)

    # calculate L2 loss
    loss = 0.5 * np.sum((out - y) ** 2)

    # add regulatization term to loss
    loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))

    # return per-item loss
    return 1./num_examples * loss

# back-propagation for the two-layer network
def backprop(X,y,model,z1,a1,z2,output,reg_lambda):
    num_examples = X.shape[0]
    # derivative of loss function
    delta3 = (output-y)/num_examples
    # multiply this by activation outputs of hidden layer
    dW2 = (a1.T).dot(delta3)
    # and over all neurons
    db2 = np.sum(delta3, axis=0, keepdims=True)

    # derivative of activation function
    #delta2 = delta3.dot(model['W2'].T) * logistic_derivative(a1) #if logistic
    #delta2 = delta3.dot(model['W2'].T) * tanh_derivative(a1) #if tanh
    delta2 = delta3.dot(model['W2'].T) * relu_derivative(a1) #if ReLU

    # multiply by input data
    dW1 = np.dot(X.T, delta2)
    # and sum over all neurons
    db1 = np.sum(delta2, axis=0)

    # add regularization terms on the two weights
    dW2 += reg_lambda * model['W2']
    dW1 += reg_lambda * model['W1']

    return dW1, dW2, db1, db2

# simple training loop
def train(model, X, y, num_passes=100000, reg_lambda = 0.1, learning_rate = 0.001):
    # whether to do stochastic gradient descent
    sgd = True

    # variable that checks whether we break iteration
    done = False

    # keeping track of losses
    previous_loss = float('inf')
    losses = []

    # iteration counter
    i = 0
    while done == False:
        if sgd:
            # choose a random set of points
            randinds = np.random.choice(np.arange(len(y)),30,False)
            # get predictions
            z1,a1,z2,output = feed_forward(model, X[randinds,:])
            # feed this into backprop
            dW1, dW2, db1, db2 = backprop(X[randinds,:],y[randinds],model,z1,a1,z2,output,reg_lambda)
        else:
            # get predictions
            z1,a1,z2,output = feed_forward(model, X)
            # feed this into backprop
            dW1, dW2, db1, db2 = backprop(X,y,model,z1,a1,z2,output,reg_lambda)

        # given the results of backprop, update both weights and biases
        model['W1'] -= learning_rate * dW1
        model['b1'] -= learning_rate * db1
        model['W2'] -= learning_rate * dW2
        model['b2'] -= learning_rate * db2

        # do some book-keeping every once in a while
        if i % 1000 == 0:
            loss = calculate_loss(model, X, y, reg_lambda)
            losses.append(loss)
            print("Loss after iteration {}: {}".format(i, loss))
            # very crude method to break optimization
            if np.abs((previous_loss-loss)/previous_loss) < 0.001:
                done = True
            previous_loss = loss
        i += 1
        if i>=num_passes:
            done = True
    return model, losses

**Bonus: arbitrary number of layers (20 points)**
위의 코드에 대해 전체 call-logic과 training loop를 유지하되, 함수가 임의의 수의 레이어를 사용할 수 있도록 코드 수정.
  - create_model, forward, backprop의 딕셔너리 수정해야함

**Part2 Pytorch version (20 points)**
- pytorch의 nn layer를 사용하여 같은 수의 파라미터와 세 개의 layer로 동일한 regression 문제를 해결할 수 있도록 코드를 수정하고 시각화
- 충분히 큰 iter 수를 사용하여 ADAM optimizer와 SGD optimizer로 각각 20번씩 테스트
- 각 optimizer에 대해 loss 평가와 plot 비교 후, 각 optimizer에 대해 설명