# Machine Learning Systems PA 1

<a target="_blank" href="https://colab.research.google.com/github/hao-ai-lab/dsc291-PA/blob/main/pa1/mlsys_hw1.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Homework due: May 3, 2024, 11:59 pm, PST**.

Automatic differentiation forms the core technique for training machine learning models. In this assignment, you are required to develop a basic automatic differentiation system from scratch. Additionally, you will construct a logistic regression model and apply it to a dataset of handwritten digits to train it.

* This assignment must be completed **individually** and is not intended for group work.
* No GPU is needed for this assignment. You may choose to work in Google Colab (using the link provided above), on your personal computer, or on any other accessible server.
* This assignment solely requires Python programming; no C++ coding is involved.
* Make sure to refer to the final section of this notebook for details on how to submit your assignment.
* Avoid posting your completed work on any public platforms (such as GitHub).
* **Regarding testing and grading:** We have provided a suite of public test scripts located in the `tests/` directory that you can use to verify your implementation. **Additionally, your assignment will be evaluated using a series of private tests.** The marks you receive for each section will depend on how many of these private tests your code passes. You are allowed to submit your assignment multiple times, but only the last submission will be considered for grading after the deadline. **This implies that your scores will not be immediately visible post-submission.** We suggest writing your own test scripts to ensure your code functions correctly.


## Set up

* If you decide to use the Google Colab environment for this assignment, start by creating a copy of this notebook. You can do this by choosing "Save a copy in Drive" from the "File" menu. After saving the copy, execute the code block below to prepare your workspace. Once the repository is cloned, you will be able to view it in the "Files" tab on the left side of the screen.


In [None]:
# Code to set up the assignment
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/
!mkdir -p dsc291
%cd /content/drive/MyDrive/dsc291
!git clone https://github.com/hao-ai-lab/dsc291-PA.git
%cd /content/drive/MyDrive/dsc291/homework1

* If you are using local/server environment, please clone this repository.

```shell
git clone https://github.com/hao-ai-lab/dsc291-PA.git
cd homework1
export PYTHONPATH=.:$PYTHONPATH
```

## Part 1: Automatic Differentiation Framework (60 pt)

In part 1, you are tasked with coding the reverse mode of the automatic differentiation algorithm.

The automatic differentiation algorithm in this assignment operates using a **computational graph**. A computational graph visually represents the sequence of operations needed to compute an expression. For instance, consider the expression $y = x_1 \times x_2 + x_1$


<img src="https://github.com/hao-ai-lab/dsc291-PA/blob/main/pa1/figure/computational_graph.jpg?raw=true" alt="figure/computational_graph.jpg" width="60%"/>

Let's begin by exploring the fundamental concepts and data structures used in the framework.
A computational graph is composed of **nodes**, each representing a distinct computation step in the evaluation of the entire expression.
Each node consists of three components, as shown in `auto_diff.py` line 6:

- an **operation** (field `op`), specifying the type of computation the node performs.
- a list of **input nodes** (field `inputs`), detailing the sources of input for the computation.
- optionally, additional "**attributes**" (field `attrs`), which vary depending on the node's operation. These attributes will be discussed in more detail later in this section.

Input nodes in a computational graph can be defined using `ad.Variable`. For instance, the input variable nodes $x_1$ and $x_2$ might be set up as follows:

```python
import auto_diff as ad

x1 = ad.Variable(name="x1")
x2 = ad.Variable(name="x2")
```

In `auto_diff.py` (line 81), the `ad.Variable` class is used to create a node
with the operation placeholder and a specified name. Input nodes have empty inputs and attrs:

```python
class Variable(Node):
    def __init__(self, name: str) -> None:
        super().__init__(inputs=[], op=placeholder, name=name)
```

Here, the placeholder operation signifies that the input variable node does not perform any computation. Apart from placeholder, there are other operations defined in auto_diff.py, such as:

- `add`, which adds two nodes,
- `matmul`, which performs matrix multiplication between two nodes.

It is important to note that these operations are globally defined once, and the op field of every node corresponds to one of these globally defined operations. You should not create your own instances of these `ops`.

Returning to our example where $y = x_1 \times x_2 + x_1$, with `x1` and `x2` already established as input variables, the rest of the graph can be defined using just one line of Python:
```python
y = x1 * x2 + x1
```

This code first creates a node with the operation `mul`, taking `x1` and `x2` as its inputs. It then constructs another node with `add`, which utilizes the result of the multiplication node along with `x1` as inputs. Consequently, this computational graph ultimately comprises four nodes.

#### Important Note

It's important to note that a computational graph (e.g., the four nodes we defined) **does not** inherently store the actual values of its nodes. The structure of this assignment aligns with the TensorFlow v1 approach that was covered in our lectures. This method contrasts with frameworks like PyTorch, where input tensor values are specified upfront, and the values of intermediate tensors are computed immediately as they are defined.

In our framework, to calculate the value of the output `y` given the inputs `x1` and `x2`, we utilize the `Evaluator` class found in `auto_diff.py` at line 373.


### Evaluator
Here's a walkthrough of how `Evaluator` works. The constructor of `Evaluator` accepts a list of nodes that it needs to evaluate. By initiating an `Evaluator` with:
```python
evaluator = ad.Evaluator(eval_nodes=[y])
```
you are essentially setting up an Evaluator instance designed to compute the value of y. To calculate this, input tensor values are provided via the Evaluator.run method, which you will implement. These input tensors are assumed to be of type `numpy.ndarray` throughout this assignment. Here’s how it works:
```python
import numpy as np

x1_value = np.array(2)
x2_value = np.array(3)
y_value = evaluator.run(input_dict={x1: x1_value, x2: x2_value})
```

In this process, the `run` method takes the input values using a dictionary of the form `Dict[Node, numpy.ndarray]`, calculates the value of the node `y` internally, and outputs the result. For instance, with the input values `2 * 3 + 2 = 8`, the expected result for `y_value` would be `np.ndarray(8)`. Note that it will not yield the correct value until you have fully implemented the method:
```python
np.testing.assert_allclose(y_value, np.array(8))
```

The `Evaluator.run` method is responsible for the forward computation of nodes. Building on what was discussed in the lectures, to calculate the gradient of the output with respect to each input node within a computational graph, we enhance the forward graph with an additional backward component. By integrating both forward and backward graphs, and providing values for the input nodes, the `Evaluator` can compute the output value, the loss value, and the gradient values for each input node in a single execution of `Evaluator.run`.

You are tasked with implementing the function `gradients(output_node: Node, nodes: List[Node]) -> List[Node]` found in `auto_diff.py`. This function constructs the backward graph needed for gradient computation. It accepts an output node—typically the node representing the loss function in machine learning applications, where the gradient is preset to 1. It also takes a list of nodes for which gradients are to be computed and returns a list of gradient nodes corresponding to each node in the input list.


Returning to our earlier example, once you have implemented the `gradients` function, you can use it to calculate the gradients of $y$ with respect to $x_1$ and $x_2$. This is done by running:
```python
x1_grad, x2_grad = ad.gradients(output_node=y, node=[x1, x2])
```
to obtain the respective gradients. Following this, you can set up an `Evaluator` with nodes `y`, `x1_grad`, and `x2_grad`. This allows you to use the `Evaluator.run` method to compute both the output value and the gradients for the input nodes.


Before you start working on the assignment, let's clarify how `operations` (ops) work. Within `auto_diff.py`, each op is equipped with three methods:

- `__call__(self, **kwargs) -> Node`, which accepts input nodes (and attributes), creates a new node utilizing this op, and returns the newly created node.
- `compute(self, node: Node, input_values: List[np.ndarray]) -> np.ndarray`, which processes the specified node along with its input values and delivers the resultant node value.
- `gradient(self, node: Node, output_grad: Node) -> List[Node]`, which receives a node and its gradient node, returning the partial adjoint nodes for each input node.

In essence, the `Op.compute` method is responsible for calculating the value of an individual node based on its inputs, while the `Evaluator.run` function computes the value of the entire graph's output based on the graph's inputs. The `Op.gradient` method is designed to construct the backward computational graph for an individual node, whereas the `gradients` function builds the backward graph for the entire graph. Accordingly, your implementation of `Evaluator.run` should effectively utilize the `compute` method from op, and your implementation of the `gradients` function should make use of the `gradient` method provided by op.


### Your tasks

**Task 1 (10 pt).** 
Implement the `compute` method for all operations in `auto_diff.py`. We have supplied examples for `AddOp` and `AddByConstOp` to guide you, but you will need to implement the remaining operations. For the scope of this homework, it is safe to assume that the inputs for operations like addition, multiplication, and division will be of the same shape.

Sample tests are provided in `tests/test_auto_diff_node_forward.py`. To evaluate your implementation of Task 1, you can execute these tests by running:


In [2]:
!python3 -m pytest -l -v tests/test_auto_diff_node_forward.py

platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.3.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /content/drive/MyDrive/15442/homework1
plugins: anyio-3.7.1
collected 8 items                                                                                  [0m

tests/test_auto_diff_node_forward.py::test_mul [32mPASSED[0m[32m                                        [ 12%][0m
tests/test_auto_diff_node_forward.py::test_mul_by_const [32mPASSED[0m[32m                               [ 25%][0m
tests/test_auto_diff_node_forward.py::test_div [32mPASSED[0m[32m                                        [ 37%][0m
tests/test_auto_diff_node_forward.py::test_div_by_const [32mPASSED[0m[32m                               [ 50%][0m
tests/test_auto_diff_node_forward.py::test_matmul[False-False] [32mPASSED[0m[32m                        [ 62%][0m
tests/test_auto_diff_node_forward.py::test_matmul[False-True] [32mPASSED[0m[32m                         [ 75%][0m
tests/test_au

**Task 2 (15 pt).** 
Implement the `Executor.run` method in `auto_diff.py`. It may be beneficial to perform a [topological sort](https://en.wikipedia.org/wiki/Topological_sorting) of the computational graph to efficiently compute the output value.

Sample tests are available in `tests/test_auto_diff_graph_forward.py`. You can evaluate your implementation of Task 2 by executing these tests:


In [3]:
!python3 -m pytest -l -v tests/test_auto_diff_graph_forward.py

platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.3.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /content/drive/MyDrive/15442/homework1
plugins: anyio-3.7.1
collected 12 items                                                                                 [0m

tests/test_auto_diff_graph_forward.py::test_identity [32mPASSED[0m[32m                                  [  8%][0m
tests/test_auto_diff_graph_forward.py::test_add [32mPASSED[0m[32m                                       [ 16%][0m
tests/test_auto_diff_graph_forward.py::test_add_by_const [32mPASSED[0m[32m                              [ 25%][0m
tests/test_auto_diff_graph_forward.py::test_mul [32mPASSED[0m[32m                                       [ 33%][0m
tests/test_auto_diff_graph_forward.py::test_mul_by_const [32mPASSED[0m[32m                              [ 41%][0m
tests/test_auto_diff_graph_forward.py::test_div [32mPASSED[0m[32m                                       [ 50%][0m
tests/test_au

**Task 3 (15 pt).** 
Implement the `gradient` method for all operations in `auto_diff.py`. We have provided examples for `AddOp` and `AddByConstOp` to guide you, but you will need to complete the implementations for the remaining operations.

Sample tests are provided in `tests/test_auto_diff_node_backward.py`. To evaluate your implementation of Task 3, you can execute these tests by running:


In [4]:
!python3 -m pytest -l -v tests/test_auto_diff_node_backward.py

platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.3.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /content/drive/MyDrive/15442/homework1
plugins: anyio-3.7.1
[1mcollecting ... [0m[1mcollected 7 items                                                                                  [0m

tests/test_auto_diff_node_backward.py::test_mul [32mPASSED[0m[32m                                       [ 14%][0m
tests/test_auto_diff_node_backward.py::test_div [32mPASSED[0m[32m                                       [ 28%][0m
tests/test_auto_diff_node_backward.py::test_div_by_const [32mPASSED[0m[32m                              [ 42%][0m
tests/test_auto_diff_node_backward.py::test_matmul[False-False] [32mPASSED[0m[32m                       [ 57%][0m
tests/test_auto_diff_node_backward.py::test_matmul[False-True] [32mPASSED[0m[32m                        [ 71%][0m
tests/test_auto_diff_node_backward.py::test_matmul[True-False] [32mPASSED[0m[32m                    

**Task 4 (20 pt).** 
Implement the `gradients` function in `auto_diff.py`. Utilizing a topological sort might prove useful for this implementation.

Sample tests are available in `tests/test_auto_diff_graph_backward.py`. You can assess your Task 4 implementation by running these tests:


In [5]:
!python3 -m pytest -l -v tests/test_auto_diff_graph_backward.py

platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.3.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /content/drive/MyDrive/15442/homework1
plugins: anyio-3.7.1
[1mcollecting ... [0m[1mcollected 3 items                                                                                  [0m

tests/test_auto_diff_graph_backward.py::test_graph [32mPASSED[0m[32m                                    [ 50%][0m
tests/test_auto_diff_graph_backward.py::test_gradient_of_gradient [32mPASSED[0m[32m                     [100%][0m



### A few notes
1. **Zero-rank arrays in NumPy.** Throughout this homework, all values are treated as `numpy.ndarray` types. An interesting aspect of NumPy is how it handles zero-rank arrays. For example, adding two zero-rank arrays (`np.array(1) + np.array(2)`) produces a scalar value instead of another zero-rank array:
```
>>> x = np.array(1)
>>> y = np.array(2)
>>> type(x), type(y), x.ndim, y.ndim
(<class 'numpy.ndarray'>, <class 'numpy.ndarray'>, 0, 0)
>>> z = x + y
>>> z, type(z)
(3, <class 'numpy.int64'>)
```
For a thorough implementation, you would need to ensure that results are wrapped back into `numpy.ndarray` types. However, for simplicity, this adjustment is optional in this homework. No tests will check for this behavior, and it will not impact your grade. Python does not enforce eager type checking for scalar values.

2. **`Node.attrs`.** In our reference implementation of `AddByConstOp` in `auto_diff.py`, the `attrs` field stores the constant operand of the addition. Generally, the `attrs` field holds all **constants** known at the time of constructing the computational graph. For instance, in `AddByConstOp`, the constant operand is stored as a node attribute, whereas in `MatMulOp`, attributes like boolean flags for transposing input matrices are used. You might find it beneficial to store the reduction axis as an attribute when implementing operations like `SumOp`.

3. **Minimality of `gradients`.** The `gradients` function builds the backward graph and returns gradient nodes for the specified nodes. It's worth noting that it's not mandatory to construct a minimal backward graph, which would only include necessary gradient nodes. For instance, in the graph `y = x1 * x2 + x1`, if we only require the gradient for `x1 * x2`, a minimal backward graph would exclusively involve this gradient. While this homework doesn't require constructing minimal backward graphs, contemplating the potential advantages or disadvantages of such an approach is a valuable exercise.




## Part 2: SGD for logistic regression (40 pt)




In this section, you are to implement the stochastic gradient descent (SGD) algorithm to train a straightforward logistic regression model.

Consider an input vector $x \in \mathbb{R}^n$. The logistic regression model we'll use is defined by the equation:
$$z = W^T x + b$$
Here, $W \in \mathbb{R}^{n \times k}$ represents the weight matrix, $b \in \mathbb{R}^k$ the bias vector, and $z \in \mathbb{R}^k$ the logits output by the model.

The model training will utilize the softmax function combined with cross-entropy loss applied to mini-batches of data. This entails solving the following optimization problem under a mini-batch setting:
\begin{equation}
\min_{W, b} \;\; \ell_{\mathrm{softmax}}(XW+b, y),
\end{equation}
where $X \in \mathbb{R}^{b \times n}$ represents a mini-batch of input data.



### Your tasks

In general, you need the following steps (components) to train the logistic regression model:

**Task 5 (15 pt).** 
In the `logistic_regression` function within `logistic_regression.py`, you need to define the forward computational graph for the equation $Z = XW + b$. Here, $XW$ is a 2-dimensional matrix and $b$ is a 1-dimensional vector. This configuration necessitates the introduction of a new operator that can broadcast the vector $b$ to match the matrix dimensions of $XW$. 

In many frameworks, such as [NumPy](https://numpy.org/doc/stable/reference/generated/numpy.broadcast_to.html), the `broadcast_to` function is used for this purpose. However, since our computational graph nodes do not maintain shape information, you may need to modify the interface of your broadcasting operator to accommodate this. Consider how you can implement this to ensure the vector $b$ correctly aligns with the dimensions of $XW$ within the graph.


**Task 6 (15 pt).**
Create the `softmax_loss` function in the `logistic_regression.py` file, which builds the necessary computational graph for evaluating softmax loss. This function should receive an input node loaded with logits and another node with one-hot encodings representing true class labels. For cases with multi-class outputs, where $y \in \{1, \ldots, k\}$, the function utilizes a logits vector $z \in \mathbb{R}^k$ and the corresponding true class $y \in \{1, \ldots, k\}$, which is represented as a one-hot vector. The loss is calculated with the following formula:

\begin{equation}
\ell_{\mathrm{softmax}}(z, y) = \log\sum_{i=1}^k \exp z_i - z_y.
\end{equation}

Additional ops for summing, taking logarithms, and exponentiating may need to be introduced, along with their gradient calculations, to properly construct this softmax loss function.


**Task 7 (10 pt).** 
Develop the `sgd_epoch` function within `logistic_regression.py` to facilitate a single epoch of stochastic gradient descent (SGD). This function should organize the provided input data and labels into multiple small groups, or mini-batches. Each mini-batch should then be fed sequentially into your pre-constructed computational graph as input.

Subsequently, compute the gradients from these operations and proceed to update the weights and biases in your logistic regression model accordingly.

Upon successful implementation and execution of this script on a dataset of handwritten digits, you should notice a prediction accuracy approaching 95%. You can verify this by executing the script in your terminal:

```shell
> python3 logistic_regression.py
...
Final test accuracy: 0.9611111111111111
```

In [None]:
!python3 logistic_regression.py

**Hint.** When you find the current op set not satisfying your needs, consider introducing a new op.

## Part 3. **Optional** - Create Your Own Test Cases (up to 10 pt)

We encourage you to create your own test cases, which helps you confirm the correctness of your implementation.
If you are interested, you can write your own tests in `tests/test_customized_cases.py` and share them with us by including this file in your submission.
We appreciate it if you can share your tests, which can help improve this course and the homework. Please note that this part is voluntary.

## Part 4 **Optional** - Create the backward graph for Sampled Softmax Loss (20 pt)

As mentioned in lecture, the `Sampled Softmax Loss` is a technique often use in language modeling and other tasks where the softmax layer becomes computationally expensive due to a large number of classes. For this optional task, create the computation graph for Sampled Softmax Loss, and then use it as a drop-in replacement for Softmax Loss from Part 2.

Think about how the sampling procedure might affect the forward and backward pass.


In [None]:
!python3 sampled_softmax.py

## Part 5. Homework Feedback (0 pt)

This is the first time we offer this course, and we appreciate any homework feedback from you.
You can leave your feedback (if any) in `feedback.txt`, and submit it together with the source code.
Possible choices can be:

- How difficult do you think this homework is?
- How much time does the homework take? Which task takes the most time?
- Which part of the homework do you feel hard to understand?
- And any other things you would like to share.

Your feedback will be very useful in helping us improve the homework quality
for next years.


## How to Submit Your Homework

In the home directory for the assignment, execute the command

In [None]:
!make handin.tar

This will create an archive file with `auto_diff.py`, `logistic_regression.py`, `tests/test_customized_cases.py` and `feedback.txt`.
You can check the contents of `handin.tar` to make sure it contains all the needed files:

In [None]:
!tar -tf handin.tar

It is expected to list the four files:
```
auto_diff.py
logistic_regression.py
feedback.txt
sampled_softmax.py
tests/test_customized_cases.py
```

Then, please go to Gradescope and submit the file `handin.tar`.

This assignment is not automatically graded, and you will not receive immediate feedback.
You can submit multiple times, but only your final submission will be graded, and the time stamp of that submission will be used in determining any late penalties.
If you are enrolled in the course (on Webreg), but not registered on Gradescope, please let the course staff know in a private post on Piazza.