In [None]:
import math # Imports the 'math' module, which gives us access to mathematical functions like 'exp()' (e.g., e to the power of x).
import random # Imports the 'random' module, used to generate random numbers (e.g., for starting weights in our neural network).
import numpy as np # Imports 'NumPy', a super important library for working with numbers, especially in arrays (like lists of numbers arranged in rows and columns).
import matplotlib.pyplot as plt # Imports 'Matplotlib's 'pyplot' for drawing graphs and charts. We'll use it to visualize functions and data.
# %matplotlib inline # This line is a "magic command" typically used in Jupyter notebooks.
                     # It makes sure any plots created by Matplotlib show up directly within the notebook, right below the code that generates them.

# --- Part 1: Basic Math and Numerical Derivatives (Warm-up) ---

# Defines a simple Python function named 'f'. It takes one input, 'x'.
# This function calculates 3 times x squared, minus 4 times x, plus 5.
def f(x):
  return 3*x**2 - 4*x + 5

# Calls our function 'f' with the number 3.0 as input.
# Expected output: 3*(3^2) - 4*3 + 5 = 3*9 - 12 + 5 = 27 - 12 + 5 = 15 + 5 = 20.0
f(3.0)
# 20.0 # This is the result from the line above.

# Creates a NumPy array (a special list of numbers) named 'xs'.
# It starts at -5, goes up to (but doesn't include) 5, and takes steps of 0.25.
# This array will serve as our x-axis values when we draw the graph of f(x).
xs = np.arange(-5, 5, 0.25)
# Applies our function 'f' to every single number in the 'xs' array.
# This creates a new array 'ys' containing the corresponding y-values for each x.
ys = f(xs)
# Uses Matplotlib to draw a line plot. 'xs' are the horizontal (x) coordinates, and 'ys' are the vertical (y) coordinates.
plt.plot(xs, ys)
# [<matplotlib.lines.Line2D at ...>] # This is just a technical reference to the plot object created by Matplotlib.

# Defines a very small number, 'h'. This tiny number is crucial for approximating derivatives numerically.
h = 0.000001
# Sets a specific value for 'x' where we want to figure out the slope of the function 'f'.
x = 2/3
# This is the formula for calculating a "numerical derivative" (an approximate slope).
# It's (f(x + a tiny step) - f(x)) / (the tiny step).
# It essentially measures how much 'f(x)' changes when 'x' changes by a very small amount 'h'.
(f(x + h) - f(x))/h
# 2.999378523327323e-06 # This is the approximate slope of 'f(x)' at 'x = 2/3'. It's very close to 0, which is the exact derivative for this point.

# 'lets get more complex' - A comment indicating we're moving on to a slightly more involved example.
a = 2.0 # Assigns the number 2.0 to variable 'a'.
b = -3.0 # Assigns the number -3.0 to variable 'b'.
c = 10.0 # Assigns the number 10.0 to variable 'c'.
d = a*b + c # Calculates 'd' based on 'a', 'b', and 'c'. d = (2.0 * -3.0) + 10.0 = -6.0 + 10.0 = 4.0.
print(d) # Prints the value of 'd'.
# 4.0 # The output from the previous print statement.

h = 0.0001 # Sets a new, slightly larger, tiny step 'h' for the next example.

# inputs # This comment indicates that 'a', 'b', and 'c' are inputs in this context.
a = 2.0 # Re-assigns 'a' to 2.0.
b = -3.0 # Re-assigns 'b' to -3.0.
c = 10.0 # Re-assigns 'c' to 10.0.

d1 = a*b + c # Calculates the initial value of 'd' (which is 4.0).
c += h # Increases the value of 'c' by 'h'. So, 'c' becomes 10.0001.
d2 = a*b + c # Calculates a new value of 'd' using the slightly changed 'c'.
             # d2 = (2.0 * -3.0) + 10.0001 = -6.0 + 10.0001 = 4.0001.

print('d1', d1) # Prints the initial value of 'd'.
print('d2', d2) # Prints the new value of 'd' after changing 'c'.
# Calculates the numerical slope of 'd' with respect to 'c'.
# (d2 - d1) / h = (4.0001 - 4.0) / 0.0001 = 0.0001 / 0.0001 = 1.0.
print('slope', (d2 - d1)/h)
# d1 4.0 # Output of the previous line.
# d2 4.0001 # Output of the previous line.
# slope 0.9999999999976694 # The calculated slope, which is very close to 1.0. This shows that if you change 'c' by a tiny bit, 'd' changes by the exact same tiny bit (because d = a*b + c, so changing c directly affects d by 1:1).

---

### Part 2: The `Value` Class (The Brains of Automatic Differentiation)

This is the most important part! The `Value` class is a custom data type that not only holds a number but also keeps track of how that number was made. This allows us to automatically calculate gradients later.

```python
class Value:
  
  # This is the 'constructor' method. It's automatically called whenever you create a new Value object.
  # 'data': The actual number this Value object will hold (e.g., 2.0, -3.0).
  # '_children': A hidden tuple (like a fixed list) of other Value objects that were used as inputs
  #              to create *this* Value. This is how we build the "computational graph" (a map of how values relate).
  # '_op': A hidden string indicating which operation created this Value (e.g., '+', '*', 'tanh'). Useful for visualization.
  # 'label': An optional, human-readable name for this Value (e.g., 'a', 'loss'). Helpful for debugging.
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data # Stores the actual numerical value.
    self.grad = 0.0 # This is where the 'gradient' will be stored. It starts at 0.0.
                    # The gradient tells us how much a change in *this* value would affect the final output.
    self._backward = lambda: None # This is a placeholder for a special function. Each math operation
                                  # will replace this with a specific function that knows how to
                                  # pass gradients backward through that particular operation.
    self._prev = set(_children) # Stores the set of input Value objects (the 'parents' in the graph).
                                # Using a 'set' means each child is stored only once, and lookup is fast.
    self._op = _op # Stores the operation symbol (e.g., '+', '*').
    self.label = label # Stores the label for this Value.

  # This method defines what gets printed when you directly print a Value object (e.g., `print(my_value)`).
  def __repr__(self):
    return f"Value(data={self.data})" # It will show 'Value(data=)' followed by the number it holds.
  
  # This method "overloads" the addition operator (+).
  # It means that whenever you add two Value objects (e.g., `c = a + b`), Python calls this method.
  def __add__(self, other):
    # This line checks if 'other' (the thing we're adding) is already a Value object.
    # If not (meaning it's a regular number), it converts it into a Value object.
    other = other if isinstance(other, Value) else Value(other)
    # Creates a new Value object to hold the result of the addition.
    # It stores the sum of their 'data', and records 'self' and 'other' as its 'children'.
    # The operation symbol is set to '+'.
    out = Value(self.data + other.data, (self, other), '+')
    
    # Defines a nested function '_backward'. This function is specific to addition.
    # It explains how to pass gradients *backward* through an addition operation.
    # If 'out = self + other', then a tiny change in 'out' comes directly from
    # a tiny change in 'self' AND a tiny change in 'other'. So, the gradient just passes through.
    def _backward():
      self.grad += 1.0 * out.grad # Adds the gradient from 'out' to 'self's gradient.
                                  # (The derivative of (self + other) with respect to self is 1).
      other.grad += 1.0 * out.grad # Adds the gradient from 'out' to 'other's gradient.
                                   # (The derivative of (self + other) with respect to other is 1).
    out._backward = _backward # Assigns this special '_backward' function to the 'out' Value object.
    
    return out # Returns the newly created Value object that represents the sum.

  # This method overloads the multiplication operator (*).
  # Whenever you multiply two Value objects (e.g., `e = a * b`), this method is called.
  def __mul__(self, other):
    # Converts 'other' to a Value object if it isn't already one.
    other = other if isinstance(other, Value) else Value(other)
    # Creates a new Value object for the result of the multiplication.
    out = Value(self.data * other.data, (self, other), '*')
    
    # Defines the _backward function specific to multiplication.
    # For `out = self * other`:
    #   The derivative of 'out' with respect to 'self' is 'other.data'.
    #   The derivative of 'out' with respect to 'other' is 'self.data'.
    def _backward():
      self.grad += other.data * out.grad # Adds the propagated gradient to 'self'.
      other.grad += self.data * out.grad # Adds the propagated gradient to 'other'.
    out._backward = _backward # Assigns this backward function to 'out'.
      
    return out # Returns the new Value object representing the product.
  
  # This method overloads the power operator (**).
  # It's called when you raise a Value object to a power (e.g., `x**2`).
  def __pow__(self, other):
    # This line is a safety check. It makes sure that the 'power' ('other') is a simple number
    # (integer or float), not another Value object. This simplifies our current implementation.
    assert isinstance(other, (int, float)), "only supporting int/float powers for now"
    # Creates a new Value object for the result of the exponentiation (self.data raised to the power of 'other').
    # It records 'self' as its only child.
    out = Value(self.data**other, (self,), f'**{other}')
    
    # Defines the _backward function for exponentiation.
    # For `out = self ** power`:
    #   The derivative of 'out' with respect to 'self' is `power * (self.data ** (power - 1))`.
    def _backward():
        self.grad += other * (self.data ** (other - 1)) * out.grad # Adds the propagated gradient to 'self'.
    out._backward = _backward

    return out # Returns the new Value object.
  
  # This method handles "reverse multiplication" (e.g., `2 * my_value`).
  # It's called when the left side of the '*' is a regular number, but the right side IS a Value object.
  def __rmul__(self, other): # other * self
    return self * other # Simply calls our regular `__mul__` method, which knows how to handle this.

  # This method overloads the true division operator (/).
  # It's called when you divide one Value object by another (e.g., `value1 / value2`).
  # It cleverly reuses our existing multiplication and power operations.
  # Division by 'other' is the same as multiplying by 'other' raised to the power of -1 (1/other).
  def __truediv__(self, other): # self / other
    return self * other**-1 # Converts division into multiplication by the inverse.

  # This method overloads the unary negation operator (e.g., `-my_value`).
  def __neg__(self): # -self
    return self * -1 # Converts negation into multiplication by -1.

  # This method overloads the subtraction operator (-).
  # It's called when you subtract one Value object from another (e.g., `value1 - value2`).
  # It reuses our existing addition and negation methods.
  # `self - other` is the same as `self + (-other)`.
  def __sub__(self, other): # self - other
    return self + (-other) # Converts subtraction into addition with negation.

  # This method handles "reverse addition" (e.g., `2 + my_value`).
  # It's called when the left side of the '+' is a regular number, but the right side IS a Value object.
  def __radd__(self, other): # other + self
    return self + other # Simply calls our regular `__add__` method.

  # This method implements the hyperbolic tangent (tanh) activation function.
  # Tanh is a common "non-linear" function used in neural networks. It squashes any input
  # number into a range between -1 and 1.
  def tanh(self):
    x = self.data # Gets the numerical data from the current Value object.
    t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1) # Calculates tanh(x) using its mathematical formula.
    out = Value(t, (self, ), 'tanh') # Creates a new Value object for the result of tanh.
                                     # It records 'self' as its only child.
    
    # Defines the _backward function specifically for tanh.
    # The derivative (how the output changes with respect to the input) of tanh(x) is `1 - tanh(x)^2`.
    def _backward():
      self.grad += (1 - t**2) * out.grad # Adds the propagated gradient to 'self'.
    out._backward = _backward # Assigns this backward function to the 'out' Value object.
    
    return out # Returns the new Value object.
  
  # This method implements the exponential function (e^x, where 'e' is Euler's number, approx 2.718).
  def exp(self):
    x = self.data # Gets the numerical data from the current Value object.
    out = Value(math.exp(x), (self, ), 'exp') # Creates a new Value object for the result of e^x.
    
    # Defines the _backward function for the exponential operation.
    # The derivative of e^x is e^x itself.
    def _backward():
      # Propagates the gradient to 'self'. 'out.data' here is the calculated e^x.
      # IMPORTANT: The comment mentions a fix. Originally, it might have been `self.grad = ...`,
      # but it should be `+=` because gradients from different paths in the graph must accumulate.
      self.grad += out.data * out.grad
    out._backward = _backward
    
    return out # Returns the new Value object.
  
  # This is the main function that performs the "backward pass" or "backpropagation".
  # It automatically calculates the gradients for ALL relevant Value objects in the computational graph,
  # starting from the current Value (which is usually the final output or "loss" of our calculation).
  def backward(self):
    
    # Step 1: Build a "topological sort" of the computational graph.
    # Imagine your calculation as a recipe. A topological sort puts the ingredients in order
    # so that you know which ingredients are needed before you can make a certain dish.
    # For backpropagation, we need to process nodes from the *output* back to the *inputs*.
    topo = [] # An empty list that will store the Value objects in topological order.
    visited = set() # A set to keep track of Value objects we've already added to 'topo'. This prevents infinite loops in complex graphs.
    
    # This is a helper function that recursively builds the topological order.
    def build_topo(v):
      if v not in visited: # If this 'Value' object 'v' hasn't been visited yet:
        visited.add(v) # Mark it as visited.
        for child in v._prev: # For each 'Value' object that was an *input* to 'v' (its children/parents in the graph):
          build_topo(child) # Recursively call 'build_topo' for that child.
        topo.append(v) # Once all of 'v's children have been processed, add 'v' itself to the 'topo' list.
    
    build_topo(self) # Start the process of building the topological sort from the 'self' Value object (our starting point for gradients, typically the final output).
    
    # Step 2: Initialize the gradient of the final output.
    # We set the gradient of the 'self' Value (our starting point) to 1.0.
    # This is because we're asking: "How much does *this* value change with respect to *itself*?" The answer is 1.
    self.grad = 1.0
    
    # Step 3: Iterate through the topologically sorted nodes in *reverse* order.
    # This ensures that when we call a node's '_backward' function, the gradient from its *output*
    # (which is `out.grad` within that _backward function) has already been computed and is ready to be used.
    for node in reversed(topo):
      node._backward() # Call the specific '_backward' function that was defined for each node's operation.
                       # This function then uses the accumulated gradient from its output (`node.grad`)
                       # and its own derivative rule to propagate (distribute) that gradient
                       # backward to its input 'Value' objects (their `grad` attributes).
```

---

### Part 3: Visualization (`draw_dot`)

This part helps us see the "computational graph" visually. It uses a separate library called `graphviz` to draw diagrams showing how our `Value` objects are connected and what their data and gradients are.

```python
from graphviz import Digraph # Imports the 'Digraph' class from the 'graphviz' library, used for drawing directed graphs.

# This function helps us trace all the connections (nodes and edges) in our computational graph.
# 'root': The starting Value object from which we want to draw the graph (usually the final output).
def trace(root):
  # builds a set of all nodes and edges in a graph
  nodes, edges = set(), set() # Initialize empty sets to store unique Value nodes and their connections (edges).
  def build(v): # A recursive helper function to build the graph structure.
    if v not in nodes: # If the current Value object 'v' hasn't been added to our 'nodes' set yet:
      nodes.add(v) # Add it to the set of nodes.
      for child in v._prev: # For each 'Value' object that was an *input* to 'v' (its direct parents in the graph):
        edges.add((child, v)) # Add an edge (a connection) from the 'child' to 'v'.
        build(child) # Recursively call 'build' for the 'child' to trace its own dependencies further back.
  build(root) # Start the graph tracing process from the 'root' Value object.
  return nodes, edges # Return the collected unique nodes and their connections.

# This function takes the 'root' Value object of a computational graph and generates a visual representation of it.
def draw_dot(root):
  # Creates a 'Digraph' object.
  # 'format='svg'' means the output will be an SVG (Scalable Vector Graphics) image.
  # 'graph_attr={'rankdir': 'LR'}' tells graphviz to draw the graph from Left to Right, making it easier to read.
  dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'})
  
  nodes, edges = trace(root) # Get all the Value nodes and their connections from the computational graph.
  for n in nodes: # Loop through each unique Value node found in the graph.
    uid = str(id(n)) # Get a unique string ID for each node. 'id(n)' gives its memory address, which is unique.
    # For any Value node in the graph, create a rectangular ('record' shape) node for it in the diagram.
    # The 'label' displays the Value's custom label, its numerical data (formatted to 4 decimal places),
    # and its computed gradient (also formatted to 4 decimal places).
    dot.node(name = uid, label = "{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad), shape='record')
    if n._op: # If this Value node was created as a result of an operation (meaning it has an '_op' symbol, like '+', '*'):
      # Create a separate, typically circular, node just for the operation itself (e.g., '+', '-').
      dot.node(name = uid + n._op, label = n._op)
      # And draw an arrow (edge) connecting this operation node to the Value node it produced.
      dot.edge(uid + n._op, uid)

  for n1, n2 in edges: # Loop through each connection (edge) found: 'n1' is an input to 'n2'.
    # Draw an arrow (edge) connecting the input node (n1) to the *operation node* of the output node (n2).
    # This visually shows the flow: Input Value -> Operation -> Output Value.
    dot.edge(str(id(n1)), str(id(n2)) + n2._op)

  return dot # Return the 'graphviz' object. In a Jupyter notebook, this object will automatically display the SVG graph.

# --- Example Usage: Neuron-like computation with tanh activation ---

# Define input values as Value objects. These are our initial numbers.
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
# Define weight values as Value objects. These are like adjustable knobs in our calculation.
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
# Define the bias value as a Value object. This is another adjustable knob, adding a constant offset.
b = Value(6.8813735870195432, label='b')

# Perform the core computation of a single "neuron": (input1 * weight1) + (input2 * weight2) + bias.
x1w1 = x1*w1; x1w1.label = 'x1*w1' # Calculate the product of x1 and w1.
x2w2 = x2*w2; x2w2.label = 'x2*w2' # Calculate the product of x2 and w2.
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2' # Sum the two weighted inputs.
n = x1w1x2w2 + b; n.label = 'n' # Add the bias to the sum. 'n' is the neuron's "raw" activation.

o = n.tanh(); o.label = 'o' # Apply the 'tanh' activation function to 'n' to get the neuron's final output 'o'.
o.backward() # Perform the backward pass starting from 'o'. This calculates all the gradients for x1, x2, w1, w2, and b.
draw_dot(o) # Draw the computational graph for 'o'. This graph will show the data and the *calculated gradients* for each node.

# --- Example Usage: Comparing our `Value` system with PyTorch ---
# This section demonstrates that our custom 'Value' class computes gradients very similarly to a professional library like PyTorch.

import torch # Imports the PyTorch library, a widely used framework for deep learning.

# Define input values using PyTorch's 'Tensor' objects.
# '.double()' ensures they are double-precision floating-point numbers.
# '.requires_grad = True' tells PyTorch to keep track of operations involving these tensors so it can compute gradients.
x1 = torch.Tensor([2.0]).double(); x1.requires_grad = True
x2 = torch.Tensor([0.0]).double(); x2.requires_grad = True
# Define weight values using PyTorch Tensors, also requiring gradients.
w1 = torch.Tensor([-3.0]).double(); w1.requires_grad = True
w2 = torch.Tensor([1.0]).double(); w2.requires_grad = True
# Define the bias value using a PyTorch Tensor, requiring gradients.
b = torch.Tensor([6.8813735870195432]).double(); b.requires_grad = True

# Perform the same neuron computation (weighted sum + bias) using PyTorch Tensors.
n = x1*w1 + x2*w2 + b
o = torch.tanh(n) # Apply the tanh activation function using PyTorch's built-in function.

print(o.data.item()) # Print the numerical data of the output 'o'. '.item()' extracts the Python number from the tensor.
o.backward() # Perform the backward pass in PyTorch to compute gradients. PyTorch handles this automatically!

print('---') # A separator line for clarity in the output.
# Print the gradients computed by PyTorch for each input and weight.
# You'll notice these values are almost identical to the gradients calculated by our custom 'Value' class,
# demonstrating that our simple implementation is correct in principle!
print('x2', x2.grad.item())
print('w2', w2.grad.item())
print('x1', x1.grad.item())
print('w1', w1.grad.item())

# --- Part 4: Building a Neural Network (`Neuron`, `Layer`, `MLP`) ---
# Now we use our `Value` class to build the components of a simple neural network.

# Defines a single Neuron, the basic building block of a neural network.
class Neuron:
  
  # Constructor: Initializes a single neuron.
  # 'nin': The number of inputs this neuron will receive.
  def __init__(self, nin):
    # Creates a list of 'nin' weights ('w'). Each weight is a 'Value' object initialized with a random number
    # between -1 and 1. These weights are the "knobs" the neuron will adjust during learning.
    self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
    # Initializes the bias ('b') for this neuron with a random 'Value' object.
    self.b = Value(random.uniform(-1,1))
  
  # This special method makes the Neuron object "callable" like a function.
  # So, if you have `my_neuron = Neuron(3)`, you can then call `output = my_neuron(input_x)`.
  # 'x': The input data (a list of numbers) that this neuron will process.
  def __call__(self, x):
    # This line implements the core computation of a neuron: (weights * inputs) + bias.
    # 'zip(self.w, x)' pairs up each weight with its corresponding input from 'x'.
    # '(wi*xi for wi, xi in zip(self.w, x))' creates a generator that yields each weight-input product.
    # 'sum(..., self.b)' sums all these products and then adds the bias 'b'.
    # This 'act' (activation) is the raw, linear output of the neuron.
    act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
    # Applies a non-linear activation function (tanh in this case) to the 'act'.
    # Non-linearity is crucial for neural networks to learn complex patterns.
    # This 'out' is the final output of this neuron.
    out = act.tanh()
    return out # Returns the output of the neuron (which is a Value object).
  
  # This method returns a list of all the adjustable parameters (weights and bias) within this neuron.
  # These are the 'Value' objects whose gradients we'll compute and update during training.
  def parameters(self):
    return self.w + [self.b] # Combines the list of weights with the single bias Value.

# Defines a Layer, which is simply a collection of Neurons.
class Layer:
  
  # Constructor: Initializes a layer of neurons.
  # 'nin': The number of inputs going into this layer (and thus into each neuron in this layer).
  # 'nout': The number of neurons in this layer (which also determines the number of outputs from this layer).
  def __init__(self, nin, nout):
    # Creates a list of 'nout' Neuron objects. Each neuron is set up to take 'nin' inputs.
    self.neurons = [Neuron(nin) for _ in range(nout)]
  
  # This special method makes the Layer object "callable" like a function.
  # 'x': The input data (a list of numbers) that this layer will process.
  def __call__(self, x):
    # Passes the input 'x' through each neuron in the layer, collecting their individual outputs.
    outs = [n(x) for n in self.neurons]
    # If there's only one neuron in the layer (i.e., 'nout' was 1), return its output directly.
    # Otherwise, return the list of outputs from all neurons in the layer.
    return outs[0] if len(outs) == 1 else outs
  
  # This method returns a list of all adjustable parameters from all neurons within this layer.
  def parameters(self):
    # Uses a list comprehension to go through each neuron in the layer and collect its parameters,
    # effectively flattening them into a single list of 'Value' objects.
    return [p for neuron in self.neurons for p in neuron.parameters()]

# Defines an MLP (Multi-Layer Perceptron), which is a complete neural network made of stacked layers.
class MLP:
  
  # Constructor: Initializes the entire MLP network.
  # 'nin': The number of inputs to the very first layer of the network.
  # 'nouts': A list of numbers specifying the number of neurons in each subsequent (hidden and output) layer.
  #          For example, if `nouts` is `[4, 4, 1]`, it means:
  #          - The first hidden layer will have 4 neurons.
  #          - The second hidden layer will have 4 neurons.
  #          - The final output layer will have 1 neuron.
  def __init__(self, nin, nouts):
    # Creates a list 'sz' that defines the input/output sizes for each layer.
    # It starts with the network's initial input size ('nin'), then adds the specified output sizes for each layer.
    # Example: if nin=3, nouts=[4,4,1], then sz becomes [3, 4, 4, 1].
    sz = [nin] + nouts
    # Creates a list of Layer objects.
    # For each pair of consecutive sizes in 'sz' (e.g., (3,4), (4,4), (4,1)), it creates a layer.
    self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
  
  # This special method makes the MLP object "callable" like a function.
  # 'x': The initial input data (a list of numbers) for the entire network.
  def __call__(self, x):
    # Passes the input 'x' sequentially through each layer in the network.
    # The output of one layer becomes the input for the next layer.
    for layer in self.layers:
      x = layer(x) # Updates 'x' with the output of the current layer.
    return x # Returns the final output of the entire MLP (a Value object or a list of Value objects if the last layer has >1 neuron).
  
  # This method returns a list of all adjustable parameters (weights and biases) from all layers in the MLP.
  def parameters(self):
    # Flattens the list of parameters from all layers into a single, comprehensive list.
    return [p for layer in self.layers for p in layer.parameters()]

# Example of creating and using the MLP network:
x = [2.0, 3.0, -1.0] # An example input data point (a list representing 3 features).
# Creates an MLP: It takes 3 inputs, has a hidden layer with 4 neurons,
# another hidden layer with 4 neurons, and an output layer with 1 neuron.
n = MLP(3, [4, 4, 1])
n(x) # Calls the MLP with the input 'x' to get a prediction.
# Value(data=0.16578526021381612) # The output of the MLP for this specific input (a Value object).

# --- Part 5: Training the Neural Network ---
# This section demonstrates how we use the 'Value' class and its 'backward()' method
# to train our 'MLP' so it learns to make better predictions. This is the core of how neural networks learn!

xs = [
  [2.0, 3.0, -1.0], # Input data sample 1
  [3.0, -1.0, 0.5], # Input data sample 2
  [0.5, 1.0, 1.0],  # Input data sample 3
  [1.0, 1.0, -1.0], # Input data sample 4
] # A list of input data samples (each is a list of 3 numerical features).

ys = [1.0, -1.0, -1.0, 1.0] # desired targets
# A list of the "correct" or "desired" target outputs for each corresponding input sample in 'xs'.
# The network will try to learn to predict these values as accurately as possible.

# This loop represents the training process. We repeat the steps inside this loop 20 times.
# Each repetition is often called an "epoch".
for k in range(20):
  
  # forward pass # Step 1: Make predictions using the current state of the network (forward pass).
  # For each input 'x' in the 'xs' list, get the network's prediction 'n(x)'.
  # 'ypred' will be a list of 'Value' objects, one for each prediction.
  ypred = [n(x) for x in xs]
  
  # loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
  # Step 2: Calculate the "loss" (how wrong the predictions are).
  # We're using the Mean Squared Error (MSE) as our loss function, a common choice for this type of problem.
  # 'zip(ys, ypred)' pairs up each true target `ygt` with its corresponding predicted output `yout`.
  # `(yout - ygt)**2` calculates the squared difference (error) for each individual prediction.
  # `sum(...)` then adds up all these squared errors to get a single 'loss' Value object.
  # The ultimate goal of training is to make this 'loss' value as small as possible.
  loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
  
  # backward pass # Step 3: Compute gradients (backward pass).
  # Before calculating new gradients for the current iteration, it's essential to reset all previously computed
  # gradients for the network's parameters to 0.0.
  # This is because the '_backward' functions in the 'Value' class use `+=` (add-assignment)
  # to accumulate gradients from different paths in the computation graph. Resetting ensures a clean start.
  for p in n.parameters():
    p.grad = 0.0
  # Call the 'backward()' method on the 'loss' Value object.
  # This is the core of automatic differentiation! It triggers a chain reaction, calculating
  # the gradient for every single weight and bias (parameter) in the entire MLP.
  # These gradients tell us how much each parameter needs to change to reduce the 'loss'.
  loss.backward()
  
  # update # Step 4: Update parameters (optimization step).
  # Iterate through all the adjustable parameters (weights and biases) of the neural network.
  for p in n.parameters():
    # Adjust the parameter's 'data' (its numerical value) by moving it in the direction *opposite* to its gradient.
    # The `-0.1` is the "learning rate". It's a small number that controls how big of a step
    # we take when adjusting parameters.
    # If a parameter's gradient is positive, increasing the parameter would increase the loss, so we decrease the parameter.
    # If a parameter's gradient is negative, decreasing the parameter would increase the loss, so we increase the parameter.
    p.data += -0.1 * p.grad
  
  # Prints the current iteration number 'k' and the current numerical value of the 'loss'.
  # As the network learns, you should see the 'loss.data' value gradually decrease over iterations.
  print(k, loss.data)
  
# Example output (showing the loss decreasing over epochs):
# 0 0.002056123958292787
# 1 0.0020404768419831024
# 2 0.0020250564320649566
# ... (many lines omitted)
# 19 0.0017934086088756394

# After the training loop finishes (after 20 iterations), print the final predictions made by the trained network.
ypred
# [Value(data=0.9817830812439714),    # Network's prediction for input 1 (original target was 1.0)
#  Value(data=-0.9863881624765284),   # Network's prediction for input 2 (original target was -1.0)
#  Value(data=-0.9766534529377958),   # Network's prediction for input 3 (original target was -1.0)
#  Value(data=0.9729591216966093)]    # Network's prediction for input 4 (original target was 1.0)
# Notice how these final predictions are now very close to the desired target values, showing that the network has learned!


SyntaxError: unterminated string literal (detected at line 189) (201695412.py, line 189)

In [None]:
"""Understanding Automatic Differentiation: A Simple Micrograd Implementation
This document explains a fundamental concept in machine learning and artificial intelligence: Automatic Differentiation (Autograd). This code provides a simplified, "micro" version of how powerful deep learning frameworks like PyTorch or TensorFlow calculate gradients, which are essential for training neural networks.

What is Automatic Differentiation?
Imagine you have a complex mathematical function, like the one that defines how a neural network makes a prediction. To "train" the network, you need to figure out how much each tiny adjustment to its internal settings (called "weights" and "biases") will affect the final prediction's error. This "how much" is precisely what a gradient tells you.

Automatic differentiation is a technique that automatically calculates these gradients. Instead of you manually deriving complex mathematical formulas for derivatives, the computer keeps track of all operations and then efficiently computes the gradients by working backward through the calculations.

The Core Building Block: The Value Object
At the heart of this system is the Value class. Think of a Value object as a special container for a number. But it's more than just a number; it also keeps track of crucial information needed for automatic differentiation:

data: This is the actual numerical value that the Value object holds (e.g., 2.0, -3.0).

grad: This stands for "gradient." It's a number that tells us how much the final output of our entire calculation is affected by a small change in this specific Value. Initially, it's 0.0.

_prev: This is a set (a collection of unique items) of Value objects that were used as inputs to create this Value. For example, if c = a + b, then a and b would be in c._prev. This creates a "computational graph" – a record of how values depend on each other.

_op: A string representing the operation that created this Value (e.g., '+', '*', 'tanh'). This is mainly for visualization.

label: A simple name you can give to a Value object for easier understanding and visualization (e.g., 'a', 'b', 'L').

_backward: This is a special function associated with each Value object. When we perform an operation (like addition or multiplication), we also define how gradients should be passed backward through that specific operation. This function will be called during the backward pass.

Building Blocks: Operations and Their Gradients
The Value class overrides standard Python operations like + (addition) and * (multiplication), allowing you to perform arithmetic directly on Value objects. This means when you add two Value objects, Python uses our custom __add__ method instead of its default addition.

When an operation like a + b happens:

A new Value object is created to hold the result (a.data + b.data).

This new Value object records a and b in its _prev set, linking them in the computational graph.

Most importantly, it defines its own _backward function. This function contains the specific rules for how the gradient flowing into this result should be distributed back to its inputs (a and b). For addition, the gradient simply passes through (multiplied by 1.0). For multiplication, it's a bit more complex (e.g., if out = a * b, then a.grad gets b.data * out.grad, and b.grad gets a.data * out.grad).

The tanh function is an example of an "activation function" often used in neural networks. It squashes any input number into a range between -1 and 1. It also defines its specific _backward rule based on the derivative of the tanh function.

New Operations in Value
The updated Value class includes more operations, making it more versatile:

__pow__(self, other): Handles exponentiation (e.g., x**2). It ensures that only integer or float powers are supported for simplicity. Its _backward function correctly applies the power rule of differentiation.

__rmul__(self, other): This allows for reverse multiplication (e.g., 2 * x where 2 is a regular number and x is a Value object). It simply calls self * other.

__truediv__(self, other): Implements division (e.g., x / y). It cleverly reuses the multiplication and power operations by treating division as self * other**-1.

__neg__(self): Handles negation (e.g., -x). It treats this as multiplication by -1 (self * -1).

__sub__(self, other): Implements subtraction (e.g., x - y). It reuses addition and negation by treating it as self + (-other).

__radd__(self, other): Allows for reverse addition (e.g., 2 + x). It simply calls self + other.

exp(self): Implements the exponential function (e 
x
 ). Its _backward function correctly applies the derivative rule for e 
x
 , which is e 
x
  itself.

These additional methods make the Value class more robust, allowing you to build more complex mathematical expressions and automatically compute their gradients.

The Magic: The backward() Method for Gradient Calculation
This is the core of automatic differentiation! When you call some_value.backward(), it does two main things:

Builds a Topological Order (build_topo): It first figures out the correct order to process all the Value objects in the computational graph. It starts from the some_value (the final output) and traces back all its dependencies, ensuring that a Value is processed only after all the Value objects that depend on it have been processed. This ordered list is called a "topological sort."

Propagates Gradients:

It initializes the grad of the some_value (the final output) to 1.0. This is because we are asking "how much does the final output change with respect to itself?" The answer is 1.

Then, it iterates through the Value objects in the topo list in reverse order (from the final output back to the initial inputs).

For each Value object, it calls its specific _backward() function. This function then adds its share of the gradient to its parent Value objects (_prev). This process continues, effectively "back-propagating" the gradient through the entire graph.

After backward() completes, every Value object in the graph will have its grad attribute filled with the correct derivative, telling you how much that specific value contributes to the final output.

Visualizing the Graph (draw_dot)
The draw_dot function uses a library called graphviz to draw a visual representation of your computational graph. This is incredibly helpful for understanding how values are connected and how gradients flow. Each node in the graph shows:

Its label (if provided)

Its data value

Its calculated grad value (after backward() has been called)
Operations like +, *, tanh are also shown as separate nodes, connecting inputs to their results.

Building a Neural Network: Neuron, Layer, and MLP
The provided code also introduces classes to build a simple Multi-Layer Perceptron (MLP), a fundamental type of neural network.

Neuron Class:

A Neuron is the basic computational unit of a neural network.

It has a set of weights (w) and a bias (b). These are the adjustable parameters that the network "learns."

The __call__ method defines how the neuron processes input x: it calculates a weighted sum of inputs (wi*xi for each input xi and its corresponding weight wi), adds the bias (b), and then applies a non-linear activation function (here, tanh). The result is the neuron's output.

The parameters() method returns all the Value objects (weights and bias) within this neuron, which are the parameters we need to optimize during training.

Layer Class:

A Layer is a collection of Neuron objects.

It takes an input of a certain size (nin) and produces an output of another size (nout), where nout is the number of neurons in the layer.

Its __call__ method simply passes the input x through each neuron in the layer, collecting their outputs. If there's only one neuron, it returns its output directly; otherwise, it returns a list of outputs.

Its parameters() method collects all parameters from all neurons within that layer.

MLP (Multi-Layer Perceptron) Class:

An MLP is a sequence of Layer objects, forming the complete neural network.

It takes an input size (nin) and a list of output sizes for each hidden layer (nouts).

Its __call__ method processes the input x sequentially through each layer. The output of one layer becomes the input for the next, until the final output of the network is produced.

Its parameters() method collects all parameters from all layers in the network.

Training a Neural Network
The final part of the code demonstrates a basic training loop for the MLP. The goal of training is to adjust the network's parameters (weights and biases) so that its predictions (ypred) are as close as possible to the desired target values (ys).

Data Preparation:

xs: A list of input data samples.

ys: A list of corresponding desired target outputs for each input sample.

Training Loop (for k in range(20)): This loop runs for a fixed number of iterations (epochs).

Forward Pass:

ypred = [n(x) for x in xs]: For each input sample x in xs, the MLP (n) makes a prediction, resulting in a list of ypred Value objects.

loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred)): A loss function is calculated. Here, it's the mean squared error (MSE). For each prediction yout and its corresponding true target ygt, the squared difference is calculated, and all these squared differences are summed up. The loss is a single Value object representing how "wrong" the network's current predictions are.

Backward Pass:

for p in n.parameters(): p.grad = 0.0: Before computing new gradients, all existing gradients for the network's parameters are reset to zero. This is crucial because _backward functions add to grad, so we need a clean slate for each iteration.

loss.backward(): This is where the magic happens! The backward() method is called on the loss Value object. This triggers the automatic differentiation process, propagating gradients all the way back through the MLP (through tanh, +, * operations in neurons and layers) to compute the gradient for every single weight and bias in the network.

Update Parameters:

for p in n.parameters(): p.data += -0.1 * p.grad: This is the optimization step. Each parameter's data is updated. We move its value in the direction opposite to its gradient (hence the -0.1 * p.grad).

0.1 is the learning rate, a small number that controls how big of a step we take in the direction of the gradient. A positive gradient means increasing the parameter would increase the loss, so we decrease the parameter. A negative gradient means decreasing the parameter would increase the loss, so we increase the parameter.

Print Loss: print(k, loss.data): The current iteration number k and the value of the loss are printed. You'll observe that the loss.data generally decreases over iterations, indicating that the network is learning and its predictions are getting closer to the targets.

This training loop demonstrates the full cycle of a neural network learning: making predictions, calculating error, finding out how to adjust parameters to reduce error, and then actually adjusting them. This iterative process is how deep learning models are trained to perform complex tasks."""