# Lecture 03/24: Physics-Informed Neural Networks (PINNs)
### Approach: background, methodology, examples

The key concept of a physics-informed neural network is modifying the loss function (i.e., the perscribed error that is minimized by the algorithm). By writing the physical laws governing the system in a form compatible with your system, the physical laws themselves can be _encoded_ in the neural network through inclusion of a term that should sum to zero in the loss function. For example, the loss $\mathcal{L}$ is comprised of $\mathcal{L}_{\mathrm{error}}$ and $\mathcal{L}_{\mathrm{physical}}$, such that
$$\mathcal{L}=\mathcal{L}_{\mathrm{error}} + \mathcal{L}_{\mathrm{physical}}$$

and

$$ \min{\mathcal{L}} = \min{\left[\mathcal{L}_{\mathrm{error}} + \mathcal{L}_{\mathrm{physical}}\right]} $$

The error from the PDE governing the physical system, $\mathcal{L}_{\mathrm{physical}}$, is called the _residual_ of the PDE or law. 

Since PDEs themselves do not specify the behavior of the system, the initial conditions and boundary conditions must likewise be incorporated in separate terms of the loss function. 

With each additional term, the landscape of the loss function in higher dimensional space becomes more and more noisy / volatile. This can spark _reduced_ performance in the PINN compared to one that only uses data, without the physical laws. There are, however, methods to circumvent these effects. One such method is **Curriculum Learning**. 

For example, we would train the advection equation PINN using only a small portion of the dataset: namely, train it for a specific regime of velocities or solutions, and gradually expand its scope after it learns a single zone. The alternative approach is to pose the problem as a sequence to learn sequentially, where the PINN learns to predict the solution in a finite time horizon and iteratively predicts the following time windows. 

Another such technique is _adaptive sampling_. Similar to Importance Sampling in monte carlo simulations, this method adaptively distributes the sampling of points in the training phase, to focus strongly on the areas with steep gradients / lots of information.

### Example PDEs:
- Advection Equation
- Reaction Diffusion Equation

## Supervised learning vs. sequence modelling:
#### Supervised learning:
- data: ${x_i,y_i}$
- model: $y \approx f_\theta(x)$
- Loss: $\mathcal{L} = \sum_i^N l(f_\theta(x_i),y_i)$
- Optimization: $\theta^* = \arg\min_\theta (\mathcal{L})$

#### Sequence Modelling:
- data: ${x_i}$
- model: $p(x) \approx f_\theta(x)$
- Loss: $\mathcal{L} = \sum_i^N \log p(f_\theta(x_i))$
- Optimization: $\theta^* = \arg\max_\theta (\mathcal{L})$
$\implies$ the sequence modelling approach does NOT use any known data (i.e., outputs $y_i$)

# The Transformer
#### Created by Google Brain
- NN that learns context and thus meaning
- Tracks relationships in sequential data, like words in a sentence
- Applies an evolving set of operations, called attention
- Detects subtle ways that distant data elements can influence and depend on each other
- Now known as a self-supervised method or a “foundational” method
 
#### The positional encoder
- tracks where the element is in the sequence
- Complex functions like
$$ \sin{\left( \frac{\mathrm{pos}}{10,000^{\frac{2i}{d}} } \right)} $$