torch.nn: a comprehensive library that supports construction of neural networks with pre-defined classess and functons that streamline the process of building, training, and deploying deep learning models.
- supports variety of layer implementations
    - linear layers: nn.Linear
    - convolutional layers: nn.Con2d, nn.Con3d for convolutional operations
    - recurrent layers: nn.RNN, nn.LSTM, nn.GRU for sequential data processing
    - normalization layers: nn.BatchNorm2d, nn.LayerNorm for normalizing activations

- activation functions:
    - nn.ReLU, nn.Sigmoid, nn.TanH

- loss functions:
    - mean squared error: nn.MSE
    - cross entropy loss: nn.CrossEntropyLoss
    - negative log-likelihood: nn.NLLLoss

- container modules:
    - organization of layers: nn.Sequential
    - submodules in a sequential or list-like manner: nn.ModuleList

- utilites:
    - regularization: nn.Dropout
    - learnable parameters: nn.Parameter

    

nn.Module: foundational class for construction neural networks that helps in defining complex architectures, effectively manage parameters, and use built-in functionalities to streamline development process

- modular design: allows for composition of neural networks by stacking either in linear, convolutional, or recurrent layers that can be used to create complex architectures from simpler components
- parameter management: all parameters including weights and bias defined within nn.Module are automatically registered and are accessible via methods such as .parameters() and .named_parameters()
- device management: supports seamless movement between GPU to CPU or vice versa using .to(), which also ensures that all the parameters and submodules are transferred effectively
- training and evaluation: .train() and .eval() methods supports switching between training and evaluation modes
- nested: can contain other modules

nn.Linear: implements an affine or linear transformation that maps input features to output features

at it's core, nn.Linear represents a fully connected layer and does the following operation: y = x . w + bias
x: input tensor, W: a weighted vector or matrix


- in_features: number of input features expected
- out_features: number of output feature produced
bias: a boolean, if True the layer will add a learnable bias



forward propagation: process in neural networks where data passes through layers to produce an output, in each layer, the network takes input, applies a linear transformation, and passes the result through an activation function.

prediction: this mechanism produces output or prediction from a given input

input_data: raw features or data points
linear transformation: for each layer, weight sum is computed along with bias; z = X . w + bias

activation function: an activation function such as sigmoid or ReLU are applied to introduce non-linearity a = activation_func(z)

layer-by-layer-propagation: output a of one layer becomes the input x for the next layer and continues until the final layer is reached

output: the final activation from the output layer is the prediction for the given input


weights: adjustable parameters that determine how signifiant or influential is each input or a feature has on the output. during forward propagation weights scale the inputs before they are summed often with bias added and are passed through activation function

- weights are a primary means which play a key role in the way networks learn patterns by adjusting them during training to redue error
- allows the network to emphasize the most importnt or relavant features and suppress the lesser significant ones
- combination of weights across layers allows the network to approximate complex functions

- in a fully connected network, if there are n inputs and m outputs the weight matrix has a shape of mxn


bias: extra parameter added to the weighted sum of the inputs before the activation function is applied, which thus allows activation function to shift the output

- without bias, no matter how the weights are adjusted, outputs of neurons would always be zero when inputs are zero, in this context bias ensures that even if the inputs are zero, the model or system can produce a non-zero output
- bias helps the model or network to learn patterns that do not necessarily pass through the origin, curcial for accurate modelling of real-world concepts
- by adjusting the bias, model or network can tune the decision boundaries and catprue the underlying relationships in the data



error: a measure of how far off or wrong is the network's predicted output compared to the actual or target values

- by knowing the error or how far off the predictions are, they can be used to adjust weights during training
- it provides immediate feedback on model's performance on individual examples
- it is the raw signal or informaiton that a loss function builds upon to improve overall model performance
error = target - prediction


loss: a function that quantifies how far off a network's predictions are from the actual target values, where it transforms raw error between target and predictions into a scalar value

- minimize loss value is one of the crucial notion of a training algorithm
- gives a numerical measure of how well a model is performing
- during training, loss can be used to calculate gradients that tell networks how to update it's weight and bias


MSE or Mean Squared Error: a loss function that measures the average of squared differences between the predicted values and the actual target values
    - squaring the errors penalizes larger errors more heavily than the smaller ones, thus encouraging the model to avoid large mistakes
    - mse is smooth and differentiable which is essential for optimization methods such as gradient descent

Cross Entropy: measures the difference between two probability measures

cost: an overall measure of how well a neural network is performing by aggregating the loss (error for individual samples) across a dataset or batch, where the loss is generally computed per sample, the cost is usually the average loss over all samples
- cost provides a single, comprehensive metric that represents the network's performance
- during training, optimization process like gradient descent minimizes tthe cost, lower cost means the network's predictions are closer to the target or actual values
- cost is generally used to compared different  models or training iterations, which then can be used to understand how various improvements in training lead to better performance.

back propagation: learning algorithm used to train neural networks. when the forward propagation give an output, back propagation computes how much each weight and bias contributed to the final error (loss),

- tells the network how to adjust weights to reduce the error
- by applying the chain rule, back propagation computes all necessary graidents efficiently
- these gradients are used by optimization algorithms to update parameters, ultimately minimizing the loss and improving model performance


ANN for regression: a neural network designed to predict continuous output values given an input, where instead of classifying inputs into discrete categories, the network learns to approximate a function that maps inputs to continuous outputs.

- can approximate highly complex or non-linear functions
- can model complex relationships that traditional linear models may not captrue
- can scale to large datasets and multiple input features and provide predictions
