Step-by-step deatiled gradients derivations for common supervised machine learning and deep learning loss functions, may be suitable for people who just started learning machine learning and deep learning. The notations in the field can vary widely from person to person or change over time, but the notations here are consistent and easy.
Currently, I have derived the gradients of the loss functions for linear regression and logistic regression using the following set of notations. To see how gradients can be derived in a deep learning nerual network setting, you can check my Notes-for-Stanford-CS224N-NLP-with-Deep-Learning, although the notations there are different from those used here:
- [Jack's Notes] 1-Intro and Word Vectors.ipynb provides gradients derivations for the average negative log likelihood loss function used in word2vec algorithm, a shallow nerual network architecture.
- [Jack's Notes] 3-Neural Networks.ipynb provides the general way to derive gradients in multilayer nerual networks using chain rules and Jacobian Matrix.
- [Jack's notes] 4-Backpropagation.ipynb provides an easy and unified way to derive gradients for both sigmoid function and softmax function, two most common output layer activation functions used in nerual networks.
In the future, I will use the following notations to do the gradients derivations for nerual networks too so that there can be a very unified hands-on tutorials for common supervised machine learning and deep learning loss functions. Keep learning!
- : the input variables (features).
- : the true output variables that we want to predict (observations);
- : the predicted values;
- : weights for the input variables.
- : bais term for the input variables. In some machine learning courses and tutorials, people may use the theta sign to reprensent both the weights and the bais term, which seem to be much more popular notations to derive basic machine learning loss fuction gradients, such as linear regression and the logistic regression. But in deep learning,the and separation seems to be more common.
- The bold font is for vector, is a vector of , and is a vector of . And so on. But please note that, is for all the variables we have for any given single training example, but is for all the training examples. This is because the obervation to predict is always a single fixed value, regardless of being a discrete class (for classification), or a continuous value (for regression).
- The uppercase plus a bold font denotes matrix, such as , etc.
- always denotes the number of the training examples, so for . The subscript is always related to .
- always means the number of input variables for any given training example, so for . The subscript is always related to . In classification problems, the subscript instead stands for the index where the true class is.
- When two subscripts are used together, is always before , such that means the th input variable in the th training example. Thus, for and for