<a name='top'></a>
# Study Notes

## Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition 

<!--[section title](#section_ID) -->

# Go to:

[Chapter 1](#ch1)<br>
[Chapter 2: End-to-End ML Project](#ch2)<br>
[Chapter 3: Classification](#ch3)<br>
[Chapter 4: Training Models](#ch4)<br>
[Chapter 5: Support Vector Machines](#ch5)<br>
[Chapter 6](#ch6)<br>
[Chapter 7](#ch7)<br>
[Chapter 8](#ch8)<br>
[Chapter 9](#ch9)<br>
[Chapter 10](#ch10)<br>
[Chapter 11](#ch11)<br>
[Chapter 12](#ch12)<br>
[Chapter 13](#ch13)<br>
[Chapter 14](#ch14)<br>
[Chapter 15](#ch15)<br>
[Chapter 16](#ch16)<br>
[Chapter 17](#ch17)<br>
[Chapter 18](#ch18)<br>
[Chapter 19](#ch19)<br>

<h2><a id="ch2">Chapter 2: End-to-End ML Project</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.


* tail-heavy: they extend much farther to the right of the median than to the left.


* your brain is an amazing pattern detection system, which means that it is highly prone to overfitting: if you look at the test set, you may stumble upon some seemingly interesting pattern in the test data that leads you to select a particular kind of Machine Learning model. When you estimate the generalization error using the test set, your estimate will be too optimistic, and you will launch a system that will not perform as well as expected. This is called data <b>snooping bias</b>.


* For example, the US population is 51.3% females and 48.7% males, so a well-conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male. This is called <b>stratified sampling</b>: the population is divided into homogeneous subgroups called <i>strata</i>, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population. If the people running the survey used purely random sampling, there would be about a 12% chance of sampling a skewed test set that was either less than 49% female or more than 54% female. Either way, the survey results would be significantly biased.


* It is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of a stratum’s importance may be biased.


* The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss out on nonlinear relationships (e.g., “if x is close to 0, then y generally goes up”).


* One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases (e.g., for ordered categories such as “bad,” “average,” “good,” and “excellent”), but it is obviously not the case for the ocean_proximity column (for example, categories 0 and 4 are clearly more similar than categories 0 and 1). To fix this issue, a common solution is to create one binary attribute per category: one attribute equal to 1 when the category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category is “INLAND” (and 0 otherwise), and so on. This is called <b>one-hot encoding</b>, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called dummy attributes. Scikit-Learn provides a OneHotEncoder class to convert categorical values into one-hot vectors

    
*  Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After one-hot encoding, we get a matrix with thousands of columns, and the matrix is full of 0s except for a single 1 per row. <b>Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements.</b> You can use it mostly like a normal 2D array,21 but if you really want to convert it to a (dense) NumPy array, just call the toarray() method
    
    
* There are two common ways to get all attributes to have the same scale: *min-max scaling* and *standardization*.
    - Min-max scaling (many people call this normalization) is the simplest: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min. **(MinMaxScaler)**
    - Standardization is different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. Standardization is much less affected by outliers. **(StandardScaler)**
    -  It is important to fit the scalers to *the training data only*
    
    
* Scikit-Learn’s cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the preceding code computes -scores before calculating the square root.


* If GridSearchCV is initialized with **refit=True** (which is the default), then once it finds the best estimator using cross-validation, it **retrains** it on the whole training set. This is usually a good idea, since feeding it more data will likely improve its performance.


* The grid search will automatically find out whether or not to add a feature you were not sure about (e.g., using the add_bedrooms_per_room hyperparameter of your CombinedAttributesAdder transformer). It may similarly be used to automatically find the best way to **handle outliers, missing features, feature selection, and more**.


* Even a model trained to classify pictures of cats and dogs **may need to be retrained regularly**, not because cats and dogs will mutate overnight, but because cameras keep changing, along with image formats, sharpness, brightness, and size ratios.


* The Machine Learning algorithms are important, of course, but it is probably preferable to be comfortable with the overall process and know three or four algorithms well rather than to spend all your time exploring advanced algorithms.

<h2><a id="ch3">Chapter 3: Classification</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* Moreover, some learning algorithms are sensitive to the order of the training instances, and they perform poorly if they get many similar instances in a row. **Shuffling** the dataset ensures that this won’t happen.


* A good place to start is with a **Stochastic Gradient Descent (SGD) classifier**, using Scikit-Learn’s SGDClassifier class. This classifier has the advantage of being **capable of handling very large datasets efficiently.** This is in part because SGD deals with training instances independently, one at a time (which also makes **SGD well suited for online learning**), as we will see later. 
    * The SGDClassifier **relies on randomness during training** (hence the name “stochastic”). If you want reproducible results, you should set the random_state parameter.


* Evaluating a classifier is often significantly trickier than evaluating a regressor


* This demonstrates <u>why accuracy is generally not the preferred performance measure for classifiers</u>, especially when you are dealing with skewed datasets (i.e., when some classes are much more frequent than others).


* A much better way to evaluate the performance of a classifier is to look at the **confusion matrix**. The general idea is to count the number of times instances of class A are classified as class B. For example, to know the number of times the classifier confused images of 5s with 3s, you would look in the fifth row and third column of the confusion matrix.


* Confusion matrix output = [[TN, FP], [FN, TP]] = [[negative class], [positive class]]
    * precision = TP / (TP+FP)
    * precision is typically used along with another metric named recall, also called sensitivity or the true positive rate (TPR)
    * recall = TP / (TP+FN)
    * <img src='img/mls2_0302.jpg'>
    * It is often convenient to combine precision and recall into a single metric called the **F1 score**, in particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high.
    * <img src='img/mls2_02.png' align='left'>
    <br clear='left'>
    * Unfortunately, you can’t have it both ways: **increasing precision reduces recall, and vice versa**. This is called the precision/recall trade-off
  
  
* The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate (FPR). The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to 1 – the true negative rate (TNR), which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity. Hence, the ROC curve plots sensitivity (recall) versus 1 – specificity.


* As a rule of thumb, you should prefer the PR curve whenever **the positive class is rare or when you care more about the false positives than the false negatives**. Otherwise, use the ROC curve.
    

* Some algorithms (such as Logistic Regression classifiers, Random Forest classifiers, and naive Bayes classifiers) are capable of handling multiple classes natively. Others (such as SGD Classifiers or Support Vector Machine classifiers) are strictly binary classifiers.


* One way to create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score. This is called **the one-versus-the-rest (OvR) strategy (also called one-versus-all)**.


* Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called the **one-versus-one (OvO) strategy**. If there are N classes, you need to train N × (N – 1) / 2 classifiers. The main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.
    * Some algorithms (such as <u>Support Vector Machine classifiers</u>) scale poorly with the size of the training set. For these algorithms OvO is preferred because it is faster to train many classifiers on small training sets than to train few classifiers on large training sets. **For most binary classification algorithms, however, OvR is preferred**.
    
    
* When a classifier is trained, it **stores the list of target classes in its classes_ attribute**, ordered by value. In this case, the index of each class in the classes_ array conveniently matches the class itself (e.g., the class at index 5 happens to be class 5), but in general you won’t be so lucky.


* However, most misclassified images seem like obvious errors to us, and it’s hard to understand why the classifier made the mistakes it did.3 The reason is that we used a simple SGDClassifier, which is a linear model. **All it does is assign a weight per class to each pixel, and when it sees a new image it just sums up the weighted pixel intensities to get a score for each class**. So since 3s and 5s differ only by a few pixels, this model will easily confuse them.


* Then when the classifier is shown a picture of Alice and Charlie, it should output [1, 0, 1] (meaning “Alice yes, Bob no, Charlie yes”). Such a classification system that outputs multiple binary tags is called a **multilabel classification system**.


* This assumes that all labels are equally important, however, which may not be the case. In particular, if you have many more pictures of Alice than of Bob or Charlie, you may want to give more weight to the classifier’s score on pictures of Alice. One simple option is to give each label a weight equal to its support (i.e., the number of instances with that target label). To do this, simply **set average="weighted"** in the preceding code.4


* The last type of classification task we are going to discuss here is called multioutput–multiclass classification (or simply **multioutput classification**). It is simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values).

<h2><a id="ch4">Chapter 4: Training Models</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* Two ways to train linear regression
    * Using a direct **“closed-form” equation** that directly computes the model parameters that best fit the model to the training set (i.e., the model parameters that minimize the cost function over the training set).
    * Using an iterative optimization approach called Gradient Descent (GD) that gradually tweaks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters as the first method. We will look at a few variants of Gradient Descent that we will use again and again when we study neural networks in Part II: **Batch GD, Mini-batch GD, and Stochastic GD**.


* Polynomial Regression, a more complex model that can fit nonlinear datasets. Since this model has more parameters than Linear Regression, it is **more prone to overfitting the training data**, so we will look at how to detect whether or not this is the case using **learning curves**, and then we will look at several regularization techniques that can reduce the risk of overfitting the training set.


* Two more models that are commonly used for classification tasks: **Logistic Regression and Softmax Regression**.


* <img src='img/eq4-2_linear_regression.jpg' align='left'>
<br clear='left'>


* In Chapter 2 we saw that the most common performance measure of a regression model is the Root Mean Square Error (RMSE) (Equation 2-1). Therefore, to train a Linear Regression model, we need to find the value of θ that minimizes the RMSE. **In practice, it is simpler to minimize the mean squared error (MSE) than the RMSE, and it leads to the same result** (because the value that minimizes a function also minimizes its square root).

* <img src='img/eq4-3_MSE.jpg' align='left'>
<br clear='left'>

* <img src='img/eq4-4_normal.jpg' align='left'>
<br clear='left'>


* This function computes θˆ=X+y, where X+ is the **pseudoinverse** of X (specifically, the Moore-Penrose inverse). You can use np.linalg.pinv() to compute the pseudoinverse directly


* <img src='img/4_moore_penrose.jpg' align='left'>
<br clear='left'>


* The pseudoinverse itself is computed using a standard matrix factorization technique called **Singular Value Decomposition (SVD)** that can decompose the training set matrix X into the matrix multiplication of three matrices U Σ V⊺ (see numpy.linalg.svd()). The pseudoinverse is computed as X+=VΣ+U⊺. To compute the matrix Σ+, the algorithm takes Σ and sets to zero all values smaller than a tiny threshold value, then it replaces all the nonzero values with their inverse, and finally it transposes the resulting matrix. This approach is more efficient than computing the Normal Equation, plus it handles edge cases nicely: indeed, the Normal Equation may not work if the matrix X⊺X is not invertible (i.e., singular), such as if m < n or if some features are redundant, but the pseudoinverse is always defined.
    * The Normal Equation computes the inverse of X⊺ X, which is an (n + 1) × (n + 1) matrix (where n is the number of features). The computational complexity of inverting such a matrix is typically about **O(n^2.4) to O(n^3)**.
    
    
    
* Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to **tweak parameters iteratively in order to minimize a cost function**.


* When using Gradient Descent, you should **ensure that all features have a similar scale** (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.


* <img src='img/eq4-6_gradient_vector.jpg' align='left'>
<br clear='left'>


* Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting ∇θMSE(θ) from θ. This is where the learning rate η comes into play:5 multiply the gradient vector by η to determine the size of the downhill step (Equation 4-7).
    * <img src='img/eq4-7_GD_step.jpg' align='left'>
<br clear='left'>

* You may wonder how to set the number of iterations. If it is too low, you will still be far away from the optimal solution when the algorithm stops; but if it is too high, you will waste time while the model parameters do not change anymore. A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny—that is, <mark>when its norm becomes smaller than a tiny number ϵ (called the tolerance)</mark>—because this happens when Gradient Descent has (almost) reached the minimum.
    * The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large.
    
    
* At the opposite extreme, **Stochastic Gradient Descent** picks a <mark>random instance in the training set at every step and computes the gradients based only on that single instance.</mark> Obviously, working on a single instance at a time makes the algorithm much faster because it has very little data to manipulate at every iteration.
    * When the cost function is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does.
    * When using Stochastic Gradient Descent, the training instances must be **independent and identically distributed (IID)** to ensure that the parameters get pulled toward the global optimum, on average.
    
    
* **Mini-batch GD** computes the gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

* <img src='img/table4-1_alg_comp.jpg' align='left'>
<br clear='left'>


* Note that when there are multiple features, <mark>Polynomial Regression is capable of finding relationships between features</mark> (which is something a plain Linear Regression model cannot do). This is made possible by the fact that PolynomialFeatures also adds all combinations of features up to the given degree. For example, if there were two features a and b, PolynomialFeatures with degree=3 would not only add the features a2, a3, b2, and b3, but also the combinations ab, a2b, and ab2.

    * PolynomialFeatures(degree=d) transforms an array containing n features **into an array containing (n + d)! / d!n! features**, where n! is the factorial of n, equal to 1 × 2 × 3 × ⋯ × n. Beware of the combinatorial explosion of the number of features!
    
    
* <mark>If a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting. If it performs poorly on both, then it is underfitting.</mark>
    * Another way to tell is to look at the **learning curves**: these are plots of the model’s performance on the training set and the validation set as a function of the training set size (or the training iteration).


* One way to improve an **overfitting** model is to feed it more training data until the validation error reaches the training error.


* If your model is **underfitting** the training data, adding more training examples will not help. You need to use a more complex model or come up with better features.


* <img src='img/ch4_bias_variance.jpg' align='left'>
<br clear='left'>


* **Ridge Regression (also called Tikhonov regularization) is a regularized version of Linear Regression**: a regularization term equal to α∑ni=1θi2 is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.


* <img src='img/eq4-8_ridge_reg.jpg' align='left'>
<br clear='left'>


* <img src='img/eq4-10_lasso_reg.jpg' align='left'>
<br clear='left'>

    * The Lasso cost function <mark>is not differentiable at θi = 0</mark> (for i = 1, 2, ⋯, n), but Gradient Descent still works fine if you use a subgradient vector g13 instead when any θi = 0.


* <img src='img/eq4-12_elastic_net.jpg' align='left'>
<br clear='left'>


* A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called **early stopping**.


* <img src='img/eq4-13_logistic_reg.jpg' align='left'>
<br clear='left'>
    
    * The score t is often called the logit. The name comes from the fact that the logit function, <mark>defined as logit(p) = log(p / (1 – p))</mark>, is the inverse of the logistic function. Indeed, if you compute the logit of the estimated probability p, you will find that the result is t. The logit is **also called the log-odds**, since it is the log of the ratio between the estimated probability for the positive class and the estimated probability for the negative class.
    
    
* The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers (as discussed in Chapter 3). This is called **Softmax Regression**, or Multinomial Logistic Regression. The idea is simple: when given an instance x, the Softmax Regression model first computes a score sk(x) for each class k, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores. 

    * <img src='img/ep4-19_softmax.jpg' align='left'>
<br clear='left'>

    * <img src='img/eq4-20_softmax.jpg' align='left'>
    <br clear='left'>
    
    * The Softmax Regression classifier predicts only one class at a time (i.e., <mark>it is multiclass, not multioutput</mark>), so it should be used only with mutually exclusive classes, such as different types of plants. You cannot use it to recognize multiple people in one picture.
    
    
* Minimizing the cost function shown in Equation 4-22, called the cross entropy, should lead to this objective because it penalizes the model when it estimates a low probability for a target class. **Cross entropy is frequently used to measure how well a set of estimated class probabilities matches the target classes**.

    * <img src='img/eq4-22_cross_entropy.jpg' align='left'>
    <br clear='left'>

<h2><a id="ch5">Chapter 5: Support Vector Machines</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called **large margin classification**.


* Notice that adding more training instances “off the street” will not affect the decision boundary at all: it is fully determined (or “supported”) by the instances located on the edge of the street. These instances are called the **support vectors** (they are circled in Figure 5-1).


* <mark>SVMs are sensitive to the feature scales.</mark>