<a name='top'></a>
# Study Notes

## Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition 

<!--[section title](#section_ID) -->

# Go to:

[Chapter 1: The Machine Learning Landscape](#ch1)<br>
[Chapter 2: End-to-End ML Project](#ch2)<br>
[Chapter 3: Classification](#ch3)<br>
[Chapter 4: Training Models](#ch4)<br>
[Chapter 5: Support Vector Machines](#ch5)<br>
[Chapter 6: Decision Trees](#ch6)<br>
[Chapter 7: Ensemble Learning and Random Forest](#ch7)<br>
[Chapter 8: Dimensionality_Reduction](#ch8)<br>
[Chapter 9: Unsupervised Learning](#ch9)<br>
[Chapter 10: Neural Nets with Keras](#ch10)<br>
[Chapter 11: Training Deep Neural Networks](#ch11)<br>
[Chapter 12: Custom Models and Training with Tensorflow](#ch12)<br>
[Chapter 13: Loading and Preprocessing Data](#ch13)<br>
[Chapter 14: Deep Computer Vision with CNNs](#ch14)<br>
[Chapter 15: Processing Sequences using RNNs and CNNs](#ch15)<br>
[Chapter 16: NLP with RNNs and Attention](#ch16)<br>
[Chapter 17: Autoencoders and GANs](#ch17)<br>
[Chapter 18: Reinforcement Learning](#ch18)<br>
[Chapter 19: Training and Deploying at Scale](#ch19)<br>

<h2><a id="ch2">Chapter 2: End-to-End ML Project</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.


* tail-heavy: they extend much farther to the right of the median than to the left.


* your brain is an amazing pattern detection system, which means that it is highly prone to overfitting: if you look at the test set, you may stumble upon some seemingly interesting pattern in the test data that leads you to select a particular kind of Machine Learning model. When you estimate the generalization error using the test set, your estimate will be too optimistic, and you will launch a system that will not perform as well as expected. This is called data <b>snooping bias</b>.


* For example, the US population is 51.3% females and 48.7% males, so a well-conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male. This is called <b>stratified sampling</b>: the population is divided into homogeneous subgroups called <i>strata</i>, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population. If the people running the survey used purely random sampling, there would be about a 12% chance of sampling a skewed test set that was either less than 49% female or more than 54% female. Either way, the survey results would be significantly biased.


* It is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of a stratum’s importance may be biased.


* The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss out on nonlinear relationships (e.g., “if x is close to 0, then y generally goes up”).


* One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases (e.g., for ordered categories such as “bad,” “average,” “good,” and “excellent”), but it is obviously not the case for the ocean_proximity column (for example, categories 0 and 4 are clearly more similar than categories 0 and 1). To fix this issue, a common solution is to create one binary attribute per category: one attribute equal to 1 when the category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category is “INLAND” (and 0 otherwise), and so on. This is called <b>one-hot encoding</b>, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called dummy attributes. Scikit-Learn provides a OneHotEncoder class to convert categorical values into one-hot vectors

    
*  Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After one-hot encoding, we get a matrix with thousands of columns, and the matrix is full of 0s except for a single 1 per row. <b>Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements.</b> You can use it mostly like a normal 2D array,21 but if you really want to convert it to a (dense) NumPy array, just call the toarray() method
    
    
* There are two common ways to get all attributes to have the same scale: *min-max scaling* and *standardization*.
    - Min-max scaling (many people call this normalization) is the simplest: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min. **(MinMaxScaler)**
    - Standardization is different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. Standardization is much less affected by outliers. **(StandardScaler)**
    -  It is important to fit the scalers to *the training data only*
    
    
* Scikit-Learn’s cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the preceding code computes -scores before calculating the square root.


* If GridSearchCV is initialized with **refit=True** (which is the default), then once it finds the best estimator using cross-validation, it **retrains** it on the whole training set. This is usually a good idea, since feeding it more data will likely improve its performance.


* The grid search will automatically find out whether or not to add a feature you were not sure about (e.g., using the add_bedrooms_per_room hyperparameter of your CombinedAttributesAdder transformer). It may similarly be used to automatically find the best way to **handle outliers, missing features, feature selection, and more**.


* Even a model trained to classify pictures of cats and dogs **may need to be retrained regularly**, not because cats and dogs will mutate overnight, but because cameras keep changing, along with image formats, sharpness, brightness, and size ratios.


* The Machine Learning algorithms are important, of course, but it is probably preferable to be comfortable with the overall process and know three or four algorithms well rather than to spend all your time exploring advanced algorithms.

<h2><a id="ch3">Chapter 3: Classification</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* Moreover, some learning algorithms are sensitive to the order of the training instances, and they perform poorly if they get many similar instances in a row. **Shuffling** the dataset ensures that this won’t happen.


* A good place to start is with a **Stochastic Gradient Descent (SGD) classifier**, using Scikit-Learn’s SGDClassifier class. This classifier has the advantage of being **capable of handling very large datasets efficiently.** This is in part because SGD deals with training instances independently, one at a time (which also makes **SGD well suited for online learning**), as we will see later. 
    * The SGDClassifier **relies on randomness during training** (hence the name “stochastic”). If you want reproducible results, you should set the random_state parameter.


* Evaluating a classifier is often significantly trickier than evaluating a regressor


* This demonstrates <u>why accuracy is generally not the preferred performance measure for classifiers</u>, especially when you are dealing with skewed datasets (i.e., when some classes are much more frequent than others).


* A much better way to evaluate the performance of a classifier is to look at the **confusion matrix**. The general idea is to count the number of times instances of class A are classified as class B. For example, to know the number of times the classifier confused images of 5s with 3s, you would look in the fifth row and third column of the confusion matrix.


* Confusion matrix output = [[TN, FP], [FN, TP]] = [[negative class], [positive class]]
    * precision = TP / (TP+FP)
    * precision is typically used along with another metric named recall, also called sensitivity or the true positive rate (TPR)
    * recall = TP / (TP+FN)
    * <img src='img/mls2_0302.jpg'>
    * It is often convenient to combine precision and recall into a single metric called the **F1 score**, in particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high.
    * <img src='img/mls2_02.png' align='left'>
    <br clear='left'>
    * Unfortunately, you can’t have it both ways: **increasing precision reduces recall, and vice versa**. This is called the precision/recall trade-off
  
  
* The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate (FPR). The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to 1 – the true negative rate (TNR), which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity. Hence, the ROC curve plots sensitivity (recall) versus 1 – specificity.


* As a rule of thumb, you should prefer the PR curve whenever **the positive class is rare or when you care more about the false positives than the false negatives**. Otherwise, use the ROC curve.
    

* Some algorithms (such as Logistic Regression classifiers, Random Forest classifiers, and naive Bayes classifiers) are capable of handling multiple classes natively. Others (such as SGD Classifiers or Support Vector Machine classifiers) are strictly binary classifiers.


* One way to create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score. This is called **the one-versus-the-rest (OvR) strategy (also called one-versus-all)**.


* Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called the **one-versus-one (OvO) strategy**. If there are N classes, you need to train N × (N – 1) / 2 classifiers. The main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.
    * Some algorithms (such as <u>Support Vector Machine classifiers</u>) scale poorly with the size of the training set. For these algorithms OvO is preferred because it is faster to train many classifiers on small training sets than to train few classifiers on large training sets. **For most binary classification algorithms, however, OvR is preferred**.
    
    
* When a classifier is trained, it **stores the list of target classes in its classes_ attribute**, ordered by value. In this case, the index of each class in the classes_ array conveniently matches the class itself (e.g., the class at index 5 happens to be class 5), but in general you won’t be so lucky.


* However, most misclassified images seem like obvious errors to us, and it’s hard to understand why the classifier made the mistakes it did.3 The reason is that we used a simple SGDClassifier, which is a linear model. **All it does is assign a weight per class to each pixel, and when it sees a new image it just sums up the weighted pixel intensities to get a score for each class**. So since 3s and 5s differ only by a few pixels, this model will easily confuse them.


* Then when the classifier is shown a picture of Alice and Charlie, it should output [1, 0, 1] (meaning “Alice yes, Bob no, Charlie yes”). Such a classification system that outputs multiple binary tags is called a **multilabel classification system**.


* This assumes that all labels are equally important, however, which may not be the case. In particular, if you have many more pictures of Alice than of Bob or Charlie, you may want to give more weight to the classifier’s score on pictures of Alice. One simple option is to give each label a weight equal to its support (i.e., the number of instances with that target label). To do this, simply **set average="weighted"** in the preceding code.4


* The last type of classification task we are going to discuss here is called multioutput–multiclass classification (or simply **multioutput classification**). It is simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values).

<h2><a id="ch4">Chapter 4: Training Models</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* Two ways to train linear regression
    * Using a direct **“closed-form” equation** that directly computes the model parameters that best fit the model to the training set (i.e., the model parameters that minimize the cost function over the training set).
    * Using an iterative optimization approach called Gradient Descent (GD) that gradually tweaks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters as the first method. We will look at a few variants of Gradient Descent that we will use again and again when we study neural networks in Part II: **Batch GD, Mini-batch GD, and Stochastic GD**.


* Polynomial Regression, a more complex model that can fit nonlinear datasets. Since this model has more parameters than Linear Regression, it is **more prone to overfitting the training data**, so we will look at how to detect whether or not this is the case using **learning curves**, and then we will look at several regularization techniques that can reduce the risk of overfitting the training set.


* Two more models that are commonly used for classification tasks: **Logistic Regression and Softmax Regression**.


* <img src='img/eq4-2_linear_regression.jpg' align='left'>
<br clear='left'>


* In Chapter 2 we saw that the most common performance measure of a regression model is the Root Mean Square Error (RMSE) (Equation 2-1). Therefore, to train a Linear Regression model, we need to find the value of θ that minimizes the RMSE. **In practice, it is simpler to minimize the mean squared error (MSE) than the RMSE, and it leads to the same result** (because the value that minimizes a function also minimizes its square root).

* <img src='img/eq4-3_MSE.jpg' align='left'>
<br clear='left'>

* <img src='img/eq4-4_normal.jpg' align='left'>
<br clear='left'>


* This function computes θˆ=X+y, where X+ is the **pseudoinverse** of X (specifically, the Moore-Penrose inverse). You can use np.linalg.pinv() to compute the pseudoinverse directly


* <img src='img/4_moore_penrose.jpg' align='left'>
<br clear='left'>


* The pseudoinverse itself is computed using a standard matrix factorization technique called **Singular Value Decomposition (SVD)** that can decompose the training set matrix X into the matrix multiplication of three matrices U Σ V⊺ (see numpy.linalg.svd()). The pseudoinverse is computed as X+=VΣ+U⊺. To compute the matrix Σ+, the algorithm takes Σ and sets to zero all values smaller than a tiny threshold value, then it replaces all the nonzero values with their inverse, and finally it transposes the resulting matrix. This approach is more efficient than computing the Normal Equation, plus it handles edge cases nicely: indeed, the Normal Equation may not work if the matrix X⊺X is not invertible (i.e., singular), such as if m < n or if some features are redundant, but the pseudoinverse is always defined.
    * The Normal Equation computes the inverse of X⊺ X, which is an (n + 1) × (n + 1) matrix (where n is the number of features). The computational complexity of inverting such a matrix is typically about **O(n^2.4) to O(n^3)**.
    
    
    
* Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to **tweak parameters iteratively in order to minimize a cost function**.


* When using Gradient Descent, you should **ensure that all features have a similar scale** (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.


* <img src='img/eq4-6_gradient_vector.jpg' align='left'>
<br clear='left'>


* Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting ∇θMSE(θ) from θ. This is where the learning rate η comes into play:5 multiply the gradient vector by η to determine the size of the downhill step (Equation 4-7).
    * <img src='img/eq4-7_GD_step.jpg' align='left'>
<br clear='left'>

* You may wonder how to set the number of iterations. If it is too low, you will still be far away from the optimal solution when the algorithm stops; but if it is too high, you will waste time while the model parameters do not change anymore. A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny—that is, <mark>when its norm becomes smaller than a tiny number ϵ (called the tolerance)</mark>—because this happens when Gradient Descent has (almost) reached the minimum.
    * The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large.
    
    
* At the opposite extreme, **Stochastic Gradient Descent** picks a <mark>random instance in the training set at every step and computes the gradients based only on that single instance.</mark> Obviously, working on a single instance at a time makes the algorithm much faster because it has very little data to manipulate at every iteration.
    * When the cost function is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does.
    * When using Stochastic Gradient Descent, the training instances must be **independent and identically distributed (IID)** to ensure that the parameters get pulled toward the global optimum, on average.
    
    
* **Mini-batch GD** computes the gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

* <img src='img/table4-1_alg_comp.jpg' align='left'>
<br clear='left'>


* Note that when there are multiple features, <mark>Polynomial Regression is capable of finding relationships between features</mark> (which is something a plain Linear Regression model cannot do). This is made possible by the fact that PolynomialFeatures also adds all combinations of features up to the given degree. For example, if there were two features a and b, PolynomialFeatures with degree=3 would not only add the features a2, a3, b2, and b3, but also the combinations ab, a2b, and ab2.

    * PolynomialFeatures(degree=d) transforms an array containing n features **into an array containing (n + d)! / d!n! features**, where n! is the factorial of n, equal to 1 × 2 × 3 × ⋯ × n. Beware of the combinatorial explosion of the number of features!
    
    
* <mark>If a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting. If it performs poorly on both, then it is underfitting.</mark>
    * Another way to tell is to look at the **learning curves**: these are plots of the model’s performance on the training set and the validation set as a function of the training set size (or the training iteration).


* One way to improve an **overfitting** model is to feed it more training data until the validation error reaches the training error.


* If your model is **underfitting** the training data, adding more training examples will not help. You need to use a more complex model or come up with better features.


* <img src='img/ch4_bias_variance.jpg' align='left'>
<br clear='left'>


* **Ridge Regression (also called Tikhonov regularization) is a regularized version of Linear Regression**: a regularization term equal to α∑ni=1θi2 is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.


* <img src='img/eq4-8_ridge_reg.jpg' align='left'>
<br clear='left'>


* <img src='img/eq4-10_lasso_reg.jpg' align='left'>
<br clear='left'>

    * The Lasso cost function <mark>is not differentiable at θi = 0</mark> (for i = 1, 2, ⋯, n), but Gradient Descent still works fine if you use a subgradient vector g13 instead when any θi = 0.


* <img src='img/eq4-12_elastic_net.jpg' align='left'>
<br clear='left'>


* A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called **early stopping**.


* <img src='img/eq4-13_logistic_reg.jpg' align='left'>
<br clear='left'>
    
    * The score t is often called the logit. The name comes from the fact that the logit function, <mark>defined as logit(p) = log(p / (1 – p))</mark>, is the inverse of the logistic function. Indeed, if you compute the logit of the estimated probability p, you will find that the result is t. The logit is **also called the log-odds**, since it is the log of the ratio between the estimated probability for the positive class and the estimated probability for the negative class.
    
    
* The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers (as discussed in Chapter 3). This is called **Softmax Regression**, or Multinomial Logistic Regression. The idea is simple: when given an instance x, the Softmax Regression model first computes a score sk(x) for each class k, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores. 

    * <img src='img/ep4-19_softmax.jpg' align='left'>
<br clear='left'>

    * <img src='img/eq4-20_softmax.jpg' align='left'>
    <br clear='left'>
    
    * The Softmax Regression classifier predicts only one class at a time (i.e., <mark>it is multiclass, not multioutput</mark>), so it should be used only with mutually exclusive classes, such as different types of plants. You cannot use it to recognize multiple people in one picture.
    
    
* Minimizing the cost function shown in Equation 4-22, called the cross entropy, should lead to this objective because it penalizes the model when it estimates a low probability for a target class. **Cross entropy is frequently used to measure how well a set of estimated class probabilities matches the target classes**.

    * <img src='img/eq4-22_cross_entropy.jpg' align='left'>
    <br clear='left'>

<h2><a id="ch5">Chapter 5: Support Vector Machines</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called **large margin classification**.


* Notice that adding more training instances “off the street” will not affect the decision boundary at all: it is fully determined (or “supported”) by the instances located on the edge of the street. These instances are called the **support vectors** (they are circled in Figure 5-1).


* <mark>SVMs are sensitive to the feature scales.</mark>


* If we strictly impose that all instances must be off the street and on the right side, this is called **hard margin classification**. There are two main issues with hard margin classification. <mark>First, it only works if the data is linearly separable. Second, it is sensitive to outliers.</mark>


* The objective is to find a good balance between keeping the street as large as possible and <mark>limiting the margin violations</mark> (i.e., instances that end up in the middle of the street or even on the wrong side). This is called **soft margin classification**.


* If your SVM model is overfitting, you can try regularizing it by reducing C.


* Unlike Logistic Regression classifiers, SVM classifiers <u>do not output probabilities for each class</u>.


* The kernel trick makes it possible to get the same result as if you had added many polynomial features, even with very high-degree polynomials, **without actually having to add them**. So there is no combinatorial explosion of the number of features because you don’t actually add any features.


* Another technique to tackle nonlinear problems is to add features computed using a **similarity function**, which measures how much each instance resembles a particular landmark.


* Nonlinear SVM Classification
    * Polynomial Kernal
    * Gaussian RBF Kernal
    * String Kernal
    
    
* With so many kernels to choose from, how can you decide which one to use? As a rule of thumb, you should always try the linear kernel first (<mark>remember that LinearSVC is much faster than SVC(kernel="linear"))</mark>, especially if the training set is very large or if it has plenty of features. If the training set is not too large, you should also try the Gaussian RBF kernel; it works well in most cases. Then if you have spare time and computing power, you can experiment with a few other kernels, using cross-validation and grid search. You’d want to experiment like that especially if there are kernels specialized for your training set’s data structure.


* <img src='img/table5-1_complexity.jpg' align='left'>
    <br clear='left'>
    
* As mentioned earlier, the SVM algorithm is versatile: not only does it support linear and nonlinear classification, but it also supports linear and nonlinear regression. To use SVMs for regression instead of classification, the trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, <mark>SVM Regression tries to fit as many instances as possible on the street</mark> while limiting margin violations (i.e., instances off the street).

    * Adding more training instances within the margin does not affect the model’s predictions; thus, the model is said to be ϵ-insensitive.

* <img src='img/eq5-5_quad_prog.jpg' align='left'>
    <br clear='left'>


* Kernal trick
    * <img src='img/eq5-9_kernal_trick.jpg' align='left'>
    <br clear='left'>

<h2><a id="ch8">Chapter 8: Dimensionality Reduction</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* Two main approaches to reducing dimensionality: **projection** and **Manifold Learning**.

    * In most real-world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated (as discussed earlier for MNIST). As a result, all training instances lie within (or close to) a <mark>much lower-dimensional subspace of the high-dimensional space.</mark>
    
    * The Swiss roll is an example of a 2D manifold. Put simply, a 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space.
    
    
* **Principal Component Analysis (PCA)** is by far the most popular dimensionality reduction algorithm. First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it


* PCA identifies the axis that accounts for the **largest amount of variance** in the training set.


* <img src='img/eq8-1.png' align='left'>
    <br clear='left'>
    
    
* Once you have identified all the principal components, you can reduce the dimensionality of the dataset down to d dimensions by projecting it onto the hyperplane defined by the first d principal components.

    * Instead of arbitrarily choosing the number of dimensions to reduce down to, it is simpler to choose the number of dimensions that add up to a <mark>sufficiently large portion of the variance (e.g., 95%)</mark>. Unless, of course, you are reducing dimensionality for data visualization—in that case you will want to reduce the dimensionality down to 2 or 3.
    
    * It is also possible to decompress the reduced dataset back to 784 dimensions by applying the inverse transformation of the PCA projection. This won’t give you back the original data, since the projection lost a bit of information (within the 5% variance that was dropped), but it will likely be close to the original data.
    
    
* **Randomized PCA**: If you set the svd_solver hyperparameter to "randomized", Scikit-Learn uses a stochastic algorithm called Randomized PCA that quickly finds an approximation of the first d principal components.


* **Incremental PCA**: One problem with the preceding implementations of PCA is that they require the whole training set to fit in memory in order for the algorithm to run. Fortunately, Incremental PCA (IPCA) algorithms have been developed. They allow you to split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time.


* **Kernel PCA**: In Chapter 5 we discussed the kernel trick, a mathematical technique that implicitly maps instances into a very high-dimensional space (called the feature space), enabling nonlinear classification and regression with Support Vector Machines. Recall that a linear decision boundary in the high-dimensional feature space corresponds to a complex nonlinear decision boundary in the original space. It turns out that the same trick can be applied to PCA, making it possible to perform complex nonlinear projections for dimensionality reduction. This is called Kernel PCA (kPCA).6 It is often <mark>good at preserving clusters of instances after projection</mark>, or sometimes even unrolling datasets that lie close to a twisted manifold.


* **Locally Linear Embedding (LLE)**: It is another powerful nonlinear dimensionality reduction (NLDR) technique. It is a Manifold Learning technique that does not rely on projections, like the previous algorithms do. In a nutshell, LLE works by first measuring how each training instance linearly relates to its closest neighbors, and then looking for a low-dimensional representation of the training set where these local relationships are best preserved). This approach makes it particularly good at unrolling twisted manifolds, especially when there is not too much noise.


* Other dimensionality reduction techniques

    * <img src='img/ch8-1.png' align='left'>
        <br clear='left'>

<h2><a id="ch9">Chapter 9: Unsupervised Learning Techniques</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* <img src='img/ch9-1.png' align='left'>
    <br clear='left'>
    
    
* **Clustering** is the task of identifying similar instances and assigning them to clusters, or groups of similar instances.


* <img src='img/ch9-2.png' align='left'><br>
  <img src='img/ch9-3.png' align='left'>
    <br clear='left'>


* Some algorithms look for instances centered around a particular point, called a *centroid*. Others look for continuous regions of densely packed instances: these clusters can take on any shape. Some algorithms are hierarchical, looking for clusters of clusters.


* It was proposed by Stuart Lloyd at Bell Labs in 1957 as a technique for pulse-code modulation, but it was only published outside of the company in 1982. In 1965, Edward W. Forgy had published virtually the same algorithm, so K-Means is sometimes referred to as *Lloyd–Forgy*.


* The computational complexity of the algorithm is generally linear with regard to the number of instances m, the number of clusters k, and the number of dimensions n. However, this is only true when the data has a clustering structure. If it does not, then in the worst-case scenario the complexity can increase exponentially with the number of instances. In practice, this rarely happens, and **K-Means is generally one of the fastest clustering algorithms**.


* The K-Means algorithm is one of the fastest clustering algorithms, but also one of the simplest:
    * First initialize $k$ centroids randomly: $k$ distinct instances are chosen randomly from the dataset and the centroids are placed at their locations.
    * Repeat until convergence (i.e., until the centroids stop moving):
        * Assign each instance to the closest centroid.
        * Update the centroids to be the mean of the instances that are assigned to them.


* To select the best model, we will need a way to evaluate a K-Mean model's performance. Unfortunately, clustering is an unsupervised task, so we do not have the targets. But at least we can measure the distance between each instance and its centroid. This is the idea behind the _inertia_ metric


* **K-means++**
    * Instead of initializing the centroids entirely randomly, it is preferable to initialize them using the following algorithm, proposed in a [2006 paper](https://goo.gl/eNUPw6) by David Arthur and Sergei Vassilvitskii:
    * Take one centroid $c_1$, chosen uniformly at random from the dataset.
    * Take a new center $c_i$, choosing an instance $\mathbf{x}_i$ with probability: $D(\mathbf{x}_i)^2$ / $\sum\limits_{j=1}^{m}{D(\mathbf{x}_j)}^2$ where $D(\mathbf{x}_i)$ is the distance between the instance $\mathbf{x}_i$ and the closest centroid that was already chosen. This probability distribution ensures that instances that are further away from already chosen centroids are much more likely be selected as centroids.
    * Repeat the previous step until all $k$ centroids have been chosen.



* Limitations of K-Means

    * it is necessary to run the algorithm several times to avoid suboptimal solutions
    * need to specify the number of clusters
    * K-Means does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes
    

* Since there are 10 different digits, it is tempting to set the number of clusters to 10. However, each digit can be written several different ways, so it is <mark>preferable to use a larger number of clusters, such as 50</mark>.


* DBSCAN algorithm defines clusters as continuous regions of high density. This algorithm works well if all the clusters are dense enough and if they are well separated by low-density regions.


* *A Gaussian mixture model (GMM)* is a probabilistic model that assumes that the instances were <mark>generated from a mixture of several Gaussian distributions whose parameters are unknown</mark>. All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid
    * <img src='img/fig9-16.png' align='left'>
        <br clear='left'>
        
        
* Likelihood function: Given a statistical model with some parameters θ,<mark>the word “probability” is used to describe how plausible a future outcome x is (knowing the parameter values θ), while the word “likelihood” is used to describe how plausible a particular set of parameter values θ are, after the outcome x is known.</mark>

<h2><a id="ch10">Chapter 10: Introduction to Artificial Neural Networks with Keras</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* <img src='img/fig10-3.png' align='left'>
    <br clear='left'><br>
    
    * The first network on the left is the identity function: if neuron A is activated, then neuron C gets activated as well (since it receives two input signals from neuron A); but if neuron A is off, then neuron C is off as well.

    * The second network performs a logical AND: neuron C is activated only when both neurons A and B are activated (a single input signal is not enough to activate neuron C).

    * The third network performs a logical OR: neuron C gets activated if either neuron A or neuron B is activated (or both).

    * Finally, if we suppose that an input connection can inhibit the neuron’s activity (which is the case with biological neurons), then the fourth network computes a slightly more complex logical proposition: neuron C is activated only if neuron A is active and neuron B is off. If neuron A is active all the time, then you get a logical NOT: neuron C is active when neuron B is off, and vice versa.
    
    
* **Perceptron**

    * The Perceptron is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron (see Figure 10-4) called a *threshold logic unit (TLU)*, or sometimes a linear threshold unit (LTU). The inputs and output are numbers (instead of binary on/off values), and each input connection is associated with a weight. The TLU computes a weighted sum of its inputs (z = w1 x1 + w2 x2 + ⋯ + wn xn = x⊺ w), then applies a step function to that sum and outputs the result: hw(x) = step(z), where z = x⊺ w.<br>
    
    * <img src='img/fig10-4.png' align='left'>
    <br clear='left'>
    
    
* **Basics of NN**

    * A Perceptron is simply composed of a single layer of TLUs,7 with each TLU connected to all the inputs. When all the neurons in a layer are connected to every neuron in the previous layer (i.e., its input neurons), the layer is called a fully connected layer, or a dense layer. The inputs of the Perceptron are fed to special passthrough neurons called input neurons: they output whatever input they are fed. All the input neurons form the input layer. Moreover, an extra bias feature is generally added (x0 = 1): it is typically represented using a special type of neuron called a bias neuron, which outputs 1 all the time. A Perceptron with two inputs and three outputs is represented in Figure 10-5. This Perceptron can classify instances simultaneously into three different binary classes, which makes it a multioutput classifier.
    
    * <img src='img/fig10-5.png' align='left'>
    <br clear='left'>
    
    
* <img src='img/eq10-3.png' align='left'>
    <br clear='left'>
    
    
* The decision boundary of each output neuron is linear, so Perceptrons are incapable of learning complex patterns (just like Logistic Regression classifiers). However, if the training instances are linearly separable, Rosenblatt demonstrated that this algorithm would converge to a solution. This is called the ***Perceptron convergence theorem***.


* XOR implementation using multilayer perceptron

    * It turns out that some of the <mark>limitations of Perceptrons can be eliminated by stacking multiple Perceptrons.</mark> The resulting ANN is called a Multilayer Perceptron (MLP). An MLP can solve the XOR problem, as you can verify by computing the output of the MLP represented on the right side of Figure 10-6: with inputs (0, 0) or (1, 1), the network outputs 0, and with inputs (0, 1) or (1, 0) it outputs 1. All connections have a weight equal to 1, except the four connections where the weight is shown.
    
    * <img src='img/fig10-6.png' align='left'>
    <br clear='left'>

* **Backpropagation**

    * For many years researchers struggled to find a way to train MLPs, without success. But in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a groundbreaking paper that introduced the backpropagation training algorithm, which is still used today. In short, it is Gradient Descent (introduced in Chapter 4) using an efficient technique for computing the gradients automatically: **in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network’s error with regard to every single model parameter.** In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.<br><br>
   
    * For each training instance, the backpropagation algorithm first makes a prediction (forward pass) and measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally tweaks the connection weights to reduce the error (Gradient Descent step).<br><br>
    
    * It is important to initialize all the hidden layers’ connection weights randomly, or else training will fail. For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and thus backpropagation will affect them in exactly the same way, so they will remain identical. In other words, <mark>despite having hundreds of neurons per layer, your model will act as if it had only one neuron per layer: it won’t be too smart.</mark> If instead you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons.<br><br>
    
    * 3 common activation functions

        * logistic (sigmoid) function, σ(z) = 1 / (1 + exp(–z))
        * The hyperbolic tangent function: tanh(z) = 2σ(2z) – 1
        * The Rectified Linear Unit function: ReLU(z) = max(0, z)
    
    
    
* Automatically computing gradients is called **automatic differentiation, or autodiff**. There are various autodiff techniques, with different pros and cons. The one used by backpropagation is called reverse-mode autodiff. It is fast and precise, and is well suited when the function to differentiate has many variables (e.g., connection weights) and few outputs (e.g., one loss). If you want to learn more about autodiff, check out Appendix D.

<h2><a id="ch11">Chapter 11: Training Deep Neural Networks</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* You may be faced with the tricky *vanishing gradients problem* or the related *exploding gradients problem*. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.


* **Glorot and He Initialization**
    * In their paper, Glorot and Bengio propose a way to significantly alleviate the unstable gradients problem. They point out that we need the signal to flow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don’t want the signal to die out, nor do we want it to explode and saturate. **For the signal to flow properly, the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction** (please check out the paper if you are interested in the mathematical details). It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons (these numbers are called the fan-in and fan-out of the layer), but Glorot and Bengio proposed a good compromise that has proven to work very well in practice: the connection weights of each layer must be initialized randomly as described in Equation 11-1, where fanavg = (fanin + fanout)/2. This initialization strategy is called Xavier initialization or Glorot initialization, after the paper’s first author.
    
    
* It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle (we will discuss how to find them in Chapter 14), then reuse the lower layers of this network. This technique is called ***transfer learning***. It will not only speed up training considerably, but also require significantly less training data.

     * <img src='img/fig11-4.png' align='left'>
    <br clear='left'>
    
    
* Training a very large deep neural network can be painfully slow. So far we have seen four ways to speed up training (and reach a better solution)
    1. applying a good initialization strategy for the connection weights
    2. using a good activation function
    3. using Batch Normalization
    4. reusing parts of a pretrained network (possibly built on an auxiliary task or using unsupervised learning). 
    5.  Another huge speed boost comes from using a faster optimizer than the regular Gradient Descent optimizer (momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and finally Adam and Nadam optimization.)
    
* <img src='img/table11-2.png' align='left'>
    <br clear='left'>
    
    
* Avoiding Overfitting Through Regularization
    
    * With thousands of parameters, you can fit the whole zoo. Deep neural networks typically have tens of thousands of parameters, sometimes even millions. This gives them an incredible amount of freedom and means they can fit a huge variety of complex datasets. But this great flexibility also makes the network prone to overfitting the training set. We need regularization.
    
    * **ℓ1 and ℓ2 Regularization**
        * for simple linear models, you can use ℓ2 regularization to constrain a neural network’s connection weights, and/or ℓ1 regularization if you want a sparse model (with many weights equal to 0).
    * **Dropout**
        * It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step (see Figure 11-9). The hyperparameter p is called the dropout rate, and it is typically set between 10% and 50%: closer to 20–30% in recurrent neural nets (see Chapter 15), and closer to 40–50% in convolutional neural networks (see Chapter 14). After training, neurons don’t get dropped anymore. And that’s all (except for a technical detail we will discuss momentarily).
        
        * <img src='img/fig11-9.png' align='left'>
    <br clear='left'>
    
    * **Monte Carlo (MC) Dropout**
    
        * *Averaging over multiple predictions* with dropout on gives us a Monte Carlo estimate that is generally more reliable than the result of a single prediction with dropout off.

<h2><a id="ch12">Chapter 12: Custom Models and Training with TensorFlow</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* About TensorFlow

    * Its core is very similar to NumPy, but with GPU support.

    * It supports distributed computing (across multiple devices and servers).

    * It includes a kind of just-in-time (JIT) compiler that allows it to optimize computations for speed and memory usage. It works by extracting the computation graph from a Python function, then optimizing it (e.g., by pruning unused nodes), and finally running it efficiently (e.g., by automatically running independent operations in parallel).

    * Computation graphs can be exported to a portable format, so you can train a TensorFlow model in one environment (e.g., using Python on Linux) and run it in another (e.g., using Java on an Android device).

    * It implements autodiff (see Chapter 10 and Appendix D) and provides some excellent optimizers, such as RMSProp and Nadam (see Chapter 11), so you can easily minimize all sorts of loss functions.
    
    
* **TensorFlow's Python API**
    * <img src='img/fig12-1.png' align='left'>
<br clear='left'>


*  GPUs can dramatically speed up computations by splitting them into many smaller chunks and running them in parallel across many GPU threads. TPUs are even faster: they are custom ASIC chips built specifically for Deep Learning operations


* <mark>TensorFlow’s API revolves around tensors, which flow from operation to operation—hence the name TensorFlow.</mark> A tensor is very similar to a NumPy ndarray: it is usually a multidimensional array, but it can also hold a scalar (a simple value, such as 42). These tensors will be important when we create custom cost functions, custom metrics, custom layers, and more


* <img src='img/fig12-1-1.png' align='left'><br>
<img src='img/fig12-1-2.png' align='left'>
<br clear='left'>


* **The mean squared error** might penalize large errors too much and cause your model to be imprecise. The **mean absolute error** would not penalize outliers as much, but training might take a while to converge, and the trained model might not be very precise. This is probably a good time to use the Huber loss (introduced in Chapter 10) instead of the good old MSE. The Huber loss is not currently part of the official Keras API, but it is available in tf.keras (just use an instance of the keras.losses.Huber class).

<h2><a id="ch13">Chapter 13: Loading and Preprocessing Data with TensorFlow</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* Ingesting a large dataset and preprocessing it efficiently can be tricky to implement with other Deep Learning libraries, but TensorFlow makes it easy thanks to the **Data API**: you just create a dataset object, and tell it where to get the data and how to transform it. TensorFlow takes care of all the implementation details, such as multithreading, queuing, batching, and prefetching. 


* The whole Data API revolves around the concept of a *dataset*

    * The dataset methods do not modify datasets, they create new ones, so make sure to keep a reference to these new datasets (e.g., with dataset = ...), or else nothing will happen.


* <img src='img/fig13-2.png' align='left'>
<br clear='left'>

    
* By calling prefetch(1) at the end, we are creating a dataset that will do its best to always be one batch ahead. <mark>In other words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting the next batch ready</mark> (e.g., reading the data from disk and preprocessing it). This can improve performance dramatically

    * <img src='img/fig13-3.png' align='left'>
<br clear='left'>


* The **TFRecord format** is TensorFlow’s preferred format for storing large amounts of data and reading it efficiently. It is a very simple binary format that just contains a sequence of binary records of varying sizes (each record is comprised of a length, a CRC checksum to check that the length was not corrupted, then the actual data, and finally a CRC checksum for the data). You can easily create a TFRecord file using the tf.io.TFRecordWriter class


* **The TensorFlow Datasets (TFDS) Project** makes it very easy to download common datasets, from small ones like MNIST or Fashion MNIST to huge datasets like ImageNet (you will need quite a bit of disk space!). The list includes image datasets, text datasets (including translation datasets), and audio and video datasets. You can visit https://homl.info/tfds to view the full list, along with a description of each dataset.

<h2><a id="ch14">Chapter 14: Deep Computer Vision Using Convolutional Neural Networks</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* You may be faced with the tricky *vanishing gradients problem* or the related *exploding gradients problem*. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.

<h2><a id="ch15">Chapter 15: Processing Sequences Using RNNs and CNNs</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* You may be faced with the tricky *vanishing gradients problem* or the related *exploding gradients problem*. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.

<h2><a id="ch16">Chapter 16: Natural Language Processing with RNNs and Attention</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* You may be faced with the tricky *vanishing gradients problem* or the related *exploding gradients problem*. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.

<h2><a id="ch17">Chapter 17: Representation Learning and Generative Learning Using Autoencoders and GANs</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* You may be faced with the tricky *vanishing gradients problem* or the related *exploding gradients problem*. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.

<h2><a id="ch18">Chapter 18: Reinforcement Learning</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* You may be faced with the tricky *vanishing gradients problem* or the related *exploding gradients problem*. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.

<h2><a id="ch19">Chapter 19: Training and Deploying TensorFlow Models at Scale</a></h2>
<div align='right'><a href='#top'>Go To Top</a></div>

* You may be faced with the tricky *vanishing gradients problem* or the related *exploding gradients problem*. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.