<h1>Chapter 1: The Machine Learning Landscape</h1>

<h2>1.1 Definition</h2>

<p>A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.</p>

<h2>1.2 Why use ML?</h2>
ML is great for:
<ul>
    <li>Problems for which existing solutions require a lot of fine-tuning or long lists of rules: One ML algorithm can often simplify code and perform better than the traditional approach.</li>
    <li>Complex problems for which using a traditional approach yields no good solution: the best ML techniques can perhaps find a solution.</li>
    <li>Fluctuating environments: a ML system can adapt to new data.</li>
    <li>Getting insights about complex problems and large amounts of data.</li>
</ul>

<h2>1.3 Classification of ML Systems</h2>
<p>There are many different types of ML sytems, so it is useful to classify them in broad categories, based on the following criteria:</p>
<ol>
    <li>Whether or not they are trained with human supervision (supervised, unsupervised, semisupervised, or Reinforcement Learning).</li>
    <li>Whether or not they can learn incrementally on the fly (online versus batch learning).</li>
    <li>Whether they work by simply comparing new data points to known data points, or instead by detecting patterns in the training data and build a predictive model, much like scientists do (instance-based learning versus model-based learning).</li>
</ol>

<h3>1.3.1 Supervised and Unsupervised Learning</h3>
<h4>1.3.1.1 Supervised Learning</h4>
<p>In <em>Supervised Learning</em>, the training set you feed to the algorithm includes the desired solutions, called labels.</p>

<p>A typical supervised learning task is <em>Classification</em>. The spam filter is a good example of this: it is trained with many example e-mails along with their <em>class</em>, and it must learn how to classify new e-mails.</p>

<p>Another example of a supervised learning task is <em>Regression</em>, whereby the goal is to predict a <em>target</em> numeric variable, such as the price of a car, given a set of <em>features</em> (mileage, brand etc.) called <em>predictors</em>. To train the system, you need to give it many examples of cars, including both their predictors and their labels.</p>

<p>Some regression algorithms can be used for classification as wel, and vice-versa. For example, <em>Logistic Regression </em> is commonly used for classification, as it can output a value that corresponds to the probability of belonging to a given  class.</p>

<h5>Supervised Learning Algorithms</h5>
<ul>
    <li>k-Nearest Neighbors</li>
    <li>Linear Regression</li>
    <li>Logistic Regression</li>
    <li>Support Vector Machines</li>
    <li>Decision Trees and Random Forests</li>
    <li>Neural Networks (although some are Unsupervised)</li>
</ul>

<h4>1.3.1.2 Unsupervised Learning</h4>
<p>In <em>Unsupervised Learning</em> the training data is unlabeled. The system tries to learn without a teacher.</p>

<h5>Unsupervised Learning Algorithms</h5>
<ul>
    <li>Clustering: K-Means, DBSCAN, Hierarchichal Cluster Analysis (HCA).</li>
    <li>Anomaly detection and novelty detection: One-class SVM, Isolation Forest.</li>
    <li>Visualisation and Dimensionality Reduction: Principal Component Analysis (PCA), Kernel PCA, Locally Linear Embedding (LLE), t-Distributed Stochastic Neighbor Embedding (t-SNE).</li>
    <li>Assoication rule learning: Apriori, Eclat.</li>
</ul>

<h4>1.3.1.3 Semisupervised Learning</h4>
<p>Some algorithms can deal with data that is partially labelled. This is called <em>Semisupervised Learning</em>. Most of these are a combination of supervised and unsupervised algorithms. For example, Deef Belief Networks (DBNs) are based on unsupervised components called restricted Boltzmann Machines (RBMs) stacked on top of one another. RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques. </p>

<h4>1.3.1.4 Reinforcement Learning</h4>
<p><em>Reinforcement Learning</em> is different altogether. The learning system, called an <em>agent</em> in this context, can observe the environment, select and perform actions, and get <em>rewards</em> in return (or <em>penalties</em> in the form of negative rewards). It must learn by itself what is the best strategy, called a <em>policy</em>, to get the most reward over time.</p>

<h3>1.3.2 Batch and Online Learning</h3>
<h4>1.3.2.1 Batch Learning</h4>
<p>In <em>Batch Learning</em>, the system is incapable of learning incrementally: it must be trained using all the available data. This will of course generally take a lot of time and computing resources, so it is typically done offline. First, the system is trained, and then it is launched into production and runs without learning anymore. THis is called <em>offline learning</em>.</p>

<h4>1.3.2.2 Online Learning</h4>
<p>In <em>Online Learning</em>, you train the system incrementaly by feeding it data instances sequentially, either individually or in small groups called <em>mini-batches</em>. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives.</p>

<p>This is great for systems that receive data as a continuous flow (e.g. stock prices) and need to adapt to change rapidly or autonomously. It is also a good option if you have limited computing resources. It can also be used to train systems on huge datasets that cannot fit into one machine's main memory (called <em>out-of-core</em> learning. The algorithm loads part of the data, runs a training step, and repeats the process until it has run on all of the data.</p>

<p>One important parameter of online learning systems is how fast they should adapt to changing data: this is called the <em>learning rate</em>.</p>

<h3>1.3.3 Instance-Based Versus Model-Based Learning</h3>
<p>Another way to categorise ML systems is how they generalise. Most ML tasks are about making predictions. This means that given a number of training examples, the system needs to be able to make good predictions for (generalise to) examples it has never seen before.</p>

<h4>1.3.3.1 Instance-Based Learning</h4>
<p>Possibly the most trivial form of learning is to learn by heart. In an <em>instance-based learning</em> system, it learns the examples by heart, then generalises to new cases by using a similarity measure ot compare them to the learned examples (or a subset of them).</p>

<h4>1.3.3.2 Model-Based Learning</h4>
<p>Another way to generalise from a set of examples is to build a model of the examples and the nused that model to make predictions. You can, for example, model the data based on a linear model, and use Linear Regression to make predictions. </p>

<h2>1.4 Challenges of ML</h2>
<ul>
    <li>Insufficient Quantity of Training Data: It takes a lot of data for most ML algorithms to work properly. Even for very simple problems you typically need thousands of examples, and for complex problems such as image or speech recognition, you may need millions of examples.</li>
    <li>Nonrepresentative Training Data: In order to generalise well, it is crucial that your training data be representitive of the new cases you want to generalise to. This is often harder than it sounds: if the sample is too small, you will have <em>sampling noise</em>, but even very large samples can be non-representative if the sampling method is flawed. This is called <em>sampling bias</em>.</li>
    <li>Poor-Quality Data: If your data is full of errors, outliers, and noise (e.g. due to poor quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well.</li>
    <li>Irrelevant Features: Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. A critical part of the success of a ML project is coming up with a good set of features to train on. This process, called <em>Feature Engineering</em>, involves the following steps:
    <ul>
        <li><em>Feature Selection</em>: Selecting the most useful features to train on among existing features.</li>
        <li><em>Feature Extraction</em>: Combining existing features to produce a more useful one. </li>
        <li>Creating new features by gathering new data.</li>
    </li>
    <li>Overfitting the Training Data: Overfitting is when the ML model performs well on the training data, but doesn't generalise well. We can constrain the model to make it simpler and reduce the risk of overfitting. This is known as <em>regularisation</em>. The amount of regularisation to apply during learning can be controlled by a <em>hyperparameter</em>, which is a parameter of a learning algorithm, not the model.</li>
    <li>Underfitting the Training Data: This occurs when your model is too simple to learn the underlying structure of the data. </li>
</ul>


<h2>1.5 Testing and Validating</h2>
<p>The only way to know how well a model will generalise to new cases is to actually try it out on new cases. One way to do that is to put your model in production and monitor how well it performs. A better option is to split your data into two sets: the <em>training set</em> and the <em>test set</em>. The error rates on new cases is called the <em>generalisation error</em>.</p>

<p>If the training error is low, but the generalisation error is high, it means that your model is overfitting the training data.</p>

<h3>1.5.1 Hyperparameter Tuning and Model Selection</h3>
<p>Evaluating a model is simple enough: just use a test set. But suppose you are hesitating between two types of models. How do you decide between them? One option is to train both and compare how well they generalise using the test set.</p>

<p>Now, suppose one of the models generalises better, and you want to apply some regularisation to avoid overfitting. How do you choose the value of the regularisation hyperparameter? One option is to train 100 different models using 100 different values for the hyperparameter. Suppose you find the best hyperparameter value that produces a model with the lowest generalisation error-say just 5%. You launch this model into production, but it produces a 15% error. What happened?</p>

<p>The problem is you measured the generalisation error multiple times on the test set, and the model and hyperparameters were adapted to produce the best model for that particular set. A common solution to this problem is called <em>holdout validation</em>: you simply hold out part of the training set to evaluate several candidate models and select the best one. The new hold-out set is called the <em>validation set</em>. More specifically, you train multiple models with various hyperparameters on the reduced training set, and select the model that performs best on the validation set. After this, you train the model on the full training set, and get your final model. Last, you evaluate this final model on the test set to get an estimate of the generalisation error.</p>

<p>If the validation set is too large, the remaining training set will be much smaller than the full training set. We can solve this by using repeated <em>cross-validation</em>, using many small validation sets. Each model is validated once per validation set after it is trained on the rest of the data.
</p>

<h3>1.5.2 Data Mismatch</h3>
<p>In some cases, it's easy to get a large amount of data for training, but this data probably won't be perfectly representative of the data that will be used in produciton. One solution is to hold out some of the training data in yet another set called the train-dev set. After the model is trained (on the training set), you can evaluate it on the train-dev set. If it performs well, then the model is not overfitting the training set. If it performs poorly, then the problem must be coming from the data mismatch.</p>