# The Fundamentals of Machine Learning

## Chapter 1 - The Machine Learning Landscape

### **What Is Machine Learning?**

Machine Learning is the science (and art) of programming computers so they can
learn from data.

Here is a slightly more general definition:


> [Machine Learning is the] field of study that gives computers the ability to learn
without being explicitly programmed.<br>&nbsp;&nbsp;
 —Arthur Samuel, 1959

And a more engineering-oriented one:

> A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by P,
improves with experience E.<br>&nbsp;&nbsp; —Tom Mitchell, 1997

A spam filter is a Machine Learning program that learns from examples of spam and ham emails.

The dataset used for learning is the training set, and each example is a training instance.

In this case:

- Task (T): flag spam for new emails.

- Experience (E): the training data.

- Performance (P): ratio of correctly classified emails (accuracy).

Simply downloading Wikipedia adds data, but it is not Machine Learning since no T, E, or P are defined.

### **Why Use Machine Learning?**

Consider how you would write a spam filter using traditional programming techniques (Figure 1-1):

1. Identify common spam features, such as frequent words (“4U,” “credit card,” “free,” “amazing”) and patterns in the subject, sender, or body.

2. Create detection algorithms for these patterns and flag emails as spam if enough are present.

3. Test and refine the program until performance is satisfactory.

![Figure 1-1](./Fig/Chapter_1/Fig1-1.png)


Rule-based spam filters become long, complex, and hard to maintain. In contrast, a Machine Learning–based filter (Figure 1-2) automatically learns frequent word patterns in spam vs. ham, making it shorter, easier to maintain, and more accurate. When spammers adapt (e.g., changing “4U” to “For U”), rule-based filters require constant updates, while an ML filter (Figure 1-3) adapts automatically by detecting new frequent patterns in flagged spam.

![Figure 1-2](./Fig/Chapter_1/Fig1-2.png)

![Figure 1-3](./Fig/Chapter_1/Fig1-3.png)

Machine Learning excels at problems too complex or lacking known algorithms, such as speech recognition. Simple hardcoded rules (e.g., detecting the high-pitch “T” in “two”) cannot scale to many words, speakers, noisy environments, and languages. Instead, ML algorithms learn from large sets of recordings. ML can also help humans learn (Figure 1-4): trained models can be inspected to reveal key predictors (e.g., words in spam filters), sometimes uncovering unexpected correlations or trends. Using ML to explore large datasets and reveal hidden patterns is known as data mining.

![Figure 1-4](./Fig/Chapter_1/Fig1-4.png)

To summarize, Machine Learning is well-suited for:

- Problems where traditional solutions need heavy fine-tuning or long rule lists.

- Complex problems with no good traditional solution.

- Fluctuating environments where systems must adapt to new data.

- Gaining insights from complex problems and large datasets.

### **Example of Applications**

Let’s look at some concrete examples of Machine Learning tasks, along with the tech‐
niques that can tackle them:

***Analyzing images of products on a production line to automatically classify them***
<br>
&nbsp;&nbsp; This is image classification, typically performed using convolutional neural net‐
works (CNNs; see Chapter 14).

***Detecting tumors in brain scans***
<br>
&nbsp;&nbsp; This is semantic segmentation, where each pixel in the image is classified (as we
want to determine the exact location and shape of tumors), typically using CNNs
as well.

***Automatically classifying news articles***
<br>
&nbsp;&nbsp; This is natural language processing (NLP), and more specifically text classifica‐
tion, which can be tackled using recurrent neural networks (RNNs), CNNs, or
Transformers (see Chapter 16).

***Automatically flagging offensive comments on discussion forums***
<br>
&nbsp;&nbsp; This is also text classification, using the same NLP tools.

***Summarizing long documents automatically***
<br>
&nbsp;&nbsp; This is a branch of NLP called text summarization, again using the same tools.

***Creating a chatbot or a personal assistant***
<br>
&nbsp; &nbsp; This involves many NLP components, including natural language understanding
(NLU) and question-answering modules.

***Building an intelligent bot for a game***
<br>
<p align = "justify">
&nbsp;&nbsp;This is often tackled using Reinforcement Learning (RL; see Chapter 18),which
is a branch of Machine Learning that trains agents (such as bots) to pick the
actions that will maximize their rewards over time (e.g., a bot may get a reward
every time the player loses some life points), within a given environment (such as
the game). The famous AlphaGo program that beat the world champion at the
game of Go was built using RL.
</p>


### **Supervised Learning / Unsupervised Learning**

Machine Learning systems can be classified by the type of supervision in training: supervised, unsupervised, semisupervised, and reinforcement learning.


#### **Supervised Learning**

In supervised learning, the training set includes labels (Figure 1-5). Common tasks:<br>
<br>

![Figure 1-5](./Fig/Fig1-5.png)

<br>

- Classification (e.g., spam filter trained with labeled emails).
- Regression (predicting numeric values such as car prices from features like mileage, age, brand; Figure 1-6).

<br>

![Figure 1-6](./Fig/Fig1-6.png)

<br>

Key notes:
- Attribute = data type (e.g., mileage).
- Feature = attribute + value (e.g., mileage = 15,000).
- Some algorithms overlap (e.g., Logistic Regression is used for classification by outputting class probabilities).

<br>    

Important supervised learning algorithms:
1. k-Nearest Neighbors
2.Linear Regression
3. Logistic Regression
4. Support Vector Machines (SVMs)
5. Decision Trees & Random Forests
6. Neural Networks

#### **Unsupervised Learning**

In unsupervised learning, the training data is unlabeled (Figure 1-7). The system learns without a teacher.
<br>

![Figure 1-7](./Fig/Chapter_1/Fig1-7.png)

<br>
Key unsupervised learning tasks and algorithms:

- **Clustering**: K-Means, DBSCAN, Hierarchical Cluster Analysis (HCA).
- **Anomaly & novelty detection**: One-class SVM, Isolation Forest.
- **Visualization & dimensionality reduction**: PCA, Kernel PCA, LLE, t-SNE.
- **Association rule learning**: Apriori, Eclat.

**Example – Clustering** :<br>

A clustering algorithm can group blog visitors without labels, e.g., 40% are comic book fans who read in the evening, while 20% are sci-fi fans active on weekends. Hierarchical clustering can subdivide groups further (Figure 1-8).

<br>

![Figure 1-8](./Fig/Chapter_1/Fig1-8.png)

<br>

**Example – Visualization** :<br>

Visualization algorithms (e.g., t-SNE) project high-dimensional unlabeled data into 2D/3D while preserving structure, revealing semantic clusters (Figure 1-9).

<br>

![Figure 1-9](./Fig/Chapter_1/Fig1-9.png)

<br>

**Example – Dimensionality reduction** :<br>

Simplifies data by merging correlated features (e.g., mileage + age → wear and tear). This reduces computation, storage, and can improve performance.

**Example – Anomaly & novelty detection** :<br>

Detects unusual patterns such as fraud or defects. The system learns normal instances, then flags unusual ones as anomalies (Figure 1-10). Novelty detection instead flags new instances not in the training set.

<br>

![Figure 1-10](./Fig/Chapter_1/Fig1-10.png)

<br>

**Example – Association rule learning**:<br>

Discovers relations in data, e.g., supermarket sales logs may show that people who buy barbecue sauce and potato chips also often buy steak.

#### **Semisupervised Learning**

Since labeling data is costly, we often have many unlabeled and few labeled instances. Algorithms that can use both are called **semisupervised** learning (Figure 1-11).
<br>

![Figure 1-11](./Fig/Chapter_1/Fig1-11.png)

Example :<br>
Photo-hosting services (e.g., Google Photos) cluster faces (unsupervised), then need only one label per person to name them across all photos.

How it works :<br>
Most semisupervised approaches combine unsupervised and supervised methods. For instance, deep belief networks (DBNs) use unsupervised restricted Boltzmann machines (RBMs) stacked together, then fine-tuned with supervised learning.

#### **Reinforcement Learning**

In **Reinforcement Learning (RL)**, the agent observes the environment, performs actions, and receives rewards or penalties (Figure 1-12). The goal is to learn the best policy a strategy mapping situations to actions that maximizes rewards over time.
<br>

![Figure 1-12](./Fig/Chapter_1/Fig1-12.png)

Examples : <br>

- Robots use RL to learn how to walk.
- AlphaGo (DeepMind) learned its winning policy by analyzing millions of games and playing against itself, later beating world champion Ke Jie in 2017.

<br>


### **Batch and Online Learning**

#### **Batch Learning**

In batch (offline) learning, the system is trained once on all available data, then deployed without further learning. To adapt to new data (e.g., new spam), the system must be retrained from scratch on the full dataset and replaced with the old version.

Although training–evaluation–deployment can be automated (Figure 1-3), retraining is costly in time and resources. Typically, updates are done daily or weekly, which is unsuitable for rapidly changing data (e.g., stock prices).

Batch learning also demands heavy resources (CPU, memory, disk, network). For massive datasets, frequent retraining can be too expensive or even infeasible. Moreover, systems with limited resources (e.g., smartphones, Mars rovers) cannot carry large datasets or train for hours.

In such cases, incremental learning algorithms are a better option.

#### **Online Learning**

In online learning, the system trains incrementally by receiving data sequentially, either one instance at a time or in small mini-batches. Each step is fast and cheap, allowing the model to adapt continuously as new data arrives.
<br>

![Figure 1-13](./Fig/Chapter_1/Fig1-13.png)


This approach is ideal for continuous data streams (e.g., stock prices) or when resources are limited. Once new data is learned, it can be discarded, saving storage.

Online learning also enables out-of-core learning, useful for huge datasets that don’t fit into memory. The algorithm processes data in chunks until the entire dataset has been covered.

<br>

![Figure 1-14](./Fig/Chapter_1/Fig1-14.png)

A key parameter is the learning rate:

High rate → adapts quickly but forgets old data (e.g., a spam filter only catching recent spam).

Low rate → adapts slowly, more robust to noise and outliers.

The main risk is bad data, which can degrade performance over time (e.g., faulty sensors, spam attacks). To mitigate this, systems should be monitored closely, with mechanisms to pause learning, roll back to a stable state, or detect anomalies in the input.

### **Instance-Based Versus Model-Based Learning**

#### **Instance-Based Learning**

One simple approach to generalization is instance-based learning—the system memorizes training examples and classifies new ones by comparing them to known instances using a similarity measure.

For example, a spam filter could flag emails not only identical to known spam but also those with many words in common with spam messages. The new instance is then classified according to the most similar stored examples.
<br>

![Figure 1-15](./Fig/Chapter_1/Fig1-15.png)

#### **Model-Based Learning**

Another approach to generalization is model-based learning, where the system builds a model from examples and uses it to make predictions.
<br>

![Figure 1-16](./Fig/Chapter_1/Fig1-16.png)

Example: Suppose you want to know if money makes people happy. You combine GDP per capita data (IMF) with life satisfaction data (OECD).
<br>

![Table 1-1](./Fig/Chapter_1/Table1-1.png)

When plotted, the data shows a roughly linear trend: higher GDP per capita often corresponds to greater life satisfaction.
<br>

![Figure 1-17](./Fig/Chapter_1/Fig1-17.png)

You can model this relationship using a linear function:

> $life\_satisfaction = \theta_{0} + \theta_{1} \times GDP\_per\_capita$

where $\theta_{0}$ and $\theta_{1}$ are parameters. Adjusting these parameters lets you fit different linear models.
<br>

![Figure 1-18](./Fig/Chapter_1/Fig1-18.png)

To find the best parameters, you define a performance measure (e.g., a cost function measuring prediction error).  
A Linear Regression algorithm then trains the model by minimizing this error.  
For this example, the best fit is:

> $\theta_{0} = 4.85,\ \theta_{1} = 4.91 \times 10^{-5}$

![Figure 1-18](./Fig/Chapter_1/Fig1-18.png)

Finally, you can use the trained model to predict unseen cases.  
For instance, given Cyprus’s GDP per capita ($22,587), the model predicts:

> $4.85 + (22,587 \times 4.91 \times 10^{-5}) \approx 5.96$


Example 1-1. Training and running a linear model using Scikit-Learn

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

# Load the data
oecd_bli = pd.read_csv("oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv("gdp_per_capita.csv",thousands=',',delimiter='\t',
encoding='latin1', na_values="n/a")

# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()
# Select a linear model
model = sklearn.linear_model.LinearRegression()
# Train the model
model.fit(X, y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus's GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]

If the model predicts poorly, you may need more attributes, better data, or a more powerful model (e.g., Polynomial Regression).

In summary:

- Study the data
- Select a model
- Train it (minimize cost)
- Apply it to make predictions (inference)
- This outlines a typical ML project workflow. Next, we’ll explore what can go wrong and hinder accurate predictions.

###  **Insufficient Quantity of Training Data**


A toddler can learn what an apple is after seeing it just a few times, but Machine Learning still requires large amounts of data thousands of examples for simple tasks and millions for complex ones like image or speech recognition.

### **The Unreasonable Effectiveness of Data**

In 2001, Banko and Brill showed that different Machine Learning algorithms perform similarly well on natural language tasks when given enough data (see Figure 1-20).
<br>

![Figure 1-20](./Fig/Chapter_1/Fig1-20.png)

<br>
Their findings suggested prioritizing data collection over algorithm development. Peter Norvig later reinforced this idea in his 2009 paper “The Unreasonable Effectiveness of Data.” However, since small- and medium-sized datasets are still common, algorithms remain important.

### **Nonrepresentative Training Data**

To generalize well, training data must be representative of the cases you want to predict. For example, when missing countries were added to the earlier dataset (see Figure 1-21), the resulting model changed significantly, revealing that a simple linear model could not capture the relationship between wealth and happiness. Using nonrepresentative data leads to inaccurate predictions, especially for very poor or very rich countries. 
<br>

![Figure 1-21](./Fig/Chapter_1/Fig1-21.png)

<br>
Ensuring representativeness is challenging, as small samples suffer from noise and large samples can still be biased if collected improperly.

### **Examples of Sampling Bias**

<p style="text-align: justify;">

Perhaps the most famous example of sampling bias happened during the US presidential election in 1936, which pitted Landon against Roosevelt: the Literary Digest conducted a very large poll, sending mail to about 10 million people. It got 2.4 million answers, and predicted with high confidence that Landon would get 57% of the votes. Instead, Roosevelt won with 62% of the votes. The flaw was in the Literary Digest’s sampling method:

- First, to obtain the addresses to send the polls to, the Literary Digest used telephone directories, lists of magazine subscribers, club membership lists, and the
 like. All of these lists tended to favor wealthier people, who were more likely to
 vote Republican (hence Landon).

- Second, less than 25% of the people who were polled answered. Again this introduced a sampling bias, by potentially ruling out people who didn’t care much about politics, people who didn’t like the Literary Digest, and other key groups. This is a special type of sampling bias called nonresponse bias. 

Here is another example: say you want to build a system to recognize funk music videos. One way to build your training set is to search for "funk music" on YouTube and use the resulting videos. But this assumes that YouTube's search engine returns a set of videos that are representative of all the funk music videos on YouTube. In reality, the search results are likely to be biased toward popular artists (and if you live in Brazil you will get a lot of "funk carioca" videos, which sound nothing like James Brown). On the other hand, how else can you get a large training set?

</p>

### **Poor Quality Data**

Poor-quality data with errors, outliers, or noise reduces model performance, making data cleaning essential a major part of a data scientist’s work. Common approaches include:

- Outliers: discard or correct them manually.
- Missing features: ignore the attribute, drop affected instances, fill values (e.g., median age), or train models with and without the feature.

### **Irrelevant Features**

A system learns effectively only if training data has enough relevant features and avoids irrelevant ones. Success often depends on feature engineering, which includes:

- Feature selection: choosing the most useful existing features.
- Feature extraction: combining features to create better ones (e.g., with dimensionality reduction).
- Creating new features: gathering additional data.

### **Overfitting the Training Data**

Overgeneralization is common in humans, and Machine Learning models can fall into the same trap—this is called overfitting. A model may perform very well on training data but fail to generalize to new cases. For example, a high-degree polynomial life satisfaction model strongly overfits the training data, fitting noise and irrelevant details rather than real patterns (see Figure 1-22). 
<br>

![Figure 1-22](./Fig/Chapter_1/Fig1-22.png)

Complex models such as deep neural networks are especially vulnerable if the training set is noisy or too small, sometimes detecting spurious correlations—for instance, countries with a “w” in their name appearing happier, a coincidence unlikely to generalize.

Overfitting occurs when a model is too complex compared to the amount and quality of data. Common solutions include:

- **Simplify the model**: reduce parameters, remove attributes, or apply constraints.
- **Gather more training** data.
- **Reduce data noise**: correct errors and eliminate outliers.

Constraining complexity is called regularization. It reduces degrees of freedom, forcing the model to remain simpler and improving its ability to generalize. Figure 1-23 shows how regularization creates a flatter slope: the fit on training data is less precise, but performance on unseen data improves. 
<br>

![Figure 1-23](./Fig/Chapter_1/Fig1-23.png)

The amount of regularization is controlled by a hyperparameter, which is set before training and determines the balance between underfitting and overfitting.

### **Underfitting the Training Data**

Underfitting is the opposite of overfitting: it happens when a model is too simple to capture the true structure of the data. For instance, a linear model of life satisfaction underfits because reality is more complex, leading to inaccurate predictions even on training examples.

Possible solutions include:

- Use a more powerful model with additional parameters.
- Improve features through feature engineering.
- Relax model constraints, such as reducing the regularization hyperparameter.

### **Testing and Validating**


The only way to know how well a model generalizes is to test it on new cases. Deploying it in production can work, but if the model performs poorly, users will notice an undesirable approach. A safer method is to split the data into a training set and a test set. The model is trained on the former and evaluated on the latter to estimate the generalization error (also called out-of-sample error), which reflects performance on unseen data.

If training error is low but generalization error is high, this indicates the model is overfitting the training data.

### **Hyperparameter Tuning and Model Selection**

Evaluating a model with a test set is straightforward, but choosing between models or tuning hyperparameters is more complex. For example, comparing a linear and polynomial model can be done with a test set, but repeatedly testing different hyperparameter values (e.g., regularization strength) risks overfitting to the test set itself. This explains why a model that looked good in testing may perform worse in production.

The common solution is holdout validation: set aside part of the training set as a validation set (also called dev set). Candidate models with various hyperparameters are trained on the reduced training data and evaluated on the validation set. The best-performing model is then retrained on the full training set, and finally evaluated once on the test set to estimate the generalization error.

This approach works well, but the size of the validation set matters. If it is too small, evaluations become noisy and may lead to poor model selection. If it is too large, candidate models are trained on much less data than the final model, reducing fairness. A common remedy is cross-validation, where the training data is split into multiple small validation sets. Each model is trained and evaluated multiple times, and results are averaged, providing a more reliable performance estimate. The drawback is that training time increases proportionally with the number of validation sets.

### **Data Mismatch**

Sometimes large training datasets are available but are not representative of production data. For example, millions of flower images from the web may differ significantly from photos actually taken with a mobile app. If only 10,000 representative pictures exist, it is crucial that both the validation set and test set consist exclusively of these representative images. Splitting them ensures realistic evaluation and avoids duplicate leakage across sets.

If a model trained on web images performs poorly on the validation set, it may be unclear whether this is due to overfitting or a data mismatch. A solution proposed by Andrew Ng is to create a train-dev set from the web data. After training on the remaining web images, evaluate on the train-dev set:

- Good performance on train-dev but poor on validation → the issue is data mismatch.
- Poor performance on train-dev → the model has overfit the training data.

In case of mismatch, preprocessing the web images to resemble mobile app photos can help. If overfitting is the issue, possible remedies include simplifying or regularizing the model, gathering more data, or cleaning the training set.

### **No Free Lunch Theorem**

A model is a simplified version of the observations. The simplifications are meant to discard the superfluous details that are unlikely to generalize to new instances. To decide what data to discard and what data to keep, you must make assumptions. For example, a linear model makes the assumption that the data is fundamentally linear and that the distance between the instances and the straight line is just noise, which can safely be ignored.

In a famous 1996 paper,11 David Wolpert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the No Free Lunch (NFL) theorem. For some datasets the best model is a linear model, while for other datasets it is a neural network. There is no model that is a priori guaranteed to work better (hence the name of the theorem). The only way to know for sure which model is best is to evaluate them all. Since this is not possible, in practice you make some reasonable assumptions about the data and evaluate only a few reasonable models. For example, for simple tasks you may evaluate linear models with various levels of regularization, and for a complex problem you
may evaluate various neural networks