# The Fundamentals of Machine Learning

## Chapter 2 - End-to-End Machine Learning Project

In this chapter you will work through an example project end to end, pretending to be a recently hired data scientist at a real estate company.1 Here are the main steps you will go through:
<br>
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.
<br>

### **Working with Real Data**

When learning Machine Learning, it is best to work with real-world datasets rather than toy examples. Many open datasets are available across domains, including:
- Popular repositories:<br> 
&nbsp; - [UC Irvine ML Repository](https://archive.ics.uci.edu/ml/index.php)<br>
&nbsp; - [Kaggle](https://www.kaggle.com/datasets)<br>
&nbsp; - [AWS datasets](https://registry.opendata.aws/)<br>  

- Meta portals: <br>
&nbsp; - [OpenDataMonitor](http://opendatamonitor.eu/) <br>
&nbsp; - [Quandl](https://www.quandl.com/)<br>

- Other lists:<br>
&nbsp; - [Wikipedia ML datasets](https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research)<br>
&nbsp; - [Quora threads](https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public)<br>
&nbsp; - [r/datasets subreddit](https://www.reddit.com/r/datasets/) <br>

In this chapter, we use the California Housing Prices dataset, built from the 1990 census and available from the StatLib repository. Although slightly outdated, it is a good dataset for learning. For teaching purposes, one categorical attribute has been added and some features removed.

As shown in Figure 2-1, this dataset provides information on California housing prices.<br>

![Figure 2-1](./Fig/Chapter_2/Fig2-1.png)



### **Look at the Big Picture**

#### **Frame the Problem**

- Always begin by asking: what is the business objective?
- Predictions must fit the downstream system e.g., predicting median house values helps guide investment decisions.
- The framing (classification vs regression, numeric vs category output) influences algorithm choice, evaluation, and effort.

#### **Pipelines**

- ML systems usually operate inside data pipelines: chains of components that fetch data, run models, and feed outputs to applications.
- Pipelines increase modularity and robustness, but failures in upstream components (e.g., stale data) can quickly harm predictions.

#### **Choose the Learning Paradigm**

The housing example is:

- Supervised learning → labels are known (median house value).

- Univariate regression → predict a single numeric value.

- Batch learning is sufficient → no need for online updates in this case.

#### **Select Performance Measure**

For regression tasks, common metrics are:

- RMSE (Root Mean Squared Error): sensitive to large errors.

![Equation 2-1](./Fig/Chapter_2/Equation2-1.png)

- MAE (Mean Absolute Error): less sensitive to outliers.

![Equation 2-2](./Fig/Chapter_2/Equation2-2.png)

Pick the measure that best matches the business need.

#### **Check the Assumption**

It is good practice to list and verify all assumptions made so far, since catching problems early saves effort later.

Example: the system outputs district prices that feed into a downstream ML system. You assume that actual prices will be used. But if the downstream system converts prices into categories (e.g., “cheap,” “medium,” “expensive”), then predicting exact prices is unnecessary the problem should be framed as classification, not regression.

After discussion with the downstream team, you confirm they indeed need actual prices, not categories. ✅ Assumptions are validated, so you can confidently move forward to data collection and modeling.

### **Get the Data**

It’s time to get your hands dirty. Don’t hesitate to pick up your laptop and walkthrough the following code examples in a Jupyter notebook. The full Jupyter note
book is available at [Full Jupyter Notebook](https://github.com/ageron/handson-ml2).