# Introduction to Prediction (Supervised Learning): Classification and Regression

## What is prediction and supervised learning?

**Prediction** is the process of estimating the value of a variable based on data. Examples of prediction include, estimating the sales price of a home based on its physical characteristics, determining whether an email is spam or not based on the content and metadata (e.g. sender's address) of the message, determining whether or not a photograph contains a cat. These are all examples of prediction. 

Here, we're using the term "prediction" a bit differently than it may be used colloquially, since it's often used as a term to describe a **forecast** - some sort of future prediction. This is a common misconception of predictions, that they estimate something that is meant to occur in the future. A prediction does not have to be of a future state. In fact, none of the examples shared above are the prediction of some future state. 

Predictions are core to **supervised learning**. Supervised learning is the machine learning subfield of learning from examples to make predictions on unseen data. Let's start with an example to make this clearer.

Suppose you wanted to predict the value of a house based on characteristics that describe it. This is a prediction problem. In this case, our output variable, $y$ represents the value of the house that we're trying to predict. And we want to predict this with our input data, $\mathbf{x}=[x_1, x_2, x_3]$, which may represent the number of bedrooms ($x_1$), the age of the home ($x_2$), or the presence of a swimming pool ($x_3$). We typically call these inputs **features**. We then use a prediction algorithm ($f$) to estimate the value of our output variable, $y$. Mathematically, we represent this as $y = f(x_1,x_2,x_3)$. For prediction, we need to build the function $f$ that accurately predicts the output variable we care about, $y$.



# Training and testing

There are at least two steps for any prediction problem: training and testing. Training is the process of giving the prediction algorithm **labeled** examples to learn from that include both the inputs and the outputs. For example, my **training data** (data that includes both the features and the target variable) for the housing price prediction example could be:

*Table 1. Example of training data*

| $x_1$<br>Number of bedrooms| $x_2$<br>Year built | $x_3$<br>Swimming pool present? | $y$<br>Price (\$) |
| ----- | ------ | ---- | ---- |
| 2 | 1965 | True  | 325,000 |
| 1 | 1957 | False | 297,000 |
| 3 | 2004 | False | 443,000 |
| 4 | 2023 | True  | 502,000 |

You'll notice that for training data, we know both the input values AND the output values. By knowing both we allow the algorithm to learn from those data in the hope of being able to generalize to new, unseen data for which we do not have the house price (output) data and we use the prediction algorithm (that you'll be coding up shortly!), $y$, to learn from it to make accurate predictions.

There are two primary categories of prediction: **classification** (for discrete variables) and **regression** (for continuous variables). 

## Classification

Is an email spam or not? Will a stock price go up, down, or stay the same? Is the image a picture of a dog, a cat, or a frog? These are all examples of categorical predictions which can be made through the process of classification. Whenever we are asking yes or no questions or choosing from options in a list, we're performing classification. For classification, the output variable, $y$, is categorical. Each of the options that could be estimated (e.g. spam, not spam) are the **class** labels that can be predicted. This concept is separate from the object oriented class that we have learned about in programming.

We typically think of classification when we are making estimates to answer questions like:
- Is it _____ or _____?
- True or false?
- What type of _____ is it?

## Regression

How much will the home sell for? How tall will the child grow to be? These are examples of regression problems when the output variable, $y$, is continuous.

We typically think of regression when we are making estimates to answer questions like:
- What is the cost of ______?
- How many _____ are there?

## Terminology Review
- **Prediction**.
- **Supervised Learning**.
- **Forecasting**.
- **Clasification**.
- **Class**.
- **Regression**.
- **Features**.
- **Labeled data**.
- **Training data**.
- **Test data**.
- **Sample** or **observation**.

*Table 1. Example of training data*

| Term | Definition | Example |
| ----- | ------ | ---- |
| Prediction | Definitin | Example |