# Challenge: Which model?
You now have a fairly substantial starting toolbox of supervised learning methods that you can use to tackle a host of exciting problems. To make sure all of these ideas are organized in your mind, please go through the list of problems below. For each, identify which supervised learning method(s) would be best for addressing that particular problem. Explain your reasoning and discuss your answers with your mentor.

**1. Predict the running times of prospective Olympic sprinters using data from the last 20 Olympics.** The times are constantly getting faster, so we couldn't use a classifier model or decision tree based model (can only predict outcomes they have seen before). Thus, I would lean towards using a linear regression model.

**2. You have more features (columns) than rows in your dataset.** This describes a situation related to the curse of dimensionality (wikipedia: typically an enormous amount of training data is required to ensure that there are several samples with each combination of values). Thus, I would use a Naive Bayes or random forest model, both of which can handle large numbers of features (however, random forest model may overfit in this scenario). PCA could also be used to reduce the number of features. Linear regression could also be an option depending on the type of data and level of multicollinearity since it can handle datasets with many features without suffering from overfitting unlike other more complex models.

**3. Identify the most important characteristic predicting likelihood of being jailed before age 20.** I would use a random forest model, looking for the rule with the largest gain in information (or loss in entropy).

**4. Implement a filter to “highlight” emails that might be important to the recipient.** I would use a Naive Bayes model because these models are typically used for sentiment classification (easy to build, work well with large datasets).

**5. You have 1000+ features.** I would use a SVM or random forest model because these models can handle large numbers of features. PCA could also be used to reduce the number of features.

**6. Predict whether someone who adds items to their cart on a website will purchase the items.** I would use a SVM or random forest classifier, depending on the size of the dataset (random forest classifier if large dataset).

**7. Your dataset dimensions are 982400 x 500.** I would use an OLS linear regression model because these models tend to perform well when there is a large number of observations (low variance).

**8. Identify faces in an image.** I would use either a KNN classifier, which learns by similarity, or a Naive Bayes model, which is often used for facial recognition (easy to build, work well with large datasets, learn via probability).

**9. Predict which of three flavors of ice cream will be most popular with boys vs girls.** I would use a KNN classifier because it learns by similarity (expect girls and boys to cluster separately). A random forest classifier would also be a good option (intrinsically suited for multiclass problems, work well with mixture of numerical and categorical features, can give a probability of belonging to a given class).

# Supervised Learning Methods

## Classification (Outcome is Categorical)

* It is important to note that with a classifier, the only outcomes that will be seen as possible have to be in the training set.
* Keep in mind a classifier generally does not average these outcomes for you, so if you're just getting a single output it will give you the most likely outcome, not the average.

### 1. Naive Bayes Classifier (Learning via Probability)
* Recall that this can be read as the probability of y, in the case of our model the categorical outcome we’re interested in, given a set of observations is equal to the probability of that set of observations given y divided by the probability of that set of outcomes.
* The other part of Naive Bayes is of course Naive. In this setting Naive refers to the assumption that any pair of variables in the conditional vector (the x variables) are independent from each other.
* We are interested in which y value is most likely to have given the observed set of x variables based on their Bayesian probabilities.
* This is most common in sentiment classification, a branch of machine learning that is designed to focus on trying to classify textual samples according to sentiment. Practically it is very good for spam filtering or telling if comments are positive or negative.

#### Types of Naive Bayes
There are three main classifiers: **Bernoulli, Multinomial, and Gaussian Naive Bayes**. We’ve covered these distributions briefly in the fundamentals course. Each classifier assumes that the distribution of the conditional (the aforementioned P(xi|y)) is the given distribution.

Now these distributions have limitations. A binomial only takes two possible values. A multinomial has discrete outcomes, and a Gaussian (also known as "normal") takes values along the continuous normal distribution.

What this means is that choosing which kind of classifier you want to use depends on the distribution of your outcome variable. Choose the distribution that best fits your data.

#### Strengths
* The way Naive Bayes handles partial data, however, does have the benefit of being indifferent to missing datapoints. Those missing datapoints simply get ignored, drawing what information it can from the other variables of that observation.
* Easy to build
* Useful for very large data sets
* Good choice when CPU and memory resources are a limiting factor

#### Weaknesses
* The first and most obvious downside of Naive Bayes is that assumption of independence. That is a double edged sword because not only is it a condition you’ll often fail to have (even when the model works well), but it also means that any time two variables affect the outcome most in concert your model will fail to see it (called interaction).
* Naive Bayes can only predict the outcome of categories it has seen before.

### 2. KNN Classifiers (Learning via Similarity)
* Look for the datapoints that are most similar to the observation we are trying to predict

#### Nearest Neighbor
* Simplest form of a similarity model
* This works quite simply: when trying to predict an observation, we find the closest (or _nearest_) known observation in our training data and use that value to make our prediction
* To find which observation is "nearest" we need some kind of way to measure distance. Typically we use _Euclidean distance_, the standard distance measure that you're familiar with from geometry. With one observation in n-dimensions $(x_1, x_2, ...,x_n)$ and the other $(w_1, w_2,...,w_n)$:

$$ \sqrt{(x_1-w_1)^2 + (x_2-w_2)^2+...+(x_n-w_n)^2} $$
* All it takes to train the model is a dataframe of independent variables and a dataframe of dependent outcomes.

#### K-Nearest Neighbors
* K-Nearest Neighbors (or "KNN") is the logical extension of Nearest Neighbor. Instead of looking at just the single nearest datapoint to predict an outcome, we look at several of the nearest neighbors, with  𝑘  representing the number of neighbors we choose to look at. Each of the  𝑘  neighbors gets to vote on what the predicted outcome should be.
* Firstly, it smooths out the predictions. If only one neighbor gets to influence the outcome, the model explicitly overfits to the training data. Any single outlier can create pockets of one category prediction surrounded by a sea of the other category. This also means instead of just predicting classes, we get implicit probabilities. If each of the $k$ neighbors gets a vote on the outcome, then the probability of the test example being from any given class $i$ is:

$$ \frac{votes_i}{k} $$

* This model can accommodate as many classes as the data set necessitates. To come up with a classifier prediction it simply takes the class for which that fraction is maximized.
* We can visualize our decision bounds with something called a **mesh**. This allows us to generate a prediction over the whole space.

#### Parameter Tuning
* The distance measurement makes the assumption that all units are equal. This is intensely problematic and one of the main issues people have with KNN. Units are rarely equivalent, and discerning how to adjust that unequivalence is an abstract and touchy subject. **This difficulty also makes binary or categorical variables nearly impossible to include in a KNN model. It really is best if they are continuous.**
* There are **two main normalization techniques** that are effective with KNN: 1) You can set the bounds of the data to 0 and 1, and then rescale every variable to be within those bounds (it may also be reasonable to do -1 to 1, but the difference is actually immaterial). This way every data point is measured in terms of its distance between the max and minimum of its category. **This is best if the data shows a linear relationship, such that scaling to a 0 to 1 range makes logical sense. It is also best if there are known limits to the dataset, as those make for logical bounds for 0 and 1 for the rescaling**, and 2) You can calculate how far each observation is from the mean, expressed in number of standard deviations: this is often called **z-scores**. Calculating z-scores and using them as your basis for measuring distance works for continuous data and puts everything in terms of how far from the mean (or "abnormal") it is.
* Sometimes the $k$ nearest observations are not all similarly close to the test. In that case it may be useful to **weight by distance**. Functionally this will weight by the inverse of distance, so that closer datapoints (with a low distance) have a higher weight than further ones.
* Choosing  𝑘  is a tradeoff. The larger the  𝑘  the more smoothed out your decision space will be, with more observations getting a vote in the prediction. A smaller  𝑘  will pick up more subtle deviations, but these deviations could be just randomness and therefore you could just be overfitting. Add in weighting and that's an additional dimension to this entire conversation.

### 3. Decision Tree Classifiers (Learning via Questions)
* A decision tree is a series of rules used to arrive at a conclusion.
* Each of the questions is a __node__. Nodes are either root nodes (the first node), interior nodes (follow up questions), or leaf nodes (endpoints). Every node except for leaf nodes contains a __rule__, which is the question we’re asking. The links between nodes are called __branches__ or __paths__. When put in terms of flow, you start at the root node and follow branches through interior nodes until you arrive at a leaf node.
* Each rule divides the data into a certain number of subgroups, typically two subgroups with binary "yes or no" questions being particularly common. It is important to note that all data has to have a way to flow through the tree, it cannot simply disappear or not be contained in the tree.
* The important thing to take away here is that **entropy is a measure of uncertainty in the outcome**. As we limit the possible number of outcomes and become more confident in the outcome, the entropy decreases. An area of the tree with only one possible outcome has zero entropy because there is no uncertainty. We can then use entropy to measure the information gain, defined as the change in entropy from the original state to the weighted potential outcomes of the following state. One method of designing an efficient decision tree is to gain the most information as quickly as possible.

#### Parameter Tuning
* We typical use entropy to build the tree (`criterion='entropy'`), which follows the method we described above of looking for information gain.
* `max_features` is the number of features that will be used per node.
* `max_depth` of the tree is the number of decision levels below the root for our classification.
* Decision tree classifiers work by splitting the training data randomly, which can lead to inconsistent trees if the code is run more than once. We can set the random seed so the tree looks a specific way for a given assignment, but we wouldn't do that when actually using this model.

#### Strengths
* Easy to represent the model visually
* Can handle caried types of data
* Feature selection is a part of the model
* Easy to use with little data preparation (do not have to worry about outliers or whether data is linearly separable)
* Can easily handle feature interactions

#### Weaknesses
* They are **randomly generated**, which can lead to **high variance** in estimates (The tree doesn't build the same way every time)
* Do not support online or live learning -- have to rebuild your tree when new examples come on
* Incredibly **prone to overfitting** (especially if allowed to grow too deep or complex)
* Biased towards the dominant class since they are driven by information gain. Thus, **balanced data is necessary**.
* Can take up a lot of memory (the more features you have, the deeper and larger your tree is likely to be)

### 4. Random Forest Classifiers (Learning via Questions)
* Ensemble model made up of decision trees
* Much like decision trees, random forest models can be used for both classification and regression problems. The main difference is how the votes are aggregated. As a classifier the most popular outcome (the mode) is returned. As a regression it is typically the average or mean that is returned.
* A decision tree is built on an entire dataset, using all the features/variables of interest, whereas a random forest randomly selects observations/rows and specific features/variables to build multiple decision trees from and then aggregates/averages the results.
* Now, Random Forest models don't just create a ton of trees using the same data again and again and again. Instead they use **bagging and random subspace** to generate trees that are different. Without this, the trees could be incredibly similar (even identical), leading to correlation between trees and vulnerability to bias in the trees from some highly predictive features dominating every tree, creating a series of very similar trees with very similar, and potentially biased, predictions. In **bagging**, each tree selects a subset of observations with replacement to build the training set. Replacement here means it can simply choose the same observation multiple times, which is only really a problem when there are few observations. It puts the observation "back in the bag", if you will, where it can be pulled and chosen again. Random forests also typically use a random subset of features for each split. This means for each time it has to perform a split or generate a rule, it is only looking at the **random subspace** created by a random subset of _some_ of the features as possibilities to generate that rule. This will help avoid the aforementioned correlation problem because the trees will not be built with the same available features at every point. **As a general rule, for a dataset with x features $\sqrt{x}$ features are used for classifiers and $x/3$ for regression.**

#### Parameter Tuning
* When building a Random Forest you get to set parameters for both the tree and the forest. So for the tree you have the same parameters as before: you can set the depth of the tree and the number of features used in each rule or split. You can also specify how the tree is built, with using information gain and entropy like we did before, or other methods like Gini impurity.
* You also get to control the **number of estimators you want to generate, or the number of trees in the forest**. Here you have a tradeoff between how much variance you can explain and the computational complexity. This is pretty easily tuneable. **As you increase the number of trees in the forest the accuracy should converge as eventually the additional learning from another tree approaches zero.** There isn't an infinite amount of information to learn, and at some point the trees have learned all they can. So when you have acceptable variance in accuracy you can stop adding trees.

#### Strengths
* Low variance and high accuracy
* Classifiers and regressors
* Helps identify most significant variables from thousands of input variables
* Intrinsically suited for multiclass problems (unlike SVM, which is intrinsically two-class)
* Work well with mixture of numerical and categorical features
* Gives a probability of belonging to a given class (unlike SVM, which gives a distance to the boundary that needs to be converted to probability somehow)

#### Weaknesses
* Cannot predict outside of the sample (true for classifiers and regressors)
* Can become large and slow
* Considered black box models (gives you an output but very little insight into how it got there)
* Cannot iterate or improve generated models

### 4. Linear Support Vector Binary Classifiers (Learning via Errors)
* Simplest example of a Support Vector Machine (SVM)
* The **margin** is the distance between the nearest point of each class and the boundary.
* The nearest point for each class is called the **support vector**
* The goal of SVM is find the _best_ boundary, or the boundary that optimizes the margin.
* In a linear SVM model, we force the boundary to be linear. However, a straight line doesn't always nicely classify our data, so SVM models can also use the "kernel trick" to create non-linear decision boundaries (transform data into a higher dimension, find a good hyperplane boundary in the higher dimension, and transform the result back to our starting dimension/data -- the kernel trick makes this process possible using the dot product).
* SVM works in as many dimensions as you'd like (given the limitations of your computing resources). The boundary between two groups is therefore not always a line. A line is simply the way to represent this boundary in two dimensions. In general terms, the boundary is always a "hyperplane". A hyperplane in n-dimensional space is an n-minus-one-dimensional space.
* It is called a hard margin when a dataset has a boundary that groups each observation exclusively on one side of the line. It won't always be possible to make a boundary with a hard margin, however. When it's not, the problem is called soft margin (these two terms apply to all classifiers, by the way). To deal with this kind of problem, SVM imposes a cost function. The cost function gives SVM two things to balance: the size of the margin (which it wants to maximize) and the cumulative distance of points on the wrong side of the margin from the boundary (which it wants to minimize). You can control the priorities of this tradeoff by controlling that weight. This depends on your tolerance for inaccurate results as compared to large margins.
* SVM models can be used for data with multiple classes by: 1) Running a hold-one-out form of binary classifier many many times (or for as many values as your outcome can take). Then for each category you create a binary classifier between having that category or having any other outcome. To aggregate these and create a multi-class classifier, each one has an output function to define its confidence in classification, which is related to its distance from the boundary and the weights for the accuracy of the classifier. The highest output value dominates thereby deciding the class, or 2) Using a pairwise approach, where every category is compared to the others in pairs. Here the class is decided by the maximum number of wins given an observation's characteristics. So an observation is categorized under every possible pair of outcomes, and then the outcome is assigned to the one that was most common.

#### Strengths
* Flexible
* Great visual explanatory power (especially linear SVC)
* Tremendous accuracy (kernel smoothing)
* Clustering (unsupervised algorithm)
* Classifier or regressor
* Widely used in pattern recognition and classification problems when the data has exactly two classes (popular in text classification)
* Tend to perform better than random forests on sparse datasets

#### Weaknesses
* Potential for poor computational efficacy (memory intensive) and explanatory power (hard to interpret)
* Difficult to tune
* Relies on some concept of "distance" between different points
* Hardly scalable beyond 10^5 datapoints

### All Classifiers

#### Evaluation Metrics
* The most basic measure of success, then, is how often our model was correct. This is called the **accuracy**. Accuracy is not always the best metric: 1) Not all (types of) errors are created equal, and 2) Understanding how your model is failing can be key to improving it. If a certain outcome is not being predicted accurately you may want to focus on engineering more features to identify that outcome.
* The next level of analysis of your classifier is often something called a **Confusion Matrix**. This is a matrix that shows the count of each possible permutation of target and prediction. It includes **false positives** (Type I Errors, false alarm) and **false negatives** (Type II errors, a miss). Within this context, **Sensitivity** is the percentage of positives correctly identified. **Specificity** is just the opposite, the percentage of negatives correctly identified.

#### Class Imbalance
* Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class. A small imbalance in your outcome classes doesn't tend to make much of a difference. A large class imbalance, however, can hugely affect your model. That's because when one class makes up a large portion of the data, it can be pretty successful for a model to simply choose the dominant class every time. **As a baseline, a worthwhile model will have an accuracy greater than the dominant class rate.** To deal with **class imbalance**, you can: 1) Ignore it and try to engineer features that strongly identify the minority class, 2) Deliberately oversample the minority class or undersample the majority class to create a more balanced training set, 3) Use probability outputs. Although Naive Bayes' probability outputs are generally inaccurate and not to be used, other models will give you a more accurate probability of a certain class. Things like logistic regression or support vector machines ("SVM") can be good at this. Instead of just taking the most likely outcome you can set up a specific cutoff or a more complex rule. In the binary case it could be going with the minority case if it has a priority greater than some threshold, and 4) Create cost functions for errors. This effectively quantifies ways in which errors are not equal. You find some functional form to scale the cost of an error up or down. This can mean something like a Type II error being twice as bad as a Type I error, or a hundred times as bad, or however you choose to quantify that relationship.

## Regression (outcome is continuous)

### 1. Linear Regression Models (Learning via Errors)
* Often use an optimization algorithm called Ordinary Least Squares (OLS) to minimize the sum of the squared distances between each point and the line
* Assumes six conditions:
  1. The relationship between (coefficients of) feature(s) and target(s) is linear.
  2. The errors of the model should be equal to zero on average.
  3. The model’s errors are consistently distributed, which is known as **homoscedasticity**.
  4. Features are at most only weakly correlated. Put differently there is not strong **multicollinearity**.
  5. The model’s errors should be uncorrelated with each other.
  6. The features and model errors are independent of one another.

#### Evaluation metrics

When we were evaluating the training performance of a linear regression model, we used metrics like R-squared and adjusted R-squared. R-squared measures the ratio of variance in the target variable that is explained by the model. However, when we are making predictions we care more about how close our predictions are to the target rather than the variance in the target variable. This means that we usually use metrics other than R-squared to gauge how good our predictions are. Here, we introduce four of the most common ones.

* **Mean absolute error (MAE)** is defined as the average of the absolute values of the errors between the true values and the predicted values:

$$ \frac{1}{n} \sum_{i=1}^{n}\lvert y_i-\hat{y}_i\rvert  $$

* **Mean squared error (MSE)** is defined as the average of the squared errors between the true values and the predicted values:

$$ \frac{1}{n} \sum_{i=1}^{n} (y_i-\hat{y}_i)^2  $$

* **Root mean squared error (RMSE)** is defined as the square root of the MSE:

$$ \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i-\hat{y}_i)^2}  $$

* **Mean absolute percentage error (MAPE)** is defined as the average of the ratio of the absolute values of the errors to the true values:

$$ \frac{1}{n} \sum_{i=1}^{n}\frac{\lvert y_i-\hat{y}_i\rvert}{y_i}  $$

Although these are different metrics, they are essentially using the difference between what we know to be correct medical cost for a person and predicted medical cost from the model.  These errors are then summed up for an overall error score.

We can use any one of the above metrics. But there are some important points to note about them:

* Lower values are desirable for all four metrics. The lower the value, the better the performance of the model.
* MAE, MSE, and RMSE are in the unit of the target variable but MAPE is unitless. So MAE, MSE, and RMSE are only useful if we compare different models that have the same target variable.
* MSE and RMSE penalize large errors more than the MAE and MAPE do. This means that MSE and RMSE are more useful when high error values are undesirable.
* For target values very close to zero, MAPE may provide a problematic picture of the performance as the ratio may go to very high values, and this may distort the average. It can also give division by zero errors if some values of the target are zero!

#### Other types of regression besides OLS
To make a model more generalizable to a test set, we sometimes adjust the model's learning objectives or loss functions. By doing this, we actually impose our preferences over potential solutions and force the model to choose one of our preferred solutions, assuming there isn't a nonpreferred solution that performs significantly better. In general, the term **regularization** refers to the process of modifying algorithms in order to lower the generalization gap without sacrificing training performance.

When we introduced linear regression, we said that model fit is determined by minimizing the sum of the squared differences between the predicted and actual values. This is _Ordinary Least Squares_:

$$\sum_{i=1}^n(y_i-(\alpha+\beta x_i))^2$$

It just so happens, however, that we can get more accurate *predictions* by modifying this cost function. One way to think of this is that the OLS cost function optimizes variance explained *in the training set.*  **Ridge**, **Lasso**, and **ElasticNet** regression are three examples of modifying this cost function. They each optimize variance explained *in the test sets*. In general, our goal is to make a model that tells us about the world (and not just our training sample) so Ridge, Lasso, and ElasticNet solutions are useful.

### 2. KNN Regression (Learning via Similarity)
* Switching KNN to a regression is a simple process. In our previous classification models, each of the $k$ oberservations voted for a _category_. As a regression they vote instead for a _value_. Then instead of taking the most popular response, the algorithm averages all of the votes. If you have weights you perform a weighted average.

#### Evaluation metrics
* Cross validation is still valuable
* Can get an $R^2$ value for the regression.

#### Linear vs. KNN
* OLS models will perform better on data that can be explained by a linear/polynomial function, while KNN models will perform better on non-linear/non-polynomial data. Thus, KNN models may be a safer option when the relationship is unknown.

### 3. Decision Tree Regression (Learning via Questions)
* Decision trees are predictive models that use a set of binary rules to calculate a target value.
* After the training phase, the decision tree produces a tree similar to the one shown above, calculating the best questions as well as their order to ask in order to make the most accurate estimates possible. When we want to make a prediction the same data format should be provided to the model in order to make a prediction. The prediction will be an estimate based on the train data that it has been trained on.
* Decision tree regression models normally use mean squared error (MSE) to decide to split a node in two or more sub-nodes.
* We need to pick a variable and the value to split on such that the two groups are as different from each other as possible.

#### Weaknesses
* Prone to overfitting
* Decision tree, and tree based models in general, are unable to extrapolate to any kind of data they haven’t seen before, particularly future time period as the models are just averaging data points they have already seen. Thus, decision trees are typically used for classification problems, and only used for regression problems if the target variable is inside the range of values in the training set.

### 4. Random Forest Regression (Learning via Questions)
* Ensemble model made up of decision trees
* Much like decision trees, random forest models can be used for both classification and regression problems. The main difference is how the votes are aggregated. As a classifier the most popular outcome (the mode) is returned. As a regression it is typically the average or mean that is returned.
* A decision tree is built on an entire dataset, using all the features/variables of interest, whereas a random forest randomly selects observations/rows and specific features/variables to build multiple decision trees from and then aggregates/averages the results.
* Now, Random Forest models don't just create a ton of trees using the same data again and again and again. Instead they use **bagging and random subspace** to generate trees that are different. Without this, the trees could be incredibly similar (even identical), leading to correlation between trees and vulnerability to bias in the trees from some highly predictive features dominating every tree, creating a series of very similar trees with very similar, and potentially biased, predictions. In **bagging**, each tree selects a subset of observations with replacement to build the training set. Replacement here means it can simply choose the same observation multiple times, which is only really a problem when there are few observations. It puts the observation "back in the bag", if you will, where it can be pulled and chosen again. Random forests also typically use a random subset of features for each split. This means for each time it has to perform a split or generate a rule, it is only looking at the **random subspace** created by a random subset of _some_ of the features as possibilities to generate that rule. This will help avoid the aforementioned correlation problem because the trees will not be built with the same available features at every point. **As a general rule, for a dataset with x features $\sqrt{x}$ features are used for classifiers and $x/3$ for regression.**

#### Parameter Tuning
* When building a Random Forest you get to set parameters for both the tree and the forest. So for the tree you have the same parameters as before: you can set the depth of the tree and the number of features used in each rule or split. You can also specify how the tree is built, with using information gain and entropy like we did before, or other methods like Gini impurity.
* You also get to control the **number of estimators you want to generate, or the number of trees in the forest**. Here you have a tradeoff between how much variance you can explain and the computational complexity. This is pretty easily tuneable. **As you increase the number of trees in the forest the accuracy should converge as eventually the additional learning from another tree approaches zero.** There isn't an infinite amount of information to learn, and at some point the trees have learned all they can. So when you have acceptable variance in accuracy you can stop adding trees.

#### Strengths
* Low variance and high accuracy

#### Weaknesses
* Cannot predict outside of the sample (true for classifiers and regressors)
* Can become large and slow
* Considered black box models (gives you an output but very little insight into how it got there)

### 5. Support Vector Regression (Learning via Errors)
* Support Vector Regression (SVR) operates much like an inversion of Support Vector Classifiers. In classification we had a computational advantage because we were only interested in the points closest to the boundary. **In regression, we instead are only interested in values far away from the prediction.**

#### Parameter Tuning
* There are two major values we tune in SVR, **C and epsilon**. **C is called the box constraint and sets the penalty for being outside of our margin. Epsilon sets the size of our margin.** So again much like the classification problem we gather our data and find its distance from a specified point (previously the boundary, now the prediction) and optimize the cost from observations being outside the margin. This ends up being a huge advantage of SVM for regression: you can set the sensitivity when building the model, not just after the fact.

## Ensemble Models, In General
* Ensemble models are essentially models made up of other models. These component models are often models that are simpler than would be necessary to accurately predict the desired outcome on their own.
* Ensemble models combine many less effective models (“weak learners”) into one more effective model (“strong learner”).

### Three Main Categories of Ensemble Models
1. Bagging is one such ensemble technique. In bagging you take subsets of the data and train a model on each subset. Then the subsets are allowed to simultaneously vote on the outcome, either taking a majority or a mean. You just saw this in action with Random Forests, the most popular bagging technique.

2. Another ensemble technique is called boosting. Rather than build multiple models simultaneously like bagging, boosting uses the output of one model as an input into the next in a form of serial processing. These models then get daisy-chained together sequentially until some stopping condition is met. We’ll cover boosting methods later. The boosting approach is exceptionally flexible – it works for classification and regression, and can be combined with any of the modeling approaches we've covered so far.

3. Lastly, stacking is a two phase process. In the first phase multiple models are trained in parallel. Then in the second phase those models are used as inputs into a final model to give your prediction. This approach combines the parallel approach embodied by bagging with the serial approach of boosting to create a hybrid of the two.

### More on Boosting
* The principle behind boosting is iterative. We start by fitting a simple model on all the data. We identify the information that the model was not able to account for (incorrect predictions in classifier, and residuals in regression) and build a new simple model that targets that new pool of information. We repeat this until we reach some predetermined stopping rule. The combination of all the models is then used to make the final predictions. Boosting is great because we can use many simple models that are each computationally fast to arrive at very accurate predictions.
* There are many different implementations of boosting that vary along the following axes: **1) Type of simple model**. You can use almost any model you like. **2) Index of error**. You can use residuals from regression, classification errors, or any cost function. **3) How the next iteration targets the error**. You can weight inaccurately-predicted cases high and accurately-predicted cases low, you can directly model residuals, or you can model only the subset of the data that was inaccurately predicted. **4) Stopping rule**. You can stop once you've run a certain number of models, once the amount of variance explained by the most recent iteration of the model is lower than some threshold, or once the change in weights between the two most recent model iterations is lower than some threshold.
* Gradient boosting is a specific type of boosting typically using decision trees. With gradient boosting, each time we run a decision tree, we extract the residuals. Then we run a new decision tree, using those residuals as the outcome to be predicted. After reaching a stopping point, we add together the predicted values from all of the decision trees to create the final gradient boosted prediction. Gradient boost comes with some methods to avoid overfitting. One option is **subsampling**, where each iteration of the boost algorithm uses a subsample of the original data. By introducing some randomness into the process, subsampling makes it harder to overfit. Another option is **shrinkage**, which we have encountered before in ridge regression. Here, the shrinkage/regularization parameter reduces the impact of subsequent iterations on the final solution. Visually, you can picture this parameter, called the "learning rate", causing each "step" along the loss function gradient to be a little smaller than the previous one. This prevents any one iteration from being too influential and misdirecting the overall boost solution. Learning rates vary between 0 (only the initial iteration matters) to 1 (all iterations are weighted equally). A model made up of many small steps is less prone to overfitting than a model made up of few large steps, but it can also lead to much slower running times, depending on the stopping rule in play.

### Strengths
* High accuracy and low variance

### Weaknesses
* Prone to overfitting (especially boosting models)
* Lack of transparency (black box)
* Can be computationally slow