# Statistical Machine Learning

## Machine Learning vs. Statistics

In the context of predictive modeling, what is the difference between machine learning and
statistics? There is not a bright line dividing the two disciplines. Machine learning tends to be
more focused on developing efficient algorithms that scale to large data in order to optimize
the predictive model. Statistics generally pays more attention to the probabilistic theory and
underlying structure of the model. Bagging, and the random forest (see “Bagging and the
Random Forest”), grew up firmly in the statistics camp. Boosting (see “Boosting”), on the
other hand, has been developed in both disciplines but receives more attention on the machine
learning side of the divide. Regardless of the history, the promise of boosting ensures that it
will thrive as a technique in both statistics and machine learning.

## K-Nearest Neighbors

The idea behind K-Nearest Neighbors (KNN) is very simple. For each record to
be classified or predicted:
1. Find K records that have similar features (i.e., similar predictor values).
2. For classification: Find out what the majority class is among those similar records, and assign that class to the new record.
3. For prediction (also called KNN regression): Find the average among those similar records, and predict that average for the new record.

- **Neighbor**: A record that has similar predictor values to another record.
- **Distance metrics**: Measures that sum up in a single number how far one record is from another.
- **Standardization**: Subtract the mean and divide by the standard deviation.
    - _Synonym_: Normalization
- **Z-score**: The value that results after standardization
- **K**: The number of neighbors considered in the nearest neighbor calculation

The prediction results depend on how the features are scaled, how
similarity is measured, and how big K is set. Also, all predictors must be in
numeric form

While the output of KNN for classification is typically a binary decision, such as default or
paid off in the loan data, KNN routines usually offer the opportunity to output a probability
(propensity) between 0 and 1. The probability is based on the fraction of one class in the $K$
nearest neighbors. In the preceding example, this probability of default would have been
estimated at $\frac{14}{20}$ or 0.7. Using a probability score lets you use classification rules other than
simple majority votes (probability of 0.5). This is especially important in problems with
imbalanced classes; see “Strategies for Imbalanced Data”. For example, if the goal is to
identify members of a rare class, the cutoff would typically be set below 50%. One common
approach is to set the cutoff at the probability of the rare event.

### Distance Metrics

Similarity (nearness) is determined using a distance metric, which is a function
that measures how far two records $(x_1, x_2, \cdots, x_p)$ and $(u_1, u_2, \cdots, u_p)$ are from one
another. The most popular distance metric between two vectors is Euclidean
distance. To measure the Euclidean distance between two vectors, subtract one
from the other, square the differences, sum them, and take the square root:

\begin{equation}
\sqrt{(x_1 - u_1)^2 +  (x_2 - u_2)^2 + \cdots + (x_p - u_p)^2}
\end{equation}

Another common distance metric for numeric data is Manhattan distance:

\begin{equation}
|x_1 - u_1| +  |x_2 - u_2| + \cdots + |x_p - u_p|
\end{equation}

Euclidean distance corresponds to the straight-line distance between two points. 
Manhattan distance is the distance between two points
traversed in a single direction at a time (e.g., traveling along rectangular city
blocks). For this reason, Manhattan distance is a useful approximation if
similarity is defined as point-to-point travel time.

In measuring distance between two vectors, variables (features) that are
measured with comparatively large scale will dominate the measure. For
example, for the loan data, the distance would be almost solely a function of the
income and loan amount variables, which are measured in tens or hundreds of
thousands. Ratio variables would count for practically nothing in comparison.
We address this problem by standardizing.