### Notebook setup

In [1]:
import numpy as np
import plotly.offline as pyo
import plotly.graph_objs as go
from plotly import tools

# Look at the Big Picture

## Frame the Problem

<span style="font-size:14pt; color:black; font-family:Garamond">

The first question to ask your boss is what exactly the business objective is. Building a model is probably not the end goal.
</span>

## Pipelines

<span style="font-size:14pt; color:black; font-family:Garamond">

A sequence of data processing *components* is called a data *pipeline*. Pipelines are very common in machine learning systems, since there is a lot of data to manipulate and many data transformations to apply.
    
Components typically run asynchronously. Each component pulls in a large of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipeline pulls this data and spits out its own output, and so on. Each component is fairly self-contained: the interface between components is simply the data store. This makes the system quite simple to grasp (with the help of a data flow graph), and different teams can focus on different components. Moreoever, if a component breaks down, the downstream components can often continue to run normally (at least for a while) by using the last output from the broken component. This makes the architecture quite robust. On the other hand, a broken component can go unnoticed for some time if proper monitoring is not implemented. The data gets stale and the overall system's performance drops.
</span>

# Terms:

## Multiple Regression:

<span style="font-size:14pt; color:black; font-family:Garamond">

When a model uses multiple features to make a prediction , this is known as multiple regression. 
</span>

## Univariate Regression:

<span style="font-size:14pt; color:black; font-family:Garamond">

When a model predits a single value for each input, this problem is known as univariate regression.
</span>

## Multivariate Regression:

<span style="font-size:14pt; color:black; font-family:Garamond">

Predicting multiple outputs per input is known as multivariate regression problem.
</span>

## When to use batch learning:

<span style="font-size:14pt; color:black; font-family:Garamond">

Whenever the data and the model can fit into the main memory, then we can apply batch learning. If the data were huge, you could either split your batch learning work across multiple servers (using the MapReduce technique) or use an online learning technique.
</span>

# Select a Performance Measure

## Root Mean Square Error (RMSE)

<span style="font-size:14pt; color:black; font-family:Garamond">

A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors. Equation 2-1 shows the mathematical formula to compute the RMSE.
    
*Equaltion 2-1. Root Mean Square Error (RMSE)*

$$
RMSE(X,h) = \sqrt{\frac{1}{m}\sum_{i=1}^{m}\bigl(h(x^{(i)}) - y^{(i)}\bigr)^2}
$$
    

</span>

### Notations:

<span style="font-size:14pt; color:black; font-family:Garamond">

$m$ is the number of instances in the dataset you are measuring the RMSE on.<br>
$x^{(i)}$ is a vector of all the feature values (execluding the label) of the $i^{th}$ instance.<br>
$y^{(i)}$ is the label (the desired output value for that instance).
</span>

## Mean Absolute Error

<span style="font-size:14pt; color:black; font-family:Garamond">

Even though the RMSE is generally the preferred performance measure for regression tasks, in some context you may prefer to use another function. For instance, if there are many outliers instances in the dataset, in this case, we may consider using *mean absolute error* (MAE, also called the ***average absolute deviation***; or ***Least Absolute Shrinkage and Selection Operator (LASSO)***; see Equation 2-2).
</span>

*Equaltion 2-2. Mean Absolute Error (MAE)*

$$
MAE(X,h) = \frac{1}{m}\sum_{i=1}^{m}\left\lvert h(x^{(i)}) - y^{(i)}\right\rvert
$$


<span style="font-size:14pt; color:black; font-family:Garamond">

Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Various distance measure, or norms, are possible. Generally speaking, calculating the size or length of a vector is often required either directly or as part of a broader vector or vector-matrix operation.
    
Generally speaking the $l_1$, $l_2$ or $l_k$ norms are commonly used for assigning a magnitude to a vector. for a vector $x$ having $N$ components, the $l_1$ just adds the components. Since we would like our magnitude to always be positive, we take the absolute value of the components.

- Computing the root of a sum of squares (RMSE) corresponds to the Euclidean norm: it is the notion of distance you are familiar with. It is also called the $l_{2}$ *norm* noted: $\lVert . \rVert_{2}$; or Ridge Operator.
- Computing the sum of absolutes (MAE) corresponds to the ℓ1 norm, noted $\lVert . \rVert_{1}$. It is sometimes called the Manhattan norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.
- More generally, the $l_{k}$ norm of a vector $v$ containing $n$ elements is defined as $\lVert v \rVert_{k} = \bigl( \left\lvert v_{0} \right\rvert^{k} + \left\lvert v_{1} \right\rvert^{k} + \cdots + \left\lvert v_{n} \right\rvert^{k} \bigr)^{\frac{1}{k}}$. $l_{0}$ just gives the number of non-zero elements in the vector, and $l_{\infty}$ gives the maximum absolute value in the vector.
- The higher the norm index, the more it focuses on large values and neglect small ones. This is why the RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.
</span>

In [13]:
num_points = 1000
num_outliers = 50

x = np.linspace(0, 10, num_points)

# places where to add outliers:
outlier_locs = np.random.choice(len(x), size=num_outliers, replace=False)
outlier_vals = np.random.normal(loc=1, scale=5, size=num_outliers)

y_true = 2 * x
y_pred = 2 * x + np.random.normal(size=num_points)
y_pred[outlier_locs] += outlier_vals

y_diff = y_true - y_pred

In [14]:
losses_given_lk = []
losses = []
norms = np.linspace(1, 5, 50)

for k in norms:
    losses_given_lk.append(np.linalg.norm(y_diff, k))
    losses.append(my_norm(y_diff, k))

In [16]:
trace_1 = go.Scatter(x=norms, 
                     y=losses_given_lk, 
                     mode="markers+lines", 
                     name="lk_norm")

trace_2 = go.Scatter(x=norms, 
                     y=losses, 
                     mode="markers+lines", 
                     name="my_lk_norm")

trace_3 = go.Scatter(x=x, 
                     y=y_true, 
                     mode="lines", 
                     name="y_true")

trace_4 = go.Scatter(x=x, 
                     y=y_pred, 
                     mode="markers", 
                     name="y_true + noise")

fig = tools.make_subplots(rows=1, cols=4, subplot_titles=("lk_norms", "my_lk_norms", "y_true", "y_true + noise"))
fig.append_trace(trace_1, 1, 1)
fig.append_trace(trace_2, 1, 2)
fig.append_trace(trace_3, 1, 3)
fig.append_trace(trace_4, 1, 4)

pyo.plot(fig, filename="lk_norms.html")


plotly.tools.make_subplots is deprecated, please use plotly.subplots.make_subplots instead



'lk_norms.html'

## Example 2

In [34]:
array = np.random.randn(10)
# array = np.array([1, 1, 1, 4, 1, 1, 1, 1])
# array = np.array([1, 1, 3])
array

array([-0.51731879,  1.43852722,  0.5841042 ,  1.20600714,  2.01543641,
        1.3420095 , -0.25783117, -1.12253678,  0.66446307,  0.41302383])

In [42]:
def my_norm(array, k):
    return np.mean(np.abs(array) ** k)**(1/k)

In [40]:
np.linalg.norm(array, 1), np.linalg.norm(array, 2), np.linalg.norm(array, 3), np.linalg.norm(array, 10)

(9.561258110585216, 3.4545982749318846, 2.5946495606046547, 2.027258231324604)

In [43]:
my_norm(array, 1), my_norm(array, 2), my_norm(array, 3), my_norm(array, 10)

(0.9561258110585216, 1.092439894967332, 1.2043296427640868, 1.610308452218342)

In [33]:
2**(2/3)

1.5874010519681994