# Regression

## Linear Regression
Linear regression's goal really is to find the best fitting hyperdimensional plane that cuts through your samples. Does not use shortest distance measure for $y(x)$ unlike PCA to compute sum of squared errors. Stated differently, these distances are the error between the approximate solution and the actual value. Ordinary least squares works by minimizing the squared sum of all these red line errors to compute the best fitting line.

Ordinary least squares, and by extension, linear regression in its multivariate form, attempts to minimize the sum of squared distances of all independent variables with just the dependent variable.

Once you have the equation, you can use it to calculate an expected value for feature y, given that you have feature x. If the x values you plug into the equation happen to lie within the x-domain boundary of those samples you trained your regression with, then this is called interpolation or even approximation, because you do have the actual observed values of y for the data in that range. When you use the function to calculate a y-value outside the bounds of your training data's x-domain boundary, that is called extrapolation.

It is an extremely well-understood and interpretable technique that run very fast, and produces reasonable results as long as you do not extrapolate too far away from your training data.

You can use linear regression if your features display a significant correlation. The stronger the feature correlation, it being closer to +1 or -1, the better and more accurate the linear regression model for your data will be. The questions linear regression helps you answer are which independent feature inputs relate to the dependent feature output variable, and the degree of that relationship.

If you have a continuous input, if you have a continuous feature as an independent variable, this is the normal case.
In that case, the weight coefficients, what they essentially tell you is how much one unit of change in your output, in your dependent feature, is proportional to a unit of change in your independent feature. So that value is the actual weight.

And then, with categorical data, you have to first numerically encode it, and then, if you look at the weight coefficients, for those features, the way you can understand them is that each weight term tells you the effect of including that particular feature or not. Basically if you include the feature or if you do not include the feature,
that weight term tells you its effect on the output.

One thing to keep in mind is that your R2 coefficient increases the more features you consider when modeling your linear regression, even if those features don't have a good correlation with your dependent feature's values. Due to this, be selective about which features you choose to use and select just the subset of the most promising ones, otherwise you might be subject to errant overfitting.

Under the hood, linear regression examines the relationship between the mean value of your output variable and your input variables. So if you're trying to model stock market security price as a function of the date, linear regression will only factor in the average stock price taken at different date intervals. If you wanted to know what the highest and lowest values were for any date interval, linear regression wouldn't be able to provide that.

Of all this, the major thing to watch out for while using linear regression (even more so than its sensitivity to outliers) is that linear regression assumes your variables are linearly independent. So in your dataset, if you have multiple observations, it would assume that the feature values of one sample have nothing to do with the values of another subject. This is often not the case. In our above, stock market example, it is often observed that one company's stock price fluctuations have a ripple effect to other companies in the same markets.

The last thing to watch out for with linear regression is that the further you extrapolate from the range of your training data, the less reliable the results of the regression become. Keep these thoughts in mind while using linear regression!

Finally, linear regression works with continuous data, as well as categorical data once numerically encoded. If you do end up using categorical data and have multiple dummy boolean columns you want to calculate, for example IceCream_Vanilla, IceCream_Chocolate, and IceCream_CookiesNCream, then you should calculate all three of these target regression lines simultaneously. Just increase the number of targets, or your training labels dimension, and then each regression calculation will have its own offset stored in your .intercepts_ array attribute, and the .coeff_ attribute will become an array of arrays, one per target. In SciKit-Learn, this is called Multi-Output Linear Regression.


```python
>>> from sklearn import linear_model
>>> model = linear_model.LinearRegression()
>>> model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

>>> # R2 Score
>>> model.score(X_test, y_test)
153.244939109

>>> # Sum of Squared Distances
>>> np.sum(model.predict(X_test) - y_test) ** 2)
5465.15
```

In [None]:
from sklearn import linear_model
model = linear_model.LinearRegression(n_jobs=-1)

In [1]:
def drawLine(model, X_test, y_test, title):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(X_test, y_test, c='g', marker='o')
    ax.plot(X_test, model.predict(X_test), color='orange', linewidth=1, alpha=0.7)

    print("Est 2014 " + title + " Life Expectancy: ", model.predict([[2014]])[0])
    print("Est 2030 " + title + " Life Expectancy: ", model.predict([[2030]])[0])
    print("Est 2045 " + title + " Life Expectancy: ", model.predict([[2045]])[0])

    score = model.score(X_test, y_test)
    title += " R2: " + str(score)
    ax.set_title(title)

    plt.show()

In [2]:
def drawPlane(model, X_test, y_test, title, R2):
    # This convenience method will take care of plotting your
    # test observations, comparing them to the regression plane,
    # and displaying the R2 coefficient
    fig = plt.figure()
    ax = Axes3D(fig)
    ax.set_zlabel('prediction')

    
    # You might have passed in a DataFrame, a Series (slice),
    # an NDArray, or a Python List... so let's keep it simple:
    X_test = np.array(X_test)
    col1 = X_test[:,0]
    col2 = X_test[:,1]

    
    # Set up a Grid. We could have predicted on the actual
    # col1, col2 values directly; but that would have generated
    # a mesh with WAY too fine a grid, which would have detracted
    # from the visualization
    x_min, x_max = col1.min(), col1.max()
    y_min, y_max = col2.min(), col2.max()
    x = np.arange(x_min, x_max, (x_max-x_min) / 10)
    y = np.arange(y_min, y_max, (y_max-y_min) / 10)
    x, y = np.meshgrid(x, y)

    
    # Predict based on possible input values that span the domain
    # of the x and y inputs:
    z = model.predict(  np.c_[x.ravel(), y.ravel()]  )
    z = z.reshape(x.shape)

    
    ax.scatter(col1, col2, y_test, c='g', marker='o')
    ax.plot_wireframe(x, y, z, color='orange', alpha=0.7)

    title += " R2: " + str(R2)
    ax.set_title(title)
    print(title)
    print("Intercept(s): ", model.intercept_)

    plt.show()