
Using data to build a model that predicts a numerical output based on a set of numerical inputs.

## 1. Parametric regression

Building a model where we represent a model using a set of parameters.
- e.g. polynomial regression
- Biased in that you have a guess for what kind of equation the underlying model is -> Can be good if you do know.
- Don't need to store original data so more space-efficient
- Can't update the model as more data is gathered.
- Training is slow, querying is fast.

## 2. K Nearest Neighbour (KNN)

**Instance-based methods:** Data-centric approach: Keep data and use it when we make a query. (Best for where you don't have a guess for what the underlying mathematical method might look like because instance-based methods can fit any shape.)
**Take mean of k nearest neighbours' y-value.**
Repeat across the x-axis
- Interpolates smoothly around datapoints.
- Unbiased: Avoid having to assume a particular model. Good for fitting complex datasets where we don't know what the underlying model is like.
- Hard to apply with a large dataset (takes up a lot of memory)
- New data can be added easily
- Training is fast, querying is potentially slow.

## 3. Kernel Regression

Weigh each datapoint according to how far away they are vs KNN each neighbour gets essentially equal weight.

## Numpy Polyfit

numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)[source]
Least squares polynomial fit.

Fit a polynomial p(x) = p[0] * x**deg + ... + p[deg] of degree deg to points (x, y). Returns a vector of coefficients p that minimises the squared error.

Parameters:	
x : array_like, shape (M,)
x-coordinates of the M sample points (x[i], y[i]).
y : array_like, shape (M,) or (M, K)
y-coordinates of the sample points. Several data sets of sample points sharing the same x-coordinates can be fitted at once by passing in a 2D-array that contains one dataset per column.
deg : int
Degree of the fitting polynomial
rcond : float, optional
Relative condition number of the fit. Singular values smaller than this relative to the largest singular value will be ignored. The default value is len(x)*eps, where eps is the relative precision of the float type, about 2e-16 in most cases.
full : bool, optional
Switch determining nature of return value. When it is False (the default) just the coefficients are returned, when True diagnostic information from the singular value decomposition is also returned.
w : array_like, shape (M,), optional
weights to apply to the y-coordinates of the sample points.
cov : bool, optional
Return the estimate and the covariance matrix of the estimate If full is True, then cov is not returned.
Returns:	
p : ndarray, shape (M,) or (M, K)
Polynomial coefficients, highest power first. If y was 2-D, the coefficients for k-th data set are in p[:,k].
residuals, rank, singular_values, rcond :
Present only if full = True. Residuals of the least-squares fit, the effective rank of the scaled Vandermonde coefficient matrix, its singular values, and the specified value of rcond. For more details, see linalg.lstsq.
V : ndarray, shape (M,M) or (M,M,K)
Present only if full = False and cov`=True. The covariance matrix of the polynomial coefficient estimates. The diagonal of this matrix are the variance estimates for each coefficient. If y is a 2-D array, then the covariance matrix for the `k-th data set are in V[:,:,k]
Warns:	
RankWarning
The rank of the coefficient matrix in the least-squares fit is deficient. The warning is only raised if full = False.
The warnings can be turned off by
>>> import warnings
>>> warnings.simplefilter('ignore', np.RankWarning)
See also
polyval
Computes polynomial values.
linalg.lstsq
Computes a least-squares fit.
scipy.interpolate.UnivariateSpline
Computes spline fits.


This function takes in a list of regression values x and y and a degree, and outputs a polynomial in the form of a list p = [p[0],p[1],...,p[degree]] as in the model above.

Another tool you may see or use in the future is the SciKit-Learn preprocessing function, PolynomialFeatures, which you can read about here. This function adds features to a dataset which are quadratic (or higher) combinations of the previous features.


In [None]:
#
#
# Regression and Classification programming exercises
#
#


#
#	In this exercise we will be taking a small data set and computing a linear function
#	that fits it, by hand.
#	

#	the data set

import numpy as np

sleep = [5,6,7,8,10]
scores = [65,51,75,75,86]


def compute_regression(sleep,scores):

    #	First, compute the average amount of each list

    avg_sleep = np.mean(sleep)
    avg_scores = np.mean(scores)

    #	Then normalize the lists by subtracting the mean value from each entry

    normalized_sleep  = [s - avg_sleep for s in sleep]
    normalized_scores = [s - avg_scores for s in scores]
    print normalized_sleep
    #	Compute the slope of the line by taking the sum over each student
    #	of the product of their normalized sleep times their normalized test score.
    #	Then divide this by the sum of squares of the normalized sleep times.

    
    slope =  np.dot(normalized_sleep, normalized_scores) / np.dot(normalized_sleep, normalized_sleep)
    #	Finally, We have a linear function of the form
    #	y - avg_y = slope * ( x - avg_x )
    #	Rewrite this function in the form
    #	y = m * x + b
    #	Then return the values m, b

    m = slope
    b = - slope * avg_sleep + avg_scores

    print "m, b = ", m, b
    return m,b


if __name__=="__main__":
    m,b = compute_regression(sleep,scores)
    print "Your linear model is y={}*x+{}".format(m,b)

In [None]:
#
#	Polynomial Regression
#
#	In this exercise we will examine more complex models of test grades as a function of 
#	sleep using numpy.polyfit to determine a good relationship and incorporating more data.
#
#
#   at the end, store the coefficients of the polynomial you found in coeffs
#

import numpy as np

sleep = [5,6,7,8,10,12,16]
scores = [65,51,75,75,86,80,0]

coeffs = np.polyfit(sleep, scores, 2)

print coeffs