# Project 1: Linear and Ploynomial Multivariate Regression

This notebook estimates car MPG based on other data about the car. It receives its data from a CSV file (`auto-mpg.data`) and stores it in a Pandas DataFrame. Basic imputation is performed to remove the NaN values found in the horsepower column, and the data is standardized. Both a linear and polynomial multivariate regression algorithms are used to predict the MPG of the car.

In [None]:
import pandas as pd
import numpy as np

## Open the file into a Pandas DataFrame

In [None]:
def create_data_frame(fname):
    data = pd.read_table(fname, header=None, delim_whitespace=True,
                         names=["mpg", "cylinders", "displacement", "horsepower",
                                "weight", "acceleration", "model year", "origin", "car name"])
    return data

In [None]:
data = create_data_frame("auto-mpg.data")

## Contents of `auto-mpg.data`

Contents are listed as pairs of column names and the type of data in the column:

1. **mpg**:       continuous
2. __cylinders__:    multi-valued discrete
3. __displacement__:  continuous
4. __horsepower__:    continuous
5. __weight__:        continuous
6. __acceleration__:  continuous
7. __model year__:    multi-valued discrete
8. __origin__:        multi-valued discrete
9. __car name__:      string (unique for each instance)

There are 398 rows (instances), each with these 9 attributes. The horsepower column is also known to have 6 NaN values.

The following cell shows the first 10 rows of the data.

In [None]:
data.head(10)

## Imputation

To impute the NaN values in the _horsepower_ column, replacement with the average value is used so that the data is not removed.

In [None]:
def clean_Nan(data):
    num_cols = data.shape[1]
    num_rows = data.shape[0]
    for col in range(num_cols-1):
        elem_list = []
        col_sum = 0
        num_items = 0
        for row in range(num_rows):
            if type(data.iloc[row, col]) is int or type(data.iloc[row, col]) is float:
                if np.isnan(data.iloc[row, col]):
                    elem_list.append((row, col))
                else:
                    col_sum += data.iloc[row, col]
                    num_items += 1
            elif type(data.iloc[row, col]) is str:
                try:
                    fdata = float(data.iloc[row, col])
                except ValueError:
                    fdata = np.nan
                if np.isnan(fdata):
                    elem_list.append((row, col))
                else:
                    data.iloc[row, col] = fdata
                    col_sum += data.iloc[row, col]
                    num_items += 1
        if num_items > 0:
            avg = col_sum / num_items
            for r, c in elem_list:
                data.iloc[r, c] = avg

The `car name` column is dropped as it provides no useful information for the algorithm.

In [None]:
clean_Nan(data)
data = data.iloc[:, :-1]
data["const"] = 1
data

## Generate Statistics

Statistics are obtained for the imputated data to aid in determining a standardization process. The following statistics are calculated:
* Mean
* Standard Deviation
* Min/Max
* Quartiles
* Number of Entries

In [None]:
def get_stats(data):
    # Makes a 8x8 array of statistics
    # Note: car names are excluded from this
    stats = np.empty([8,8])
    df = data.values[:,:-1]
    inds = np.asarray(np.where(df == '?'))
    for r, c in inds.T:
        df[r, c] = np.nan
    df = df.astype(float)
    stats[:,0] = np.mean(df, axis=0)
    stats[:,1] = np.std(df, axis=0)
    stats[:,2] = df.min(axis=0)
    stats[:,3] = df.max(axis=0)
    stats[:,4] = np.percentile(df, 25, axis=0)
    stats[:,5] = np.percentile(df, 50, axis=0)
    stats[:,6] = np.percentile(df, 75, axis=0)
    stats[:,7].fill(df.shape[0])
    stats = pd.DataFrame(stats, index=data.columns[:-1], columns=["Mean", "Std", "Min", "Max", "25 Percentile", "50 Percentile", "75 Percentile", "Num Elems"])
    return stats

In [None]:
stats = get_stats(data)

In [None]:
print(pd.DataFrame(stats))

## Standardization

For standardization, each value will be replaced with its z-score.

In [None]:
def standardize(data, stats):
    for label in stats.index:
        data[label] = data[label].apply(lambda x: (x - stats.loc[label, "Mean"]) / stats.loc[label, "Std"])

In [None]:
standardize(data, stats)
data

## Split Data into Training and Testing Sets

The data is divided so that 80% of it is used for training, and the remaining 20% is used for testing.

The data is divided randomly to prevent bias.

In [None]:
num_rows = data.shape[0]
div = num_rows // 5
train_max = 4 * div
inds = np.random.choice(range(num_rows), size=train_max, replace=False)
test_inds = [i for i in range(num_rows) if i not in inds]
train = data.iloc[inds.tolist(), :]
test = data.iloc[test_inds, :]

## Split Data into Inputs and Outputs

The output data is separated from the input data, and all data is converted to `numpy` arrays of floats to simplify later calculations.

In [None]:
X_train = train.loc[:, "cylinders":"const"].values.astype(float)
r_train = train.loc[:, "mpg"].values.astype(float)
X_test = test.loc[:, "cylinders":"const"].values.astype(float)
r_test = test.loc[:, "mpg"].values.astype(float)

## Training for Linear Regression

A standard multivariate linear regression algorithm is used. The equation for the weights is as follows:
$$
w = (X^{T}X)^{-1}X^{T}r
$$

In [None]:
def linreg_train(X, r):
    return np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)), X.T), r)

Ensures the input contains a 1 at its end to simplify the prediction.

In [None]:
def linreg_predict(X, weights):
    if len(X) == len(weights):
        X_pred = X[:]
    elif len(X) == len(weights)-1:
        X_pred = np.append(X, 1)
    else:
        raise TypeError("weights (size {}) and X (size {}) have incompatible sizes.\nSizes should either be the same, or X should be one element smaller than weights.".format(len(weights), len(X)))
    return np.dot(weights, X_pred)

Calculates the mean squared error given input `X` and expected output `r`.

In [None]:
def error_linreg(X, r, weights):
    scores = []
    for data, result in zip(X, r):
        y = linreg_predict(data, weights)
        scores.append((y-result)**2)
    scores = np.array(scores)
    lsquare_error = np.average(scores)
    return lsquare_error

In [None]:
weights = linreg_train(X_train, r_train)
weights

## Check Training Error

In [None]:
lsquare = error_linreg(X_train, r_train, weights)
print("Mean Squared Error on Training = {}".format(lsquare))

## Testing for Linear Regression

Mean Squared Error will be used as the main testing algorithm.

In [None]:
lsquare_test = error_linreg(X_test, r_test, weights)
print("Mean Squared Error on Testing = {}".format(lsquare_test))

## Training for Polynomial Regression

The multivariate polynomial regression algorithm is implemented by calculating all powers of each variable from 1 up to the degree of the polynomial (i.e. for a quadratic regression, it calculates square of each feature and preserves the original values). It adds the extra data into a new data array. Then, the linear regression algorithm from above is applied to the expanded dataset to get the polynomial regression's weights.

This function reads in the same type of data that was passed to the linear regression algorithm and expands it to work for the polynomial regression algorithm.

In [None]:
def _expand_data_to_degree(data, degree=2):
    try:
        num_cols = data.shape[1]-1
        final_data = np.empty((data.shape[0],0))
        for col in range(num_cols):
            for i in range(degree-1):
                new_col = np.power(data[:, col], degree-i)
                final_data = np.column_stack((final_data, new_col))
            final_data = np.column_stack((final_data, data[:, col]))
        final_data = np.column_stack((final_data, data[:, -1]))
    except IndexError:
        num_cols = len(data)-1
        final_data = np.empty((0,))
        for col in range(num_cols):
            for i in range(degree-1):
                new_col = data[col]**(degree-i)
                final_data = np.append(final_data, new_col)
            final_data = np.append(final_data, data[col])
        final_data = np.append(final_data, data[-1])
    return final_data

In [None]:
def polyreg_train(X, r, degree=2):
    X_poly = _expand_data_to_degree(X, degree)
    return (linreg_train(X_poly, r), degree)

In [None]:
def polyreg_predict(X, weights):
    X_poly = _expand_data_to_degree(X, weights[1])
    return linreg_predict(X_poly, weights[0])

Calculates the mean squared error for the provided model given input `X` and expected output `r`.

In [None]:
def error_polyreg(X, r, weights):
    scores = []
    for data, result in zip(X, r):
        y = polyreg_predict(data, weights)
        scores.append((y-result)**2)
    scores = np.array(scores)
    lsquare_error = np.average(scores)
    return lsquare_error

In [None]:
def cross_validate(X_train, r_train):
    train_errors = []
    valid_errors = []
    for degree in range(1, 5):
        train_errors.append([])
        valid_errors.append([])
    for i in range(10):
        num_tr = X_train.shape[0]
        div_tr = num_rows // 4
        tr_max = 3 * div
        tr_inds = np.random.choice(range(num_tr), size=tr_max, replace=False)
        tr_valid_inds = [i for i in range(num_tr) if i not in tr_inds]
        X_train_tr = X_train[tr_inds, :]
        X_valid_tr = X_train[tr_valid_inds, :]
        r_train_tr = r_train[tr_inds]
        r_valid_tr = r_train[tr_valid_inds]
        for degree in range(1, 5):
            poly_weights = polyreg_train(X_train_tr, r_train_tr, degree=degree)
            train_error = error_polyreg(X_train_tr, r_train_tr, poly_weights)
            train_errors[degree-1].append(train_error)
            valid_error = error_polyreg(X_valid_tr, r_valid_tr, poly_weights)
            valid_errors[degree-1].append(valid_error)
    avg_train_errors = []
    avg_valid_errors = []
    for te, ve in zip(train_errors, valid_errors):
        te = np.array(te)
        ve = np.array(ve)
        avg_train_errors.append(np.average(te))
        avg_valid_errors.append(np.average(ve))
    avg_train_errors = np.array(avg_train_errors)
    avg_valid_errors = np.array(avg_valid_errors)
    print("Average Training Errors: {}".format(avg_train_errors))
    print("Average Validation Errors: {}".format(avg_valid_errors))
    comp_matrix = np.empty((4,4))
    for i in range(4):
        for j in range(4):
            comp = avg_valid_errors[i] - avg_valid_errors[j]
            if comp < 0.005:
                comp_matrix[i, j] = 0
            else:
                comp_matrix[i, j] = comp
    best_degree = 0
    all_neg = False
    second_zero = 0
    for i in range(4):
        for j in range(4):
            if i == j:
                continue
            if comp_matrix[i, j] < 0:
                all_neg = True
            elif comp_matrix[i, j] == 0:
                second_zero = j+1
                all_neg = True
            else:
                all_neg = False
                break
        if all_neg:
            if second_zero > 0 and second_zero < i+1:
                best_degree = j+1
            else:
                best_degree = i+1
            break
    print("Cross Validation suggests the best polynomial degree is {}".format(best_degree))
    return best_degree

## Check Training Error

In [None]:
degree = cross_validate(X_train, r_train)
poly_weights = polyreg_train(X_train, r_train, degree=degree)
print("\nPolynomial Weights: {}".format(poly_weights))
lsquare_poly = error_polyreg(X_train, r_train, poly_weights)
print("Mean Squared Error on Training = {}".format(lsquare_poly))

## Testing for Polynomial Regression
Mean Squared Error will be used as the main testing algorithm.

In [None]:
lsquare_testpoly = error_polyreg(X_test, r_test, poly_weights)
print("Mean Squared Error on Testing = {}".format(lsquare_testpoly))

## Checking for Feature Importance

To see how important a feature is, columns are progressively removed from the dataset, and errors are obtained and printed.

In [None]:
def feature_analysis(X, r, degree):
    train_errors = []
    valid_errors = []
    for i in range(X.shape[1]):
        X_train = X[i:]
        num_tr = X_train.shape[0]
        div_tr = num_rows // 4
        tr_max = 3 * div
        tr_inds = np.random.choice(range(num_tr), size=tr_max, replace=False)
        tr_valid_inds = [i for i in range(num_tr) if i not in tr_inds]
        X_train_tr = X_train[tr_inds, :]
        X_valid_tr = X_train[tr_valid_inds, :]
        r_train_tr = r[tr_inds]
        r_valid_tr = r[tr_valid_inds]
        weights = polyreg_train(X_train_tr, r_train_tr, degree=degree)
        te = error_polyreg(X_train_tr, r_train_tr, weights)
        ve = error_polyreg(X_valid_tr, r_valid_tr, weights)
        train_errors.append(te)
        valid_errors.append(ve)
    print("Training Errors by Starting Column:\n  {}\n".format(train_errors))
    print("Validation Errors by Starting Column:\n  {}\n".format(valid_errors))

In [None]:
feature_analysis(X_train, r_train, degree)