# Linear Regression

This is the second programming assignment for CSCE478/878 Introduction to Machine Learning on Linear Regression. This notebook is divided into 3 sections, namely
1. **Part A (Model Code)**
1. **Part B (Data Processing)**
1. **Part C (Model Evaluation)**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import itertools

## Part A

### 1. Implement the following function that generates the polynomial and interaction features for a given degree of the polynomial

In [2]:
"""
Takes the feature matrix X and the polynomial degree and returns
X with features of polynomial degrees and pairwise interaction
terms

Input:
    X - 2D np.ndarray feature matrix
    degree - integer > 0
    
Output:
    X - 2D np.ndarray polynomial feature matrix 
"""
def polynomialFeatures(X, degree):
    # Create x_0 as all 1s
    x_0 = np.ones((X.shape[0],1))
    
    # Create combinations of interaction terms
    interactions =  lambda x: np.multiply.reduce(np.array(list(itertools.combinations(x.T,2))),1).T
    #X = np.concatenate((X,interactions(X)), axis=1)
    
    # Create polynomials of degree n
    if degree > 1:
        for d in range(2, degree):
            X = np.concatenate((X,np.power(X,d)), axis = 1)
            
    X = np.concatenate((X,interactions(X)), axis=1)
    
    # concatenate features
    X = np.concatenate((x_0,X), axis=1)
    
    return X

In [3]:
polynomialFeatures(X,3)

NameError: name 'X' is not defined

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly.fit_transform(X)

### 2. Implement the following function to calculate and return the mean squared error (mse) of two vectors.
 

In [None]:
# Khalid

###  3. Implement the following function to plot the training and validation root mean square error (rmse) values of the data matrix X for various polynomial degree starting from 1 up to the value set by the argument “maxPolynomialDegree” 

In [4]:
# Khalid

### 4.Implement a Linear_Regression model class. It should have the following three methods. Note the that “fit” method should implement the batch gradient descent algorithm.


In [208]:
'''
Linear Regression that takes optional hyperparameters. Contains methods fit using Batch Gradient Descent to
estimate the weight parameters of the model.
Methods:
    fit - Given X and y, using batch gradient descent to find w
    predict - Given X, using w to provide prediction
'''
class Linear_Regression():
    def __init__(self, learning_rate=0.01, epochs=100, tol=None, regularizer=None, lambd=0.0, **kwargs):
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.tol = tol
        self.regularizer = regularizer
        self.lambd = lambd
        self.w = None
        
        return
    
    def fit(self, X, y):
        # Add x_0 = 1 for all for intercept and concat data
        x_0 = np.ones((X.shape[0],1))
        X = np.concatenate((x_0,X), axis=1)
        
        # Init prev_cost with 0
        prev_cost = 0
        
        # Number of training samples
        m = len(X)
        # Initialize all weights to 0
        self.w = np.zeros((X.shape[1],))
        
        # Run batch gradient descent up to self.epoch times
        for i in range(self.epochs):
            # Calculate mse with current weights
            new_cost = np.mean(np.square(X.dot(self.w) - y))
            
            # Break if absolute cost of previous cost and current cost is smaller than self.tol
            if self.tol is not None:
                if abs(prev_cost - new_cost) > self.tol:
                    prev_cost = new_cost
                else:
                    break
            
            # Calculate gradient
            grad = (X.T.dot(X.dot(self.w)-y))
            
            # Apply Regularization term to gradient
            if self.regularizer == "l2":
                grad = grad + self.lambd * self.w
            elif self.regularizer == 'l1':
                grad = grad + self.lambd * np.sign(self.w)
            
            # Update weights
            self.w = self.w - (self.learning_rate/m)*grad
            
    def predict(self, X):
        
        x_0 = np.ones((X.shape[0],1))
        X = np.concatenate((x_0,X), axis=1)
        
        pred = X.dot(self.w)
        
        return pred


In [209]:
X = X_train
y = y_train

In [210]:
lm = Linear_Regression(learning_rate=0.0001,epochs=1000,regularizer='l2',lambd = 0.01)

In [211]:
lm.fit(X,y)

In [213]:
pred = lm.predict(X_test)

In [214]:
np.mean(np.square(pred-y_test))

0.5358009728201006

In [37]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X,y)

In [38]:
reg.coef_

array([ 2.49905527e-02, -1.08359026e+00, -1.82563948e-01,  1.63312698e-02,
       -1.87422516e+00,  4.36133331e-03, -3.26457970e-03, -1.78811638e+01,
       -4.13653144e-01,  9.16334413e-01,  2.76197699e-01])

## Part B

### 5. Read in the winequality-red.csv file as a Pandas data frame.

In [31]:
df = pd.read_csv('./Data/winequality-red.csv', sep=';')

### 6. Use the techniques from the recitation to summarize each of the variables in the dataset in terms of mean, standard deviation, and quartiles. Include this in your report.

In [32]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


### 7. Shuffle the rows of your data. You can use def = df.sample(frac=1) as an idiomatic way to shuffle the data in Pandas without losing column names. Create a test dataset by randomly sampling 20% of the data. Remaining data should be used for training.

In [33]:
df = df.sample(frac=1)

In [34]:
def partition(X, y, t):
    # Determine sizes of sample, training and test set
    n = len(y)
    size_train = int(t * n)
    size_test = 1 - size_train
    
    # Generate list of all index
    range_index = [x for x in range(0,n)]
    # Generate list of random index with the size of training set
    train_index = random.sample(range(0, n), size_train)
    # Obtain the set difference between all the training for test 
    test_index = list(set(range_index).difference(set(train_index)))
    
    # Subsetting train and test
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]
    
    return X_train, X_test, y_train, y_test

In [35]:
X = np.array(df.drop(columns=['quality'],axis=1))
y = np.array(df['quality'])

In [36]:
X_train, X_test, y_train, y_test = partition(X, y, 0.8)

## Part C

### 8. Model selection via Hyperparameter tuning: Use the kFold function (known as sFold function from previous assignment) to evaluate the performance of your model over each combination of lambd, learning_rate and regularizer hyperparameters from the following sets:

In [8]:
# Khalid

### 9. Evaluate your model on the test data and report the mean squared error.
 

In [None]:
# Khalid

### 10. Determine the best model hyperparameter values for the training data matrix with polynomial degree 3.


In [9]:
# Khalid

### 11. Using the plot_polynomial_model_complexity function plot the rmse values for the training and validation folds for polynomial degree 1, 2, 3, 4 and 5. Use the training data as input for this function. You need to choose the hyperparameter values judiciously to work on the higher-degree polynomial models.

In [10]:
# Khalid

### 12. Implement the Stochastic Gradient Descent Linear Regression algorithm. 

In [None]:
# Khalid