## We've already seen how to implement a linear regression where we used a single variable to predict the value of another related variable. In the case where we want to predict the value of a variable using more than one variable as input then we need to use matrices.

In this notebook we'll implement a multivariate linear regression. Here we'll only cover continuous covariate variables but the method works identically if we used categorical covariates - it just requires us to do some extra processing before fitting the model!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Generate data for multivariate regression

In [None]:
n = 1000 #Number of observations in the training set
p = 5 #Number of parameters, including intercept

#Assign True parameters to be estimated
beta = np.random.uniform(-10, 10, p) #Randomly initialise true parameters
print(beta)

In [None]:
X = np.random.uniform(0,10,(n,(p-1))) 
X0 = np.array([1]*n).reshape((n,1)) #Columns for intercept

X = np.concatenate([X0,X], axis = 1) #Join intercept to other variables to form feature matrix


In [None]:
Y = np.matmul(X,beta) + np.random.normal(0,10,n) #Linear combination of the features plus a normal error term

In [None]:
#Concatenate to create dataframe

dataFeatures = pd.DataFrame(X)
dataFeatures.columns = [f'X{i}' for i in range(p)]

dataTarget = pd.DataFrame(Y)
dataTarget.columns = ['Y']

data = pd.concat([dataFeatures, dataTarget], axis = 1)


In [None]:
data.head()

# The Algebra

To fit a linear regression for a set of features $X$ and a set of targets $Y$, we compute the model parameters as:

$$\hat \beta = (X^TX)^{-1}X^Ty$$

$\hat \beta$ is a $p \times 1$ vector where each element of the vector corresponds to the estimate of the true parameter which generated the data


This estimator is derived using the same ideas as for the single variable case but we have to work with matrices rather than vectors - See [this link](http://home.iitk.ac.in/~shalab/regression/Chapter3-Regression-MultipleLinearRegressionModel.pdf) for a detailed derivation. 

In [None]:
class LinearRegressionMultivariate:
    
    def __init__(self, data, target, features, trainTestRatio = 0.9):
        #data - a pandas dataset 
        #target - the name of the pandas column which contains the true labels
        #features - A list containing the names of the columns which we will use to do the regression
        #trainTestRatio - the proportion of the entire dataset which we'll use for training
                    #   - the rest will be used for testing
        
        self.target = target
        self.features = features 
        
        #Split up data into a training and testing set
        self.train, self.test = train_test_split(data, test_size=1-trainTestRatio)
    
    
        
    def fitLR(self):
        #Fit a linear regression to the training data
        
        pass
    
    def predict(self,x):
        #Given a vector (or matrix) of new observations x, predict the corresponding target values
        
        pass
    

In [None]:
myModel = LinearRegressionMultivariate(data, 'Y', [f'X{i}' for i in range(p)])

In [None]:
myModel.fitLR()

# Print the model estimates - there should be the right number (p) of them!

In [None]:
print(myModel.betaHat)
print(myModel.betaHat.shape) #==p

# Predict values for the test set

In [None]:
testPred = myModel.predict(np.array(myModel.test[myModel.features]))

In [None]:
plt.scatter(myModel.test[myModel.target], testPred)
plt.xlabel = 'True test values'
plt.ylabel = 'Predicted test values'

#plot line y = x
x = np.arange(np.floor(myModel.test[myModel.target].min()), np.ceil(myModel.test[myModel.target].max()))
plt.plot(x,x,color = 'green')

plt.show()

If the points roughly follow the line y = x then that's an indication the model is working well enough