# COSI123A Homework2 Nicole Meng
In this assignment, you are given a training data set (training_data.xlsx) and a test data set (test_data.xlsx). Each row in training_data.xlsx and test_data.xlsx is a sample. The first column of training_data.xlsx contains the sample IDs. The last column of training_data.xlsx contain the targets of the training samples. The rest columns are the features of samples. The test data file (test_data.xlsx) contain the same information except that it does not contain the targets. Use the training data set to train a linear regression model and apply the model to the test samples.

Submit: 

(a) Codes with proper comments

(b) Prediction results of the test samples. Save the prediction results in a text file ('test_results.txt'). Each line in 'test_result.txt' is the prediction result of the corresponding test sample in 'test_data.xlsx'.

(c) A brief write-up explaining how you make your model robust.

# Import Libraries and Packages

In [1]:
#First, we will import all the libraries we might need for this project
import pandas as pd
import numpy as np
import sklearn
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import normalize

# Data Processing

In [2]:
#Since we are given two data files in .xlsx, we want to transform them into dataframes to read it easily
train = pd.read_excel('training_data.xlsx')
test = pd.read_excel('test_data.xlsx')

We will take a look at the first couple of rows in training and testing data to get an idea of the data

In [3]:
train.head()

Unnamed: 0,Samples,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X270,X271,X272,X273,X274,X275,X276,X277,X278,Y
0,training sample 1,73.713,72.861,73.287,54.231,51,0.670229,9.09,3.23,5.67,...,1.7,2.12,6.8,1.7,3.28,-0.40075,-0.1971,187.84,1.57,-1.334686
1,training sample 2,54.388,54.395,54.3915,56.388,70,0.773891,6.89,1.7,3.33,...,1.09,1.09,6.75,1.7,3.38,-0.47187,-0.23242,159.88,7.34,0.434026
2,training sample 3,55.665,55.883,55.774,56.229,62,-0.739181,7.44,2.39,5.06,...,1.09,1.09,8.88,1.69,6.19,-0.39627,-0.21657,268.42,1.26,1.629256
3,training sample 4,63.92,63.925,63.9225,56.643,62,-0.739181,6.97,1.71,5.73,...,1.7,2.12,7.86,1.84,3.29,-0.38502,-0.18639,215.35,1.25,-0.307086
4,training sample 5,55.665,55.883,55.774,56.229,62,-0.739181,7.44,2.39,5.06,...,1.7,2.12,8.12,2.05,3.27,-0.44449,-0.21373,175.32,1.27,-1.147735


In [4]:
test.head()

Unnamed: 0,Samples,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X269,X270,X271,X272,X273,X274,X275,X276,X277,X278
0,test sample 1,55.665,55.883,55.774,56.229,62,-0.739181,7.44,2.39,5.06,...,3.89,2.08,2.72,7.83,1.7,3.25,-0.40553,-0.22048,217.3,-3.89
1,test sample 2,78.854,78.853,78.8535,55.161,61,-0.966118,6.94,1.73,5.72,...,3.59,1.7,2.12,11.09,1.97,3.29,-0.39109,-0.19325,293.53,1.22
2,test sample 3,52.789,52.801,52.795,56.202,64,0.920026,8.28,1.7,3.34,...,3.59,1.7,2.13,4.77,1.7,7.17,-0.4033,-0.17332,193.5,1.62
3,test sample 4,42.215,52.094,47.1545,55.769,50,-0.262375,11.22,2.01,3.5,...,6.26,1.71,4.57,8.9,1.7,4.81,-0.38733,-0.21012,300.42,-2.98
4,test sample 5,73.713,72.861,73.287,54.231,51,0.670229,9.09,3.23,5.67,...,2.58,1.09,1.09,6.76,1.7,3.39,-0.44273,-0.23633,180.99,6.67


In [5]:
#Next, we will read into the dataframes we created and define some more terms
X = train.iloc[: , 1:-1]
X_test = test.iloc[: , 1:]
Y = train.iloc[: , -1] #target column

#We will then normalize the data
X = normalize(X, axis = 1, norm = 'max')
X_test = normalize(X_test, axis = 1, norm = 'max')
X = pd.DataFrame(X)
X_test = pd.DataFrame(X_test)

In [6]:
#Create the correlation matric for X in order to run selection on the correlation matrix
correlation = X.corr().abs()
diagonal = np.triu(np.ones(correlation.shape), 1)
select = correlation.where(diagonal.astype(bool))

In [7]:
#Next, we are going to try to drop the columns with high correlation(bigger than 0.9) because they will not be helpful in making predictions
drop_list = []
for col in select.columns:
    if any(select[col] > 0.9):
        drop_list.append(col)
X = X.drop(X[drop_list], axis = 1)
X_test = X_test.drop(X_test[drop_list], axis = 1)
X_after_name = list(X.columns)

# Linear Regreassion Model

In [8]:
#Now we are going to use the data to train our linear regression model
model = LinearRegression()
model.fit(X, Y)
importance = model.coef_ #importance selection

In [9]:
#During the importance selection process, we will use a map to map the scores
score_map = {}
importance_score = []
for i in range(len(importance)):
    importance_score.append(importance[i])
    
for i in range(len(importance_score)):
    score_map[X_after_name[i]] = importance[i]
    
for key in score_map:
    if abs(score_map.get(key))<20:
        del X[key]
        del X_test[key]

In [10]:
#We will make our predictions by calculating the weight vector w
transpose_x = X.transpose()
inverse = np.linalg.pinv(np.dot(X.transpose(), X))
temp = np.dot(inverse,transpose_x)
w =np.dot(temp, Y)
Y_test = np.dot(w.T, X_test.transpose())

# Saving Result

In [11]:
prediction = list(Y_test)

#Put the final prediction output in a file named "test_results.txt"
np.savetxt('test_results.txt', prediction, fmt = "%f")