Since the HW grading is done in a semi-automatic manner, please adhere to the following naming format for your submission.
Each group of students (mostly pairs, with some approved exceptions) should submit a Jupyter notebook (.ipynb file and not a .zip file) whose name is the underscored-separated id list of all the submitters. 
For example, for two submitters, the naming format is: id1_id2.ipynb.

# Question 1

a) Download the "Boston1.csv" database, and explore the data. Explanation about the dataset can be found here: http://www.clemson.edu/economics/faculty/wilson/R-tutorial/analyzing_data.html

Find the columns with missing values and filter them out of the data.

In [22]:
import pandas as pd

df = pd.read_csv("boston1.csv")

null_columns = df.columns[df.isnull().sum() > 0].tolist()
print(f"Columns with nulls are: {null_columns}. Dropping those columns...")
df = df.drop(columns=null_columns, axis=1)

Columns with nulls are: ['misData']. Dropping those columns...


b) Divide the filtered data randomly into a train set (70% of the data) and test set (30% of the data).

In [23]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.3)

# Question 2

If you haven't done this previously, install the scikit-learn package for python.

a) On the train set, run a linear regression model as follows:
Divide the training set into explanatory variables (the X matrix with which we'll try to make a prediction) and a target variable (y, the value which we'll try to predict). Use the 'medv' attribute as the target variable y and the rest of the features as the X matrix. Run a linear regression model on those sets, and print the regression coefficients.

In [24]:
from sklearn.linear_model import LinearRegression

X_train = train_data.drop('medv', axis=1)
y_train = train_data['medv']

# Create and run the LinearRegression model
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

# Print the regression coefficients
regression_coefficients = pd.DataFrame({'Feature': X_train.columns, 'Coefficient': regression_model.coef_})
print(regression_coefficients)

    Feature  Coefficient
0      crim    -0.078811
1        zn     0.036640
2     indus     0.018928
3      chas     2.945335
4       nox   -15.058001
5        rm     4.708187
6       age    -0.016677
7       dis    -1.384853
8       rad     0.242605
9       tax    -0.009831
10  ptratio    -0.921733
11    black     0.010887
12    lstat    -0.433003
13  randCol     0.952398


b) Use the linear regression model to predict the values of the test set's 'medv' column, based on the test set's other attributes. Print the Mean Squared Error of the model on the train set and on the test set.
Usually, the MSE on the train set would be lower than the MSE on the test set, since the model parameters are optimized with respect to the train set. Must this always be the case? Can you think of a few examples for when this might not be the case?

In [25]:
from sklearn.metrics import mean_squared_error

y_hat_train = regression_model.predict(X_train)
train_MSE = mean_squared_error(y_train, y_hat_train)

X_test = test_data.drop('medv', axis=1)
y_test = test_data['medv']
y_hat_test = regression_model.predict(X_test)
test_MSE = mean_squared_error(y_test, y_hat_test)
pd.DataFrame({'Set': ['train_set', 'test_set'], 'MSE': [train_MSE, test_MSE]})

Unnamed: 0,Set,MSE
0,train_set,20.519814
1,test_set,26.317532


In [26]:
print("It is possible that the train dataset's MSE is higher than the test dataset's MSE, since it's possible that the train data isn't realizable by linear regression - so it's not 0 - and by a chance, the test data only contains entries that fit exactly on the linear regression line. For example, see the code + outcome below:")

# An un-realizable train data
example_train_data = pd.DataFrame({'X': [1, 2, 3], 'Y': [1, 4, 1]})
example_X_train = example_train_data.drop('Y', axis=1)
example_y_train = example_train_data['Y']

example_regression_model = LinearRegression()
example_regression_model.fit(example_X_train, example_y_train)

# A "perfect" test data
example_test_data = pd.DataFrame({'X': range(100), 'Y': 100 * [2]})
example_X_test = example_test_data.drop('Y', axis=1)
example_y_test = example_test_data['Y']

example_y_hat_train = example_regression_model.predict(example_X_train)
example_train_MSE = mean_squared_error(example_y_train, example_y_hat_train)

example_y_hat_test = example_regression_model.predict(example_X_test)
example_test_MSE = mean_squared_error(example_y_test, example_y_hat_test)
pd.DataFrame({'Set': ['train_set', 'test_set'], 'MSE': [example_train_MSE, example_test_MSE]})

It is possible that the train dataset's MSE is higher than the test dataset's MSE, since it's possible that the train data isn't realizable by linear regression - so it's not 0 - and by a chance, the test data only contains entries that fit exactly on the linear regression line. For example, see the code + outcome below:


Unnamed: 0,Set,MSE
0,train_set,2.0
1,test_set,1.720373e-28


c) Add some noise (with mean=0, std=1) to the test set's y, and predict it again. What happened to the MSE? Why?

In [27]:
import numpy as np

noise = np.random.normal(0, 1)
noised_test_y = y_test + noise

y_hat_test = regression_model.predict(X_test)
noised_test_MSE = mean_squared_error(noised_test_y, y_hat_test)
pd.DataFrame({'Set': ['test_set', 'noised_test_set'], 'MSE': [test_MSE, noised_test_MSE]})

Unnamed: 0,Set,MSE
0,test_set,26.317532
1,noised_test_set,26.97571


In [28]:
print("The MSE hasn't changed nearly at all. This makes sense, because the test data is normally-distributed around the linear regression line, so adding a noise from a normal distribution shouldn't affect it much. In other words, since not all of the test X points are exactly on the regression line, some moved closer to it while some moved further away, and in general the MSE remained about the same")

The MSE hasn't changed nearly at all. This makes sense, because the test data is normally-distributed around the linear regression line, so adding a noise from a normal distribution shouldn't affect it much. In other words, since not all of the test X points are exactly on the regression line, some moved closer to it while some moved further away, and in general the MSE remained about the same


# Question 3

a) Create a Recursive feature elimination model, with a linear regression estimator, that selects half of the original number of features. Hint: Check the feature_selection module in scikit-learn.

b) Use the feature elimination model on the full database (after filtering columns with missing values, before partitioning into train/test). Print the features that were selected. Remember that we separate the 'medv' attribute to be our y, while the rest of the attributes in the dataset serve as features to learn from.

c) We'd like to find out the optimal number of features. Create feature elimination models (with linear regression estimators) for every number of features between 1 and n (where n = all the original features, 'medv' excluded). For each number of features, run a linear regression as in Question 2, only on the selected features, in order to predict 'medv'. Print the Mean Sqaured Error for each number of features.

d) Conclude the optimal number of features for this task. Think about the cost of adding for data vs the benefit of a more accurate prediction. Explain your answer.

# Question 4

Perform a cross-validation of the linear regression on the train set with K=5. Print the CV scores for each repeat.