# Linear Regression [35pts (+5 bonus)]

## Introduction
One of the most widespread regression tools is the simple but powerful linear regression. In this notebook, you will engineer the Pittsburgh bus data into numerical features and use them to predict the number of minutes until the bus reaches the bus stop at Forbes and Morewood. 

Notebook restriction: you may not use scikit-learn for this notebook.  

## Q1: Labeling the Dataset [8pts]

You may have noticed that the Pittsburgh bus data has a predictions table with the TrueTime predictions on arrival time, however it does not have the true label: the actual number of minutes until a bus reaches Forbes and Morewood. You will have to generate this yourself. 

Using the `all_trips` function that you implemented in homework 2, you can split the dataframe into separate trips. You will first process each trip into a form more natural for the regression setting. For each trip, you will need to locate the point at which a bus passes the bus stop to get the time at which the bus passes the bus stop. From here, you can calculate the true label for all prior datapoints, and throw out the rest. 

### Importing functions from homework 2

Using the menu in Jupyter, you can import code from your notebook as a Python script using the following steps: 
1. Click File -> Download as -> Python (.py)
2. Save file (time_series.py) in the same directory as this notebook 
3. (optional) Remove all test code (i.e. lines between AUTOLAB_IGNORE macros) from the script for faster loading time
4. Import from the notebook with `from time_series import function_name`

### Specifications

1. To determine when the bus passes Morewood, we will use the Euclidean distance as a metric to determine how close the bus is to the bus stop. 
2. We will assume that the row entry with the smallest Euclidean distance to the bus stop is when the bus reaches the bus stop, and that you should truncate all rows that occur **after** this entry.  In the case where there are multiple entries with the exact same minimal distance, you should just consider the first one that occurs in the trip (so truncate everything after the first occurance of minimal distance). 
3. Assume that the row with the smallest Euclidean distance to the bus stop is also the true time at which the bus passes the bus stop. Using this, create a new column called `eta` that contains for each row, the number of minutes until the bus passes the bus stop (so the last row of every trip will have an `eta` of 0).
4. Make sure your `eta` is numerical and not a python timedelta object. 

In [1]:
import pandas as pd
import numpy as np
import scipy.linalg as la
from collections import Counter

In [11]:
# AUTOLAB_IGNORE_START
from time_series import load_data, split_trips

try:
    vdf = pd.read_pickle('vdf.pkl')
except FileNotFoundError:
    vdf, _ = load_data('bus_train.db')
except:
    raise
    
all_trips = split_trips(vdf)
# AUTOLAB_IGNORE_STOP

In [30]:
def label_and_truncate(trip, bus_stop_coordinates):
    """ Given a dataframe of a trip following the specification in the previous homework assignment,
        generate the labels and throw away irrelevant rows. 
        
        Args: 
            trip (dataframe): a dataframe from the list outputted by split_trips from homework 2
            stop_coordinates ((float, float)): a pair of floats indicating the (latitude, longitude) 
                                               coordinates of the target bus stop. 
            
        Return:
            (dataframe): a labeled trip that is truncated at Forbes and Morewood and contains a new column 
                         called `eta` which contains the number of minutes until it reaches the bus stop. 
        """
    dist_square = np.square(trip["lat"] - bus_stop_coordinates[0]) + np.square(trip["lon"] - bus_stop_coordinates[1])
    # find smallest one
    min_label = dist_square.idxmin()
    # calulate eta
    eta = min_label - trip.index
    # create a new column eta
    trip["eta"] = eta.seconds // 60
    # return labeled and truncated trip
    return trip[trip.index <= min_label]

In [78]:
# AUTOLAB_IGNORE_START
try:
    labeled_vdf = pd.read_pickle('labeled_vdf.pkl')
except FileNotFoundError:
    morewood_coordinates = (40.444671114203, -79.94356058465502) # (lat, lon)
    labeled_trips = [label_and_truncate(trip, morewood_coordinates) for trip in all_trips]
    labeled_vdf = pd.concat(labeled_trips).reset_index()
    # We remove datapoints that make no sense (ETA more than 10 hours)
    labeled_vdf = labeled_vdf[labeled_vdf["eta"] < 10*60].reset_index(drop=True)
except:
    raise
print(Counter([len(t) for t in labeled_trips]))
labeled_vdf.head()
# AUTOLAB_IGNORE_STOP

Counter({18: 231, 19: 196, 17: 184, 20: 163, 21: 162, 16: 162, 15: 127, 22: 112, 31: 110, 23: 108, 30: 106, 14: 105, 33: 102, 32: 100, 36: 93, 34: 89, 37: 85, 24: 84, 28: 83, 35: 81, 29: 79, 27: 77, 25: 69, 41: 68, 13: 67, 40: 59, 26: 58, 38: 55, 45: 48, 44: 45, 42: 45, 39: 44, 48: 43, 46: 41, 1: 40, 5: 37, 43: 35, 50: 33, 6: 32, 47: 30, 12: 30, 7: 27, 51: 25, 49: 24, 52: 20, 11: 20, 8: 20, 55: 19, 54: 18, 10: 17, 53: 15, 9: 15, 4: 15, 3: 12, 2: 12, 58: 7, 57: 4, 61: 3, 56: 3, 59: 2, 62: 2, 64: 1, 60: 1, 63: 1})


Unnamed: 0,tmstmp,vid,lat,lon,hdg,pid,rt,des,pdist,spd,tablockid,tatripid,eta
0,2016-08-11 10:56:00,3200,40.445932,-79.951764,340,4669,61D,Downtown,37371,22,061D-278,6899,0
1,2016-08-11 12:15:00,3200,40.411015,-79.906921,204,4669,61D,Downtown,212,0,061D-278,6913,35
2,2016-08-11 12:16:00,3200,40.411015,-79.906921,204,4669,61D,Downtown,176,0,061D-278,6913,34
3,2016-08-11 12:17:00,3200,40.411015,-79.906921,204,4669,61D,Downtown,176,0,061D-278,6913,33
4,2016-08-11 12:18:00,3200,40.411015,-79.906921,204,4669,61D,Downtown,176,0,061D-278,6913,32


## Q2: Generating Basic Features [8pts]
In order to perform linear regression, we need to have numerical features. However, not everything in the bus database is a number, and not all of the numbers even make sense as numerical features. If you use the data as is, it is highly unlikely that you'll achieve anything meaningful.

Consequently, you will perform some basic feature engineering. Feature engineering is extracting "features" or statistics from your data, and hopefully improve the performance if your learning algorithm (in this case, linear regression). Good features can often make up for poor model selection and improve your overall predictive ability on unseen data. In essence, you want to turn your data into something your algorithm understands. 

### Specifications
1. The input to your function will be a concatenation of the trip dataframes generated in Q1 with the index dropped (so same structure as the original dataframe, but with an extra column and less rows). 
2. Linear models typically have a constant bias term. We will encode this as a column of 1s in the dataframe. Call this column 'bias'. 
2. We will keep the following columns as is, since they are already numerical:  pdist, spd, lat, lon, and eta 
3. Time is a cyclic variable. To encode this as a numerical feature, we can use a sine/cosine transformation. Suppose we have a feature of value f that ranges from 0 to N. Then, the sine and cosine transformation would be $\sin\left(2\pi \frac{f}{N}\right)$ and $\cos\left(2\pi \frac{f}{N}\right)$. For example, the sine transformation of 6 hours would be $\sin\left(2\pi \frac{6}{24}\right)$, since there are 24 hours in a cycle. You should create sine/cosine features for the following:
    * day of week (cycles every week, 0=Monday)
    * hour of day (cycles every 24 hours, 0=midnight)
    * time of day represented by total number of minutes elapsed in the day (cycles every 60*24 minutes, 0=midnight).
4. Heading is also a cyclic variable, as it is the ordinal direction in degrees (so cycles every 360 degrees). 
4. Buses run on different schedules on the weekday as opposed to the weekend. Create a binary indicator feature `weekday` that is 1 if the day is a weekday, and 0 otherwise. 
5. Route and destination are both categorical variables. We can encode these as indicator vectors, where each column represents a possible category and a 1 in the column indicates that the row belongs to that category. This is also known as a one hot encoding. Make a set of indicator features for the route, and another set of indicator features for the destination. 
6. The names of your indicator columns for your categorical variables should be exactly the value of the categorical variable. The pandas function `pd.DataFrame.get_dummies` will be useful. 

In [79]:
def create_features(vdf):
    """ Given a dataframe of labeled and truncated bus data, generate features for linear regression. 
    
        Args:
            df (dataframe) : dataframe of bus data with the eta column and truncated rows
        Return: 
            (dataframe) : dataframe of features for each example
        """
    # cyclic varibles
    def cyclic_transform(f, N, name):
        return pd.Series(np.sin(2 * np.pi * f / N), name="sin_"+name), pd.Series(np.cos(2 * np.pi * f / N), name="cos_"+name)
    
    (sin_day_of_week, cos_day_of_week) = cyclic_transform(vdf["tmstmp"].dt.dayofweek, 7, "day_of_week")
    (sin_hour_of_day, cos_hour_of_day) = cyclic_transform(vdf["tmstmp"].dt.hour, 24, "hour_of_day")
    (sin_time_of_day, cos_time_of_day) = cyclic_transform(vdf["tmstmp"].dt.hour * 60 + vdf["tmstmp"].dt.minute,
                                                         24*60, "time_of_day")
    (sin_hdg, cos_hdg) = cyclic_transform(vdf["hdg"], 360, "hdg")
    # is_weekday
    is_weekday_bool = vdf["tmstmp"].dt.dayofweek < 5
    weekday = is_weekday_bool.astype(int).rename("weekday")
    # one hot encoding route and destination
    rt_df = pd.get_dummies(vdf["rt"])
    des_df = pd.get_dummies(vdf["des"])
    # bias
    bias = pd.Series(np.ones(vdf.index.shape))
    # concatenating objects
    vdf_num = vdf.loc[:, ("pdist", "spd", "lat", "lon", "eta")]
    feature_df = pd.concat([bias, vdf_num, sin_hdg, cos_hdg, sin_day_of_week, cos_day_of_week,
                            sin_hour_of_day, cos_hour_of_day,sin_time_of_day, cos_time_of_day,
                            weekday, rt_df, des_df], axis=1, ignore_index=False)
    return feature_df

# AUTOLAB_IGNORE_START
try:
    vdf_features = pd.read_pickle('vdf_features.pkl')
except FileNotFoundError:
    vdf_features = create_features(labeled_vdf)
except:
    raise
# AUTOLAB_IGNORE_STOP

In [80]:
# AUTOLAB_IGNORE_START
with pd.option_context('display.max_columns', 26):
    vdf_features.columns
    vdf_features.head()
# AUTOLAB_IGNORE_STOP

Index([                  0,             'pdist',               'spd',
                     'lat',               'lon',               'eta',
                 'sin_hdg',           'cos_hdg',   'sin_day_of_week',
         'cos_day_of_week',   'sin_hour_of_day',   'cos_hour_of_day',
         'sin_time_of_day',   'cos_time_of_day',           'weekday',
                     '61A',               '61B',               '61C',
                     '61D',         'Braddock ',          'Downtown',
         'Greenfield Only',       'McKeesport ', 'Murray-Waterfront',
               'Swissvale'],
      dtype='object')

Unnamed: 0,0,pdist,spd,lat,lon,eta,sin_hdg,cos_hdg,sin_day_of_week,cos_day_of_week,sin_hour_of_day,cos_hour_of_day,sin_time_of_day,cos_time_of_day,weekday,61A,61B,61C,61D,Braddock,Downtown,Greenfield Only,McKeesport,Murray-Waterfront,Swissvale
0,1.0,37371,22,40.445932,-79.951764,0,-0.34202,0.939693,0.433884,-0.900969,0.5,-0.866025,0.275637,-0.961262,1,0,0,0,1,0,1,0,0,0,0
1,1.0,212,0,40.411015,-79.906921,35,-0.406737,-0.913545,0.433884,-0.900969,1.224647e-16,-1.0,-0.065403,-0.997859,1,0,0,0,1,0,1,0,0,0,0
2,1.0,176,0,40.411015,-79.906921,34,-0.406737,-0.913545,0.433884,-0.900969,1.224647e-16,-1.0,-0.069756,-0.997564,1,0,0,0,1,0,1,0,0,0,0
3,1.0,176,0,40.411015,-79.906921,33,-0.406737,-0.913545,0.433884,-0.900969,1.224647e-16,-1.0,-0.074108,-0.99725,1,0,0,0,1,0,1,0,0,0,0
4,1.0,176,0,40.411015,-79.906921,32,-0.406737,-0.913545,0.433884,-0.900969,1.224647e-16,-1.0,-0.078459,-0.996917,1,0,0,0,1,0,1,0,0,0,0


Our implementation has the following output. Verify that your code has the following columns (order doesn't matter): 
```python
>>> vdf_features.columns
Index([             u'bias',             u'pdist',               u'spd',
                     u'lat',               u'lon',               u'eta',
                 u'sin_hdg',           u'cos_hdg',   u'sin_day_of_week',
         u'cos_day_of_week',   u'sin_hour_of_day',   u'cos_hour_of_day',
         u'sin_time_of_day',   u'cos_time_of_day',           u'weekday',
               u'Braddock ',          u'Downtown',   u'Greenfield Only',
             u'McKeesport ', u'Murray-Waterfront',         u'Swissvale',
                     u'61A',               u'61B',               u'61C',
                     u'61D'],
      dtype='object')
   bias  pdist  spd        lat        lon  eta   sin_hdg   cos_hdg  \
0   1.0   1106    0  40.439504 -79.996981   16  0.913545 -0.406737   
1   1.0   1106    0  40.439504 -79.996981   15  0.913545 -0.406737   
2   1.0   1778    8  40.438842 -79.994733   14  0.829038 -0.559193   
3   1.0   2934    7  40.437938 -79.991213   13  0.997564 -0.069756   
4   1.0   2934    7  40.437938 -79.991213   13  0.997564 -0.069756   

   sin_day_of_week  cos_day_of_week ...   Braddock   Downtown  \
0         0.433884        -0.900969 ...         0.0       0.0   
1         0.433884        -0.900969 ...         0.0       0.0   
2         0.433884        -0.900969 ...         0.0       0.0   
3         0.433884        -0.900969 ...         0.0       0.0   
4         0.433884        -0.900969 ...         0.0       0.0   

   Greenfield Only  McKeesport   Murray-Waterfront  Swissvale  61A  61B  61C  \
0              0.0          0.0                0.0        1.0  1.0  0.0  0.0   
1              0.0          0.0                0.0        1.0  1.0  0.0  0.0   
2              0.0          0.0                0.0        1.0  1.0  0.0  0.0   
3              0.0          0.0                0.0        1.0  1.0  0.0  0.0   
4              0.0          0.0                0.0        1.0  1.0  0.0  0.0   

   61D  
0  0.0  
1  0.0  
2  0.0  
3  0.0  
4  0.0  

[5 rows x 25 columns]
```

## Q3 Linear Regression using Ordinary Least Squares [10 + 4pts]
Now you will finally implement a linear regression. As a reminder, linear regression models the data as

$$\mathbf y = \mathbf X\mathbf \beta + \mathbf \epsilon$$

where $\mathbf y$ is a vector of outputs, $\mathbf X$ is also known as the design matrix, $\mathbf \beta$ is a vector of parameters, and $\mathbf \epsilon$ is noise. We will be estimating $\mathbf \beta$ using Ordinary Least Squares, and we recommending following the matrix notation for this problem (https://en.wikipedia.org/wiki/Ordinary_least_squares). 

### Specification
1. We use the numpy term array-like to refer to array like types that numpy can operate on (like Pandas DataFrames). 
1. Regress the output (eta) on all other features
2. Return the predicted output for the inputs in X_test
3. Calculating the inverse $(X^TX)^{-1}$ is unstable and prone to numerical inaccuracies. Furthermore, the assumptions of Ordinary Least Squares require it to be positive definite and invertible, which may not be true if you have redundant features. Thus, you should instead use $(X^TX + \lambda*I)^{-1}$ for identity matrix $I$ and $\lambda = 10^{-4}$, which for now acts as a numerical "hack" to ensure this is always invertible. Furthermore, instead of computing the direct inverse, you should utilize the Cholesky decomposition which is much more stable when solving linear systems. 

In [81]:
class LR_model():
    """ Perform linear regression and predict the output on unseen examples. 
        Attributes: 
            beta (array_like) : vector containing parameters for the features """
    
    def __init__(self, X, y):
        """ Initialize the linear regression model by computing the estimate of the weights parameter
            Args: 
                X (array_like) : feature matrix of training data where each row corresponds to an example
                y (array_like) : vector of training data outputs 
            """
        self.beta = np.zeros(X.shape[1])
        self.X = X
        self.y = y
        
    def predict(self, X_p): 
        """ Predict the output of X_p using this linear model. 
            Args: 
                X_p (array_like) feature matrix of predictive data where each row corresponds to an example
            Return: 
                (array_like) vector of predicted outputs for the X_p
            """
        restrict = 1E-4
        self.beta = np.linalg.solve(self.X.T @ self.X + restrict * np.identity(self.X.shape[1]),
                                    self.X.T @ self.y)
        return X_p @ self.beta


We have provided some validation data for you, which is another scrape of the Pittsburgh bus data (but for a different time span). You will need to do the same processing to generate labels and features to your validation dataset. Calculate the mean squared error of the output of your linear regression on both this dataset and the original training dataset. 

How does it perform? One simple baseline is to make sure that it at least predicts as well as predicting the mean of what you have seen so far. Does it do better than predicting the mean? Compare the mean squared error of a predictor that predicts the mean vs your linear classifier. 

### Specifications
1. Build your linear model using only the training data
2. Compute the mean squared error of the predictions on both the training and validation data. 
3. Compute the mean squared error of predicting the mean of the **training outputs** for all inputs. 
4. You will need to process the validation dataset in the same way you processed the training dataset.
5. You will need to split your features from your output (eta) prior to calling compute_mse

In [82]:
# Calculate mean squared error on both the training and validation set
def compute_mse(LR, X, y, X_v, y_v):
    """ Given a linear regression model, calculate the mean squared error for the 
        training dataset, the validation dataset, and for a mean prediction
        Args:
            LR (LR_model) : Linear model
            X (array-like) : feature matrix of training data where each row corresponds to an example
            y (array like) : vector of training data outputs 
            X_v (array-like) : feature matrix of validation data where each row corresponds to an example
            y_v (array like) : vector of validation data outputs 
        Return: 
            (train_mse, train_mean_mse, 
             valid_mse, valid_mean_mse) : a 4-tuple of mean squared errors
                                             1. MSE of linear regression on the training set
                                             2. MSE of predicting the mean on the training set
                                             3. MSE of linear regression on the validation set
                                             4. MSE of predicting the mean on the validation set
                         
            
    """
    train_predict = LR.predict(X)
    train_mean = np.mean(y)
    valid_predict = LR.predict(X_v)
    valid_mean = np.mean(y_v)
    train_mse = np.mean(np.square(train_predict - y))
    train_mean_mse = np.mean(np.square(train_mean - y))
    valid_mse = np.mean(np.square(valid_predict - y_v))
    valid_mean_mse = np.mean(np.square(valid_mean - y_v))
    return train_mse, train_mean_mse, valid_mse, valid_mean_mse

In [83]:
# AUTOLAB_IGNORE_START
# First you should replicate the same processing pipeline as we did to the training set
try:
    vdf_valid = pd.read_pickle('vdf_valid.pkl')
    pdf_valid = pd.read_pickle('pdf_valid.pkl')
except FileNotFoundError:
    vdf_valid, pdf_valid = load_data('bus_valid.db')
except:
    raise

try:
    vdf_features_valid = pd.read_pickle('vdf_features_valid.pkl')
except FileNotFoundError:
    all_trips_valid = split_trips(vdf_valid)
    labeled_trips_valid = [label_and_truncate(trip, morewood_coordinates) for trip in all_trips_valid]
    labeled_vdf_valid = pd.concat(labeled_trips_valid).reset_index()
    labeled_vdf_valid = labeled_vdf_valid[labeled_vdf_valid["eta"] < 10*60].reset_index(drop=True)
    vdf_features_valid = create_features(labeled_vdf_valid)
except:
    raise   

In [84]:
# Separate the features from the output and pass it into your linear regression model.
X_df = vdf_features.drop(columns="eta").values
y_df = vdf_features["eta"].values
X_valid_df = vdf_features_valid.drop(columns="eta").values
y_valid_df = vdf_features_valid["eta"].values
LR = LR_model(X_df, y_df)
train_mse, train_mean_mse, valid_mse, valid_mean_mse = compute_mse(LR, 
                                                                   X_df, 
                                                                   y_df, 
                                                                   X_valid_df, 
                                                                   y_valid_df)

print ("train_mse: ", train_mse, "\n", 
       "train_mean_mse: ", train_mean_mse, "\n", 
       "valid_mse: ", valid_mse, "\n", 
       "valid_mean_mse: ", valid_mean_mse)
# AUTOLAB_IGNORE_STOP

train_mse:  70.44884676590628 
 train_mean_mse:  208.4347581494387 
 valid_mse:  46.6973461276751 
 valid_mean_mse:  179.30565894042988


As a quick check, our training data MSE is approximately 38.99. 

In [89]:
# Save dataframes
# vdf.to_pickle("vdf.pickle") 
# labeled_vdf.to_pickle("labeled_vdf.pickle") 
# vdf_features.to_pickle("vdf_features.pickle") 
# vdf_features_valid.to_pickle("vdf_features_valid.pickle")
# vdf_valid.to_pickle("vdf_valid.pickle")
# pdf_valid.to_pickle("pdf_valid.pickle")

## Q4 TrueTime Predictions [5pts]
How do you fare against the Pittsburgh Truetime predictions? In this last problem, you will match predictions to their corresponding vehicles to build a dataset that is labeled by TrueTime. Remember that we only evaluate performance on the validation set (never the training set). How did you do?

### Specification
1. You should use the pd.DataFrame.merge function to combine your vehicle dataframe and predictions dataframe into a single dataframe. You should drop any rows that have no predictions (see the how parameter). (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html)
2. You can compute the TrueTime ETA by taking their predicted arrival time and subtracting the timestamp, and converting that into an integer representing the number of minutes. 
3. Compute the mean squared error for linear regression only on the rows that have predictions (so only the rows that remain after the merge). 

In [86]:
def compare_truetime(LR, labeled_vdf, pdf):
    """ Compute the mse of the truetime predictions and the linear regression mse on entries that have predictions.
        Args:
            LR (LR_model) : an already trained linear model
            labeled_vdf (pd.DataFrame): a dataframe of the truncated and labeled bus data (same as the input to create_features)
            pdf (pd.DataFrame): a dataframe of TrueTime predictions
        Return: 
            (tt_mse, lr_mse): a tuple of the TrueTime MSE, and the linear regression MSE
        """
    # merge df
    merged_df = labeled_vdf_valid.merge(pdf_valid, how = "inner", on = ("tmstmp", "vid", "rt", "des"))
    merged_df = merged_df.loc[:, ['tmstmp', 'vid', 'lat', 'lon', 'hdg',
                                  'pdist', 'spd', 'rt', 'des', 'eta', 'prdtm']]
    true_eta = merged_df["prdtm"] - merged_df["tmstmp"]
    true_eta = true_eta.dt.seconds // 60
    merged_df["true_eta"] = true_eta
    # create features
    vdf_features_valid = create_features(merged_df)
    # tt mse
    y_mean = np.mean(true_eta)
    tt_mse = np.mean(np.square(true_eta - y_mean))
    # predict
    X_valid = vdf_features_valid.drop(columns="eta").values
    y_predict = LR.predict(X_valid)
    lr_mse = np.mean(np.square(true_eta - y_predict))
    return tt_mse, lr_mse

In [87]:
# AUTOLAB_IGNORE_START
compare_truetime(LR, labeled_vdf_valid, pdf_valid)
# AUTOLAB_IGNORE_STOP

(64.80611398866446, 4.737152357333266)

As a sanity check, your linear regression MSE should be approximately 50.20. 

In [None]:
# My score is way higher than sample answer, because I cleaned the data when importing?

## Q5 Feature Engineering contest (bonus)

You may be wondering "why did we pick the above features?" Some of the above features may be entirely useless, or you may have ideas on how to construct better features. Sometimes, choosing good features can be the entirety of a data science problem. 

In this question, you are given complete freedom to choose what and how many features you want to generate. Upon submission to Autolab, we will run linear regression on your generated features and maintain a scoreboard of best regression accuracy (measured by mean squared error). 

The top scoring students will receive a bonus of 5 points. 

### Tips:
* Test your features locally by building your model using the training data, and predicting on the validation data. Compute the mean squared error on the **validation dataset** as a metric for how well your features generalize. This helps avoid overfitting to the training dataset, and you'll have faster turnaround time than resubmitting to autolab. 
* The linear regression model will be trained on your chosen features of the same training examples we provide in this notebook. 
* We test your regression on a different dataset from the training and validation set that we provide for you, so the MSE you get locally may not match how your features work on the Autolab dataset. 
* We will solve the linear regression using Ordinary Least Squares with regularization $\lambda=10^{-4}$ and a Cholesky factorization, exactly as done earlier in this notebook. 
* Note that the argument contains **UNlabeled** data: you cannot build features off the output labels (there is no ETA column). This is in contrast to before, where we kept everything inside the same dataframe for convenience. You can produce the sample input by removing the "eta" column, which we provide code for below. 
* Make sure your features are all numeric. Try everything!

In [None]:
def contest_features(vdf, vdf_train):
    """ Given a dataframe of UNlabeled and truncated bus data, generate ANY features you'd like for linear regression. 
        Args:
            vdf (dataframe) : dataframe of bus data with truncated rows but unlabeled (no eta column )
                              for which you should produce features
            vdf_train (dataframe) : dataframe of training bus data, truncated and labeled 
        Return: 
            (dataframe) : dataframe of features for each example in vdf
        """
    # create your own engineered features
    pass
    
# AUTOLAB_IGNORE_START
# contest_cols = list(labeled_vdf.columns)
# contest_cols.remove("eta")
# contest_features(labeled_vdf_valid[contest_cols], labeled_vdf).head()
# AUTOLAB_IGNORE_STOP