In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

%matplotlib inline


In [3]:

#Save final cleaned X and Y data sets
X_full=pd.read_csv('X_seq.csv',low_memory=False)
y=pd.read_csv('Y_seq.csv',low_memory=False)

While we would like to apply multivariate models first and foremost for simplicity, some papers suggest multiple univariate models could have advantages. For example, multivariate models showed increase in processing time and performance impact due to correlation within outputs while not having a significant advantage in performance over univariate models.

https://www.researchgate.net/publication/357876220_Machine_Learning_for_Multi-Output_Regression_When_should_a_holistic_multivariate_approach_be_preferred_over_separate_univariate_ones

Another paper worth exploring outlines an evolutiuonary prototype selection for multi-output regression (EPS-MOR). This will obtain reduced size training sets while maximizing prediction quality. 

https://www.sciencedirect.com/science/article/pii/S0925231219307611

Their code was unable to be replicated but we performed prototype selection on our training data and compared using mean error.


In [39]:
#read final cleaned X and Y data sets
X_full=pd.read_csv('X_seq.csv',low_memory=False)
y=pd.read_csv('Y_seq.csv',low_memory=False)

In [40]:
cols_to_drop = [x for x in X_full.columns if 'Unnamed' in x]
X_full = X_full.drop(cols_to_drop, axis=1)
cols_to_drop = [x for x in y.columns if 'Unnamed' in x]
y = y.drop(cols_to_drop, axis=1)

In [41]:
x_train, x_test, y_train, y_test = train_test_split(X_full, y, random_state=0)

In [79]:
#randomly select dataset by M number of labels
def rand_prototypes(M, train_data, train_labels):
    np.random.seed(10)
    indices = np.random.choice( len(train_labels) , M, replace=False)
    return train_data.iloc[:,indices], indices

In [70]:
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report

In [66]:
# define model
model = KNeighborsRegressor()
# fit model
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
# summarize prediction
print(y_pred[0])

[1.79461069e-04 7.20366732e-02 8.96051157e-01]


In [69]:
print(np.shape(y_test))
print(np.shape(y_pred))

(1460, 3)
(1460, 3)


In [72]:
print('MSE: {}'.format(mean_squared_error(y_test, y_pred)))
print(model.score(x_test, y_test))

MSE: 0.0022796685474204027
0.7627693141391253


In [83]:
#prototype selection using 50 random features
x_proto, idx=rand_prototypes(100, x_train, x_train.columns)

In [84]:
#compare KNN
# define model
modelp = KNeighborsRegressor()
# fit model
modelp.fit(x_proto, y_train)
y_pred = modelp.predict(x_test.iloc[:,idx])
# summarize prediction
print(y_pred[0])

[1.64103240e-04 7.20366732e-02 8.91222798e-01]


In [85]:
print('MSE: {}'.format(mean_squared_error(y_test, y_pred)))
print(modelp.score(x_test.iloc[:,idx], y_test))

MSE: 0.00218309683560113
0.7693902916101947


Note that depending on the seed, the random prototype selection did not always help the model performance and smarter prototype selection should be explored. 
