### Old analysis related to forgetting to pass cv to cross_val_predict

> I'm keeping this around because it's got some good stuff related to manually exploring the folds in k-crossfold and general advice on `cross_validate`, `cross_val_score`, and `cross_val_predict`.


Interestingly, the noq fits better than basicq. Is this overfitting? But, looking at metrics df's, basicq fits better. Is this possible? 

Yes, seems like this is related to difference between `cross_val_score` (equivalently, `cross_validate` which I use so that I can use multiple scoring metrics) and `cross_val_predict`. See, for example, https://stackoverflow.com/questions/55009704/why-is-cross-val-predict-not-appropriate-for-measuring-the-generalisation-error. This is also discussed in the sklearn docs. In a nutshell, `cross_validate` reports scores averaged over folds and `cross_val_predict` simply provides prediction for each data point when it was in the test set. Computing the error metric based on these predictions and the actuals does not necessarily equate to the averages over folds (measures might not be linear in sense that average of averages is not same as average over entire set).

**HOWEVER, I still don't see how it's possible to get the two plots above and yet in the `metrics_df`'s below, the basicq had lower MAEs in all folds than noq.**

**2021-10-01 - I THINK I FOUND THE PROBLEM! I was not passing cv=cv_iterator into cross_val_predict which means the folds in cross_validate were not the same as those in cross_val_predict. Doh!**

**YES, that was the issue. Of course, that means I need to rerun all the model fits yet again.**

From the SO post, both are reasonable ways of looking at relative model performance but if one really wants to get at generalization error, probably need to have a leave out test set that's not used at all in CV procedure.

It's possible to get the kfold indices to more deeply explore model performance. 

https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py

In [None]:
X_pp_noq = pd.read_csv('data/X_pp_noq_exp11.csv', index_col=0)
X_pp_basicq = pd.read_csv('data/X_pp_basicq_exp11.csv', index_col=0)
y_pp_occ_p95 = pd.read_csv('data/y_pp_occ_p95_exp11.csv', index_col=0, squeeze=True)
y_pp_occ_p95

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=4)

In [None]:
cv.get_n_splits()

In [None]:
# cv = KFold(n_splits=5, shuffle=False, random_state=None)
cv = KFold(n_splits=5, shuffle=True, random_state=4)
split=1
for train_index, test_index in cv.split(X_pp_noq):
    print(f"Split {split}")
    print(train_index, test_index)
    split += 1

In [None]:
from sklearn.metrics import mean_absolute_error

steps = []
steps.extend([PolynomialFeatures(2), LinearRegression(fit_intercept=True)])
model = make_pipeline(*steps)
cv = KFold(n_splits=5, shuffle=True, random_state=4)

split = 1
noq_data = {}
for train_index, test_index in cv.split(X_pp_noq):
    data = {}
    X_train = X_pp_noq.iloc[train_index]
    X_test = X_pp_noq.iloc[test_index]
    y_train = y_pp_occ_p95.iloc[train_index]
    y_test = y_pp_occ_p95.iloc[test_index]
    print(f"Split {split}")
    # Fit on train
    model.fit(X_train, y_train)
    y_fitted = model.predict(X_train)
    y_predict = model.predict(X_test)
    mae_train = mean_absolute_error(y_train, y_fitted)
    mae_test = mean_absolute_error(y_test, y_predict)
    print(f"mae_train: {mae_train}, mae_test: {mae_test}")
    
    data['train_index'] = train_index
    data['test_index'] = test_index
    data['y_train'] = y_train
    data['y_test'] = y_test
    data['y_fitted'] = y_fitted
    data['y_predict'] = y_predict
    
    noq_data[split] = data
    split += 1

Yep, these match the metrics_df shown below. Now let's do basicq. Yes, they match too.

In [None]:
from sklearn.metrics import mean_absolute_error

steps = []
steps.extend([PolynomialFeatures(2), LinearRegression(fit_intercept=True)])
model = make_pipeline(*steps)
cv = KFold(n_splits=5, shuffle=True, random_state=4)

split = 1
basicq_data = {}
for train_index, test_index in cv.split(X_pp_basicq):
    data = {}
    X_train = X_pp_basicq.iloc[train_index]
    X_test = X_pp_basicq.iloc[test_index]
    y_train = y_pp_occ_p95.iloc[train_index]
    y_test = y_pp_occ_p95.iloc[test_index]
    print(f"Split {split}")
    
    # Fit on train
    model.fit(X_train, y_train)
    y_fitted = model.predict(X_train)
    y_predict = model.predict(X_test)
    mae_train = mean_absolute_error(y_train, y_fitted)
    mae_test = mean_absolute_error(y_test, y_predict)
    print(f"mae_train: {mae_train}, mae_test: {mae_test}")
    
    data['train_index'] = train_index
    data['test_index'] = test_index
    data['y_train'] = y_train
    data['y_test'] = y_test
    data['y_fitted'] = y_fitted
    data['y_predict'] = y_predict
    
    basicq_data[split] = data
    split += 1

Ok, how to look at each of the folds and try to make sense of how the overall graphs can suggest noq fits better than basicq while the metrics_df summaries suggest the opposite.

In [None]:
fold = 4
actual = noq_data[fold]['y_test']
noq_pred = noq_data[fold]['y_predict']
basicq_pred = basicq_data[fold]['y_predict']

mae_noq = mean_absolute_error(actual, noq_pred)
mae_basicq = mean_absolute_error(actual, basicq_pred)

In [None]:
for fold in range(1, 6):
    print(basicq_data[fold]['y_predict'])

In [None]:
ax_anchor=0

fig, ax = plt.subplots(2, figsize=(10,10))
ax[0].scatter(actual, noq_pred, color='red')
ax[0].set_title('noq')
ax[0].annotate(f"{mae_noq:.3f}", xy=(5, 50),  xycoords='data')
ax[1].scatter(actual, basicq_pred, color='blue')
ax[1].set_title('basicq')
ax[1].annotate(f"{mae_basicq:.3f}", xy=(5, 50),  xycoords='data')
for axis in ax:
    axis.axline((ax_anchor - 0.1 * ax_anchor, ax_anchor - 0.1 * ax_anchor), slope=1)
    axis.set_xlabel('actual')  # Add an x-label to the axes.
    axis.set_ylabel('predicted')  # Add a y-label to the axes.
    axis.set_xlim(0, 80)
    axis.set_ylim(0, 80)
fig.tight_layout()
#ax.set_title(title)  # Add a title to the axes.

The individual fold scatter plots and examination of the test predictions do not indicate the points appearing in the original scatter (e.g. actual ~ 10 and pred ~ 15).

#### Another issue related to randomness
https://scikit-learn.org/stable/common_pitfalls.html#randomness

In researching the details of `cross_validate` and `cross_val_predict`, I uncovered another potential complication related to integer random number seeds vs `RandomState` instances. See the link above for the example of how in a RF, an integer seed will then lead to CRN in the random parts of RF. This isn't necessarily terrible.

I've got an input param in `crossval_summarize_mm` for the random forest random state. Right now it's set to 0 (an int) but I could change this to a `RandomState`. I don't this is a big deal in this study.

It might also be useful to rerun the experiments with a different random int for my `kfold_random_state` param in ``crossval_summarize_mm` (currently defaulted to 4).