# Crossval-POC

We can't do *pure* crossval with entire rows because when we fit the new row to the model we're observing the data in a way.
But we can probably still do model selection by when the improvement in fit diverges from the improvement in crossval.

There's also an example [here](https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_fa_model_selection.html#sphx-glr-auto-examples-decomposition-plot-pca-vs-fa-model-selection-py) I'm interested in.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt
%matplotlib inline

# Local files
import factor
import selection
import vis
import data

# Load models
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler

In [None]:
X = data.load("p70m")  # helm, p70m or synthetic

# Or perhaps there are small predictive features that are getting clobbered by large spurious features
# (poor signal to noise ratio)

print(f"{X.shape[0]} models, {X.shape[1]} features")

In [None]:
# Now let's do row-wise cross validation for model selection

Z = StandardScaler().fit_transform(X)
n_components = 10
row_MSEs, row_std, row_fit = selection.row_cross_validate(Z, PCA(), n_components)
al_MSEs, al_std, fit_err = selection.cross_validate(Z, factor.PCA(), n_components)

components = np.arange(n_components) + 1
plt.figure
plt.plot(components, row_MSEs, label="Row holdout (scikit-learn)")
plt.plot(components, fit_err, label="No holdout (scikit-learn)")
plt.plot(components, al_MSEs, label="Partial holdout (inhouse)")
plt.legend()
plt.show()

# Thoughts:
* You can see the elbow in all (even non-holdout) - so we'd *probably* guess the right dimensions
* Only my holdout method definitively shows overfitting
* I believe some methods respond better to row holdout, eg factor analysis