In [None]:
import pandas
import kfold_learning as kfl
from sklearn import datasets
from sklearn.model_selection import train_test_split

Let's load some data

In [None]:
data = datasets.california_housing.fetch_california_housing()

In [None]:
features = pandas.DataFrame(data.data, columns=data.feature_names)
features.shape

In [None]:
data.feature_names

split the data into training (2/3) and testing (1/3) sets, then, set our X and y variables for train and test

In [None]:
tr_X, te_X = train_test_split(features, train_size = 0.67)
target = pandas.Series(data.target)
tr_y = target[tr_X.index]
te_y = target[te_X.index]

Now we'll run the model. Using default settings, kfold_feature_learning expects a regression problem and will run nested 10-fold cross-validation of LassoCV. Let's run it with the defaults, however, we will not apply the model to our test data just yet...

In [None]:
output = kfl.kfold_feature_learning(tr_X, te_X, tr_y, te_y, 
                                    hide_test=True) # don't apply to test data

because we didn't burn the test data, we can tweak some of the parameters to see if we can improve the validation accuracy (though it is not a guarantee that this will improve the model generalizibility

Let's start by changing the number folds in the model

In [None]:
output = kfl.kfold_feature_learning(tr_X, te_X, tr_y, te_y,
                                    folds = 20,
                                    hide_test=True) # don't apply to test data

We can also set the model to only include features that are significantly associated with the target at a specified p-value threshold

In [None]:
output = kfl.kfold_feature_learning(tr_X, te_X, tr_y, te_y,
                                    folds = 3, p_cutoff = 0.001,
                                    hide_test=True) # don't apply to test data

Perhaps you would rather run a different type of model? Just pass it as the clf argument

In [None]:
from sklearn import linear_model
output = kfl.kfold_feature_learning(tr_X, te_X, tr_y, te_y,
                                    folds = 3, 
                                    clf = linear_model.RidgeCV(),
                                    hide_test=True) # don't apply to test data

The kfold_feature_learning function will also accept grid_search type models

In [None]:
from sklearn.model_selection import GridSearchCV
params = {'loss': ['squared_loss','huber','epsilon_insensitive',
                   'squared_epsilon_insensitive'],
          'penalty': ['none', 'l2', 'l1'],
          'alpha': [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1]} 
selector = GridSearchCV(linear_model.SGDRegressor(random_state=123),
                        [params],cv=10, scoring = 'r2')

In [None]:
output = kfl.kfold_feature_learning(tr_X, te_X, tr_y, te_y,
                                    folds = 3, 
                                    clf = selector,
                                    search = True, # req. if using grid search
                                    hide_test=True) # don't apply to test data

There are other parameters to explore as well. You can learn more by viewing the docstring: kfl.kfold_feature_learning? 

but let's go ahead and apply a model to our test data.

In [None]:
output = kfl.kfold_feature_learning(tr_X, te_X, tr_y, te_y,
                                    folds = 20,
                                    hide_test=False) # apply to test data

Many relevant aspects of the model can be found in the output. We can use these to explore our model further

In [None]:
output.keys()

First, let's visualize feature importances, which can also be used to generalize the model to another dataset

In [None]:
list(zip(data.feature_names,
         output['final_model_weights']))

Using the output, we can visualize the predicted vs. observed target values for the test data

In [None]:
import matplotlib.pyplot as plt
plt.scatter(output['test_predicted'], te_y)
plt.xlabel('Predicted')
plt.ylabel('Observed')
plt.show()

I started integrating support for classifiers, but I ran out of time. Right now, the classification aspect of the code *does not work* so don't bother with it.