|
| 1 | +<!-- |
| 2 | +{% comment %} |
| 3 | +Copyright 2021 IBM |
| 4 | +
|
| 5 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 6 | +you may not use this file except in compliance with the License. |
| 7 | +You may obtain a copy of the License at |
| 8 | +
|
| 9 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +Unless required by applicable law or agreed to in writing, software |
| 12 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +See the License for the specific language governing permissions and |
| 15 | +limitations under the License. |
| 16 | +{% endcomment %} |
| 17 | +--> |
| 18 | + |
| 19 | +### Tuning hyper-parameters with CodeFlare Pipelines |
| 20 | + |
| 21 | +GridSearchCV() is often used for hyper-parameter turning for a model constructed via sklearn pipelines. It does an exhaustive search over specified parameter values for a pipeline. It implements a `fit()` method and a `score()` method. The parameters of the pipeline used to apply these methods are optimized by cross-validated grid-search over a parameter grid. |
| 22 | +Here we show how to convert an example of using `GridSearchCV()` to tune the hyper-parameters of an sklearn pipeline into one that uses Codeflare (CF) pipelines `grid_search_cv()`. We use the [Pipelining: chaining a PCA and a logistic regression](https://scikit-learn.org/stable/auto_examples/compose/plot_digits_pipe.html#sphx-glr-auto-examples-compose-plot-digits-pipe-py) from sklearn pipelines as an example. |
| 23 | + |
| 24 | +In this sklearn example, a pipeline is chained together with a PCA and a LogisticRegression. The n_components parameter of the PCA and the C parameter of the LogisticRegression are defined in a param_grid: with n_components in `[5, 15, 30, 45, 64]` and `C` defined by `np.logspace(-4, 4, 4)`. A total of 20 combinations of `n_components` and `C` parameter values will be explored by `GridSearchCV()` to find the best one with the highest `mean_test_score`. |
| 25 | + |
| 26 | +```python |
| 27 | +pca = PCA() |
| 28 | +logistic = LogisticRegression(max_iter=10000, tol=0.1) |
| 29 | +pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)]) |
| 30 | + |
| 31 | +X_digits, y_digits = datasets.load_digits(return_X_y=True) |
| 32 | + |
| 33 | +param_grid = { |
| 34 | + 'pca__n_components': [5, 15, 30, 45, 64], |
| 35 | + 'logistic__C': np.logspace(-4, 4, 4), |
| 36 | +} |
| 37 | +search = GridSearchCV(pipe, param_grid, n_jobs=-1) |
| 38 | +search.fit(X_digits, y_digits) |
| 39 | +print("Best parameter (CV score=%0.3f):" % search.best_score_) |
| 40 | +print(search.best_params_) |
| 41 | +``` |
| 42 | + |
| 43 | +After running `GridSearchCV().fit()`, the best parameters of `PCA__n_components` and `LogisticRegression__C`, together with the cross-validated mean_test scores are printed out as follows. In this example, the best n_components chosen is 45 for the PCA. |
| 44 | + |
| 45 | +```python |
| 46 | +Best parameter (CV score=0.920): |
| 47 | +{'logistic__C': 0.046415888336127774, 'pca__n_components': 45} |
| 48 | +``` |
| 49 | + |
| 50 | +The PCA explained variance ratio and the best n_components chosen are plotted in the top chart. The classification accuracy and its std_test_score are plotted in the bottom chart. The best n_components can be obtained by calling best_estimator_.named_step['pca'].n_components from the returned object of GridSearchCV(). |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +### Converting to CodeFalre Pipelines with *grid search* |
| 55 | + |
| 56 | + |
| 57 | +We next describe the step-by-step conversion of this example to one that uses CodeFlare Pipelines. |
| 58 | + |
| 59 | +#### **Step 1: importing codeflare.pipelines packages and ray** |
| 60 | + |
| 61 | +We need to first import various `codeflare.pipelines` packages, including Datamodel and runtime, as well as ray and call `ray.shutdwon()` and `ray.init()`. Note that, in order to run this CodeFlare example notebook, you need to have a running ray instance. |
| 62 | + |
| 63 | +```python |
| 64 | +import codeflare.pipelines.Datamodel as dm |
| 65 | +import codeflare.pipelines.Runtime as rt |
| 66 | +from codeflare.pipelines.Datamodel import Xy |
| 67 | +from codeflare.pipelines.Datamodel import XYRef |
| 68 | +from codeflare.pipelines.Runtime import ExecutionType |
| 69 | +import ray |
| 70 | +ray.shutdown() |
| 71 | +ray.init() |
| 72 | +``` |
| 73 | + |
| 74 | +#### **Step 2: defining and setting up a codeflare pipeline** |
| 75 | + |
| 76 | +A codeflare pipeline is defined by EstimatorNodes and edges connecting two EstimatorNodes. In this case, we define node_pca and node_logistic and we connect these two nodes with `pipeline.add_edge()`. Before we can execute `fit()` on a pipeline, we need to set up the proper input to the pipeline. |
| 77 | + |
| 78 | +#### **Step 3: defining pipeline param grid and executing** |
| 79 | + |
| 80 | +Codeflare pipelines grid_search_cv() |
| 81 | +Codeflare pipelines runtime converts an sklearn param_grid into a codeflare pipelines param grid. We also specify the default KFold parameter for running the cross-validation. Finally, Codeflare pipelines runtime executes the grid_search_cv(). |
| 82 | + |
| 83 | +```python |
| 84 | +# param_grid |
| 85 | +param_grid = { |
| 86 | + 'pca__n_components': [5, 15, 30, 45, 64], |
| 87 | + 'logistic__C': np.logspace(-4, 4, 4), |
| 88 | + } |
| 89 | + |
| 90 | +pipeline_param = dm.PipelineParam.from_param_grid(param_grid) |
| 91 | + |
| 92 | +# default KFold for grid search |
| 93 | +k = 5 |
| 94 | +kf = KFold(k) |
| 95 | + |
| 96 | +# execute CF pipeplie grid_search_cv |
| 97 | +result = rt.grid_search_cv(kf, pipeline, pipeline_input, pipeline_param) |
| 98 | +``` |
| 99 | + |
| 100 | +#### **Step 4: parsing the returned result from `grid_search_cv()`** |
| 101 | + |
| 102 | +As the Codeflare pipelines project is still actively under development, APIs to access some attributes of the explored pipelines in the `grid_search_cv()` are not yet available. As a result, a slightly more verbose code is used to get the best pipeline, its associated parameter values and other statistics from the returned object of `grid_search_cv()`. For example, we need to loop through all the 20 explored pipelines to get the best pipeline. And, to get the n_component of an explored pipeline, we first use `.get_nodes()` on the returned cross-validated pipeline and then use .get_estimator() and then finally use `.get_params()`. |
| 103 | + |
| 104 | +```python |
| 105 | +import statistics |
| 106 | + |
| 107 | +# pick the best mean' and best pipeline |
| 108 | +best_pipeline = None |
| 109 | +best_mean_scores = 0.0 |
| 110 | +best_n_components = 0 |
| 111 | + |
| 112 | +df = pd.DataFrame(columns =('n_components', 'mean_test_score', 'std_test_score')) |
| 113 | + |
| 114 | +for cv_pipeline, scores in result.items(): |
| 115 | + mean = statistics.mean(scores) |
| 116 | + std = statistics.stdev(scores) |
| 117 | + n_components = 0 |
| 118 | + params = {} |
| 119 | + # get the 'n_components' value of the PCA in this cv_pipeline |
| 120 | + for node_name, node in cv_pipeline.get_nodes().items(): |
| 121 | + params[node_name] = node.get_estimator().get_params() |
| 122 | + if 'n_components' in params[node_name]: |
| 123 | + n_components = params[node_name]['n_components'] |
| 124 | + assert(n_components > 0) |
| 125 | + df = df.append({'n_components' : n_components, 'mean_test_score' : mean, 'std_test_score' : std}, ignore_index=True) |
| 126 | + if mean > 0.92: |
| 127 | + print(mean) |
| 128 | + print(str(params)) |
| 129 | + |
| 130 | + if mean > best_mean_scores: |
| 131 | + best_pipeline = cv_pipeline |
| 132 | + best_mean_scores = mean |
| 133 | + best_n_components = n_components |
| 134 | +``` |
| 135 | + |
| 136 | +Due to the differences in split, the CodeFlare pipelines `grid_search_cv()` produces the best pipeline with `n_components = 64` for the PAC, and the 2nd best with `n_components = 45`. We print out the parameters of the 2nd best and the best pipelines as follows. |
| 137 | + |
| 138 | +```python |
| 139 | +0.9226679046734757 |
| 140 | +{'pca__3': {'copy': True, 'iterated_power': 'auto', 'n_components': 45, 'random_state': None, 'svd_solver': 'auto', 'tol': 0.0, 'whiten': False}, 'logistic__1': {'C': 0.046415888336127774, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 10000, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.1, 'verbose': 0, 'warm_start': False}} |
| 141 | +0.9260058805323429 |
| 142 | +{'pca__4': {'copy': True, 'iterated_power': 'auto', 'n_components': 64, 'random_state': None, 'svd_solver': 'auto', 'tol': 0.0, 'whiten': False}, 'logistic__1': {'C': 0.046415888336127774, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 10000, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.1, 'verbose': 0, 'warm_start': False}} |
| 143 | +``` |
| 144 | + |
| 145 | +The corresponding plots are similar to those from the sklearn `GridSearchCV()`, except that the `n_components` chosen is 64 for the best score for the CodeFlare pipelines `grid_search_cv()`. |
| 146 | + |
| 147 | + |
| 148 | + |
| 149 | +The Jupyter notebook of this example is available [here](https://github.com/project-codeflare/codeflare/blob/main/notebooks/plot_digits_pipe.ipynb). Please download it and try it out to understand how you might convert an sklearn example to one that uses Codeflare pipelines. And please let us know what you think. |
| 150 | + |
| 151 | + |
| 152 | + |
| 153 | + |
0 commit comments