You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/examples/hyperparameter.md
+9-10Lines changed: 9 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,10 +18,10 @@ limitations under the License.
18
18
19
19
### Tuning hyper-parameters with CodeFlare Pipelines
20
20
21
-
GridSearchCV() is often used for hyper-parameter turning for a model constructed via sklearn pipelines. It does an exhaustive search over specified parameter values for a pipeline. It implements a `fit()` method and a `score()` method. The parameters of the pipeline used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
21
+
`GridSearchCV()` is often used for hyper-parameter turning for a model constructed via sklearn pipelines. It does an exhaustive search over specified parameter values for a pipeline. It implements a `fit()` method and a `score()` method. The parameters of the pipeline used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
22
22
Here we show how to convert an example of using `GridSearchCV()` to tune the hyper-parameters of an sklearn pipeline into one that uses Codeflare (CF) pipelines `grid_search_cv()`. We use the [Pipelining: chaining a PCA and a logistic regression](https://scikit-learn.org/stable/auto_examples/compose/plot_digits_pipe.html#sphx-glr-auto-examples-compose-plot-digits-pipe-py) from sklearn pipelines as an example.
23
23
24
-
In this sklearn example, a pipeline is chained together with a PCA and a LogisticRegression. The n_components parameter of the PCA and the C parameter of the LogisticRegression are defined in a param_grid: with n_components in `[5, 15, 30, 45, 64]` and `C` defined by `np.logspace(-4, 4, 4)`. A total of 20 combinations of `n_components` and `C` parameter values will be explored by `GridSearchCV()` to find the best one with the highest `mean_test_score`.
24
+
In this sklearn example, a pipeline is chained together with a PCA and a LogisticRegression. The `n_components` parameter of the PCA and the `C` parameter of the LogisticRegression are defined in a `param_grid`: with `n_components` in `[5, 15, 30, 45, 64]` and `C` defined by `np.logspace(-4, 4, 4)`. A total of 20 combinations of `n_components` and `C` parameter values will be explored by `GridSearchCV()` to find the best one with the highest `mean_test_score`.
After running `GridSearchCV().fit()`, the best parameters of `PCA__n_components` and `LogisticRegression__C`, together with the cross-validated mean_test scores are printed out as follows. In this example, the best n_components chosen is 45 for the PCA.
43
+
After running `GridSearchCV().fit()`, the best parameters of `PCA__n_components` and `LogisticRegression__C`, together with the cross-validated `mean_test scores` are printed out as follows. In this example, the best `n_components` chosen is 45 for the PCA.
The PCA explained variance ratio and the best n_components chosen are plotted in the top chart. The classification accuracy and its std_test_score are plotted in the bottom chart. The best n_components can be obtained by calling best_estimator_.named_step['pca'].n_components from the returned object of GridSearchCV().
50
+
The PCA explained variance ratio and the best `n_components` chosen are plotted in the top chart. The classification accuracy and its `std_test_score` are plotted in the bottom chart. The best `n_components` can be obtained by calling `best_estimator_.named_step['pca'].n_components` from the returned object of `GridSearchCV()`.
51
51
52
52

53
53
@@ -58,7 +58,7 @@ We next describe the step-by-step conversion of this example to one that uses Co
58
58
59
59
#### **Step 1: importing codeflare.pipelines packages and ray**
60
60
61
-
We need to first import various `codeflare.pipelines` packages, including Datamodel and runtime, as well as ray and call `ray.shutdwon()` and `ray.init()`. Note that, in order to run this CodeFlare example notebook, you need to have a running ray instance.
61
+
We need to first import various `codeflare.pipelines` packages, including `Datamodel` and `runtime`, as well as `ray` and call `ray.shutdwon()` and `ray.init()`. Note that, in order to run this CodeFlare example notebook, you need to have a running ray instance.
62
62
63
63
```python
64
64
import codeflare.pipelines.Datamodel as dm
@@ -73,7 +73,7 @@ ray.init()
73
73
74
74
#### **Step 2: defining and setting up a codeflare pipeline**
75
75
76
-
A codeflare pipeline is defined by EstimatorNodes and edges connecting two EstimatorNodes. In this case, we define node_pca and node_logistic and we connect these two nodes with `pipeline.add_edge()`. Before we can execute `fit()` on a pipeline, we need to set up the proper input to the pipeline.
76
+
A codeflare pipeline is defined by `EstimatorNodes` and `edges` connecting two `EstimatorNodes`. In this case, we define `node_pca` and `node_logistic` and we connect these two nodes with `pipeline.add_edge()`. Before we can execute `fit()` on a pipeline, we need to set up the proper input to the pipeline.
Codeflare pipelines runtime converts an sklearn param_grid into a codeflare pipelines param grid. We also specify the default KFold parameter for running the cross-validation. Finally, Codeflare pipelines runtime executes the grid_search_cv().
93
+
Codeflare pipelines runtime converts an sklearn param_grid into a codeflare pipelines param grid. We also specify the default `KFold` parameter for running the cross-validation. Finally, Codeflare pipelines runtime executes the `grid_search_cv()`.
95
94
96
95
```python
97
96
# param_grid
@@ -112,7 +111,7 @@ result = rt.grid_search_cv(kf, pipeline, pipeline_input, pipeline_param)
112
111
113
112
#### **Step 4: parsing the returned result from `grid_search_cv()`**
114
113
115
-
As the Codeflare pipelines project is still actively under development, APIs to access some attributes of the explored pipelines in the `grid_search_cv()` are not yet available. As a result, a slightly more verbose code is used to get the best pipeline, its associated parameter values and other statistics from the returned object of `grid_search_cv()`. For example, we need to loop through all the 20 explored pipelines to get the best pipeline. And, to get the n_component of an explored pipeline, we first use `.get_nodes()` on the returned cross-validated pipeline and then use .get_estimator() and then finally use `.get_params()`.
114
+
As the Codeflare pipelines project is still actively under development, APIs to access some attributes of the explored pipelines in the `grid_search_cv()` are not yet available. As a result, a slightly more verbose code is used to get the best pipeline, its associated parameter values and other statistics from the returned object of `grid_search_cv()`. For example, we need to loop through all the 20 explored pipelines to get the best pipeline. And, to get the `n_component` of an explored pipeline, we first use `.get_nodes()` on the returned cross-validated pipeline and then use `.get_estimator()` and then finally use `.get_params()`.
0 commit comments