Skip to content

Commit 38110ab

Browse files
authored
Update hyperparameter.md
1 parent 3acc9a8 commit 38110ab

File tree

1 file changed

+9
-10
lines changed

1 file changed

+9
-10
lines changed

docs/source/examples/hyperparameter.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@ limitations under the License.
1818

1919
### Tuning hyper-parameters with CodeFlare Pipelines
2020

21-
GridSearchCV() is often used for hyper-parameter turning for a model constructed via sklearn pipelines. It does an exhaustive search over specified parameter values for a pipeline. It implements a `fit()` method and a `score()` method. The parameters of the pipeline used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
21+
`GridSearchCV()` is often used for hyper-parameter turning for a model constructed via sklearn pipelines. It does an exhaustive search over specified parameter values for a pipeline. It implements a `fit()` method and a `score()` method. The parameters of the pipeline used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
2222
Here we show how to convert an example of using `GridSearchCV()` to tune the hyper-parameters of an sklearn pipeline into one that uses Codeflare (CF) pipelines `grid_search_cv()`. We use the [Pipelining: chaining a PCA and a logistic regression](https://scikit-learn.org/stable/auto_examples/compose/plot_digits_pipe.html#sphx-glr-auto-examples-compose-plot-digits-pipe-py) from sklearn pipelines as an example. 
2323

24-
In this sklearn example, a pipeline is chained together with a PCA and a LogisticRegression. The n_components parameter of the PCA and the C parameter of the LogisticRegression are defined in a param_grid: with n_components in `[5, 15, 30, 45, 64]` and `C` defined by `np.logspace(-4, 4, 4)`. A total of 20 combinations of `n_components` and `C` parameter values will be explored by `GridSearchCV()` to find the best one with the highest `mean_test_score`.
24+
In this sklearn example, a pipeline is chained together with a PCA and a LogisticRegression. The `n_components` parameter of the PCA and the `C` parameter of the LogisticRegression are defined in a `param_grid`: with `n_components` in `[5, 15, 30, 45, 64]` and `C` defined by `np.logspace(-4, 4, 4)`. A total of 20 combinations of `n_components` and `C` parameter values will be explored by `GridSearchCV()` to find the best one with the highest `mean_test_score`.
2525

2626
```python
2727
pca = PCA()
@@ -40,14 +40,14 @@ print("Best parameter (CV score=%0.3f):" % search.best_score_)
4040
print(search.best_params_)
4141
```
4242

43-
After running `GridSearchCV().fit()`, the best parameters of `PCA__n_components` and `LogisticRegression__C`, together with the cross-validated mean_test scores are printed out as follows. In this example, the best n_components chosen is 45 for the PCA.
43+
After running `GridSearchCV().fit()`, the best parameters of `PCA__n_components` and `LogisticRegression__C`, together with the cross-validated `mean_test scores` are printed out as follows. In this example, the best `n_components` chosen is 45 for the PCA.
4444

4545
```python
4646
Best parameter (CV score=0.920):
4747
{'logistic__C': 0.046415888336127774, 'pca__n_components': 45}
4848
```
4949

50-
The PCA explained variance ratio and the best n_components chosen are plotted in the top chart. The classification accuracy and its std_test_score are plotted in the bottom chart. The best n_components can be obtained by calling best_estimator_.named_step['pca'].n_components from the returned object of GridSearchCV().
50+
The PCA explained variance ratio and the best `n_components` chosen are plotted in the top chart. The classification accuracy and its `std_test_score` are plotted in the bottom chart. The best `n_components` can be obtained by calling `best_estimator_.named_step['pca'].n_components` from the returned object of `GridSearchCV()`.
5151

5252
![](../images/pca_1.png)
5353

@@ -58,7 +58,7 @@ We next describe the step-by-step conversion of this example to one that uses Co
5858

5959
#### **Step 1: importing codeflare.pipelines packages and ray**
6060

61-
We need to first import various `codeflare.pipelines` packages, including Datamodel and runtime, as well as ray and call `ray.shutdwon()` and `ray.init()`. Note that, in order to run this CodeFlare example notebook, you need to have a running ray instance.
61+
We need to first import various `codeflare.pipelines` packages, including `Datamodel` and `runtime`, as well as `ray` and call `ray.shutdwon()` and `ray.init()`. Note that, in order to run this CodeFlare example notebook, you need to have a running ray instance.
6262

6363
```python
6464
import codeflare.pipelines.Datamodel as dm
@@ -73,7 +73,7 @@ ray.init()
7373

7474
#### **Step 2: defining and setting up a codeflare pipeline**
7575

76-
A codeflare pipeline is defined by EstimatorNodes and edges connecting two EstimatorNodes. In this case, we define node_pca and node_logistic and we connect these two nodes with `pipeline.add_edge()`. Before we can execute `fit()` on a pipeline, we need to set up the proper input to the pipeline.
76+
A codeflare pipeline is defined by `EstimatorNodes` and `edges` connecting two `EstimatorNodes`. In this case, we define `node_pca` and `node_logistic` and we connect these two nodes with `pipeline.add_edge()`. Before we can execute `fit()` on a pipeline, we need to set up the proper input to the pipeline.
7777

7878
```python
7979
pca = PCA()
@@ -88,10 +88,9 @@ pipeline_input = dm.PipelineInput()
8888
pipeline_input.add_xy_arg(node_pca, dm.Xy(X_digits, y_digits))
8989
```
9090

91-
#### **Step 3: defining pipeline param grid and executing**
91+
#### **Step 3: defining pipeline param grid and executing Codeflare pipelines `grid_search_cv()`**
9292

93-
Codeflare pipelines grid_search_cv()
94-
Codeflare pipelines runtime converts an sklearn param_grid into a codeflare pipelines param grid. We also specify the default KFold parameter for running the cross-validation. Finally, Codeflare pipelines runtime executes the grid_search_cv().
93+
Codeflare pipelines runtime converts an sklearn param_grid into a codeflare pipelines param grid. We also specify the default `KFold` parameter for running the cross-validation. Finally, Codeflare pipelines runtime executes the `grid_search_cv()`.
9594

9695
```python
9796
# param_grid
@@ -112,7 +111,7 @@ result = rt.grid_search_cv(kf, pipeline, pipeline_input, pipeline_param)
112111

113112
#### **Step 4: parsing the returned result from `grid_search_cv()`**
114113

115-
As the Codeflare pipelines project is still actively under development, APIs to access some attributes of the explored pipelines in the `grid_search_cv()` are not yet available. As a result, a slightly more verbose code is used to get the best pipeline, its associated parameter values and other statistics from the returned object of `grid_search_cv()`. For example, we need to loop through all the 20 explored pipelines to get the best pipeline. And, to get the n_component of an explored pipeline, we first use `.get_nodes()` on the returned cross-validated pipeline and then use .get_estimator() and then finally use `.get_params()`.
114+
As the Codeflare pipelines project is still actively under development, APIs to access some attributes of the explored pipelines in the `grid_search_cv()` are not yet available. As a result, a slightly more verbose code is used to get the best pipeline, its associated parameter values and other statistics from the returned object of `grid_search_cv()`. For example, we need to loop through all the 20 explored pipelines to get the best pipeline. And, to get the `n_component` of an explored pipeline, we first use `.get_nodes()` on the returned cross-validated pipeline and then use `.get_estimator()` and then finally use `.get_params()`.
116115

117116
```python
118117
import statistics

0 commit comments

Comments
 (0)