Skip to content

Commit db270e4

Browse files
committed
Updated docs.
1 parent 22c11dc commit db270e4

File tree

11 files changed

+262
-21
lines changed

11 files changed

+262
-21
lines changed

docs/source/_static/custom.css

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,5 +29,5 @@ img[alt='Open R Editor'] {
2929
}
3030

3131
.wy-side-nav-search .wy-dropdown > a img.logo, .wy-side-nav-search > a img.logo {
32-
width: 170px;
32+
width: 160px;
3333
}
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
<!--
2+
{% comment %}
3+
Copyright 2021 IBM
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
{% endcomment %}
17+
-->
18+
19+
### Fit and score multiple classifiers with CodeFlare Pipeline
20+
21+
We use a sklearn pipeline example Comparing Nearest Neighbors with and without Neighborhood Components Analysis to demonstrate how to define, fit and score multiple classifiers with CodeFlare (CF) Pipelines. The sklearn and CF pipeline notebook is published here [here](https://github.com/project-codeflare/codeflare/blob/main/notebooks/plot_nca_classification.ipynb)
22+
This example plots the class decision boundaries given by a Nearest Neighbors classifier when using the Euclidean distance on the original features, versus using the Euclidean distance after the transformation learned by Neighborhood Components Analysis. Its output is pictorially illustrated with colored decision boundaries like the pictures below.
23+
24+
This example plots the class decision boundaries given by a Nearest Neighbors classifier when using the Euclidean distance on the original features, versus using the Euclidean distance after the transformation learned by Neighborhood Components Analysis. Its output is pictorially illustrated with colored decision boundaries like the pictures below.
25+
26+
![](../images/classification_and_score_1.jpeg)
27+
28+
Classification score and boundaries of KNN with k=1
29+
30+
![](../images/classification_and_score_2.jpeg)
31+
32+
Classification score and boundaries of KNN with Neighborhood Component Analysis
33+
34+
In the original sklearn pipeline definition, the two KNN classifiers, with and without NCA, are defined with a list of two Pipeline objects as follows:
35+
36+
```python
37+
classifiers = [Pipeline([('scaler', StandardScaler()),
38+
('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
39+
]),
40+
Pipeline([('scaler', StandardScaler()),
41+
('nca', NeighborhoodComponentsAnalysis()),
42+
('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
43+
])
44+
]
45+
```
46+
47+
Recognizing both Pipelines start with StandardScalar, we can express it with the same EstimatorNode in a CF pipeline to save redundant calculation, as follows:
48+
49+
```python
50+
pipeline = dm.Pipeline()
51+
node_scalar = dm.EstimatorNode('scaler', StandardScaler())
52+
node_knn = dm.EstimatorNode('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
53+
node_nca = dm.EstimatorNode('nca', NeighborhoodComponentsAnalysis())
54+
node_knn_post_nca = dm.EstimatorNode('knn_post_nca', KNeighborsClassifier(n_neighbors=n_neighbors))
55+
```
56+
57+
In the above CF pipeline, four EstimatorNodes are initiated to define a StandardScalar, an NCA and two instances of KNN classifiers. The EstimatorNodes host respective transformers and models with data input and output ports connected by edges. As shown below, the add_edge() method with binary input connects EstimatorNodes in the upstream-to-downstream order.
58+
59+
```python
60+
pipeline.add_edge(node_scalar, node_knn)
61+
pipeline.add_edge(node_scalar, node_nca)
62+
pipeline.add_edge(node_nca, node_knn_post_nca)
63+
```
64+
65+
A pipeline's input data is constructed by creating the PipelineInput object. The add_xy_arg() method specifies the input node of the CF pipeline where training data and labels X_train and y_train are added. CF runtime rt executes the pipeline with the declared pipeline, ExecutionType and PipelineInput as input arguments. In our two-branch pipeline example, ExecutionType.FIT fits both pipelines in a single invocation.
66+
67+
```python
68+
train_input = dm.PipelineInput()
69+
train_input.add_xy_arg(node_scalar, dm.Xy(X_train, y_train))
70+
pipeline_fitted = rt.execute_pipeline(pipeline, ExecutionType.FIT, train_input)
71+
```
72+
73+
To get classification scores and prediction output, we first use the select_pipeline() method to identify a fitted pipeline by its end EstimatorNode. For example, the first KNN classifier without NCA is declared by node_knn. select_pipeline() returns the branch ending at node_knn. Subsequently, we use the execute_pipeline() method to invoke the scoring method by specifying ExecutionType.SCORE. Similarly, prediction output is returned via ExecutionType.PREDICT.
74+
75+
```python
76+
knn_pipeline = rt.select_pipeline(pipeline_fitted, pipeline_fitted.get_xyrefs(node_knn)[0])
77+
knn_score = ray.get(rt.execute_pipeline(knn_pipeline, ExecutionType.SCORE, test_input).get_xyrefs(node_knn)[0].get_Xref())
78+
Z = ray.get(rt.execute_pipeline(knn_pipeline, ExecutionType.PREDICT, predict_input).get_xyrefs(node_knn)[0].get_Xref())
79+
```
80+
81+
The same steps apply to get the score and predictions of the branch ending at node_knn_post_nca as shown below.
82+
83+
```python
84+
nca_pipeline = rt.select_pipeline(pipeline_fitted, pipeline_fitted.get_xyrefs(node_knn_post_nca)[0])
85+
nca_score = ray.get(rt.execute_pipeline(nca_pipeline, ExecutionType.SCORE, test_input).get_xyrefs(node_knn_post_nca)[0].get_Xref())
86+
87+
Z = ray.get(rt.execute_pipeline(nca_pipeline, ExecutionType.PREDICT, predict_input).get_xyrefs(node_knn_post_nca)[0].get_Xref())
88+
```
89+
90+
Lastly, the fitted, scored and predicted results of the CF pipeline fully replicated those of the original sklearn pipeline, with the same plots shown below.
91+
92+
![](../images/classification_and_score_3.jpeg)
93+
94+
Classification score and boundaries of KNN with k=1
95+
96+
![](../images/classification_and_score_4.jpeg)
97+
98+
Classification score and boundaries of KNN with Neighborhood Component Analysis
99+
100+
The Jupyter notebook of this example is available [here](https://github.com/project-codeflare/codeflare/blob/main/notebooks/plot_nca_classification.ipynb) to demonstrate how one might translate sklearn pipelines to Codeflare pipelines that take advantage of Ray's distributed processing. Please try it out and let us know what you think.
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
<!--
2+
{% comment %}
3+
Copyright 2021 IBM
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
{% endcomment %}
17+
-->
18+
19+
### Tuning hyper-parameters with CodeFlare Pipelines
20+
21+
GridSearchCV() is often used for hyper-parameter turning for a model constructed via sklearn pipelines. It does an exhaustive search over specified parameter values for a pipeline. It implements a `fit()` method and a `score()` method. The parameters of the pipeline used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
22+
Here we show how to convert an example of using `GridSearchCV()` to tune the hyper-parameters of an sklearn pipeline into one that uses Codeflare (CF) pipelines `grid_search_cv()`. We use the [Pipelining: chaining a PCA and a logistic regression](https://scikit-learn.org/stable/auto_examples/compose/plot_digits_pipe.html#sphx-glr-auto-examples-compose-plot-digits-pipe-py) from sklearn pipelines as an example. 
23+
24+
In this sklearn example, a pipeline is chained together with a PCA and a LogisticRegression. The n_components parameter of the PCA and the C parameter of the LogisticRegression are defined in a param_grid: with n_components in `[5, 15, 30, 45, 64]` and `C` defined by `np.logspace(-4, 4, 4)`. A total of 20 combinations of `n_components` and `C` parameter values will be explored by `GridSearchCV()` to find the best one with the highest `mean_test_score`.
25+
26+
```python
27+
pca = PCA()
28+
logistic = LogisticRegression(max_iter=10000, tol=0.1)
29+
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
30+
31+
X_digits, y_digits = datasets.load_digits(return_X_y=True)
32+
33+
param_grid = {
34+
'pca__n_components': [5, 15, 30, 45, 64],
35+
'logistic__C': np.logspace(-4, 4, 4),
36+
}
37+
search = GridSearchCV(pipe, param_grid, n_jobs=-1)
38+
search.fit(X_digits, y_digits)
39+
print("Best parameter (CV score=%0.3f):" % search.best_score_)
40+
print(search.best_params_)
41+
```
42+
43+
After running `GridSearchCV().fit()`, the best parameters of `PCA__n_components` and `LogisticRegression__C`, together with the cross-validated mean_test scores are printed out as follows. In this example, the best n_components chosen is 45 for the PCA.
44+
45+
```python
46+
Best parameter (CV score=0.920):
47+
{'logistic__C': 0.046415888336127774, 'pca__n_components': 45}
48+
```
49+
50+
The PCA explained variance ratio and the best n_components chosen are plotted in the top chart. The classification accuracy and its std_test_score are plotted in the bottom chart. The best n_components can be obtained by calling best_estimator_.named_step['pca'].n_components from the returned object of GridSearchCV().
51+
52+
![](../images/pca_1.png)
53+
54+
### Converting to CodeFalre Pipelines with *grid search*
55+
56+
57+
We next describe the step-by-step conversion of this example to one that uses CodeFlare Pipelines.
58+
59+
#### **Step 1: importing codeflare.pipelines packages and ray**
60+
61+
We need to first import various `codeflare.pipelines` packages, including Datamodel and runtime, as well as ray and call `ray.shutdwon()` and `ray.init()`. Note that, in order to run this CodeFlare example notebook, you need to have a running ray instance.
62+
63+
```python
64+
import codeflare.pipelines.Datamodel as dm
65+
import codeflare.pipelines.Runtime as rt
66+
from codeflare.pipelines.Datamodel import Xy
67+
from codeflare.pipelines.Datamodel import XYRef
68+
from codeflare.pipelines.Runtime import ExecutionType
69+
import ray
70+
ray.shutdown()
71+
ray.init()
72+
```
73+
74+
#### **Step 2: defining and setting up a codeflare pipeline**
75+
76+
A codeflare pipeline is defined by EstimatorNodes and edges connecting two EstimatorNodes. In this case, we define node_pca and node_logistic and we connect these two nodes with `pipeline.add_edge()`. Before we can execute `fit()` on a pipeline, we need to set up the proper input to the pipeline.
77+
78+
#### **Step 3: defining pipeline param grid and executing**
79+
80+
Codeflare pipelines grid_search_cv()
81+
Codeflare pipelines runtime converts an sklearn param_grid into a codeflare pipelines param grid. We also specify the default KFold parameter for running the cross-validation. Finally, Codeflare pipelines runtime executes the grid_search_cv().
82+
83+
```python
84+
# param_grid
85+
param_grid = {
86+
'pca__n_components': [5, 15, 30, 45, 64],
87+
'logistic__C': np.logspace(-4, 4, 4),
88+
}
89+
90+
pipeline_param = dm.PipelineParam.from_param_grid(param_grid)
91+
92+
# default KFold for grid search
93+
k = 5
94+
kf = KFold(k)
95+
96+
# execute CF pipeplie grid_search_cv
97+
result = rt.grid_search_cv(kf, pipeline, pipeline_input, pipeline_param)
98+
```
99+
100+
#### **Step 4: parsing the returned result from `grid_search_cv()`**
101+
102+
As the Codeflare pipelines project is still actively under development, APIs to access some attributes of the explored pipelines in the `grid_search_cv()` are not yet available. As a result, a slightly more verbose code is used to get the best pipeline, its associated parameter values and other statistics from the returned object of `grid_search_cv()`. For example, we need to loop through all the 20 explored pipelines to get the best pipeline. And, to get the n_component of an explored pipeline, we first use `.get_nodes()` on the returned cross-validated pipeline and then use .get_estimator() and then finally use `.get_params()`.
103+
104+
```python
105+
import statistics
106+
107+
# pick the best mean' and best pipeline
108+
best_pipeline = None
109+
best_mean_scores = 0.0
110+
best_n_components = 0
111+
112+
df = pd.DataFrame(columns =('n_components', 'mean_test_score', 'std_test_score'))
113+
114+
for cv_pipeline, scores in result.items():
115+
mean = statistics.mean(scores)
116+
std = statistics.stdev(scores)
117+
n_components = 0
118+
params = {}
119+
# get the 'n_components' value of the PCA in this cv_pipeline
120+
for node_name, node in cv_pipeline.get_nodes().items():
121+
params[node_name] = node.get_estimator().get_params()
122+
if 'n_components' in params[node_name]:
123+
n_components = params[node_name]['n_components']
124+
assert(n_components > 0)
125+
df = df.append({'n_components' : n_components, 'mean_test_score' : mean, 'std_test_score' : std}, ignore_index=True)
126+
if mean > 0.92:
127+
print(mean)
128+
print(str(params))
129+
130+
if mean > best_mean_scores:
131+
best_pipeline = cv_pipeline
132+
best_mean_scores = mean
133+
best_n_components = n_components
134+
```
135+
136+
Due to the differences in split, the CodeFlare pipelines `grid_search_cv()` produces the best pipeline with `n_components = 64` for the PAC, and the 2nd best with `n_components = 45`. We print out the parameters of the 2nd best and the best pipelines as follows.
137+
138+
```python
139+
0.9226679046734757
140+
{'pca__3': {'copy': True, 'iterated_power': 'auto', 'n_components': 45, 'random_state': None, 'svd_solver': 'auto', 'tol': 0.0, 'whiten': False}, 'logistic__1': {'C': 0.046415888336127774, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 10000, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.1, 'verbose': 0, 'warm_start': False}}
141+
0.9260058805323429
142+
{'pca__4': {'copy': True, 'iterated_power': 'auto', 'n_components': 64, 'random_state': None, 'svd_solver': 'auto', 'tol': 0.0, 'whiten': False}, 'logistic__1': {'C': 0.046415888336127774, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 10000, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.1, 'verbose': 0, 'warm_start': False}}
143+
```
144+
145+
The corresponding plots are similar to those from the sklearn `GridSearchCV()`, except that the `n_components` chosen is 64 for the best score for the CodeFlare pipelines `grid_search_cv()`.
146+
147+
![](../images/pca_2.png)
148+
149+
The Jupyter notebook of this example is available [here](https://github.com/project-codeflare/codeflare/blob/main/notebooks/plot_digits_pipe.ipynb). Please download it and try it out to understand how you might convert an sklearn example to one that uses Codeflare pipelines. And please let us know what you think.
150+
151+
152+
153+

docs/source/examples/test.md

Lines changed: 0 additions & 19 deletions
This file was deleted.
18.2 KB
Loading
18.6 KB
Loading
18.2 KB
Loading
18.6 KB
Loading

docs/source/images/pca_1.png

17 KB
Loading

docs/source/images/pca_2.png

17.2 KB
Loading

0 commit comments

Comments
 (0)