# Streamlining workflows with pipelines

1-Obtaining the Breast Cancer Wisconsin dataset

In [2]:
import pandas as pd
df = pd.read_csv(
 'https://archive.ics.uci.edu/ml/'
 'machine-learning-databases'
 '/breast-cancer-wisconsin/wdbc.data',
 header=None)

2-Next, we will assign the 30 features to a NumPy array, X. Using
a LabelEncoder object, we will transform the class labels from their
original string representation ('M' and 'B') into integers:

In [4]:
from sklearn.preprocessing import LabelEncoder

X=df.loc[:, 2:].values
y=df.loc[:, 1].values
le=LabelEncoder()
y=le.fit_transform(y)
le.classes_

array(['B', 'M'], dtype=object)

In [5]:
le.transform(['M', 'B'])

array([1, 0])

3-train and test split

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)

Instead of going through the model fitting and data transformation steps for the
training and test datasets separately, we can chain the StandardScaler, PCA,
and LogisticRegression objects in a pipeline:

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
pipe_lr=make_pipeline(StandardScaler(),
                      PCA(n_components=2),
                          LogisticRegression(random_state=1,
                                             solver='lbfgs'))
pipe_lr.fit(X_train, y_train)
y_pred=pipe_lr.predict(X_test)
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))

Test Accuracy: 0.956
