#### Pipelines
Pipelines are a simple way to <span style="background-color: #FFFF00">keep your data preprocessing and modeling code organized</span>.Pipeline <span style="background-color: #FFFF00">bundles preprocessing and modeling steps</span>
- Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you <span style="background-color: #FFFF00">won't need to manually keep track of your training and validation data</span> at each step.
- Fewer Bugs: There are <span style="background-color: #FFFF00">fewer opportunities to misapply a step or forget a preprocessing step</span>.
- Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
- More Options for Model Validation: You will see an example in the next tutorial, which <span style="background-color: #FFFF00">covers cross-validation</span>.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X_full = pd.read_csv('train.csv', index_col='Id')
X_test_full = pd.read_csv('test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)
# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)


In [4]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols  = [col for col in X_train_full.columns if X_train_full[col].dtype == 'object' and 
                    X_train_full[col].nunique() < 10]
# Select numerical columns
numerical_cols = [col for col in X_train_full.columns if X_train_full[col].dtype in ['int64','float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

Step 1: Define Preprocessing Steps

- Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:

<span style="background-color: #FFFF00">imputes missing values in numerical data</span>, and
imputes missing values and applies a <span style="background-color: #FFFF00">one-hot encoding to categorical data</span>.

In [23]:
from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Preprocessing for numerical data
num_pre = SimpleImputer(strategy = 'mean')
# Preprocessing for categorical data
cat_pre = Pipeline(steps = [('imputer',SimpleImputer(strategy = 'most_frequent')),
                            ('onehot',OneHotEncoder(handle_unknown = 'ignore'))])
# Bundle preprocessing for numerical and categorical data
preprocessing = ColumnTransformer(transformers = [('num',num_pre,numerical_cols),
                                              ('cat',cat_pre,categorical_cols)])

Step 2: Define the Model

In [25]:
from sklearn.ensemble import RandomForestRegressor 
model = RandomForestRegressor(n_estimators = 100 , random_state  = 0)


Step 3: Create and Evaluate the PipelineÂ¶
Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps. 


In [33]:
from sklearn.metrics import mean_absolute_error 
# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps = [('preprocessor',preprocessing),
                                  ('model',model)])
# Preprocessing of training data, fit model
clf.fit(X_train,y_train)
# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)
# Evaluate the model
print(f"Mae is {mean_absolute_error(preds,y_valid)}")


Mae is 17612.84342465753
