# Course: [Master Machine Learning with scikit-learn](https://courses.dataschool.io/view/courses/master-machine-learning-with-scikit-learn)

## Chapters 1-5

*© 2022 Data School. All rights reserved.*

# Chapter 1: Introduction

## 1.1 Course overview

**High-level topics:**

- Handling missing values, text data, categorical data, and class imbalance
- Building a reusable workflow
- Feature engineering, selection, and standardization
- Avoiding data leakage
- Tuning your entire workflow

**How you will benefit from this course:**

- Knowledge of best practices
- Confidence when tackling new ML problems
- Ability to anticipate and solve problems
- Improved code quality
- Better, faster results

## 1.2 scikit-learn vs Deep Learning

**Benefits of scikit-learn:**

- Consistent interface to many models
- Many tuning parameters (but sensible defaults)
- Workflow-related functionality
- Exceptional documentation
- Active community support

**Drawbacks of deep learning:**

- More computational resources
- Higher learning curve
- Less interpretable models

## 1.3 Prerequisite skills

**scikit-learn prerequisites:**

- Loading a dataset
- Defining the features and target
- Training and evaluating a model
- Making predictions with new data

**New to scikit-learn?**

- Enroll in "Introduction to Machine Learning with scikit-learn" (free)
- Available at https://courses.dataschool.io
- Complete lessons 1 through 7

## 1.4 Course setup and software versions

**How to install scikit-learn and pandas:**

- **Option 1:** Install together
  - **Anaconda:** https://www.anaconda.com/products/distribution
- **Option 2:** Install separately
  - **scikit-learn:** https://scikit-learn.org
  - **pandas:** https://pandas.pydata.org

In [113]:
import sklearn
sklearn.__version__

'1.0.2'

**scikit-learn version:**

- **Course version:** 0.23.2
- **Minimum version:** 0.20.2

**How to install scikit-learn 0.23.2:**

- **Option 1:** conda install scikit-learn==0.23.2
- **Option 2:** pip install -U scikit-learn==0.23.2

In [114]:
import pandas
pandas.__version__

'1.1.5'

**Using Google Colab with the course:**

- Similar to the Jupyter Notebook
- Runs in your browser
- Free (but requires a Google account)
- Available at https://colab.research.google.com

## 1.5 Course outline

**Chapters:**

1. Introduction
2. Review of the Machine Learning workflow
3. Encoding categorical features
4. Improving your workflow with ColumnTransformer and Pipeline
5. Workflow review #1
6. Encoding text data
7. Handling missing values
8. Fixing common workflow problems
9. Workflow review #2
10. Evaluating and tuning a Pipeline
11. Comparing linear and non-linear models
12. Ensembling multiple models
13. Feature selection
14. Feature standardization
15. Feature engineering with custom transformers
16. Workflow review #3
17. High-cardinality categorical features
18. Class imbalance
19. Class imbalance walkthrough
20. Going further

**Lesson types:**

- Core lessons
- Q&A lessons

**Why not focus on algorithms?**

- Workflow will have a greater impact on your results
- Reusable workflow enables you to try many different algorithms
- Hard to know (in advance) which algorithm will work best

## 1.6 Course datasets

**Datasets:**

- Titanic
- US census
- Mammography scans

**Why use smaller datasets?**

- Easier and faster access to files
- Reduced computational time
- Greater understanding of the course material

## 1.7 Meet your instructor

**About me:**

- Founder of Data School
- Teaching data science for 7+ years
- Passionate about teaching people who are new to data science
- Live in Asheville, North Carolina
- Degree in Computer Engineering

# Chapter 2: Review of the Machine Learning workflow

## 2.1 Loading and exploring a dataset

In [115]:
import pandas as pd
df = pd.read_csv('http://bit.ly/MLtrain', nrows=10)
df = pd.read_csv('titanic_train.csv', nrows=10)

In [116]:
df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


**Machine Learning terminology:**

- **Target:** Goal of prediction
- **Classification:** Problem with a categorical target
- **Feature:** Input to the model (column)
- **Sample:** Single observation (row)
- **Training data:** Data with known target values

**Feature selection methods:**

- Human intuition
- Domain knowledge
- Data exploration
- Automated methods

**Currently selected features:**

- **Parch:** Number of parents or children aboard with that passenger
- **Fare:** Amount the passenger paid

In [117]:
X = df[['Parch', 'Fare']]
X

Unnamed: 0,Parch,Fare
0,0,7.25
1,0,71.2833
2,0,7.925
3,0,53.1
4,0,8.05
5,0,8.4583
6,0,51.8625
7,1,21.075
8,2,11.1333
9,0,30.0708


In [118]:
y = df['Survived']
y

0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64

In [119]:
X.shape

(10, 2)

In [120]:
y.shape

(10,)

## 2.2 Building and evaluating a model

In [121]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=1)

Now that we've defined X and y, our next step is to build and evaluate a model.

To start, we're going to use logistic regression as our model. It's a good default choice for classification problems because it's both fast and interpretable.

We import it from the linear_model module, and then we create an instance called logreg. This is our model object.

The default solver for logistic regression has changed between different scikit-learn versions, but in this course I'm going to set the solver to liblinear. I'm specifying the solver explicitly and setting a value for random_state so that if you run the same code at home, you will most likely get the same results as me.

Let's now talk about model evaluation. The goal of model evaluation is to simulate how a model will perform on future data so that we can choose between models today. To do model evaluation, we need both an evaluation procedure and an evaluation metric.

The procedure we will use is K-fold cross-validation. Another option is to use train/test split, but cross-validation is generally superior because it gives a lower variance estimate of model performance.

The metric we will use is classification accuracy. There are many other classification metrics we could have chosen, but accuracy is suitable for this problem for two reasons:

- First, there is not significant class imbalance.
- And second, predicting the positive class correctly is just as important to us as predicting the negative class correctly.

That being said, I will cover other classification metrics in the chapters on class imbalance.

With such a small dataset, we're going to use 3-fold cross-validation, rather than 5 or 10 folds which is more typical. Let me briefly review what happens during 3-fold cross-validation:

- The rows are split into 3 subsets, which we'll call A, B, and C.
- First, A and B together become the training set, and C becomes the testing set. The model is trained on the training set, the trained model makes predictions for the testing set, and those predictions are evaluated.
- Next, A and C together become the training set, and B becomes the testing set. Again, the model is trained, it makes predictions, and the predictions are evaluated.
- Finally, B and C together become the training set, and A becomes the testing set. The training, predicting, and evaluation process happens one final time.
- Because the evaluation process occurred 3 times, it returns 3 scores, and we will usually take the mean of those scores.

Let's go ahead and use cross-validation to evaluate our model:

- First we import the cross_val_score function from the model_selection module.
- Then we pass it the model object, X and y, the number of cross-validation folds, and the evaluation metric. Although the default metric for classification problems is accuracy, I recommend specifying it explicitly so that there's no ambiguity.
- When we run cross_val_score, it does the dataset splitting, training, predicting, and evaluation. 3 accuracy scores are returned, and the mean of those scores is 69%.

If you received a different result, that's not a problem. The results can vary based on your scikit-learn version due to changes in the default parameters, algorithm changes, bug fixes, and so on.

Unfortunately, we can't take these results seriously because the dataset is so small. There's actually no reliable evaluation procedure when your training data only contains 10 rows, but I did want to demonstrate it anyway to emphasize that model evaluation is a normal part of the Machine Learning workflow.


**Requirements for model evaluation:**

- **Procedure:** K-fold cross-validation
- **Metric:** Classification accuracy

**Steps of 3-fold cross-validation:**

1. Split rows into 3 subsets (A, B, C)
2. A & B is training set, C is testing set
  - Train model on training set
  - Make predictions on testing set
  - Evaluate predictions
3. Repeat with A & C as training set, B as testing set
4. Repeat with B & C as training set, A as testing set
5. Calculate the mean of the scores

In [122]:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()

0.6944444444444443

## 2.3 Using the model to make predictions

At this point in the workflow, we would typically try making changes in order to achieve a better accuracy, such as:

- Tuning the the model's hyperparameters
- Adding or removing features
- Or trying a different classification model other than logistic regression.

We'll cover these topics in detail later in the course, but for now, let's assume that we're happy with the model as-is. Thus, our next steps are to train the model and then use it to make predictions on new data.

We use the model's fit method, which instructs the model to try to learn the relationship between X and y.

There are four important points I want to note here:

- First, you should train your model on the entire dataset before using it to make predictions, otherwise you are throwing away valuable training data. In truth, we do have more than 10 rows, but for now we are considering our entire training dataset to be these 10 rows.
- Second, the model object is modified in-place when you run the fit method, and so there's no need to overwrite the logreg object using an assignment statement.
- Third, scikit-learn understands how to work with pandas objects, and so we can pass X and y directly to the fit method.
- And finally, if you're using scikit-learn 0.23 or later, you will only see the parameters that have changed from the defaults when you print or fit a model. That's why it only displays the random_state and solver parameters, whereas in previous versions of scikit-learn, all model parameters would have been displayed.

Now, let's read in a new dataset for which we don't know the target values. You can read it from a URL or from your local computer, so choose whichever option you prefer. Again, we are only going to keep the first 10 rows.

You'll notice that it has the same columns as the df DataFrame, except that there's no Survived column, which is the column that we're going to predict.

Before we make predictions, we have to define X_new. It has to have the same columns as X, and those columns have to be in the same order.

Finally, we'll use the trained model to make predictions by passing X_new to the predict method, which outputs a NumPy array. There are 10 predictions because it makes 1 prediction for each sample in X_new.

The predictions are in the same order as the samples in X_new, meaning the first prediction is for the first row in X_new, the second prediction is for the second row in X_new, and so on.

Note that we can't actually evaluate the accuracy of these predictions because we don't know the true target values for the samples in X_new.|

**Ways to improve the model:**

- Hyperparameter tuning
- Adding or removing features
- Trying a different model

In [123]:
logreg.fit(X, y)

LogisticRegression(random_state=1, solver='liblinear')

**Important points about model fitting:**

- Train your model on the entire dataset before making predictions
- Assignment statement is unnecessary
- Passing pandas objects is fine
- Only prints parameters that have changed (version 0.23 or later)

In [124]:
df_new = pd.read_csv('http://bit.ly/MLnewdata', nrows=10)
df_new = pd.read_csv('titanic_new.csv', nrows=10)
df_new

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [125]:
X_new = df_new[['Parch', 'Fare']]
X_new

Unnamed: 0,Parch,Fare
0,0,7.8292
1,0,7.0
2,0,9.6875
3,0,8.6625
4,1,12.2875
5,0,9.225
6,0,7.6292
7,1,29.0
8,0,7.2292
9,0,24.15


In [126]:
logreg.predict(X_new)

array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1], dtype=int64)

## 2.4 Q&A: How do I adapt this workflow to a regression problem?

In this course, we're going to be focusing on classification problems, which means that the target you're trying to predict is categorical. The other main type of prediction problem is regression, in which your target value is continuous.

If you're planning to work on a regression problem, the good news is that the workflow I'm teaching will work just as well to solve classification or regression problems. There are only two changes you will need to make to adapt this workflow for regression:

- First, you will need to choose a different Machine Learning model. For example, you might choose linear regression instead of logistic regression, since linear regression predicts continuous values whereas logistic regression predicts class values.
- Second, you will need to choose a different model evaluation metric. For example, you might choose mean squared error instead of accuracy, since mean squared error is appropriate for continuous values whereas accuracy is only appropriate for categorical values.

**Adapting this workflow for regression:**

1. Choose a different model
2. Choose a different evaluation metric

## 2.5 Q&A: How do I adapt this workflow to a multiclass problem?

In this chapter, we worked on a binary classification problem, which means there are only two possible output classes.

Multiclass problems are ones in which there are more than two output classes. The classic example of this is the iris dataset, in which each iris plant can be classified as one of three possible species.

Thankfully, in scikit-learn, all classifiers automatically handle multiclass problems with no changes to the workflow. It automatically detects the number of classes from the data, thus you don't even have to inform scikit-learn that you're working on a multiclass problem.

So how do classifiers handle multiclass problems?

- Many classifiers are inherently multiclass, meaning that they work exactly the same regardless of the number of classes.
- For classifiers that only work in the binary case, they can be extended to the multiclass case using the so-called "one-vs-one" or "one-vs-rest" strategies, in which multiple models are fit and the results are combined. You can research more about these strategies if you're interested, but the key point is that this is handled for you automatically by scikit-learn without you needing to do anything special.

**Types of classification problems:**

- **Binary:** Two output classes
- **Multiclass:** More than two output classes

**How classifiers handle multiclass problems:**

- Many are inherently multiclass
- Others can be extended using "one-vs-one" or "one-vs-rest" strategies

## 2.6 Q&A: Why should I select a Series for the target?

From our DataFrame, there are two ways you could imagine selecting the target variable of Survived:

- The first way is as a pandas Series, which we can do using a single set of brackets.
- The second way is as a pandas DataFrame with one column, which we can do using two sets of brackets.

These two objects look similar, but they actually have different shapes. The Series is a one-dimensional object, while the DataFrame is a two-dimensional object.

The difference between these two objects is more clear if you convert them to NumPy arrays using the to_numpy method.

Now that you've seen that these two objects are different, the question is: Why does it matter which object you use for the target?

To answer this question, I have to briefly explain multilabel classification. A multilabel classification problem is one in which each sample can simultaneously have more than one label. The classic example of this is classifying the topic of a document. For example, a document might be about politics, religion, or law, or it might fit into multiple topics at once.

This is different from multiclass classification because multiclass only allows a sample to have a single label, whereas multilabel allows a single sample to have multiple labels.

Anyway, scikit-learn supports multilabel classification by allowing you to represent the target as a two-dimensional object. For example, if you had 10 documents and there were 3 possible labels, you would actually use a 10 by 3 DataFrame as your y value.

That is all to say that using a two-dimensional DataFrame as your y value signals to scikit-learn that you are working on a multilabel problem. Our classification problem is not multilabel, and thus we use a one-dimensional Series as our y value to signal that we are working on a single-label problem.

In [127]:
df['Survived']

0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64

In [128]:
df[['Survived']]

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0
5,0
6,0
7,0
8,1
9,1


In [129]:
df['Survived'].shape

(10,)

In [130]:
df[['Survived']].shape

(10, 1)

In [131]:
df['Survived'].to_numpy()

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1], dtype=int64)

In [132]:
df[['Survived']].to_numpy()

array([[0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1]], dtype=int64)

**Multilabel vs multiclass problems:**

- **Multilabel:** Each sample can have more than one label
- **Multiclass:** Each sample can have one label

**Multilabel vs multiclass targets:**

- **Multilabel:** 2-dimensional y (DataFrame)
- **Multiclass:** 1-dimensional y (Series)

## 2.7 Q&A: How do I add the model's predictions to a DataFrame?

Let's say that you wanted to match up the 10 predictions output by the model with the 10 rows of the X_new DataFrame so that you can see the predictions next to the features.

To do this, you would first convert the predictions from a NumPy array to a pandas Series. Note that we are setting the Series index to match the index of X_new, and we're giving the Series a name.

Then, you use the concat function to concatenate the X_new DataFrame and the predictions Series along the columns axis, which outputs a DataFrame. Note that the name of the Series became the name of that DataFrame column.

In [133]:
predictions = pd.Series(logreg.predict(X_new), index=X_new.index,
                        name='Prediction')

In [134]:
pd.concat([X_new, predictions], axis='columns')

Unnamed: 0,Parch,Fare,Prediction
0,0,7.8292,0
1,0,7.0,0
2,0,9.6875,0
3,0,8.6625,0
4,1,12.2875,1
5,0,9.225,0
6,0,7.6292,0
7,1,29.0,1
8,0,7.2292,0
9,0,24.15,1


## 2.8 Q&A: How do I determine the confidence level of each prediction?

For some classification problems, you are only interested in the predicted class labels. However, sometimes it's useful to output the predicted probabilities of class membership using the predict_proba method.

The output array has 10 rows because the model made predictions for 10 samples, and it has 2 columns because there are 2 possible classes. The left column represents class 0, and the right column represents class 1.

Let's talk about how to interpret this array, using the first row as an example:

- The model calculated a likelihood of 58% that the first sample in X_new was class 0 and a 42% likelihood that it was class 1.
- Because class 0 had a higher likelihood, the model predicted class 0 for this sample.
- And as you might imagine, the values in every row will always add up to 1.

If you need just the second column, meaning the predicted probabilities of class 1, then you can extract it using NumPy's slicing notation. The colon means select all rows, and the 1 means select the column in the 1 position, which is the second column.

Some classifiers, such as logistic regression, are known as well-calibrated classifiers, which means that their predicted probabilities can be directly interpreted as the model's confidence level in that prediction. So for example, it's more confident that the 8th sample is class 1 than it is that the 10th sample is class 1.

Knowing these confidence levels can be useful if you're most interested in the samples with the highest predicted probabilities. For example, if you were trying to predict who might be interested in purchasing a specific product, you might focus all of your marketing budget on reaching those customers with the highest predicted probabilities of purchase.

Keep in mind that there are other classifiers which are not as well-calibrated, such as Naive Bayes. In those cases, it's less appropriate to interpret their predicted probabilities as confidence levels. Thus if you know you're going to be interested in these confidence levels, it's best to use a well-calibrated classifier like logistic regression.

In [135]:
logreg.predict(X_new)

array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1], dtype=int64)

In [136]:
logreg.predict_proba(X_new)

array([[0.57804075, 0.42195925],
       [0.58275546, 0.41724454],
       [0.56742414, 0.43257586],
       [0.57328835, 0.42671165],
       [0.48357081, 0.51642919],
       [0.57007262, 0.42992738],
       [0.57917926, 0.42082074],
       [0.38795132, 0.61204868],
       [0.58145374, 0.41854626],
       [0.48342837, 0.51657163]])

**Array of predicted probabilities:**

- One row for each sample
- One column for each class

In [137]:
logreg.predict_proba(X_new)[:, 1]

array([0.42195925, 0.41724454, 0.43257586, 0.42671165, 0.51642919,
       0.42992738, 0.42082074, 0.61204868, 0.41854626, 0.51657163])

## 2.9 Q&A: How do I check the accuracy of the model's predictions?

At the end of this chapter, we made predictions for the 10 samples in X_new, though we couldn't check the accuracy of these predictions because we don't know the true target values for those samples. That will sometimes be the case in the real world.

For example, if you built a model to predict what medical conditions someone might develop based on their genetic information, you may not ever find out whether the model's predictions were correct, either because that data is not being collected or because that data is protected by privacy laws.

In other cases, you can actually check the accuracy of your predictions. For example, if you built a model to predict the outcome of all US Supreme Court cases, you would make those predictions before those cases were decided, and then you could check the model's accuracy once the court's rulings were publicly announced.

Ideally, the actual accuracy of your model will be close to the accuracy that you estimated using your training data during model evaluation. If it's not close, that could indicate a problem with your model evaluation procedure, or it could indicate that there are some important differences between your training data and the new data.

In all cases, you can incorporate this new data into your training data since you know the true target values, which should help the model to make better predictions in the future.

**Checking model accuracy:**

- **Not possible:** Target value is unknown or is private data
- **Possible:** Target value is known

## 2.10 Q&A: What do the "solver" and "random_state" parameters do?

In this chapter, when creating the model object, I set the logistic regression's solver to liblinear, and I set the random_state to 1. I set these values so that if you ran the same code at home, you would most likely get the same results as me. In this lesson, I'll explain what these two parameters actually do.

The solver is the algorithm used to solve the optimization problem of calculating the logistic regression's coefficients. In other words, given the features and the target, the solver figures out the coefficients. The solvers have different strengths and weaknesses, different properties, and ultimately may come up with different results. Here's a comparison chart from the scikit-learn documentation.

I recommend reviewing this chart and reading the documentation to decide which solver to use for your particular problem, but ultimately it's fine to just try each one and see what happens.

liblinear used to be the default solver for logistic regression, but in scikit-learn version 0.22, they changed the default to lbfgs instead. I'm having all of us use liblinear in this course so that we will tend to get the same results, regardless of scikit-learn version. Keep in mind that we still aren't guaranteed to get the exact same results, because with each new version, bugs are fixed and other algorithm parameters are sometimes changed.

One final note about the solver is that if you ever get a convergence warning when using logistic regression, the best solution is usually just to use a different solver.

Next, let's talk about the random_state parameter. It happens that there is some randomness involved in three of the solvers, including liblinear. That means you may get different results each time you fit the model. By setting the random_state, you ensure that your model will output the same results every time.

More generally, any time you're running a scikit-learn function that involves a random process, I recommend setting the random_state parameter to any integer. That allows your code to be reproducible, both by you and others. Keep in mind that the only way to know whether a given function involves a random process is by reading the documentation.

In [138]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

<img src="https://www.dataschool.io/files/solver_comparison.png">

**Default solver for logistic regression:**

- **Before version 0.22:** liblinear
- **Starting in version 0.22:** lbfgs

**Advice for random_state:**

- Set random_state to any integer when a random process is involved
- Allows your code to be reproducible

## 2.11 Q&A: How do I show all of the model parameters?

Starting in scikit-learn version 0.23, when you print any estimator (such as a model, a transformer, or a pipeline), it will only show you the parameters that are not set to their default values. For example, when we print out the logreg model object, it only shows the random_state and solver parameters because we set those explicitly.

However, you can still see all parameters by running the get_params method.

If you like, you can restore the behavior from previous scikit-learn versions by importing the set_config function and then setting the print_changed_only parameter to False. Now, all parameters will be printed, regardless of whether you've changed them.

I prefer the new behavior, so I'm going to set print_changed_only back to True.

In [139]:
logreg

LogisticRegression(random_state=1, solver='liblinear')

In [140]:
logreg.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 1,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [141]:
from sklearn import set_config
set_config(print_changed_only=False)

In [142]:
logreg

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [143]:
set_config(print_changed_only=True)

## 2.12 Q&A: Should I shuffle the samples when using cross-validation?

When I ran cross_val_score earlier in this chapter, I passed an integer, 3 in this case, that specified the number of cross-validation folds.

This code shows you what happens "under the hood" when you specify cv=3 for a classification problem. I'm going to walk through this code so you understand what's happening and you can modify it when needed.

First, you'll notice that we're importing a class called StratifiedKFold. It's known as a cross-validation splitter, which means that its role is to split datasets. We create an instance of this class and pass it a 3 so that it will create 3 folds. And then we can pass this instance to cross_val_score instead of an integer.

It's called StratifiedKFold because it uses stratified sampling to ensure that the class proportions are approximately equal in each fold. For example, if 40% of the passengers in the dataset survived, then stratified sampling ensures that about 40% of each fold is survived passengers.

In other words, it ensures that each fold is representative of the entire dataset. Stratified sampling is desirable because it produces more reliable cross-validation scores, and again, scikit-learn will do this for you by default.

Another good thing to know about StratifiedKFold is that by default, it does not shuffle the samples before splitting. Thus, there is nothing random about this process, and as such you will get the same results every time you run cross_val_score.

In most cases, it doesn't matter whether you shuffle the samples before splitting. However, if the order of the samples in your dataset is not arbitrary, then it's important to randomly shuffle the samples when cross-validating.

For example, you could imagine that if your dataset was sorted by one of the features, then some folds would only have high values of that feature and other folds would only have low values of that feature, which could result in unreliable cross-validation scores.

If you do need to shuffle the samples, you simply modify the cross-validation splitter by setting shuffle to True, and then pass that splitter object to cross_val_score. Note that because you are introducing randomness into the process by shuffling, you should also set a random_state to ensure reproducibility.

In summary, if you have a classification problem and the samples are in an arbitrary order, you can just pass an integer to the cv parameter of cross_val_score, and it will use stratified sampling without shuffling.

If your samples are not in an arbitrary order, you should use StratifiedKFold as your splitter and set shuffle to True, and then pass the splitter object to the cv parameter of cross_val_score, as I did above.

Finally, it's worth mentioning that if you're working on a regression problem instead, and you need to shuffle the samples, you should use the KFold class instead of the StratifiedKFold class, because stratified sampling does not apply to regression problems.

In [144]:
cross_val_score(logreg, X, y, cv=3, scoring='accuracy')

array([0.75      , 0.66666667, 0.66666667])

In [145]:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(3)
cross_val_score(logreg, X, y, cv=kf, scoring='accuracy')

array([0.75      , 0.66666667, 0.66666667])

**Stratified sampling:**

- Ensures that each fold is representative of the dataset
- Produces more reliable cross-validation scores

In [146]:
kf = StratifiedKFold(3, shuffle=True, random_state=1)
cross_val_score(logreg, X, y, cv=kf, scoring='accuracy')

array([0.75      , 0.33333333, 0.66666667])

**When to shuffle your samples:**

- **Samples in arbitrary order:** Shuffling not needed
- **Samples are ordered:** Shuffling needed

**How to shuffle your samples:**

- **Classification:** StratifiedKFold
- **Regression:** KFold

# Chapter 3: Encoding categorical features

## 3.1 Introduction to one-hot encoding 

**How to run the code above:**

- **Jupyter Notebook:**
  - Select this cell
  - Click "Cell" menu, then "Run All Above"
- **JupyterLab:**
  - Select this cell
  - Click "Run" menu, then "Run All Above Selected Cell"

In this chapter, we're going to focus on one of the most important data preprocessing steps, which is the encoding of categorical features.

Before we start, it's important to note that this chapter builds on the objects and the imports from the previous chapter. So if you just opened your notebook a moment ago, you need to run all of the code above before starting this chapter.

- If you're using the Jupyter Notebook, the easiest way to do this is to select this cell, click on the "Cell" menu and then select "Run All Above", which runs all of the cells above the currently selected cell.
- If you're using JupyterLab, you would select this cell, click the "Run" menu and then select "Run All Above Selected Cell", which does the same thing.

You should repeat this process every time you open the notebook.

Now that that's complete, let's take a look at our Titanic DataFrame. In the last chapter, the only features we used were Parch and Fare. In this chapter, we want to add Embarked and Sex as additional features, in case they improve our model.

As a reminder, Parch is the number of parents or children aboard with that passenger, and Fare is the amount the passenger paid. Our first new feature, Embarked, is the port that each passenger embarked from, and the possible values are C, Q, or S. Our other new feature, Sex, is simply male or female.

Both Embarked and Sex are known as unordered categorical features because there are distinct categories and there's no inherent logical ordering to the categories. This type of data is also known as nominal data.

All scikit-learn models expect features to be numeric, and so Embarked and Sex can't actually be passed directly to a model. Instead, we're going to encode them using a process called one-hot encoding, also known as dummy encoding.



In [147]:
df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


**Currently selected features:**

- **Parch:** Number of parents or children aboard with that passenger
- **Fare:** Amount the passenger paid
- **Embarked:** Port the passenger embarked from
- **Sex:** Male or Female

**Unordered categorical data:**

- Contains distinct categories
- No inherent logical ordering to the categories
- Also called "nominal data"

Let's look at the code for one-hot encoding. First, we import the OneHotEncoder class from the preprocessing module. Then, we create an instance of it and set sparse to False.

By default, OneHotEncoder will output a sparse matrix, which is the most efficient and performant data structure for this type of data. By setting sparse to False, it will instead output a dense matrix, which is just the normal way of representing a matrix. This representation will allow us to examine the output so that we can understand the encoding scheme.



In [148]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

**Matrix representations:**

- **Sparse:** More efficient and performant
- **Dense:** More readable

Next, we'll encode the Embarked column by passing it to the fit_transform method of the OneHotEncoder. We'll talk about the fit_transform method in the next lesson, but for now I just want to highlight that we'll use double brackets around Embarked to pass it as a single-column DataFrame instead of using single brackets to pass it as a Series.

This is important because OneHotEncoder expects to receive a two-dimensional object (such as a DataFrame) since a one-dimensional object is considered ambiguous. A one-dimensional Series could be interpreted either as a single feature or a single sample, whereas our two-dimensional DataFrame signals to scikit-learn that this is indeed a single feature.

Running the fit_transform method outputs this 10 by 3 array. This is the encoded version of the Embarked column, and it is exactly what we will pass to the model instead of the strings C, Q, and S.

Let's talk about how we interpret this output. There are 3 columns because there were 3 unique values in Embarked. Each row contains a single 1, and the rest of the values in the row are 0. 100 means "C", 010 means "Q", and 001 means "S", which you can confirm by comparing it to the Embarked column in our DataFrame.

As an aside, this is called one-hot encoding because in each row there is one "hot" level, meaning one non-zero level.

This is also the same output you would get by using the get_dummies function in pandas, though we'll talk later in the course why it's best to do all of your preprocessing in scikit-learn instead of pandas.

Let's now look at the categories attribute of the OneHotEncoder. You can think of it as the column header for our 10 by 3 array. In other words, the categories attribute tells you that the first column represents C, the second column represents Q, and the third column represents S. Because the categories are always in alphabetical order from left to right, I didn't actually have to examine the categories attribute in order to know how to interpret it.

As an aside, you'll notice a lot of attributes in scikit-learn end in an underscore. This is scikit-learn's convention for any attribute that is learned or estimated from the data during the fit step.

We've now seen how the OneHotEncoder encodes the Embarked feature. But why is this a reasonable way to encode a categorical feature?

You can think of it this way: OneHotEncoder creates a feature from each level so that the model can learn the relationship between each level and the target value. In this case, the model can learn the relationship between the target value of Survived and whether or not a passenger embarked at a given port.

For example, the model might learn from the first feature that passengers who embarked at C have a higher survival rate than passengers who didn't embark at C. This is similar to how a model might learn from a numeric feature like Fare that passengers with a higher Fare have a higher survival rate than passengers with a lower Fare.

At this point, you might be wondering whether we could have instead encoded Embarked as a single numeric feature with the values 0, 1, and 2 representing C, Q, and S. The answer is that yes, we can do this, but it's generally not a good idea to do this with unordered categories because it would imply an ordering that doesn't inherently exist.

To see why it's not a good idea, let's pretend that passengers who embarked at C and S had high survival rates, and passengers who embarked at Q had low survival rates. There would be no way for a linear model like logistic regression to learn that relationship if Embarked is encoded as a single feature since a single feature can't be assigned both a negative coefficient to represent the impact of Q with respect to C and a positive coefficient to represent the impact of S with respect to Q.

In summary, encoding Embarked as a single feature would prohibit a linear model from learning a non-linear relationship in the data, which is why encoding it as multiple features is generally the better choice.


**Why use double brackets?**

- **Single brackets:**
  - Outputs a Series
  - Could be interpreted as a single feature or a single sample
- **Double brackets:**
  - Outputs a single-column DataFrame
  - Interpreted as a single feature

In [149]:
ohe.fit_transform(df[['Embarked']])

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.]])

**Output of OneHotEncoder:**

- One column for each unique value
- One non-zero value in each row:
  - 1, 0, 0 means "C"
  - 0, 1, 0 means "Q"
  - 0, 0, 1 means "S"

In [150]:
# can be considered as the column headr of our ohe output
ohe.categories_

[array(['C', 'Q', 'S'], dtype=object)]

**Why use one-hot encoding?**

- Model can learn the relationship between each level and the target value
- Example: Model might learn that "C" passengers have a higher survival rate than "not C" passengers

**Why not encode as a single feature?**

- **Pretend:**
  - C: high survival rate
  - Q: low survival rate
  - S: high survival rate
- **Single feature would need two coefficients:**
  - Negative coefficient for impact of Q (with respect to C)
  - Positive coefficient for impact of S (with respect to Q)

## 3.2 Transformer methods: fit, transform, fit_transform

**Generic transformer methods:**

- **fit:** Transformer learns something
- **transform:** Transformer uses what it learned to do the data transformation

**OneHotEncoder methods:**

- **fit:** Learn the categories
- **transform:** Create the feature matrix using those categories

Let's discuss the fit_transform method, since that's the method we used with OneHotEncoder to encode the Embarked feature.

OneHotEncoder is known as a transformer, meaning its role is to perform data transformations. Transformers usually have a "fit" method and always have a "transform" method. The fit method is when the transformer learns something, and the transform method is when it uses what it learned to do the data transformation.

Using OneHotEncoder as an example, the fit method is when it learns the categories from the data in alphabetical order, and the transform method is when it creates the feature matrix using those categories.

The fit_transform method, which is what we used above, just combines those two steps into a single method call. You can actually do those steps as two separate calls of fit then transform, but the single method call of fit_transform is better because it's more computationally efficient and also more readable in my opinion.

## 3.3 One-hot encoding of multiple features

We saw how to use OneHotEncoder to encode the Embarked column, but we actually need to encode both Embarked and Sex. Thankfully, OneHotEncoder can be applied to multiple features at once.

To do this, we simply pass a two-column DataFrame to the fit_transform method, whereas previously we had passed a one-column DataFrame. It outputs 5 columns, in which the first 3 columns represent Embarked and the last 2 columns represent Sex.

Looking at the categories attribute, we first see the 3 categories that were learned from Embarked in alphabetical order, and then we see the 2 categories that were learned from Sex in alphabetical order. Thus, we know that a 10 in the last two columns means "female", and a 01 in the last two columns means "male".

So for example, the first sample in the output array is 00101, which means they embarked from S and they are male. The second sample in the array is 10010, which means they embarked from C and they are female. And so on.

Recall that our goal in this chapter was to numerically encode Embarked and Sex so we could include them in our model along with Parch and Fare. How might we do that?

One idea would be to manually stack Parch and Fare side-by-side with the 5 columns output by OneHotEncoder, and then train the model using all 7 columns. However, we would need to repeat the same exact process of encoding and stacking with the new data, since if you train a model with 7 features, you need the same 7 features in the new data in order to make predictions.

This process would work, but it's less than ideal, since repeating the same steps twice is both inefficient and error-prone. Not only that, but the complexity of this process will continue to increase as you preprocess additional features.

In the next chapter, I'll introduce you to the ColumnTransformer and Pipeline classes. We'll use these two classes to accomplish our goal of adding Embarked and Sex to our model, but we'll do it in a way that is both reliable and efficient.

In [151]:
ohe.fit_transform(df[['Embarked', 'Sex']])

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.]])

In [152]:
ohe.categories_

[array(['C', 'Q', 'S'], dtype=object), array(['female', 'male'], dtype=object)]

**Decoding the output array:**

- **First three columns:**
  - 1, 0, 0 means "C"
  - 0, 1, 0 means "Q"
  - 0, 0, 1 means "S"
- **Last two columns:**
  - 1, 0 means "female"
  - 0, 1 means "male"
- **Example:**
  - 0, 0, 1, 0, 1 means "S, male"
  - 1, 0, 0, 1, 0 means "C, female"

**How to manually add Embarked and Sex to the model:**

1. Stack Parch and Fare side-by-side with OneHotEncoder output
2. Repeat the same process with new data

**Problems with a manual approach:**

- Repeating steps is inefficient and error-prone
- Complexity will increase

## 3.4 Q&A: When should I use transform instead of fit_transform?

Earlier in this chapter, we used the fit_transform method of OneHotEncoder to encode two categorical features. In this lesson, I'll show you when it's appropriate to just use the transform method instead of fit_transform. The example below will use OneHotEncoder, but the principles I'm teaching here apply the same way to all transformers.

We'll start by creating a DataFrame of training data with just 1 categorical feature. Let's run fit_transform on the entire DataFrame.

Recall that fit_transform is really 2 steps. During the first step, which is fit, the OneHotEncoder learns the 3 categories. During the second step, which is transform, the OneHotEncoder creates the feature matrix using those categories. It outputs a 4 by 3 array, since there are 4 samples and 3 categories.

Now, we'll create a DataFrame of testing data. It contains the same feature, but that feature includes one less category. What would happen if we ran fit_transform on the testing data?

The output array only includes two columns, because the testing data only included two categories. The first column represents the A category, and the second column represents the C category.

This is problematic, because if we trained a model using the 3-column feature matrix, and then tried to make predictions on the 2-column feature matrix, it would error due to a shape mismatch. That makes sense because if you train a model such as logistic regression using 3 features, it will learn 3 coefficients, and it expects to use all 3 of those coefficients when making predictions.

The solution is to run fit_transform on the training data, and only run transform on the testing data. Let's take a look at the output arrays.

Notice that the categories are represented the same way in both arrays: the first column represents A, the second column represents B, and the third column represents C.

This happened because we only ran the fit method once, on the training data, and the fit method is when the OneHotEncoder learns the categories.

Then we ran the transform method twice, both on the training data and the testing data. Because we didn't run the fit method on the testing data, the categories learned from the training data were applied to both the training and testing data. This is critically important because it means that both our training and testing feature matrices have 3 columns, and those 3 columns mean the same thing.

In summary, when using any transformer, you will always use the fit_transform method on the training data and only the transform method on the testing data.

In [153]:
demo_train = pd.DataFrame({'letter':['A', 'B', 'C', 'B']})
demo_train

Unnamed: 0,letter
0,A
1,B
2,C
3,B


In [154]:
ohe.fit_transform(demo_train)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

**Example of fit_transform on training data:**

- **fit:** Learn 3 categories (A, B, C)
- **transform:** Create feature matrix with 3 columns

In [155]:
demo_test = pd.DataFrame({'letter':['A', 'C', 'A']})
demo_test

Unnamed: 0,letter
0,A
1,C
2,A


In [156]:
ohe.fit_transform(demo_test)

array([[1., 0.],
       [0., 1.],
       [1., 0.]])

**Example of fit_transform on testing data:**

- **fit:** Learn 2 categories (A, C)
- **transform:** Create feature matrix with 2 columns

In [157]:
ohe.fit_transform(demo_train)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [158]:
ohe.transform(demo_test)

array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

**Correct process:**

1. Run fit_transform on training data:
  - **fit:** Learn 3 categories (A, B, C)
  - **transform:** Create feature matrix with 3 columns
2. Run transform on testing data:
  - **transform:** Create feature matrix with 3 columns

## 3.5 Q&A: What happens if the testing data includes a new category?

In the previous lesson, we created this example DataFrame of training data. When we passed that DataFrame to fit_transform, the output array included three columns. We know from the categories attribute that those columns represent the categories A, B, and C.

Now we'll create a new DataFrame of testing data that includes a category, D, which was not seen in the training data.

If you pass this new DataFrame to the transform method, it will throw an error because it doesn't know how to represent the D category. It only knows how to represent the A, B, and C categories because those are the ones that were seen by the OneHotEncoder during the fit step.

There are two possible solutions to this problem.

The first solution is to specify the categories manually to the OneHotEncoder when creating an instance. Then, when fit_transform is run on the training data, a column is reserved for each of the four categories. As a result, the transform on the testing data will no longer error.

However, specifying the categories manually is only a useful solution if you know all possible categories that might ever appear in your data. But in the real world, you don't always know the full set of categories ahead of time.

For example, there might be rare categories that aren't present in your set of samples, or new categories might be added in the future. For example, if one of your categorical features was medical billing codes, you could imagine that new billing codes are added over time.

If you don't know all possible categories, then the solution is to set the handle_unknown parameter of the OneHotEncoder to ignore, which overrides the default value of error.

Let's use the fit_transform method on our training data one more time. The output array includes three columns representing A, B, and C. Now, when you use the transform method on the testing data, the third sample is encoded as all zeros because D is an unknown category.

Although this might seem strange, this is actually quite a reasonable approach since you don't have any information from the training data about the relationship between the D category and the target value.

One limitation of this approach, however, is that all unknown categories will be encoded the same way, which means that an E value in the testing data would also be encoded as all zeros.

Here's my overall advice:

1. When starting a project, keep the handle_unknown parameter set to its default value of error so that you will know if you are encountering new categories in your testing data.

2. If you do find that you're encountering new categories, but you can determine the full set of categories through research, then define the categories manually when creating the OneHotEncoder instance.

3. If you can't determine the full set of categories, then set the handle_unknown parameter to ignore. However, you should retrain your model as soon as possible using data that includes those new categories.

In [159]:
demo_train

Unnamed: 0,letter
0,A
1,B
2,C
3,B


In [160]:
ohe.fit_transform(demo_train)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [161]:
ohe.categories_

[array(['A', 'B', 'C'], dtype=object)]

In [162]:
demo_test_unknown = pd.DataFrame({'letter':['A', 'C', 'D']})
demo_test_unknown

Unnamed: 0,letter
0,A
1,C
2,D


In [163]:
ohe.transform(demo_test_unknown)

ValueError: Found unknown categories ['D'] in column 0 during transform

In [None]:
ohe = OneHotEncoder(sparse=False, categories=[['A', 'B', 'C', 'D']])

In [None]:
ohe.fit_transform(demo_train)

In [None]:
ohe.transform(demo_test_unknown)

**Why you might not know all possible categories:**

- Rare categories aren't present in your set of samples
- New categories are added later

In [None]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [None]:
ohe.fit_transform(demo_train)

In [None]:
ohe.transform(demo_test_unknown)

**Advice for OneHotEncoder:**

1. Start with handle_unknown set to 'error'
2. If possible, specify the categories manually
3. If necessary, set handle_unknown to 'ignore' and then retrain your model

## 3.6 Q&A: Should I drop one of the one-hot encoded categories?

Here's the example training data that we've used in the past few lessons. And here's the default one-hot encoding of this DataFrame.

When one-hot encoding, it's somewhat common to drop the first column of the output array because it contains redundant information and because it avoids collinearity between features.

If you want to drop the first column, you can set the OneHotEncoder's drop parameter to first, though this option only exists in scikit-learn version 0.21 and later. When you run the fit_transform, you can see that the output array contains 1 less column. However, the new encoding retains the same information, since each category is still represented by a unique code: 00 means A, 10 means B, and 01 means C.

Dropping the first column will work regardless of the number of categories, but you're only ever allowed to drop a single column. And it doesn't actually matter which column you drop, though the convention is to drop the first column.

You've now seen that you can drop the first column, but the question is, should you drop the first column? Here's my advice.

If you know that perfectly collinear features will cause problems, such as when feeding the resulting data into a neural network or an unregularized regression, then it's a good idea to drop the first column. However, for most scikit-learn models, perfectly collinear features will not cause any problems, and thus dropping the first column will not benefit the model.

There are also some significant downsides to dropping the first column that you need to be aware of.

Number one, dropping the first column is incompatible with ignoring unknown categories, which is the handle_unknown='ignore' option that we saw in the previous lesson, since the dropped category and unknown categories would both be encoded as all zeros. You are allowed to do this starting in scikit-learn 1.0, but I still don't recommend it.

Number two, dropping the first column can introduce bias into the model if you standardize your features, such as with StandardScaler, or if you use a regularized model, such as logistic regression, since the dropped category will be exempt from standardization and regularization.

In summary, I recommend that you drop the first column only if you know that perfectly collinear features will cause problems, otherwise I don't recommend dropping the first column.

In [None]:
demo_train

In [None]:
ohe.fit_transform(demo_train)

**You can drop the first column:**

- Contains redundant information
- Avoids collinearity between features

In [None]:
ohe = OneHotEncoder(sparse=False, drop='first')
ohe.fit_transform(demo_train)

**Decoding the output array (after dropping the first column):**

- 0, 0 means "A"
- 1, 0 means "B"
- 0, 1 means "C"

**Should you drop the first column?**

- **Advantages:**
  - Useful if perfectly collinear features will cause problems (does not apply to most models)
- **Disadvantages:**
  - Incompatible with handle_unknown='ignore'
  - Introduces bias if you standardize features or use a regularized model

## 3.7 Q&A: How do I encode an ordinal feature?

**Types of categorical data:**

- Unordered (nominal data)
- Ordered (ordinal data)

Throughout this chapter, we used one-hot encoding to encode unordered categorical features, also known as nominal data. But how should you encode categorical features with an inherent logical ordering, also known as ordinal data? That's the subject of this lesson.

Let's take a look at our Titanic DataFrame.

Pclass, which stands for passenger class, is an ordinal feature. Although it's already numeric, the numbers 1, 2, and 3 represent the categories 1st class, 2nd class, and 3rd class. It's considered ordinal data because there is a logical ordering to the categories.

Our intuition is that there may be a relationship between Pclass values increasing and survival rate decreasing, because passengers in the lower-numbered classes may have gotten priority access to lifeboats. Thus if we were going to include Pclass in the model, we would keep the existing numeric encoding so that the model can learn the relationship between Pclass and Survived with a single feature. You could use one-hot encoding with Pclass instead, but the model wouldn't be able to learn that relationship as effectively because that information would be spread out across three features.

Let's create an example DataFrame to see how to handle ordinal features that are stored as strings.

In this DataFrame, we have two ordinal features, Class and Size. If you have ordinal data, you should use the OrdinalEncoder class to do the encoding. First, you import it from the preprocessing module. Then, you create an instance of OrdinalEncoder, and when you do so, you define the logical order of the categories.

We pass a list of lists to the categories parameter, in which the first inner list is the categories for the Class feature, and the second inner list is the categories for the Size feature. I put the two lists in that order because that is the order in which I'll be passing the features to the fit_transform method.

One important note is that I included the M category for Size even though it wasn't present in this DataFrame because I knew that it would occur in the dataset at some point.

Next, we pass the DataFrame to the OrdinalEncoder's fit_transform method in order to do the encoding. You'll notice that each input feature became a single feature in the output array.

For the Class feature, first was encoded as 0, second was encoded as 1, and third was encoded as 2.

For the Size feature, S was encoded as 0, L was encoded as 2, and XL was encoded as 3. And if M appears in the data at some point, it will be encoded as 1.

Again, we encoded each input feature as a single column so that the model can learn the relationship between the target and an increase or decrease in each feature.

Let's briefly contrast this with the output you would get if you used OneHotEncoder with these same two features.

OneHotEncoder would create 7 columns in the output array, since Class has 3 categories and Size has 4 categories. These 7 columns contain the same information as the 2 columns output by OrdinalEncoder, but the model would have a comparatively harder time learning from the 7 columns since the information is expressed in a less compact form.

Here's a summary of my advice on this topic:

1. If you have an ordinal feature that's already encoded numerically, then leave it as-is.
2. If you have an ordinal feature that's stored as strings, then encode it using OrdinalEncoder.
3. If you have a nominal feature, then encode it using OneHotEncoder.

In chapter 17, we'll explore this topic further and see if there are cases in which you should diverge from this advice.

One final note about OrdinalEncoder is that unlike OneHotEncoder, it does not allow for new categories in the testing data that were not seen during training. However, that functionality is available beginning in scikit-learn version 0.24 using a handle_unknown parameter.

In [None]:
df

**Options for encoding Pclass:**

- **Ordinal encoding:** Creates one feature
- **One-hot encoding:** Creates three features

In [None]:
df_ordinal = pd.DataFrame({'Class': ['third', 'first', 'second', 'third'],
                           'Size': ['S', 'S', 'L', 'XL']})
df_ordinal

In [None]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['first', 'second', 'third'],
                                ['S', 'M', 'L', 'XL']])

In [None]:
oe.fit_transform(df_ordinal)

**Decoding the output array:**

- **First column:**
  - 0 means "first"
  - 1 means "second"
  - 2 means "third"
- **Second column:**
  - 0 means "S"
  - 1 means "M"
  - 2 means "L"
  - 3 means "XL"
- **Example:**
  - 2, 0 means "third, S"

In [None]:
ohe = OneHotEncoder(sparse=False, categories=[['first', 'second', 'third'],
                                              ['S', 'M', 'L', 'XL']])
ohe.fit_transform(df_ordinal)

**Advice for encoding categorical data:**

- **Ordinal feature stored as numbers:** Leave as-is
- **Ordinal feature stored as strings:** Use OrdinalEncoder
- **Nominal feature:** Use OneHotEncoder

## 3.8 Q&A: What's the difference between OrdinalEncoder and LabelEncoder?

There are many similarities between the OrdinalEncoder and LabelEncoder classes, so in this lesson I'll explain how they're different and why you should be using OrdinalEncoder, not LabelEncoder.

The first main difference is that OrdinalEncoder allows you to define the order of the categories, whereas LabelEncoder does not. LabelEncoder simply uses the alphabetical order of the values you pass to it to determine which value to encode as 0, which value to encode as 1, and so on.

The second main difference is that OrdinalEncoder can be used to encode multiple features at once, whereas LabelEncoder can only encode one column of data at once.

Because of these differences, OrdinalEncoder is much better suited than LabelEncoder for encoding ordinal features. And in fact, LabelEncoder is only intended for the encoding of class labels, hence its name.

You might be asking why LabelEncoder even exists, given its limitations. There are two reasons.

First, in older versions of scikit-learn, some classification models were not able to handle string-based labels. LabelEncoder was used to encode those strings as integers so that they could be passed to the model. That limitation was removed a few years ago, and so all scikit-learn classifiers can now handle string-based labels. Therefore, you should never need to use LabelEncoder for encoding your class labels.

Second, also in order versions of scikit-learn, OneHotEncoder did not accept strings as input. Thus if you had categorical data stored as strings, you actually had to use LabelEncoder to encode the strings as integers before passing them to the OneHotEncoder. Again, that limitation was removed a few years ago, and so you can pass string-based categorical data directly to OneHotEncoder.

Because of this legacy from older versions of scikit-learn, many people are familiar with LabelEncoder and thus use it to encode features. However, the best practice is to use OrdinalEncoder to encode ordinal features. In fact, it's rare that you will ever need to use LabelEncoder, which is why I'm not using it in this course.

&nbsp; | OrdinalEncoder | LabelEncoder
:--- | :---: | :---:
Can you define the category order? | Yes | No
Can you encode multiple features? | Yes | No

**Outdated uses for LabelEncoder:**

- Encoding string-based labels for some classifiers
- Encoding string-based features for OneHotEncoder

## 3.9 Q&A: Should I encode numeric features as ordinal features?

Normally, when you have a continuous numeric feature such as Fare, you pass that feature directly to your Machine Learning model. However, one strategy that is sometimes used with numeric features is to "discretize" or "bin" them into categorical features. In scikit-learn, we can do this using KBinsDiscretizer.

When creating an instance of KBinsDiscretizer, you define the number of bins, the binning strategy, and the method used for encoding the result. Here's the output when we pass the Fare feature to the fit_transform method.

Because we specified 3 bins, every sample has been assigned to bin 0 or 1 or 2. The smallest values were assigned to bin 0, the largest values were assigned to bin 2, and the values in between were assigned to bin 1. Thus, we've taken a continuous numeric feature and encoded it as an ordinal feature, and this ordinal feature could be passed to the model in place of the numeric feature.

The obvious follow-up question is: Should we discretize our numeric features? Theoretically, discretization can benefit linear models by helping them to learn non-linear trends. However, my general recommendation is to not use discretization, for three main reasons.

First, discretization removes all nuance from the data, which makes it harder for a model to learn the actual trends that are present in the data.

Second, discretization reduces the variation in the data, which makes it easier to find trends that don't actually exist.

Third, any possible benefits of discretization are highly dependent on the parameters used with KBinsDiscretizer. Making those decisions by hand creates a risk of overfitting the training data, and making those decisions during a tuning process adds both complexity and processing time, and so neither of those options is particularly attractive to me.

In [None]:
df[['Fare']]

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

In [None]:
kb = KBinsDiscretizer(n_bins=3, strategy='quantile', encode='ordinal')

In [None]:
kb.fit_transform(df[['Fare']])

**Why not discretize numeric features?**

- Makes it harder to learn the actual trends
- Makes it easier to discover non-existent trends
- May result in overfitting

# Chapter 4: Improving your workflow with ColumnTransformer and Pipeline

## 4.1 Preprocessing features with ColumnTransformer

In the last chapter, our goal was to include two numeric features and two categorical features in our model. We saw how to numerically encode the categorical features using OneHotEncoder, but we lacked an efficient process for stacking those encoded features next to the numerical features, and we lacked an efficient way to apply this same preprocessing to our new data.

In this chapter, we're going to solve both of those problems using the ColumnTransformer and Pipeline classes:

- ColumnTransformer will make it easy to apply different preprocessing steps to different columns.
- Pipeline will make it easy to apply the same workflow to training data and new data.

**Problems from Chapter 3:**

- Need to stack categorical features next to numerical features
- Need to apply the same preprocessing to new data

**How to solve those problems:**

- **ColumnTransformer:** Apply different preprocessing steps to different columns
- **Pipeline:** Apply the same workflow to training data and new data

To start, we'll create a Python list of the four columns we've been working with, and use that to create our X object.

We're still going to be one-hot encoding the Embarked and Sex columns, so we'll create an instance of OneHotEncoder. We're using the default options for OneHotEncoder, which means it will output a sparse matrix, but that's fine because we're not going to examine the output directly.

In [164]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']

In [165]:
X = df[cols]
X

Unnamed: 0,Parch,Fare,Embarked,Sex
0,0,7.25,S,male
1,0,71.2833,C,female
2,0,7.925,S,female
3,0,53.1,S,female
4,0,8.05,S,male
5,0,8.4583,Q,male
6,0,51.8625,S,male
7,1,21.075,S,male
8,2,11.1333,S,female
9,0,30.0708,C,female


In [166]:
ohe = OneHotEncoder()

Now it's time to create our first ColumnTransformer, which will take care of any data transformations that we specify. We'll start by importing the make_column_transformer function from the compose module.

In general, you use make_column_transformer by passing it one or more tuples, and each tuple should have two elements:

1. The first element is a transformer.
2. The second element is a list of columns to which that transformer should be applied. Note that in most cases, this element should be a list even if you are only specifying a single column.

In our case, we'll pass it a single tuple in which the first element is our OneHotEncoder object and the second element is a list of the two columns we want to one-hot encode.

After all tuples, we'll set the remainder parameter to drop, which means that all columns which are not explicitly mentioned in the ColumnTransformer should be dropped. Drop is actually the default value for remainder, but I'm including it here just for clarity.

Note that I could have defined the ColumnTransformer on a single line, but I prefer breaking the lines in this way for readability.

When we run this code, the make_column_transformer code returns a ColumnTransformer object, which we'll save as ct.


In [167]:
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='drop')

**Tuple elements for make_column_transformer:**

1. Transformer object
2. List of columns to which the transformer should be applied

Next, we'll perform the transformation by passing X, which is our four-column DataFrame, to the fit_transform method of the ct object. It outputs a 10 by 5 array that represents the one-hot encoding of the Embarked and Sex columns. The first three columns represent Embarked and the other two columns represent Sex, and they're in that order because that's the order in which they were listed in the ColumnTransformer.

Note that even though the Parch and Fare columns are part of X, they're excluded from the output array because we told the ColumnTransformer to drop all unspecified columns.

In [168]:
ct.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.]])

**Output columns:**

- **Columns 1-3:** Embarked
- **Columns 4-5:** Sex

This is nice, but our actual goal was to create a matrix that includes the Parch and Fare columns alongside the encoded versions of Embarked and Sex. To accomplish that, we'll simply change the value of remainder from drop to passthrough. This means that all columns which are not mentioned in the ColumnTransformer should be passed through to the output unmodified. In other words, include the Parch and Fare columns in the output, but don't transform them in any way.

In [169]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough')

When we run the fit_transform method this time, it outputs a 10 by 7 array. The first five columns represent the encoded Embarked and Sex columns, and the sixth and seventh columns are the Parch and Fare columns. The column order is based on the order in which the columns were listed in the ColumnTransformer, followed by any you passthrough.

In [170]:
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

**Output columns:**

- **Columns 1-3:** Embarked
- **Columns 4-5:** Sex
- **Column 6:** Parch
- **Column 7:** Fare

We were able to figure out on our own what each column represents, but you can also use the ColumnTransformer's get_feature_names method to confirm the meanings of these 7 features. The x0 simply means feature 0 that was passed to the OneHotEncoder, and the x1 means feature 1.


In [171]:
ct.get_feature_names()



['onehotencoder__x0_C',
 'onehotencoder__x0_Q',
 'onehotencoder__x0_S',
 'onehotencoder__x1_female',
 'onehotencoder__x1_male',
 'Parch',
 'Fare']

Before we move on, I have two quick asides about the get_feature_names method:

- First, the get_feature_names method didn't work with passthrough columns prior to scikit-learn version 0.23, so you'll get an error if you run the code with previous versions.
- Second, the get_feature_names method has been replaced with a similar method called get_feature_names_out beginning in scikit-learn 1.0.

**Notes about get_feature_names:**

- **Before version 0.23:** Didn't work with passthrough columns
- **Starting in version 1.0:** Has been replaced with get_feature_names_out

**Tuple elements for make_column_transformer (revised):**

1. Transformer object or "drop" or "passthrough"
2. List of columns to which the transformer should be applied

To wrap up this lesson, I want to show you one other way to specify this same ColumnTransformer.

As I mentioned before, make_column_transformer accepts tuples, and the first element of each tuple is usually a transformer object (like our "ohe" object). However, the first element of the tuple can also be the special strings "drop" or "passthrough", which tells the ColumnTransformer to drop or passthrough specific columns.

So, we're going to add a second tuple in which the transformer is the string "passthrough", and we want to apply this passthrough transformer to the columns Parch and Fare. This ColumnTransformer will do the exact same thing as the previous one, but I actually prefer this notation any time I have a small number of passthrough columns, since it reminds me of which columns I'm passing through.

It's still important to remember that the default value for the remainder parameter is "drop", which means that any unspecified columns will be dropped, though we don't have any unspecified columns in this case.

We'll run the fit_transform method one more time, and you can see that it outputs the same 7 columns as before. And to be clear, this is the feature matrix that we will pass to our model.

In [172]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch', 'Fare']))

In [173]:
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

## 4.2 Chaining steps with Pipeline

In the previous lesson, we accomplished our first goal, which was to apply different preprocessing to different columns using ColumnTransformer. In this lesson, we're moving on to our second goal, which is to apply the same workflow to training data and new data using the Pipeline class.

A Pipeline is used to chain together sequential steps. In this case, we want to chain together two steps, namely data preprocessing followed by model building.

We'll start by importing the make_pipeline function. Then, we can create a Pipeline instance by passing it two objects: our ColumnTransformer instance for data preprocessing, and our logistic regression instance for model building. We'll save it as an object called "pipe", which is a 2-step Pipeline.

You might remember that back in Chapter 2, we used cross-validation to evaluate our model when it only included the Parch and Fare features. Now that we've added the Embarked and Sex features, it would normally make sense to cross-validate the updated model to see whether the adding those features made our model better or worse. And in fact, you can (and should) cross-validate an entire Pipeline.

However, any model evaluation procedure is highly unreliable with only 10 rows of data, and so any change in the cross-validated accuracy would be misleading. Thus we're going to skip the cross-validation step for the moment, though we'll return to it in a later chapter once we're using the full dataset.

Since we're skipping cross-validation, our next step is just to run the fit method on the Pipeline, and pass it X and y. Here's what happens when we fit the Pipeline:

- First, it runs the ColumnTransformer step, meaning that it takes X, which is a 4-column DataFrame that contains both numbers and strings, and transforms it into the 7-column feature matrix that only includes numbers.
- Second, it runs the LogisticRegression step, meaning that the model is fit to this 7-column feature matrix. In other words, it learns the relationship between those 7 features and the y values.

Note that when you fit a Pipeline, it will actually print out the steps. You can see that step 1 is a ColumnTransformer that includes a OneHotEncoder and a passthrough transformer, and step 2 is a LogisticRegression model.



In [175]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(ct, logreg)

**Pipeline steps:**

1. Data preprocessing with ColumnTransformer
2. Model building with LogisticRegression

In [176]:
pipe.fit(X, y)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])

**Fitting the Pipeline:**

1. ColumnTransformer converts X (4 columns) into a numeric feature matrix (7 columns)
2. LogisticRegression model is fit to the feature matrix

In case it helps you to understand the Pipeline better, I'm going to show you what happens "under the hood" when you fit this Pipeline. To be clear, you should not actually write the following code, rather it is just for teaching purposes.

First, X is transformed by the ColumnTransformer into X_t, which stands for X transformed. Second, the LogisticRegression model is fit on X_t and y. And as you would expect, X has the shape 10 by 4, and X_t has the shape 10 by 7.

In [177]:
X_t = ct.fit_transform(X)
logreg.fit(X_t, y)

LogisticRegression(random_state=1, solver='liblinear')

In [178]:
print(X.shape)
print(X_t.shape)

(10, 4)
(10, 7)


## 4.3 Using the Pipeline to make predictions

Now that we've fit our Pipeline, we want to use it to make predictions on new data.

The first step is to update the X_new DataFrame so that it contains the same columns as X. Recall that the cols object contains the names of our four columns, and so we can use it to select those four columns from the df_new DataFrame.

Now, we can pass X_new to the Pipeline's predict method to make predictions for these ten samples. When we run it, the Pipeline applies the same transformations to X_new that it applied to X, and the transformed version of X_new is passed to the fitted logistic regression model so that it can make predictions.

In [179]:
X_new = df_new[cols]
X_new

Unnamed: 0,Parch,Fare,Embarked,Sex
0,0,7.8292,Q,male
1,0,7.0,S,female
2,0,9.6875,Q,male
3,0,8.6625,S,male
4,1,12.2875,S,female
5,0,9.225,S,male
6,0,7.6292,Q,female
7,1,29.0,S,male
8,0,7.2292,C,female
9,0,24.15,S,male


In other words, the Pipeline enabled us to accomplish our second goal, which is to apply the same workflow to training data and new data.

As a reminder, we can't evaluate the accuracy of these ten predictions because we don't know the true target values for X_new.

In [180]:
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

**Predicting with the Pipeline:**

1. ColumnTransformer applies the same transformations to X_new
2. Fitted LogisticRegression model makes predictions on the transformed version of X_new

Just like before, I'm going to show you what happens "under the hood" when you make predictions using this Pipeline. Again, you should not actually write the following code, rather it is just for teaching purposes.

First, X_new is transformed by the ColumnTransformer into X_new_t, which stands for X_new transformed. Second, the fitted LogisticRegression model makes predictions for the samples in X_new_t. And as you would expect, X_new has the shape 10 by 4, and X_new_t has the shape 10 by 7.

In [182]:
X_new_t = ct.transform(X_new)
logreg.predict(X_new_t)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

In [183]:
print(X_new.shape)
print(X_new_t.shape)

(10, 4)
(10, 7)


**ColumnTransformer methods:**

1. Run fit_transform on X:
  - **fit:** Learn the encoding
  - **transform:** Apply the encoding to create 7 columns
2. Run transform on X_new:
  - **transform:** Apply the encoding to create 7 columns

One important point I want to highlight is that the Pipeline's predict method called the ColumnTransformer's transform method, not its fit_transform method. Why would that be?

Recall that the fit step is when a transformer learns something, and the transform step is when it uses what it learned to do the transformation. Thus you fit on X to learn an encoding, and you transform on X and X_new to apply that encoding.

This is critically important. Our logistic regression model was fit on 7 columns, and so it learned 7 coefficients. To make predictions, you need to pass 7 columns to the predict method, and those 7 columns need to mean the same thing as the 7 columns you used when fitting the model. Thus, the predict method only runs transform so that the exact same encoding will be applied to the training data and the new data.

It's okay if you're still a bit fuzzy on the difference between fit and transform, because the Pipeline object will just do the right thing for you when you run fit or predict. However, understanding the difference will ultimately help you to go further with scikit-learn.

## 4.4 Q&A: How do I drop some columns and passthrough others?

Currently we only have 4 columns in X, namely Parch, Fare, Embarked, and Sex. But imagine that we had many more columns, and we wanted to drop a few columns and passthrough the rest. How would we do that efficiently?

We can use the special string "drop" to tell the ColumnTransformer which columns to drop, and also tell it to passthrough all remaining columns. So in this example, we're one-hot encoding Embarked and Sex, which creates 5 columns, dropping Fare, and passing through Parch, which adds 1 more column.

We could use this same pattern to drop a few columns and passthrough hundreds of columns without having to list the passthrough columns one-by-one.

In [184]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('drop', ['Fare']),
    remainder='passthrough')
ct.fit_transform(X)

array([[0., 0., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 0., 1., 0.],
       [0., 1., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1., 1.],
       [0., 0., 1., 1., 0., 2.],
       [1., 0., 0., 1., 0., 0.]])

Conversely, we might want to passthrough a few columns and drop the rest. We can use the special string "passthrough" to tell the ColumnTransformer which columns to passthrough, and also tell it to drop all remaining columns. So in this example, we're one-hot encoding Embarked and Sex, which creates 5 columns, passing through Parch, which adds 1 more column, and dropping Fare.

Again, we can use this pattern to passthrough a few columns and drop hundreds of columns without listing them all.

Finally, just a reminder that "drop" is the default value for remainder, so you aren't actually required to specify it here.

In [185]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch']),
    remainder='drop')
ct.fit_transform(X)

array([[0., 0., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 0., 1., 0.],
       [0., 1., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1., 1.],
       [0., 0., 1., 1., 0., 2.],
       [1., 0., 0., 1., 0., 0.]])

## 4.5 Q&A: How do I transform the unspecified columns?

We know how to drop or passthrough the unspecified columns in a ColumnTransformer, but let's pretend we wanted to apply a transformation to all of the unspecified columns. This is actually simple to do by passing a transformer to the remainder parameter.

For example, we might want to scale all of the unspecified columns. One option is MaxAbsScaler, which divides each feature by its maximum value and thus scales it to the range negative 1 to positive 1. We'll import it from the preprocessing module and then create an instance. Then, we can pass the scaler to the remainder parameter.

When we run the fit_transform method, you can see that the first 5 columns were created from Embarked and Sex, and the sixth column is the scaled version of the Parch column.

In [186]:
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

In [187]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('drop', ['Fare']),
    remainder=scaler)
ct.fit_transform(X)

array([[0. , 0. , 1. , 0. , 1. , 0. ],
       [1. , 0. , 0. , 1. , 0. , 0. ],
       [0. , 0. , 1. , 1. , 0. , 0. ],
       [0. , 0. , 1. , 1. , 0. , 0. ],
       [0. , 0. , 1. , 0. , 1. , 0. ],
       [0. , 1. , 0. , 0. , 1. , 0. ],
       [0. , 0. , 1. , 0. , 1. , 0. ],
       [0. , 0. , 1. , 0. , 1. , 0.5],
       [0. , 0. , 1. , 1. , 0. , 1. ],
       [1. , 0. , 0. , 1. , 0. , 0. ]])

## 4.6 Q&A: How do I select columns from a NumPy array?

Throughout the course, we've been using a pandas DataFrame as our input. But what if your input data was a NumPy array instead? Let's see how that affects our workflow.

We'll start by converting the X and X_new DataFrames into NumPy arrays called X_array and X_new_array. Here's what X_array looks like.

In [188]:
X_array = X.to_numpy()
X_new_array = X_new.to_numpy()

In [189]:
X_array

array([[0, 7.25, 'S', 'male'],
       [0, 71.2833, 'C', 'female'],
       [0, 7.925, 'S', 'female'],
       [0, 53.1, 'S', 'female'],
       [0, 8.05, 'S', 'male'],
       [0, 8.4583, 'Q', 'male'],
       [0, 51.8625, 'S', 'male'],
       [1, 21.075, 'S', 'male'],
       [2, 11.1333, 'S', 'female'],
       [0, 30.0708, 'C', 'female']], dtype=object)

If this was our input data, and we wanted to use a ColumnTransformer, we wouldn't be able to specify the columns by name because columns of a NumPy array don't have names. However, we do have a couple of other options.

First, we could specify the columns by integer position. Embarked and Sex are columns 2 and 3, so in this example, we're one-hot encoding Embarked and Sex and passing through the remainder. Note that we're passing X_array, not X, to the fit_transform method.

In [190]:
ct = make_column_transformer(
    (ohe, [2, 3]),
    remainder='passthrough')
ct.fit_transform(X_array)

array([[0.0, 0.0, 1.0, 0.0, 1.0, 0, 7.25],
       [1.0, 0.0, 0.0, 1.0, 0.0, 0, 71.2833],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0, 7.925],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0, 53.1],
       [0.0, 0.0, 1.0, 0.0, 1.0, 0, 8.05],
       [0.0, 1.0, 0.0, 0.0, 1.0, 0, 8.4583],
       [0.0, 0.0, 1.0, 0.0, 1.0, 0, 51.8625],
       [0.0, 0.0, 1.0, 0.0, 1.0, 1, 21.075],
       [0.0, 0.0, 1.0, 1.0, 0.0, 2, 11.1333],
       [1.0, 0.0, 0.0, 1.0, 0.0, 0, 30.0708]], dtype=object)

Another option is to specify the columns using slices, which is useful for large ranges of columns next to one another. In this case, we're selecting columns 2 through 3 for one-hot encoding, and passing through the remainder. Remember that Python slices are inclusive of the starting value, which is 2 in this case, and exclusive of the ending value, which is 4 in this case.


In [None]:
ct = make_column_transformer(
    (ohe, slice(2, 4)),
    remainder='passthrough')
ct.fit_transform(X_array)

One final option is to specify the columns using a boolean mask. Normally you would create the mask using some sort of condition, but in this case I'm just writing out a mask to select columns 2 and 3 for one-hot encoding, and passing through the remainder.


In [None]:
ct = make_column_transformer(
    (ohe, [False, False, True, True]),
    remainder='passthrough')
ct.fit_transform(X_array)

So those are our three options for selecting columns in a ColumnTransformer when your input source is a NumPy array.

Other than that, the rest of our workflow remains the same. We'll just update the Pipeline to use our new ColumnTransformer. Then we can fit the Pipeline with X_array and y, and make predictions for the X_new_array.

**Options for selecting columns from a NumPy array:**

- Integer position
- Slice
- Boolean mask

In [191]:
pipe = make_pipeline(ct, logreg)

In [192]:
pipe.fit(X_array, y)
pipe.predict(X_new_array)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

## 4.7 Q&A: How do I select columns by data type?

In [193]:
from sklearn.compose import make_column_selector

So far in the course, we've been selecting columns one-by-one. But let's say that we had many more columns, and we simply wanted to one-hot encode all object columns and passthrough all numeric columns without listing all of them out. How would we do that?

The easiest way to do this is with the make_column_selector function, which is new in scikit-learn version 0.22.

We're going to create two column selectors called select_object and select_number. To do this, we just set the dtype_include parameter to the data type we want to include, and it outputs a callable. Then, we pass the callables to make_column_transformer instead of the column names, and the callables select the columns for us.

In [194]:
select_object = make_column_selector(dtype_include=object)
select_number = make_column_selector(dtype_include='number')

When we run fit_transform, you can see that once again, the object columns have been one-hot encoded and the numeric columns have been passed through.


In [195]:
ct = make_column_transformer(
    (ohe, select_object),
    ('passthrough', select_number))
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

One slight variation of this is that you can tell make_column_selector to exclude instead of include a specific data type. In this example, we're using the dtype_exclude parameter to create a column selector that excludes the object data type.

This time, we'll tell the ColumnTransformer to one-hot encode all object columns and passthrough all non-object columns, which has the same effect as before.

In [196]:
exclude_object = make_column_selector(dtype_exclude=object)

In [197]:
ct = make_column_transformer(
    (ohe, select_object),
    ('passthrough', exclude_object))
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

In [198]:
select_datetime = make_column_selector(dtype_include='datetime')
select_category = make_column_selector(dtype_include='category')

There are also other data type options you can use, such as the datetime data type or the pandas category data type.

Finally, it's worth noting that you can also pass a list of multiple data types to make_column_selector.

In [200]:
select_multiple = make_column_selector(dtype_include=[object, 'category'])

## 4.8 Q&A: How do I select columns by column name pattern?

Let's say that we had a lot of columns, and all of the columns that we wanted to select for a particular transformation had the same pattern in their names. For example, maybe all of those columns started with the same word.

Once again, we can use the make_column_selector function, which allows us to select columns by regular expression pattern. Here's a silly example in which we select columns that include the capital letters E or S.

When we run the fit_transform method, Embarked and Sex have been one-hot encoded, and the remaining columns have been passed through.

Again, this is only useful if your column names follow a particular pattern and you know how to write regular expressions.

In [201]:
select_ES = make_column_selector(pattern='E|S')

In [202]:
ct = make_column_transformer(
    (ohe, select_ES),
    remainder='passthrough')
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

## 4.9 Q&A: Should I use ColumnTransformer or make_column_transformer?

So far in the course, we've been creating ColumnTransformers using the make_column_transformer function. In this lesson, I'll show you how to use the ColumnTransformer class and then compare it to make_column_transformer so that you can decide which one you want to use.

To start, we'll import the ColumnTransformer class from the compose module, and then we'll create an instance.

When creating an instance, the first difference you might notice is that the tuples have three elements rather than two. The first element of each tuple is a name of your choosing that you are required to assign to the transformer.

In [203]:
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
    [('OHE', ohe, ['Embarked', 'Sex']),
     ('pass', 'passthrough', ['Parch', 'Fare'])])
ct

ColumnTransformer(transformers=[('OHE', OneHotEncoder(), ['Embarked', 'Sex']),
                                ('pass', 'passthrough', ['Parch', 'Fare'])])

In this case, the first tuple is our one-hot encoding of Embarked and Sex, and we're assigning it the name "OHE" in all caps. The second tuple is our special passthrough transformer for Parch and Fare, and we're assigning it the name "pass". We can see these names when we print out the ColumnTransformer.

You might also notice that the tuples are in a list, which is a requirement of the ColumnTransformer class.


**Tuple elements for ColumnTransformer:**

1. Transformer name
2. Transformer object or "drop" or "passthrough"
3. List of columns to which the transformer should be applied

Now let's create the same ColumnTransformer using the make_column_transformer function. When using make_column_transformer, we don't define names for the transformers. Instead, each transformer is assigned a default name, which is the lowercase version of the transformer's class name.

As you can see when we print it out, the one-hot encoder is assigned the name "onehotencoder" (all lowercase), and the passthrough transformer is assigned the name "passthrough".

All of that being said, which one should you use?

In [204]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch', 'Fare']))
ct

ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 ['Embarked', 'Sex']),
                                ('passthrough', 'passthrough',
                                 ['Parch', 'Fare'])])

I prefer make_column_transformer, because I find the code both easier to read and easier to write, so that's what I'll use in this course. I usually don't mind the default transformer names, and in fact I like that I don't have to come up with a name for each transformer.

However, there are times when defining names for the transformers is useful. Custom names can be clearer if you're performing a grid search of transformer parameters, or if you're using the same type of transformer multiple times in the same ColumnTransformer instance. We'll see examples of this later in the course.

One final note is that the ColumnTransformer class enables transformer weights, meaning you can emphasize the output of some transformers more than others. The specific use case of this is not yet clear to me, but if you do decide to use transformer weights, then you can't use the make_column_transformer function and you must use the ColumnTransformer class.

&nbsp; | ColumnTransformer | make_column_transformer
:--- | :---: | :---:
Allows custom names? | Yes | No
Allows transformer weights? | Yes | No

## 4.10 Q&A: Should I use Pipeline or make_pipeline?

So far in the course, we've been creating Pipelines using the make_pipeline function. In this lesson, I'll show you how to use the Pipeline class and then compare it to make_pipeline so that you can decide which one you want to use.

To start, we'll import the Pipeline class from the pipeline module, and then we'll create an instance.

When creating an instance, the main difference you might notice is that we're passing in a list of tuples to the Pipeline constructor. Each tuple has two elements, in which the first element is the name you're assigning to the Pipeline step, and the second element is the model or transformer you're including in the Pipeline.

In [205]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('preprocessor', ct), ('classifier', logreg)])
pipe

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('classifier',
                 LogisticRegression(random_state=1, solver='liblinear'))])

**Tuple elements for Pipeline:**

1. Step name
2. Model or transformer object

In this case, the first tuple is our preprocessing step using ColumnTransformer, and we're assigning it the name "preprocessor". The second tuple is our model building step using logistic regression, and we're assigning it the name "classifier". We can see these names when we print out the Pipeline.

We can also see the step names by accessing the named_steps attribute of the Pipeline and running the keys method.

In [206]:
pipe.named_steps.keys()

dict_keys(['preprocessor', 'classifier'])

Now let's create the same Pipeline using the make_pipeline function. When using make_pipeline, we don't define names for the steps. Instead, each step is assigned a default name, which is the lowercase version of the step's class name.

As you can see when we print it out, the first step is assigned the name "columntransformer" (all lowercase), and the second step is assigned the name "logisticregression" (all lowercase). Again, we can also see the step names using the named_steps attribute.

All of that being said, which one should you use?

In [207]:
pipe = make_pipeline(ct, logreg)
pipe

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])

In [208]:
pipe.named_steps.keys()

dict_keys(['columntransformer', 'logisticregression'])

&nbsp; | Pipeline | make_pipeline
:--- | :---: | :---:
Allows custom names? | Yes | No

I prefer make_pipeline, because I find the code both easier to read and easier to write, so that's what I'll use in this course. I usually don't mind the default step names, and in fact I like that I don't have to come up with a name for each step.

However, custom step names can be useful for clarity, especially if you're performing a grid search of a Pipeline. We'll see many examples of this later in the course.

## 4.11 Q&A: How do I examine the steps of a Pipeline?

Sometimes you might want to examine the steps of a fitted Pipeline so that you can understand what's happening within each step. In this lesson, I'll show you how to do it.

We'll start by fitting the Pipeline, which prints out the two steps.

In [209]:
pipe.fit(X, y)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])

As I mentioned in the previous lesson, make_pipeline assigned a name to each step, which is the lowercase version of the step's class name. In this case, our step names are "columntransformer" and "logisticregression".

To examine an individual step, you select the named_steps attribute and pass the step name in brackets. Note that if we had assigned custom step names such as "preprocessor" and "classifier", we would be using those here instead.

In [210]:
pipe.named_steps.keys()

dict_keys(['columntransformer', 'logisticregression'])

In [211]:
pipe.named_steps['columntransformer']

ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 ['Embarked', 'Sex']),
                                ('passthrough', 'passthrough',
                                 ['Parch', 'Fare'])])

In [212]:
pipe.named_steps['logisticregression']

LogisticRegression(random_state=1, solver='liblinear')

Once you've accessed a step, you can examine its attributes or run its methods. For example, we can run the get_feature_names method from the "columntransformer" step to learn the names of each feature. As a reminder, the x0 means feature 0 that was passed to the OneHotEncoder, and the x1 means feature 1.

In [213]:
pipe.named_steps['columntransformer'].get_feature_names()



['onehotencoder__x0_C',
 'onehotencoder__x0_Q',
 'onehotencoder__x0_S',
 'onehotencoder__x1_female',
 'onehotencoder__x1_male',
 'Parch',
 'Fare']

We can also see the coefficient values of the 7 features by examining the "coef_" attribute of the "logisticregression" step. These coefficients are listed in the same order as the features, though the intercept is stored in a separate attribute.

By finding the 4 positive coefficients, you can determine that embarking at port C, being female, and having a higher Parch and Fare are all associated with a greater likelihood of survival. Note that these are just associations the model learned from 10 rows of training data. They are not necessarily statistically significant associations, and in fact scikit-learn does not provide p-values.

In [214]:
pipe.named_steps['logisticregression'].coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

Finally, it's worth noting that there are three other ways that you can examine the steps of a Pipeline:

- First, you can use named_steps with periods.
- Second, you can exclude the named_steps attribute entirely.
- And third, you can reference the step by position rather than by name.

Personally, I like the initial bracket notation because I think it's the most readable, even though it's the most typing. However, using named_steps with the periods seems to be the only option that supports autocompleting both the step name and the attribute, which is a nice benefit.

In [215]:
pipe.named_steps.logisticregression.coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

In [216]:
pipe['logisticregression'].coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

In [217]:
pipe[1].coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

# Chapter 5: Workflow review #1

## 5.1 Recap of our workflow

In this chapter, we're going to review the workflow that we've built so far to make sure you understand the key concepts before we start adding additional complexity.

To start, we're going to walk through all of the code that is necessary to recreate our workflow up to this point. We begin by importing pandas, the OneHotEncoder and LogisticRegression classes, and the make_column_transformer and make_pipeline functions.

Next, we create a list of the four columns we're going to select from our data. Then, we read in 10 rows of training data and use it to define our X and y. And we read in 10 rows of new data and use it to define X_new.

We create an instance of OneHotEncoder, which is our only transformer at this point. And then we build the ColumnTransformer, which one-hot encodes Embarked and Sex and passes through Parch and Fare.

We also create an instance of logistic regression. Finally, we create a two-step Pipeline, fit the Pipeline to X and y, and use the fitted Pipeline to make predictions on X_new.

That's really all the code we need to recreate our entire workflow from the last few chapters. You'll notice that there are no calls to fit_transform or transform because all of that functionality is encapsulated by the Pipeline.

I did exclude cross-validation from this recap because, as mentioned previously, any model evaluation procedure is highly unreliable with only 10 rows of data. However, we will thoroughly explore the topic of model evaluation later in the course, once we are using the full dataset.

In [218]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [219]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']

In [220]:
df = pd.read_csv('http://bit.ly/MLtrain', nrows=10)
X = df[cols]
y = df['Survived']

In [221]:
df_new = pd.read_csv('http://bit.ly/MLnewdata', nrows=10)
X_new = df_new[cols]

In [222]:
ohe = OneHotEncoder()

In [223]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch', 'Fare']))

In [224]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

In [225]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

## 5.2 Comparing ColumnTransformer and Pipeline

<img src="https://www.dataschool.io/files/simple_pipeline.png" width="400">

**ColumnTransformer vs Pipeline:**

- **ColumnTransformer:**
  - Selects subsets of columns, transforms them independently, stacks the results side-by-side
  - Only includes transformers
  - Does not have steps (transformers operate in parallel)
- **Pipeline:**
  - Series of steps that occur in order
  - Output of each step becomes the input to the next step
  - Last step is a model or transformer, all other steps are transformers

In order to be successful in the rest of the course, it's very important that you clearly understand the differences between a ColumnTransformer and a Pipeline. In this lesson, I'm going to explain those differences. This diagram should help to illustrate the concepts.

Let's start with the ColumnTransformer, which received 4 columns of input from the X DataFrame:

- It selected 2 of those columns, namely Embarked and Sex, and used the OneHotEncoder to transform them into 5 columns.
- It selected the other 2 columns, namely Parch and Fare, and did nothing to them, which of course resulted in 2 columns.
- Finally, it stacked the 5 columns output by OneHotEncoder and the 2 columns output by the passthrough transformer side-by-side, resulting in a total of 7 columns.

Now let's talk about the Pipeline, which has 2 steps:

- Step 1 is a ColumnTransformer that received 4 columns of input and transformed them into 7 columns.
- Step 2 is a logistic regression model that received 7 columns of input and used those 7 columns either for fitting or predicting.

With those examples in mind, we can step back and summarize the differences between a ColumnTransformer and a Pipeline:

- A ColumnTransformer pulls out subsets of columns and transforms them independently, and then stacks the results side-by-side.
- It only ever does data transformations, meaning your ColumnTransformer will never include a model.
- And it does not have steps, because each subset of columns is transformed independently. In other words, data does not flow from one transformer to the next.

In contrast:

- A Pipeline is a series of steps that occur in order, and the output of each step becomes the input to the next step.
- Thus if you had a 3-step Pipeline, the output of step 1 becomes the input to step 2, and the output of step 2 becomes the input to step 3.
- The last step of a Pipeline can be a model or a transformer, whereas all other steps must be transformers.

In summary, a Pipeline contains steps that operate in sequence, whereas a ColumnTransformer contains transformers that operate in parallel. In later chapters, you'll see why this difference is so important and how it guides the structure of our workflow.

## 5.3 Creating a Pipeline diagram

To wrap up this chapter, I want to show you a feature that's new in scikit-learn version 0.23 that can help you to visualize and thus better understand your Pipelines.

To start, we'll import the set_config function. Then we run it and set the display parameter to "diagram". With that configuration, you'll see a diagram any time you print out a Pipeline or any other estimator.

This is basically the same diagram I created. And you can actually click on any element in order to see more details. For example, if you click on the transformer names, you can see the columns they're transforming. And if you click on the class names, you can see any parameters that have been changed from their default values.

If you use this configuration but you ever need to see the regular text output, you can just use the print function with your Pipeline.

Finally, it's worth noting that displaying diagrams is the default starting in scikit-learn version 1.1. If you'd prefer to always see the text output, you can change the configuration by setting the display parameter to "text".

In [226]:
from sklearn import set_config
set_config(display='diagram')

In [227]:
pipe

In [228]:
print(pipe)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])


In [229]:
set_config(display='text')
pipe

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])