# Course: [Master Machine Learning with scikit-learn](https://courses.dataschool.io/view/courses/master-machine-learning-with-scikit-learn)

## Chapters 1-5

*© 2022 Data School. All rights reserved.*

# Chapter 1: Introduction

## 1.1 Course overview

**High-level topics:**

- Handling missing values, text data, categorical data, and class imbalance
- Building a reusable workflow
- Feature engineering, selection, and standardization
- Avoiding data leakage
- Tuning your entire workflow

**How you will benefit from this course:**

- Knowledge of best practices
- Confidence when tackling new ML problems
- Ability to anticipate and solve problems
- Improved code quality
- Better, faster results

## 1.2 scikit-learn vs Deep Learning

**Benefits of scikit-learn:**

- Consistent interface to many models
- Many tuning parameters (but sensible defaults)
- Workflow-related functionality
- Exceptional documentation
- Active community support

**Drawbacks of deep learning:**

- More computational resources
- Higher learning curve
- Less interpretable models

## 1.3 Prerequisite skills

**scikit-learn prerequisites:**

- Loading a dataset
- Defining the features and target
- Training and evaluating a model
- Making predictions with new data

**New to scikit-learn?**

- Enroll in "Introduction to Machine Learning with scikit-learn" (free)
- Available at https://courses.dataschool.io
- Complete lessons 1 through 7

## 1.4 Course setup and software versions

**How to install scikit-learn and pandas:**

- **Option 1:** Install together
  - **Anaconda:** https://www.anaconda.com/products/distribution
- **Option 2:** Install separately
  - **scikit-learn:** https://scikit-learn.org
  - **pandas:** https://pandas.pydata.org

In [None]:
import sklearn
sklearn.__version__

**scikit-learn version:**

- **Course version:** 0.23.2
- **Minimum version:** 0.20.2

**How to install scikit-learn 0.23.2:**

- **Option 1:** conda install scikit-learn==0.23.2
- **Option 2:** pip install -U scikit-learn==0.23.2

In [None]:
import pandas
pandas.__version__

**Using Google Colab with the course:**

- Similar to the Jupyter Notebook
- Runs in your browser
- Free (but requires a Google account)
- Available at https://colab.research.google.com

## 1.5 Course outline

**Chapters:**

1. Introduction
2. Review of the Machine Learning workflow
3. Encoding categorical features
4. Improving your workflow with ColumnTransformer and Pipeline
5. Workflow review #1
6. Encoding text data
7. Handling missing values
8. Fixing common workflow problems
9. Workflow review #2
10. Evaluating and tuning a Pipeline
11. Comparing linear and non-linear models
12. Ensembling multiple models
13. Feature selection
14. Feature standardization
15. Feature engineering with custom transformers
16. Workflow review #3
17. High-cardinality categorical features
18. Class imbalance
19. Class imbalance walkthrough
20. Going further

**Lesson types:**

- Core lessons
- Q&A lessons

**Why not focus on algorithms?**

- Workflow will have a greater impact on your results
- Reusable workflow enables you to try many different algorithms
- Hard to know (in advance) which algorithm will work best

## 1.6 Course datasets

**Datasets:**

- Titanic
- US census
- Mammography scans

**Why use smaller datasets?**

- Easier and faster access to files
- Reduced computational time
- Greater understanding of the course material

## 1.7 Meet your instructor

**About me:**

- Founder of Data School
- Teaching data science for 7+ years
- Passionate about teaching people who are new to data science
- Live in Asheville, North Carolina
- Degree in Computer Engineering

# Chapter 2: Review of the Machine Learning workflow

## 2.1 Loading and exploring a dataset

In [1]:
import pandas as pd
df = pd.read_csv('http://bit.ly/MLtrain', nrows=10)
df = pd.read_csv('titanic_train.csv', nrows=10)

In [2]:
df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


**Machine Learning terminology:**

- **Target:** Goal of prediction
- **Classification:** Problem with a categorical target
- **Feature:** Input to the model (column)
- **Sample:** Single observation (row)
- **Training data:** Data with known target values

**Feature selection methods:**

- Human intuition
- Domain knowledge
- Data exploration
- Automated methods

**Currently selected features:**

- **Parch:** Number of parents or children aboard with that passenger
- **Fare:** Amount the passenger paid

In [3]:
X = df[['Parch', 'Fare']]
X

Unnamed: 0,Parch,Fare
0,0,7.25
1,0,71.2833
2,0,7.925
3,0,53.1
4,0,8.05
5,0,8.4583
6,0,51.8625
7,1,21.075
8,2,11.1333
9,0,30.0708


In [4]:
y = df['Survived']
y

0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64

In [5]:
X.shape

(10, 2)

In [6]:
y.shape

(10,)

## 2.2 Building and evaluating a model

In [7]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=1)

Now that we've defined X and y, our next step is to build and evaluate a model.

To start, we're going to use logistic regression as our model. It's a good default choice for classification problems because it's both fast and interpretable.

We import it from the linear_model module, and then we create an instance called logreg. This is our model object.

The default solver for logistic regression has changed between different scikit-learn versions, but in this course I'm going to set the solver to liblinear. I'm specifying the solver explicitly and setting a value for random_state so that if you run the same code at home, you will most likely get the same results as me.

Let's now talk about model evaluation. The goal of model evaluation is to simulate how a model will perform on future data so that we can choose between models today. To do model evaluation, we need both an evaluation procedure and an evaluation metric.

The procedure we will use is K-fold cross-validation. Another option is to use train/test split, but cross-validation is generally superior because it gives a lower variance estimate of model performance.

The metric we will use is classification accuracy. There are many other classification metrics we could have chosen, but accuracy is suitable for this problem for two reasons:

- First, there is not significant class imbalance.
- And second, predicting the positive class correctly is just as important to us as predicting the negative class correctly.

That being said, I will cover other classification metrics in the chapters on class imbalance.

With such a small dataset, we're going to use 3-fold cross-validation, rather than 5 or 10 folds which is more typical. Let me briefly review what happens during 3-fold cross-validation:

- The rows are split into 3 subsets, which we'll call A, B, and C.
- First, A and B together become the training set, and C becomes the testing set. The model is trained on the training set, the trained model makes predictions for the testing set, and those predictions are evaluated.
- Next, A and C together become the training set, and B becomes the testing set. Again, the model is trained, it makes predictions, and the predictions are evaluated.
- Finally, B and C together become the training set, and A becomes the testing set. The training, predicting, and evaluation process happens one final time.
- Because the evaluation process occurred 3 times, it returns 3 scores, and we will usually take the mean of those scores.

Let's go ahead and use cross-validation to evaluate our model:

- First we import the cross_val_score function from the model_selection module.
- Then we pass it the model object, X and y, the number of cross-validation folds, and the evaluation metric. Although the default metric for classification problems is accuracy, I recommend specifying it explicitly so that there's no ambiguity.
- When we run cross_val_score, it does the dataset splitting, training, predicting, and evaluation. 3 accuracy scores are returned, and the mean of those scores is 69%.

If you received a different result, that's not a problem. The results can vary based on your scikit-learn version due to changes in the default parameters, algorithm changes, bug fixes, and so on.

Unfortunately, we can't take these results seriously because the dataset is so small. There's actually no reliable evaluation procedure when your training data only contains 10 rows, but I did want to demonstrate it anyway to emphasize that model evaluation is a normal part of the Machine Learning workflow.


**Requirements for model evaluation:**

- **Procedure:** K-fold cross-validation
- **Metric:** Classification accuracy

**Steps of 3-fold cross-validation:**

1. Split rows into 3 subsets (A, B, C)
2. A & B is training set, C is testing set
  - Train model on training set
  - Make predictions on testing set
  - Evaluate predictions
3. Repeat with A & C as training set, B as testing set
4. Repeat with B & C as training set, A as testing set
5. Calculate the mean of the scores

In [8]:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()

0.6944444444444443

## 2.3 Using the model to make predictions

At this point in the workflow, we would typically try making changes in order to achieve a better accuracy, such as:

- Tuning the the model's hyperparameters
- Adding or removing features
- Or trying a different classification model other than logistic regression.

We'll cover these topics in detail later in the course, but for now, let's assume that we're happy with the model as-is. Thus, our next steps are to train the model and then use it to make predictions on new data.

We use the model's fit method, which instructs the model to try to learn the relationship between X and y.

There are four important points I want to note here:

- First, you should train your model on the entire dataset before using it to make predictions, otherwise you are throwing away valuable training data. In truth, we do have more than 10 rows, but for now we are considering our entire training dataset to be these 10 rows.
- Second, the model object is modified in-place when you run the fit method, and so there's no need to overwrite the logreg object using an assignment statement.
- Third, scikit-learn understands how to work with pandas objects, and so we can pass X and y directly to the fit method.
- And finally, if you're using scikit-learn 0.23 or later, you will only see the parameters that have changed from the defaults when you print or fit a model. That's why it only displays the random_state and solver parameters, whereas in previous versions of scikit-learn, all model parameters would have been displayed.

Now, let's read in a new dataset for which we don't know the target values. You can read it from a URL or from your local computer, so choose whichever option you prefer. Again, we are only going to keep the first 10 rows.

You'll notice that it has the same columns as the df DataFrame, except that there's no Survived column, which is the column that we're going to predict.

Before we make predictions, we have to define X_new. It has to have the same columns as X, and those columns have to be in the same order.

Finally, we'll use the trained model to make predictions by passing X_new to the predict method, which outputs a NumPy array. There are 10 predictions because it makes 1 prediction for each sample in X_new.

The predictions are in the same order as the samples in X_new, meaning the first prediction is for the first row in X_new, the second prediction is for the second row in X_new, and so on.

Note that we can't actually evaluate the accuracy of these predictions because we don't know the true target values for the samples in X_new.|

**Ways to improve the model:**

- Hyperparameter tuning
- Adding or removing features
- Trying a different model

In [9]:
logreg.fit(X, y)

LogisticRegression(random_state=1, solver='liblinear')

**Important points about model fitting:**

- Train your model on the entire dataset before making predictions
- Assignment statement is unnecessary
- Passing pandas objects is fine
- Only prints parameters that have changed (version 0.23 or later)

In [10]:
df_new = pd.read_csv('http://bit.ly/MLnewdata', nrows=10)
df_new = pd.read_csv('titanic_new.csv', nrows=10)
df_new

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [11]:
X_new = df_new[['Parch', 'Fare']]
X_new

Unnamed: 0,Parch,Fare
0,0,7.8292
1,0,7.0
2,0,9.6875
3,0,8.6625
4,1,12.2875
5,0,9.225
6,0,7.6292
7,1,29.0
8,0,7.2292
9,0,24.15


In [12]:
logreg.predict(X_new)

array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1], dtype=int64)

## 2.4 Q&A: How do I adapt this workflow to a regression problem?

**Adapting this workflow for regression:**

1. Choose a different model
2. Choose a different evaluation metric

## 2.5 Q&A: How do I adapt this workflow to a multiclass problem?

**Types of classification problems:**

- **Binary:** Two output classes
- **Multiclass:** More than two output classes

**How classifiers handle multiclass problems:**

- Many are inherently multiclass
- Others can be extended using "one-vs-one" or "one-vs-rest" strategies

## 2.6 Q&A: Why should I select a Series for the target?

In [13]:
df['Survived']

0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64

In [14]:
df[['Survived']]

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0
5,0
6,0
7,0
8,1
9,1


In [15]:
df['Survived'].shape

(10,)

In [16]:
df[['Survived']].shape

(10, 1)

In [17]:
df['Survived'].to_numpy()

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1], dtype=int64)

In [18]:
df[['Survived']].to_numpy()

array([[0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1]], dtype=int64)

**Multilabel vs multiclass problems:**

- **Multilabel:** Each sample can have more than one label
- **Multiclass:** Each sample can have one label

**Multilabel vs multiclass targets:**

- **Multilabel:** 2-dimensional y (DataFrame)
- **Multiclass:** 1-dimensional y (Series)

## 2.7 Q&A: How do I add the model's predictions to a DataFrame?

In [None]:
predictions = pd.Series(logreg.predict(X_new), index=X_new.index,
                        name='Prediction')

In [None]:
pd.concat([X_new, predictions], axis='columns')

## 2.8 Q&A: How do I determine the confidence level of each prediction?

In [None]:
logreg.predict(X_new)

In [None]:
logreg.predict_proba(X_new)

**Array of predicted probabilities:**

- One row for each sample
- One column for each class

In [None]:
logreg.predict_proba(X_new)[:, 1]

## 2.9 Q&A: How do I check the accuracy of the model's predictions?

**Checking model accuracy:**

- **Not possible:** Target value is unknown or is private data
- **Possible:** Target value is known

## 2.10 Q&A: What do the "solver" and "random_state" parameters do?

In [None]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

<img src="https://www.dataschool.io/files/solver_comparison.png">

**Default solver for logistic regression:**

- **Before version 0.22:** liblinear
- **Starting in version 0.22:** lbfgs

**Advice for random_state:**

- Set random_state to any integer when a random process is involved
- Allows your code to be reproducible

## 2.11 Q&A: How do I show all of the model parameters?

In [None]:
logreg

In [None]:
logreg.get_params()

In [None]:
from sklearn import set_config
set_config(print_changed_only=False)

In [None]:
logreg

In [None]:
set_config(print_changed_only=True)

## 2.12 Q&A: Should I shuffle the samples when using cross-validation?

In [None]:
cross_val_score(logreg, X, y, cv=3, scoring='accuracy')

In [None]:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(3)
cross_val_score(logreg, X, y, cv=kf, scoring='accuracy')

**Stratified sampling:**

- Ensures that each fold is representative of the dataset
- Produces more reliable cross-validation scores

In [None]:
kf = StratifiedKFold(3, shuffle=True, random_state=1)
cross_val_score(logreg, X, y, cv=kf, scoring='accuracy')

**When to shuffle your samples:**

- **Samples in arbitrary order:** Shuffling not needed
- **Samples are ordered:** Shuffling needed

**How to shuffle your samples:**

- **Classification:** StratifiedKFold
- **Regression:** KFold

# Chapter 3: Encoding categorical features

## 3.1 Introduction to one-hot encoding 

**How to run the code above:**

- **Jupyter Notebook:**
  - Select this cell
  - Click "Cell" menu, then "Run All Above"
- **JupyterLab:**
  - Select this cell
  - Click "Run" menu, then "Run All Above Selected Cell"

In [None]:
df

**Currently selected features:**

- **Parch:** Number of parents or children aboard with that passenger
- **Fare:** Amount the passenger paid
- **Embarked:** Port the passenger embarked from
- **Sex:** Male or Female

**Unordered categorical data:**

- Contains distinct categories
- No inherent logical ordering to the categories
- Also called "nominal data"

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

**Matrix representations:**

- **Sparse:** More efficient and performant
- **Dense:** More readable

**Why use double brackets?**

- **Single brackets:**
  - Outputs a Series
  - Could be interpreted as a single feature or a single sample
- **Double brackets:**
  - Outputs a single-column DataFrame
  - Interpreted as a single feature

In [None]:
ohe.fit_transform(df[['Embarked']])

**Output of OneHotEncoder:**

- One column for each unique value
- One non-zero value in each row:
  - 1, 0, 0 means "C"
  - 0, 1, 0 means "Q"
  - 0, 0, 1 means "S"

In [None]:
ohe.categories_

**Why use one-hot encoding?**

- Model can learn the relationship between each level and the target value
- Example: Model might learn that "C" passengers have a higher survival rate than "not C" passengers

**Why not encode as a single feature?**

- **Pretend:**
  - C: high survival rate
  - Q: low survival rate
  - S: high survival rate
- **Single feature would need two coefficients:**
  - Negative coefficient for impact of Q (with respect to C)
  - Positive coefficient for impact of S (with respect to Q)

## 3.2 Transformer methods: fit, transform, fit_transform

**Generic transformer methods:**

- **fit:** Transformer learns something
- **transform:** Transformer uses what it learned to do the data transformation

**OneHotEncoder methods:**

- **fit:** Learn the categories
- **transform:** Create the feature matrix using those categories

## 3.3 One-hot encoding of multiple features

In [None]:
ohe.fit_transform(df[['Embarked', 'Sex']])

In [None]:
ohe.categories_

**Decoding the output array:**

- **First three columns:**
  - 1, 0, 0 means "C"
  - 0, 1, 0 means "Q"
  - 0, 0, 1 means "S"
- **Last two columns:**
  - 1, 0 means "female"
  - 0, 1 means "male"
- **Example:**
  - 0, 0, 1, 0, 1 means "S, male"
  - 1, 0, 0, 1, 0 means "C, female"

**How to manually add Embarked and Sex to the model:**

1. Stack Parch and Fare side-by-side with OneHotEncoder output
2. Repeat the same process with new data

**Problems with a manual approach:**

- Repeating steps is inefficient and error-prone
- Complexity will increase

## 3.4 Q&A: When should I use transform instead of fit_transform?

In [None]:
demo_train = pd.DataFrame({'letter':['A', 'B', 'C', 'B']})
demo_train

In [None]:
ohe.fit_transform(demo_train)

**Example of fit_transform on training data:**

- **fit:** Learn 3 categories (A, B, C)
- **transform:** Create feature matrix with 3 columns

In [None]:
demo_test = pd.DataFrame({'letter':['A', 'C', 'A']})
demo_test

In [None]:
ohe.fit_transform(demo_test)

**Example of fit_transform on testing data:**

- **fit:** Learn 2 categories (A, C)
- **transform:** Create feature matrix with 2 columns

In [None]:
ohe.fit_transform(demo_train)

In [None]:
ohe.transform(demo_test)

**Correct process:**

1. Run fit_transform on training data:
  - **fit:** Learn 3 categories (A, B, C)
  - **transform:** Create feature matrix with 3 columns
2. Run transform on testing data:
  - **transform:** Create feature matrix with 3 columns

## 3.5 Q&A: What happens if the testing data includes a new category?

In [None]:
demo_train

In [None]:
ohe.fit_transform(demo_train)

In [None]:
ohe.categories_

In [None]:
demo_test_unknown = pd.DataFrame({'letter':['A', 'C', 'D']})
demo_test_unknown

In [None]:
ohe.transform(demo_test_unknown)

In [None]:
ohe = OneHotEncoder(sparse=False, categories=[['A', 'B', 'C', 'D']])

In [None]:
ohe.fit_transform(demo_train)

In [None]:
ohe.transform(demo_test_unknown)

**Why you might not know all possible categories:**

- Rare categories aren't present in your set of samples
- New categories are added later

In [None]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [None]:
ohe.fit_transform(demo_train)

In [None]:
ohe.transform(demo_test_unknown)

**Advice for OneHotEncoder:**

1. Start with handle_unknown set to 'error'
2. If possible, specify the categories manually
3. If necessary, set handle_unknown to 'ignore' and then retrain your model

## 3.6 Q&A: Should I drop one of the one-hot encoded categories?

In [None]:
demo_train

In [None]:
ohe.fit_transform(demo_train)

**You can drop the first column:**

- Contains redundant information
- Avoids collinearity between features

In [None]:
ohe = OneHotEncoder(sparse=False, drop='first')
ohe.fit_transform(demo_train)

**Decoding the output array (after dropping the first column):**

- 0, 0 means "A"
- 1, 0 means "B"
- 0, 1 means "C"

**Should you drop the first column?**

- **Advantages:**
  - Useful if perfectly collinear features will cause problems (does not apply to most models)
- **Disadvantages:**
  - Incompatible with handle_unknown='ignore'
  - Introduces bias if you standardize features or use a regularized model

## 3.7 Q&A: How do I encode an ordinal feature?

**Types of categorical data:**

- Unordered (nominal data)
- Ordered (ordinal data)

In [None]:
df

**Options for encoding Pclass:**

- **Ordinal encoding:** Creates one feature
- **One-hot encoding:** Creates three features

In [None]:
df_ordinal = pd.DataFrame({'Class': ['third', 'first', 'second', 'third'],
                           'Size': ['S', 'S', 'L', 'XL']})
df_ordinal

In [None]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['first', 'second', 'third'],
                                ['S', 'M', 'L', 'XL']])

In [None]:
oe.fit_transform(df_ordinal)

**Decoding the output array:**

- **First column:**
  - 0 means "first"
  - 1 means "second"
  - 2 means "third"
- **Second column:**
  - 0 means "S"
  - 1 means "M"
  - 2 means "L"
  - 3 means "XL"
- **Example:**
  - 2, 0 means "third, S"

In [None]:
ohe = OneHotEncoder(sparse=False, categories=[['first', 'second', 'third'],
                                              ['S', 'M', 'L', 'XL']])
ohe.fit_transform(df_ordinal)

**Advice for encoding categorical data:**

- **Ordinal feature stored as numbers:** Leave as-is
- **Ordinal feature stored as strings:** Use OrdinalEncoder
- **Nominal feature:** Use OneHotEncoder

## 3.8 Q&A: What's the difference between OrdinalEncoder and LabelEncoder?

&nbsp; | OrdinalEncoder | LabelEncoder
:--- | :---: | :---:
Can you define the category order? | Yes | No
Can you encode multiple features? | Yes | No

**Outdated uses for LabelEncoder:**

- Encoding string-based labels for some classifiers
- Encoding string-based features for OneHotEncoder

## 3.9 Q&A: Should I encode numeric features as ordinal features?

In [None]:
df[['Fare']]

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

In [None]:
kb = KBinsDiscretizer(n_bins=3, strategy='quantile', encode='ordinal')

In [None]:
kb.fit_transform(df[['Fare']])

**Why not discretize numeric features?**

- Makes it harder to learn the actual trends
- Makes it easier to discover non-existent trends
- May result in overfitting

# Chapter 4: Improving your workflow with ColumnTransformer and Pipeline

## 4.1 Preprocessing features with ColumnTransformer

**Problems from Chapter 3:**

- Need to stack categorical features next to numerical features
- Need to apply the same preprocessing to new data

**How to solve those problems:**

- **ColumnTransformer:** Apply different preprocessing steps to different columns
- **Pipeline:** Apply the same workflow to training data and new data

In [None]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']

In [None]:
X = df[cols]
X

In [None]:
ohe = OneHotEncoder()

In [None]:
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='drop')

**Tuple elements for make_column_transformer:**

1. Transformer object
2. List of columns to which the transformer should be applied

In [None]:
ct.fit_transform(X)

**Output columns:**

- **Columns 1-3:** Embarked
- **Columns 4-5:** Sex

In [None]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough')

In [None]:
ct.fit_transform(X)

**Output columns:**

- **Columns 1-3:** Embarked
- **Columns 4-5:** Sex
- **Column 6:** Parch
- **Column 7:** Fare

In [None]:
ct.get_feature_names()

**Notes about get_feature_names:**

- **Before version 0.23:** Didn't work with passthrough columns
- **Starting in version 1.0:** Has been replaced with get_feature_names_out

**Tuple elements for make_column_transformer (revised):**

1. Transformer object or "drop" or "passthrough"
2. List of columns to which the transformer should be applied

In [None]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch', 'Fare']))

In [None]:
ct.fit_transform(X)

## 4.2 Chaining steps with Pipeline

In [None]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(ct, logreg)

**Pipeline steps:**

1. Data preprocessing with ColumnTransformer
2. Model building with LogisticRegression

In [None]:
pipe.fit(X, y)

**Fitting the Pipeline:**

1. ColumnTransformer converts X (4 columns) into a numeric feature matrix (7 columns)
2. LogisticRegression model is fit to the feature matrix

In [None]:
X_t = ct.fit_transform(X)
logreg.fit(X_t, y)

In [None]:
print(X.shape)
print(X_t.shape)

## 4.3 Using the Pipeline to make predictions

In [None]:
X_new = df_new[cols]
X_new

In [None]:
pipe.predict(X_new)

**Predicting with the Pipeline:**

1. ColumnTransformer applies the same transformations to X_new
2. Fitted LogisticRegression model makes predictions on the transformed version of X_new

In [None]:
X_new_t = ct.transform(X_new)
logreg.predict(X_new_t)

In [None]:
print(X_new.shape)
print(X_new_t.shape)

**ColumnTransformer methods:**

1. Run fit_transform on X:
  - **fit:** Learn the encoding
  - **transform:** Apply the encoding to create 7 columns
2. Run transform on X_new:
  - **transform:** Apply the encoding to create 7 columns

## 4.4 Q&A: How do I drop some columns and passthrough others?

In [None]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('drop', ['Fare']),
    remainder='passthrough')
ct.fit_transform(X)

In [None]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch']),
    remainder='drop')
ct.fit_transform(X)

## 4.5 Q&A: How do I transform the unspecified columns?

In [None]:
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

In [None]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('drop', ['Fare']),
    remainder=scaler)
ct.fit_transform(X)

## 4.6 Q&A: How do I select columns from a NumPy array?

In [None]:
X_array = X.to_numpy()
X_new_array = X_new.to_numpy()

In [None]:
X_array

In [None]:
ct = make_column_transformer(
    (ohe, [2, 3]),
    remainder='passthrough')
ct.fit_transform(X_array)

In [None]:
ct = make_column_transformer(
    (ohe, slice(2, 4)),
    remainder='passthrough')
ct.fit_transform(X_array)

In [None]:
ct = make_column_transformer(
    (ohe, [False, False, True, True]),
    remainder='passthrough')
ct.fit_transform(X_array)

**Options for selecting columns from a NumPy array:**

- Integer position
- Slice
- Boolean mask

In [None]:
pipe = make_pipeline(ct, logreg)

In [None]:
pipe.fit(X_array, y)
pipe.predict(X_new_array)

## 4.7 Q&A: How do I select columns by data type?

In [None]:
from sklearn.compose import make_column_selector

In [None]:
select_object = make_column_selector(dtype_include=object)
select_number = make_column_selector(dtype_include='number')

In [None]:
ct = make_column_transformer(
    (ohe, select_object),
    ('passthrough', select_number))
ct.fit_transform(X)

In [None]:
exclude_object = make_column_selector(dtype_exclude=object)

In [None]:
ct = make_column_transformer(
    (ohe, select_object),
    ('passthrough', exclude_object))
ct.fit_transform(X)

In [None]:
select_datetime = make_column_selector(dtype_include='datetime')
select_category = make_column_selector(dtype_include='category')

In [None]:
select_multiple = make_column_selector(dtype_include=[object, 'category'])

## 4.8 Q&A: How do I select columns by column name pattern?

In [None]:
select_ES = make_column_selector(pattern='E|S')

In [None]:
ct = make_column_transformer(
    (ohe, select_ES),
    remainder='passthrough')
ct.fit_transform(X)

## 4.9 Q&A: Should I use ColumnTransformer or make_column_transformer?

In [None]:
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
    [('OHE', ohe, ['Embarked', 'Sex']),
     ('pass', 'passthrough', ['Parch', 'Fare'])])
ct

**Tuple elements for ColumnTransformer:**

1. Transformer name
2. Transformer object or "drop" or "passthrough"
3. List of columns to which the transformer should be applied

In [None]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch', 'Fare']))
ct

&nbsp; | ColumnTransformer | make_column_transformer
:--- | :---: | :---:
Allows custom names? | Yes | No
Allows transformer weights? | Yes | No

## 4.10 Q&A: Should I use Pipeline or make_pipeline?

In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('preprocessor', ct), ('classifier', logreg)])
pipe

**Tuple elements for Pipeline:**

1. Step name
2. Model or transformer object

In [None]:
pipe.named_steps.keys()

In [None]:
pipe = make_pipeline(ct, logreg)
pipe

In [None]:
pipe.named_steps.keys()

&nbsp; | Pipeline | make_pipeline
:--- | :---: | :---:
Allows custom names? | Yes | No

## 4.11 Q&A: How do I examine the steps of a Pipeline?

In [None]:
pipe.fit(X, y)

In [None]:
pipe.named_steps.keys()

In [None]:
pipe.named_steps['columntransformer']

In [None]:
pipe.named_steps['logisticregression']

In [None]:
pipe.named_steps['columntransformer'].get_feature_names()

In [None]:
pipe.named_steps['logisticregression'].coef_

In [None]:
pipe.named_steps.logisticregression.coef_

In [None]:
pipe['logisticregression'].coef_

In [None]:
pipe[1].coef_

# Chapter 5: Workflow review #1

## 5.1 Recap of our workflow

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [None]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']

In [None]:
df = pd.read_csv('http://bit.ly/MLtrain', nrows=10)
X = df[cols]
y = df['Survived']

In [None]:
df_new = pd.read_csv('http://bit.ly/MLnewdata', nrows=10)
X_new = df_new[cols]

In [None]:
ohe = OneHotEncoder()

In [None]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch', 'Fare']))

In [None]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

In [None]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

## 5.2 Comparing ColumnTransformer and Pipeline

<img src="https://www.dataschool.io/files/simple_pipeline.png" width="400">

**ColumnTransformer vs Pipeline:**

- **ColumnTransformer:**
  - Selects subsets of columns, transforms them independently, stacks the results side-by-side
  - Only includes transformers
  - Does not have steps (transformers operate in parallel)
- **Pipeline:**
  - Series of steps that occur in order
  - Output of each step becomes the input to the next step
  - Last step is a model or transformer, all other steps are transformers

## 5.3 Creating a Pipeline diagram

In [None]:
from sklearn import set_config
set_config(display='diagram')

In [None]:
pipe

In [None]:
print(pipe)

In [None]:
set_config(display='text')
pipe