# Preprocessing and Pipelines

In this module, you will learn how to do some simple preprocessing with your data. Additionally, we will look at how to prevent data leakage with scikit-learn's awesome pipelines.

<b>Functions and attributes in this lecture: </b>
- `pandas:` - Pandas package with alias `pd`
  - `.mean()` - Get the mean value of a dataframe
  - `.replace()` - Replaces values in a series with new values
  - `.drop()` - Drop certain colunns or rows
  - `.dropna()` - Drop the rows with missing values
  - `.fillna()` - Fill in the missing values with a spesific value
- `sklearn.preprocessing` - Submodule for preprocessing data
  - `StandardScaler()` - Scale the data
    - `.fit()` - Training the scaler on the data
    - `.transform()` - Tranforms data by scaling it
    - `.mean_` - Get the mean for the scaling
    - `.var_` - Get the variance for the scaling
- `sklearn.pipelines` - Submodule for assembeling pipelines
  - `Pipeline()` - Basic constructor for setting up a pipeline
    - `.fit()` - Training the pipeline on the data
    - `.predict()` - Predict values on new data
    - `.score()` - Get the score determined by the last model in the pipeline
    - `.named_steps` - Get the components of the pipeline

In [None]:
# Non-sklearn packages
import numpy as np
import pandas as pd

# Sklearn packages
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Importing the Titanic Dataset

In this section we will import the famous Titanic dataset and clean some of the missing values in it!

In [None]:
# The titanic dataset is inside the seaborn package
from seaborn import load_dataset

# Load the Titanic data set
titanic = load_dataset("titanic")

In [None]:
# Checking summary data of the dataset


In [None]:
# Information about the columns


In [None]:
# Remove the "deck" feature


In [None]:
# Fill in the mean age for those with missing age


In [None]:
# Check that the value has been filled in


In [None]:
# Drop the remining two rows


In [None]:
# Check that we have no more missing values


### Choosing Relevant Features

Not all the features you are presented with are nessesarily useful for predicting the survived feature. We will now exclude some of the features to only consider those we believe will affect the survived column significantly.

In [None]:
# Checking our dataset


In [None]:
# Removing duplicate information


In [None]:
# Encode the sex as 0 for female and 1 for male


In [None]:
# Can look at the correlation matrix (embark town is not present!)


In [None]:
# Drop the low-correlation columns and the embark town column


In [None]:
# Our dataset 


## Standardizing the Values

It is useful to standardize the values before passing them into machine learning models. While this is not important for all machine learning models, it is important for many of them.

In [None]:
# Dividing into traning sets and testing sets


In [None]:
# Importing and initializing a standard scaler estimator


In [None]:
# Fitting the estimator on the training set


In [None]:
# Getting the mean and variance of the training set


In [None]:
# Scaling the training set and testing set in the same way


In [None]:
# Checking the output


In [None]:
# Training a logistic regression model


In [None]:
# Get the predictions and accuracy score


In [None]:
# This is not so good, since the data is unbalanced!


## Creating a Pipeline for Our Data

We will now put our scaling and logistic regression into a pipeline so that it is more managable.

In [None]:
# Importing the Pipeline object


In [None]:
# Creting a pipeline


In [None]:
# Fitting the pipeline


In [None]:
# Have all the information in the named_steps attribute


In [None]:
# Can now use predict and X_test gets automatically scaled


In [None]:
# Can use score to get the accuracy score for logistic regression
