# Pipeline scaling

In this notebook, we are going to scale the different steps of our Machine Learning process by making a pipeline.

In [72]:
# === System imports ===
sys.path.append("../../")

# === Third-party import ===
import pandas as pd
import numpy as np
from  sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_log_error
from sklearn.impute import SimpleImputer

# === Local imports ===
import preprocessors as pp
import config
import utils

ROOT = utils.get_project_root()

## Making the Piepeline

The Pipeline steps are:

1. Numerical: Remove outliers IQR
2. Numerical: Log transform variables
3. Categorical: Add 'Rare' label
4. Categorical: One-hot-encoder
5. Categorical: Add additional columns
6. Feature Scaling
7. Lasso model


In [2]:
pipe_train_1 = Pipeline(
    [
        ('outliers_remover',
            pp.OutliersRemover(variables=config.NUMERICAL_VARIABLES)),
        ('log_transformer',
            pp.LogTransformer(variables=config.VARIABLES_TO_LOG_TRANSFORM)),
        ('rare_label_encoder',
            pp.RareLabelCategoricalEncode(variables=config.CATEGORICAL_VARIABLES)),
        ('one_hot_encoder',
            pp.OneHotEncoder(variables=config.CATEGORICAL_VARIABLES)),
    ]
)

pipe_test_1 = Pipeline(
    [
        ('log_transformer',
            pp.LogTransformer(variables=config.VARIABLES_TO_LOG_TRANSFORM)),
        ('rare_label_encoder',
            pp.RareLabelCategoricalEncode(variables=config.CATEGORICAL_VARIABLES)),
        ('one_hot_encoder',
            pp.OneHotEncoder(variables=config.CATEGORICAL_VARIABLES)),
    ]
)

# Log transform target

pipe_2 = Pipeline(
    [
        ('scaler', MinMaxScaler()),
        ('model', Lasso(alpha=0.005, random_state=0))
    ]
)

## Re-find the result in previous notebooks

To see if our pipeline works find. We are going to fit and train it on the exactly same dataset we have been working with so far.

### Load data

In [3]:
data = pd.read_csv(filepath_or_buffer=config.TRAINING_DATAFILE)
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
train, test = train_test_split(data[config.MOST_RELEVANT_VARIABLES], 
                                   test_size=0.1, 
                                   random_state=0)
for dataset in ['train', 'test']:
    print(f'{dataset}: {eval(dataset).shape}')



train: (1314, 7)
test: (146, 7)


***Note:***

`train` and `test` include respectively `X_train`,`y_train` and `X_test`, `y_test`

### Trains and Predict with the pipelines

In [5]:
def train_and_predict(train, test):
    train = train.copy()
    test = test.copy()

    # First steps of transformation
    train = pipe_train_1.transform(train)
    test = pipe_test_1.transform(test)

    # Column matching
    train, test = pp.match_one_hot_encoded_vars(X_train=train, X_test=test)

    # Split X and y
    X_train = train.drop(config.TARGET, axis=1)
    y_train = train[config.TARGET]

    if config.TARGET in test.columns: # It won't be the case for thr submission dataset
        X_test = test.drop(config.TARGET, axis=1)

    # Training
    model = pipe_2.fit(X_train, y_train)

    # Make prediction
    y_pred = model.predict(X_test)

    # Treansform exponential
    y_pred = np.exp(y_pred)

    # Transform to Pandas Series
    y_pred = pd.Series(y_pred, index=test.index)

    return y_pred


In [6]:
y_pred = train_and_predict(train, test)
y_test = test[config.TARGET]

### Check the score

In [7]:
print(f'Test RMSE: {np.sqrt(mean_squared_log_error(y_test, y_pred))}')

Test RMSE: 0.22342908158240873


If you check out the previous notebook we got exactly the same result

## Make prediciton for submission

Now that our pipeline is assured to give the expected result.
Let's try to make our first prediction for subsmission.

We will train our model on the hole `train.csv` file and test it on the `test.csv` for submission

### Load the data

In [57]:
train = pd.read_csv(filepath_or_buffer=config.TRAINING_DATAFILE, index_col=0)
test = pd.read_csv(filepath_or_buffer=config.TESTING_DATAFILE, index_col=0)

for dataset in ['train', 'test']:
    print(f'{dataset}: {eval(dataset).shape}')

train: (1460, 80)
test: (1459, 79)


In [58]:
test.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Gar2,12500,6,2010,WD,Normal
1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,3,2010,WD,Normal
1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,...,144,0,,,,0,1,2010,WD,Normal


In [59]:
train = train[config.MOST_RELEVANT_VARIABLES]
test = test[config.FEATURES]

for dataset in ['train', 'test']:
    print(f'{dataset}: {eval(dataset).shape}')

train: (1460, 7)
test: (1459, 6)


### Make the prediction

When trying to bluntly run the function `train_and_predict` on the training and testing sets to submit the rusult, an error raised that there were missing values.

They certainly come from the test with which we haven't worked with so far. 

Let's check that out!

In [60]:
print("Number of missing values in the train set")
print(train[config.FEATURES].isna().sum(axis=1).sum())

print("Number of missing values in the test set")
print(test[config.FEATURES].isna().sum(axis=1).sum())

Number of missing values in the train set
0
Number of missing values in the test set
2


Indeed! ... but for only two rows. 

We didn't have to deal with this issue for building the model with the train dataset.
Let's see which variable(s) possess these **NaNs**

In [61]:
test[config.FEATURES].isna().sum(axis=0)

GrLivArea       0
GarageArea      1
TotalBsmtSF     1
OverallQual     0
FullBath        0
TotRmsAbvGrd    0
dtype: int64

These are the two numerical variables `GarageArea` and `TotalBsmtSF`.

Let's quickly fix the issure by imputing the missing values by the mean value for each variable.

In [62]:
def impute_mean(df):
    mean_imputer = SimpleImputer(strategy='mean')
    imputed_df = pd.DataFrame(mean_imputer.fit_transform(test))
    imputed_df.columns = df.columns
    imputed_df.index = df.index

    assert  not imputed_df.isna().any().any(), "Still have missing values"

    return imputed_df

test = impute_mean(test)

In [65]:
y_pred = train_and_predict(train, test)
y_pred

Id
1461    140764.962227
1462    163501.696673
1463    157060.321307
1464    171897.231418
1465    189884.690467
            ...      
2915    101121.269307
2916    113459.248760
2917    156778.301614
2918    108373.437717
2919    220920.741563
Length: 1459, dtype: float64

Here we are with our first reuslts for submission.
Let's save them

### Save results

In [68]:
first_submission = y_pred.to_frame(name=config.TARGET)
first_submission

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1461,140764.962227
1462,163501.696673
1463,157060.321307
1464,171897.231418
1465,189884.690467
...,...
2915,101121.269307
2916,113459.248760
2917,156778.301614
2918,108373.437717


In [73]:
first_submission.to_csv(path_or_buf=f'{ROOT}/datasets/outputs/with_main_variables/first_submission_lasso.csv')