# kaggle - Learn: Intermediate Machine Learning
- https://www.kaggle.com/learn/intermediate-machine-learning
> JM-Future: mk a program to obtain a mini dataset from a big dataset
mantaining cols names and a prportional number of NaN in every col?, ex. to conver melb_data to min_melb_data, three versions or put the % of shrink

## 4.- Pipelines
- A critical skill for deploying (and even testing) complex models with pre-processing.
- Wow to use pipelines to clean up your modeling code.

### Intro
-  pipeline bundles (agrupa) preprocessing and modeling steps so you can use the whole bundle as if it were a single step.
- __Cleaner Code:__ you won't need to manually keep track of your training and validation data at each step.
- __Fewer Bugs:__ There are fewer opportunities to misapply a step or forget a preprocessing step.
- __Easier to Productionize:__ help to transition a model from a prototype to something deployable at scale. 
- __More Options for Model Validation:__ future example with cross_validation

## A Case
- at first, the steps to have training and validation data.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('files/melb_data.csv')

# Separate target from predictors
y = df.Price
X = df.drop(['Price'], axis=1)

# #JM: Other ways to select predictors
# leftcols = [col for col in df.columns if col != 'Price']
# df1 = df.loc[:, leftcols]
# df2 = df[leftcols]
# print(df.shape, df1.shape, df2.shape)
# display(df.head(2), df1.head(2), df2.head(2))

# Divide data into trining and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [col for col in X_train_full.columns if
                    X_train_full[col].nunique() < 10 and df[col].dtype == 'object']

# Select numerical columns
numerical_cols = [col for col in X_train_full.columns if
                  X_train_full[col].dtype in ['float64', 'int64']]

# #JM: Comparing collections or dfs equality
# num_cols = [col for col in X_train_full.columns if 
#             X_train_full[col].dtype != 'object']
# #print(numerical_cols == num_cols, numerical_cols is num_cols)   # True False
# df2 = df.copy()
# print(df2 == df, df2 is df)   # True..True..True.. False

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

display(X_train.head(), X_train.isnull().any(), X_train.shape)
print(X_train.Car.isnull().value_counts())

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
664,h,S,Southern Metropolitan,3,9.2,3104.0,3.0,2.0,2.0,368.0,177.0,2009.0,-37.7846,145.0935,7809.0
3270,h,S,Eastern Metropolitan,2,10.5,3081.0,2.0,1.0,2.0,586.0,80.0,1955.0,-37.7435,145.0486,2947.0
3873,h,S,Southern Metropolitan,2,11.2,3145.0,2.0,1.0,1.0,348.0,,,-37.8672,145.0432,8801.0
13170,h,S,Northern Metropolitan,3,19.6,3076.0,3.0,1.0,1.0,521.0,,,-37.63854,145.05179,10926.0
1730,h,S,Southern Metropolitan,4,11.4,3163.0,3.0,2.0,2.0,687.0,237.0,1983.0,-37.8931,145.0479,7822.0


Type             False
Method           False
Regionname       False
Rooms            False
Distance         False
Postcode         False
Bedroom2         False
Bathroom         False
Car               True
Landsize         False
BuildingArea      True
YearBuilt         True
Lattitude        False
Longtitude       False
Propertycount    False
dtype: bool

(10185, 15)

False    10138
True        47
Name: Car, dtype: int64


> Notice that the data contains both categorical data and columns with missing values. With a pipeline, it's easy to deal with both!

We construct the full pipeline in three steps
1. Define Preprocessing Steps. *ColumnTransformer* class
2. Define the Model. *RandomForest* by ex.
3. Create and Evaluate the Pipeline. *Pipeline* class

### Step 1: Define Preprocessing Steps.
- imputes missing values in numerical data, and
- imputes missing values and applies a one-hot encoding to categorical data.

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

### Step 2: Define the Model

In [8]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

In [9]:
# ## JM trys
# m1 = RandomForestRegressor()
# m1.get_params()                 # n_estimators defualt = 100
#                                 # random_state = None

### Step 3: Create and Evaluate the Pepeline
*Pipeline* class to define a pipeline that bundles the preprocessing and modeling steps. Things to notice:
- With the pipeline, we preprocess the training data and fit the model in a single line of code.
- With the pipeline, we supply the unprocessed features in X_valid to the predict() method, and the pipeline automatically preprocesses the features before generating predictions.

In [10]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)])

# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

MAE: 163987.3804899362


## Conclusion
Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.