# kaggle - Learn: Intermediate Machine Learning
- https://www.kaggle.com/learn/intermediate-machine-learning
> JM-Future: mk a program to obtain a mini dataset from a big dataset
mantaining cols names and a prportional number of NaN in every col?, ex. to conver melb_data to min_melb_data, three versions or put the % of shrink

## 4.- Exercise: Pipelines

### Starting getting trainig and validation sets

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('files/train.csv', index_col='Id')
X_test_full = pd.read_csv('files/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

> Notice that the data contains both categorical data and columns with missing values. With a pipeline, it's easy to deal with both!

### Preprocess the data and train the model
- preprocess: handle missing values and categorical variables.

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

print('MAE:', mean_absolute_error(y_valid, preds))

MAE: 17861.780102739725


> The code yields a value around 17_862 for the mean absolute error (MAE). In the next step, you will amend the code to do better.

## Step 1: Improve the performance
### Part A
Now, it's your turn! In the code cell below, define your own preprocessing steps and random forest model. Fill in values for the following variables:
- numerical_transformer
- categorical_transformer
- model

To pass this part of the exercise, you need only define valid preprocessing steps and a random forest model.


In [13]:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(random_state=18)

### hint() + solution()
Hint: While there are many different potential solutions to this problem, we achieved satisfactory results by changing only column_transformer from the default value - specifically, we changed the strategy parameter that decides how missing values are imputed.

Solution:

- Preprocessing for numerical data    
numerical_transformer = SimpleImputer(strategy='constant')

- Preprocessing for categorical data    
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

- Bundle preprocessing for numerical and categorical data    
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

- Define model    
model = RandomForestRegressor(n_estimators=100, random_state=0)

### Part B

Run the code cell below without changes.

To pass this step, you need to have defined a pipeline in __Part A__ that achieves lower MAE than the code above. You're encouraged to take your time here and try out many different approaches, to see how low you can get the MAE! (If your code does not pass, please amend the preprocessing steps and model in Part A.)

In [14]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

MAE: 17755.97363013699


### Step 2: Generate test predictions
Now, you'll use your trained model to generate predictions with the test data.

In [15]:
# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test)

### hint + sol
Hint: Use the pipeline in my_pipeline and the predict() method.

Solution:

- Preprocessing of test data, fit model    
preds_test = my_pipeline.predict(X_test)



Run the next code cell without changes to save your results to a CSV file that can be submitted directly to the competition.


In [None]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)