**Pipelines** are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

1. Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
2. Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
3. Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
4. More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.

In [10]:
import pandas as pd
X = pd.read_csv("winemag-data_first150k.csv", index_col=0)
#print(X.head())
X.dropna(inplace=True)


# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X.columns if
                    #X_train_full[cname].nunique() < 10 and 
                    X[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]

print('cols: ', X.columns)
print("cat: ", categorical_cols)
print("num: ", numerical_cols)

cols:  Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'variety', 'winery'],
      dtype='object')
cat:  ['country', 'description', 'designation', 'province', 'region_1', 'region_2', 'variety', 'winery']
num:  ['points', 'price']


### Step 1: Define o pre-processamento
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:

- imputes missing values in numerical data, and
- imputes missing values and applies a one-hot encoding to categorical data.

In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

### Step 2: Define o modelo

In [12]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

### Utiliza

In [14]:
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

In [17]:
my_pipeline.fit(X, Y)

In [None]:
preds = my_pipeline.predict(X_test)

### Parâmetros

Há formas de testar diferentes parâmetros ou métodos.
Para isso utiliza-se o **GridSearchCV**

In [None]:

steps = [('SKB', SelectKBest(score_func= f_regression )),  ('SVR', SVR(kernel='rbf'))]
param_grid = {'SVR__C': [100, 500, 600, 1000, 1500, 2000], 'SKB__k':[2,3,4,5,6,7,8]}

print("teste")
estimator = Pipeline(steps=steps)
cv = GridSearchCV(estimator, param_grid, verbose=5, n_jobs=-1, cv=10)
cv.fit(X, y_train)

In [None]:
# Imprime os melhores parâmetros
print(cv.best_params_, cv.best_score_)

In [None]:
pred = cv.predict(x_test)