# Improving your model
>  Here, you'll improve on your benchmark model using pipelines. Because the budget consists of both text and numeric data, you'll learn to how build pipielines that process multiple types of data. You'll also explore how the flexibility of the pipeline workflow makes testing different approaches efficient, even in complicated problems like this one!

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 3 exercises "Case Study: School Budgeting with Machine Learning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://learn.datacamp.com/skill-tracks/machine-learning-fundamentals-with-python)

In [1]:
import pandas as pd
import numpy as np

## Pipelines, feature & text preprocessing

### Instantiate pipeline

<div class=""><p>In order to make your life easier as you start to work with all of the data in your original DataFrame, <code>df</code>, it's time to turn to one of scikit-learn's most useful objects: the <code>Pipeline</code>.</p>
<p>For the next few exercises, you'll reacquaint yourself with pipelines and train a classifier on some synthetic (sample) data of multiple datatypes before using the same techniques on the main dataset.</p>
<p>The sample data is stored in the DataFrame, <code>sample_df</code>, which has three kinds of feature data: numeric, text, and numeric with missing values. It also has a label column with two classes, <code>a</code> and <code>b</code>.</p>
<p>In this exercise, your job is to instantiate a pipeline that trains using the <code>numeric</code> column of the sample data.</p></div>

In [6]:
sample_df = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/case-study-school-budgeting-with-machine-learning-in-python/data/sample_df.csv', index_col=0)

Instructions
<ul>
<li>Import <code>Pipeline</code> from <code>sklearn.pipeline</code>.</li>
<li>Create training and test sets using the numeric data only. Do this by specifying <code>sample_df[['numeric']]</code> in <code>train_test_split()</code>.</li>
<li>Instantiate a pipeline as <code>pl</code> by adding the classifier step. Use a name of <code>'clf'</code> and the same classifier from Chapter 2: <code>OneVsRestClassifier(LogisticRegression())</code>.</li>
<li>Fit your pipeline to the training data and compute its accuracy to see it in action! Since this is toy data, you'll use the default scoring method for now. In the next chapter, you'll return to log loss scoring.</li>
</ul>

In [7]:
# Import Pipeline
from sklearn.pipeline import Pipeline

# Import other necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Split and select numeric data only, no nans 
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=22)

# Instantiate Pipeline object: pl
pl = Pipeline([
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - numeric, no nans: ", accuracy)


Accuracy on sample data - numeric, no nans:  0.62


**Now it's time to incorporate numeric data with missing values by adding a preprocessing step!**

### Preprocessing numeric features

<div class=""><p>What would have happened if you had included the with <code>'with_missing'</code> column in the last exercise? Without imputing missing values, the pipeline would <strong>not</strong> be happy (try it and see). So, in this exercise you'll improve your pipeline a bit by using the <code>Imputer()</code> imputation transformer from scikit-learn to fill in missing values in your sample data.</p>
<p>By default, the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html" target="_blank" rel="noopener noreferrer">imputer transformer</a> replaces NaNs with the mean value of the column. That's a good enough imputation strategy for the sample data, so you won't need to pass anything extra to the imputer. </p>
<p>After importing the transformer, you will edit the steps list used in the previous exercise by inserting a <code>(name, transform)</code> tuple. Recall that steps are processed sequentially, so make sure the new tuple encoding your <em>preprocessing</em> step is put in the right place.</p>
<p>The <code>sample_df</code> is in the workspace, in case you'd like to take another look. Make sure to select <strong>both</strong> numeric columns- in the previous exercise we couldn't use <code>with_missing</code> because we had no preprocessing step!</p></div>

In [None]:
text_data.fillna('', inplace=True)

Instructions
<ul>
<li>Import <code>Imputer</code> from <code>sklearn.preprocessing</code>.</li>
<li>Create training and test sets by selecting the correct subset of <code>sample_df</code>: <code>'numeric'</code> and <code>'with_missing'</code>.</li>
<li>Add the tuple <code>('imp', Imputer())</code> to the correct position in the pipeline. <code>Pipeline</code> processes steps sequentially, so the imputation step should come <em>before</em> the classifier step.</li>
<li>Complete the <code>.fit()</code> and <code>.score()</code> methods to fit the pipeline to the data and compute the accuracy.</li>
</ul>

In [9]:
# Import the Imputer object
from sklearn.impute import SimpleImputer
#from sklearn.preprocessing import Imputer

# Create training and test sets using only numeric data
X_train, X_test, y_train, y_test = train_test_split(sample_df[["numeric", "with_missing"]],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=456)

# Insantiate Pipeline object: pl
pl = Pipeline([
        ('imp', SimpleImputer()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all numeric, incl nans: ", accuracy)


Accuracy on sample data - all numeric, incl nans:  0.636


**Now you know how to use preprocessing in pipelines with numeric data, and it looks like the accuracy has improved because of it!**

## Text features and feature unions

### Preprocessing text features

<div class=""><p>Here, you'll perform a similar preprocessing pipeline step, only this time you'll use the <code>text</code> column from the sample data.</p>
<p>To preprocess the text, you'll turn to <code>CountVectorizer()</code> to generate a bag-of-words representation of the data, as in Chapter 2. Using the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html" target="_blank" rel="noopener noreferrer">default</a> arguments, add a <code>(step, transform)</code> tuple to the steps list in your pipeline.</p>
<p>Make sure you select only the text column for splitting your training and test sets.</p>
<p>As usual, your <code>sample_df</code> is ready and waiting in the workspace.</p></div>

In [36]:
sample_df.text.fillna('', inplace=True)

Instructions
<ul>
<li>Import <code>CountVectorizer</code> from <code>sklearn.feature_extraction.text</code>.</li>
<li>Create training and test sets by selecting the correct subset of <code>sample_df</code>: <code>'text'</code>.</li>
<li>Add the <code>CountVectorizer</code> step (with the name <code>'vec'</code>)  to the correct position in the pipeline.</li>
<li>Fit the pipeline to the training data and compute its accuracy.</li>
</ul>

In [37]:
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Split out only the text data
X_train, X_test, y_train, y_test = train_test_split(sample_df['text'],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=456)

# Instantiate Pipeline object: pl
pl = Pipeline([
        ('vec', CountVectorizer()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - just text data: ", accuracy)


Accuracy on sample data - just text data:  0.808


### Multiple types of processing: FunctionTransformer

<div class=""><p>The next two exercises will introduce new topics you'll need to make your pipeline truly excel. </p>
<p>Any step in the pipeline <em>must</em> be an object that implements the <code>fit</code> and <code>transform</code> methods. The <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html" target="_blank" rel="noopener noreferrer"><code>FunctionTransformer</code></a> creates an object with these methods out of any Python function that you pass to it. We'll use it to help select subsets of data in a way that plays nicely with pipelines.</p>
<p>You are working with numeric data that needs imputation, and text data that needs to be converted into a bag-of-words. You'll create functions that separate the text from the numeric variables and see how the <code>.fit()</code> and <code>.transform()</code> methods work.</p></div>

Instructions
<ul>
<li>Compute the selector <code>get_text_data</code> by using a lambda function and <code>FunctionTransformer()</code> to obtain all <code>'text'</code> columns.</li>
<li>Compute the selector <code>get_numeric_data</code> by using a lambda function and <code>FunctionTransformer()</code> to obtain all the numeric columns (including missing data). These are <code>'numeric'</code> and <code>'with_missing'</code>.</li>
<li>Fit and transform <code>get_text_data</code> using the <code>.fit_transform()</code> method with <code>sample_df</code> as the argument.</li>
<li>Fit and transform <code>get_numeric_data</code> using the same approach as above.</li>
</ul>

In [38]:
# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Obtain the text data: get_text_data
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)

# Obtain the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[['numeric', 'with_missing']], validate=False)

# Fit and transform the text data: just_text_data
just_text_data = get_text_data.fit_transform(sample_df)

# Fit and transform the numeric data: just_numeric_data
just_numeric_data = get_numeric_data.fit_transform(sample_df)

# Print head to check results
print('Text Data')
print(just_text_data.head())
print('\nNumeric Data')
print(just_numeric_data.head())

Text Data
0           
1        foo
2    foo bar
3           
4    foo bar
Name: text, dtype: object

Numeric Data
     numeric  with_missing
0 -10.856306      4.433240
1   9.973454      4.310229
2   2.829785      2.469828
3 -15.062947      2.852981
4  -5.786003      1.826475


**You can see in the shell that fit and transform are now available to the selectors. Let's put the selectors to work!**

### Multiple types of processing: FeatureUnion

<div class=""><p>Now that you can separate text and numeric data in your pipeline, you're ready to perform separate steps on each by nesting pipelines and using <code>FeatureUnion()</code>.</p>
<p>These tools will allow you to streamline all preprocessing steps for your model, even when multiple datatypes are involved.  Here, for example, you don't want to impute our text data, and you don't want to create a bag-of-words with our numeric data. Instead, you want to deal with these separately and then join the results together using <code>FeatureUnion()</code>.</p>
<p>In the end, you'll still have only two high-level steps in your pipeline: preprocessing and model instantiation. The difference is that the first preprocessing step actually consists of a pipeline for numeric data and a pipeline for text data. The results of those pipelines are joined using <code>FeatureUnion()</code>.</p></div>

Instructions
<ul>
<li>In the <code>process_and_join_features</code>:<ul>
<li>Add the steps <code>('selector', get_numeric_data)</code> and <code>('imputer', Imputer())</code> to the <code>'numeric_features'</code> preprocessing step.</li>
<li>Add the equivalent steps for the <code>text_features</code> preprocessing step. That is, use <code>get_text_data</code> and a <code>CountVectorizer</code> step with the name <code>'vectorizer'</code>.</li></ul></li>
<li>Add the transform step <code>process_and_join_features</code> to <code>'union'</code> in the main pipeline, <code>pl</code>.</li>
<li>Hit 'Submit Answer' to see the pipeline in action!</li>
</ul>

In [40]:
# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Split using ALL data in sample_df
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing', 'text']],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=22)

# Create a FeatureUnion with nested pipeline: process_and_join_features
process_and_join_features = FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )

# Instantiate nested pipeline: pl
pl = Pipeline([
        ('union', process_and_join_features),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])


# Fit pl to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all data: ", accuracy)


Accuracy on sample data - all data:  0.928


**You now know more about pipelines than many practicing data scientists.**

## Choosing a classification model

### Using FunctionTransformer on the main dataset

<div class=""><p>In this exercise you're going to use <code>FunctionTransformer</code> on the primary budget data, before instantiating a multiple-datatype pipeline in the next exercise.</p>
<p>Recall from Chapter 2 that you used a custom function <code>combine_text_columns</code> to select and properly format <strong>text data</strong> for tokenization; it is loaded into the workspace and ready to be put to work in a function transformer!</p>
<p>Concerning the <strong>numeric data</strong>, you can use <code>NUMERIC_COLUMNS</code>, preloaded as usual, to help design a subset-selecting lambda function.</p>
<p>You're all finished with sample data. The original <code>df</code> is back in the workspace, ready to use.</p></div>

In [80]:
df = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/case-study-school-budgeting-with-machine-learning-in-python/data/TrainingData.csv', index_col=0)
LABELS = ['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type', 'Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status']
NUMERIC_COLUMNS = ['FTE', 'Total']

In [81]:
#https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/data/multilabel.py

from warnings import warn

import numpy as np
import pandas as pd

def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ Takes the dataset as read in, drops the non-feature, non-text columns and
        then combines all of the text columns into a single vector that has all of
        the text for a row.
        
        :param data_frame: The data as read in with read_csv (no preprocessing necessary)
        :param to_drop (optional): Removes the numeric and label columns by default.
    """
    # drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # replace nans with blanks
    text_data.fillna("", inplace=True)
    
    # joins all of the text items in a row (axis=1)
    # with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)

def multilabel_sample(y, size=1000, min_count=5, seed=None):
    """ Takes a matrix of binary labels `y` and returns
        the indices for a sample of size `size` if
        `size` > 1 or `size` * len(y) if size =< 1.
        
        The sample is guaranteed to have > `min_count` of
        each label.
    """
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).all():
            raise ValueError()
    except (TypeError, ValueError):
        raise ValueError('multilabel_sample only works with binary indicator matrices')
    
    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')
    
    if size <= 1:
        size = np.floor(y.shape[0] * size)
    
    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count
    
    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))
    
    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])
    
    sample_idxs = np.array([], dtype=choices.dtype)
    
    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])
        
    sample_idxs = np.unique(sample_idxs)
        
    # now that we have at least min_count of each, we can just random sample
    sample_count = size - sample_idxs.shape[0]
    
    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices, size=sample_count, replace=False)
        
    return np.concatenate([sample_idxs, remaining_sampled])


def multilabel_sample_dataframe(df, labels, size, min_count=5, seed=None):
    """ Takes a dataframe `df` and returns a sample of size `size` where all
        classes in the binary matrix `labels` are represented at
        least `min_count` times.
    """
    idxs = multilabel_sample(labels, size=size, min_count=min_count, seed=seed)
    return df.loc[idxs]


def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
    """ Takes a features matrix `X` and a label matrix `Y` and
        returns (X_train, X_test, Y_train, Y_test) where all
        classes in Y are represented at least `min_count` times.
    """
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)    
    train_set_idxs = np.setdiff1d(index, test_set_idxs)
    
    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask
    
    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])

In [85]:
from sklearn.model_selection import train_test_split

Instructions
<ul>
<li>Complete the call to <code>multilabel_train_test_split()</code> by selecting <code>df[NON_LABELS]</code>.</li>
<li>Compute <code>get_text_data</code> by using <code>FunctionTransformer()</code> and passing in <code>combine_text_columns</code>. Be sure to also specify <code>validate=False</code>.</li>
<li>Use <code>FunctionTransformer()</code> to compute <code>get_numeric_data</code>. In the lambda function, select out the <code>NUMERIC_COLUMNS</code> of <code>x</code>. Like you did when computing <code>get_text_data</code>, also specify <code>validate=False</code>.</li>
</ul>

In [110]:
# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])

# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Split into training and test sets
'''X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
                                                               dummy_labels,
                                                               size=0.2, 
                                                               seed=123)'''

X_train, X_test, y_train, y_test = train_test_split(df[NON_LABELS], 
                                                    dummy_labels, 
                                                    test_size=0.2, 
                                                    random_state=123)

# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

# Preprocess the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

### Add a model to the pipeline

<div class=""><p>You're about to take everything you've learned so far and implement it in a <code>Pipeline</code> that works with the real, <a href="https://www.drivendata.org/" target="_blank" rel="noopener noreferrer">DrivenData</a> budget line item data you've been exploring.</p>
<p><strong>Surprise!</strong> The structure of the pipeline is exactly the same as earlier in this chapter:</p>
<ul>
<li>the <strong>preprocessing step</strong> uses <code>FeatureUnion</code> to join the results of nested pipelines that each rely on <code>FunctionTransformer</code> to select multiple datatypes</li>
<li>the <strong>model step</strong> stores the model object</li>
</ul>
<p>You can then call familiar methods like <code>.fit()</code> and <code>.score()</code> on the <code>Pipeline</code> object <code>pl</code>.</p></div>

Instructions
<ul>
<li>Complete the <code>'numeric_features'</code> transform with the following steps:<ul>
<li><code>get_numeric_data</code>, with the name <code>'selector'</code>.</li>
<li><code>Imputer()</code>, with the name <code>'imputer'</code>.</li></ul></li>
<li>Complete the <code>'text_features'</code> transform with the following steps:<ul>
<li><code>get_text_data</code>, with the name <code>'selector'</code>.</li>
<li><code>CountVectorizer()</code>, with the name <code>'vectorizer'</code>.</li></ul></li>
<li>Fit the pipeline to the training data.</li>
<li>Hit 'Submit Answer' to compute the accuracy!</li>
</ul>

In [118]:
# Complete the pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', OneVsRestClassifier(LogisticRegression(max_iter = 500)))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)


Accuracy on budget dataset:  0.0


**Now that you've built the entire pipeline, you can easily start trying out different models by just modifying the 'clf' step.**

### Try a different class of model

<div class=""><p>Now you're cruising. One of the great strengths of pipelines is how easy they make the process of testing different models.</p>
<p>Until now, you've been using the model step <code>('clf', OneVsRestClassifier(LogisticRegression()))</code> in your pipeline.</p>
<p>But what if you want to try a different model? Do you need to build an entirely new pipeline? New nests? New FeatureUnions? Nope! You just have a simple one-line change, as you'll see in this exercise.</p>
<p>In particular, you'll swap out the logistic-regression model and replace it with a <a href="https://en.wikipedia.org/wiki/Random_forest" target="_blank" rel="noopener noreferrer">random forest</a> classifier, which uses the statistics of an ensemble of decision trees to generate predictions.</p></div>

Instructions
<ul>
<li>Import the <code>RandomForestClassifier</code> from <code>sklearn.ensemble</code>.</li>
<li>Add a <code>RandomForestClassifier()</code> step named <code>'clf'</code> to the pipeline.</li>
<li>Hit 'Submit Answer' to fit the pipeline to the training data and compute its accuracy.</li>
</ul>

In [112]:
# Import random forest classifer
from sklearn.ensemble import RandomForestClassifier

# Edit model step in pipeline
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', RandomForestClassifier())
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)


Accuracy on budget dataset:  0.3108974358974359


### Can you adjust the model or parameters to improve accuracy?

<div class=""><p>You just saw a substantial improvement in accuracy by swapping out the model. Pipelines are amazing!</p>
<p>Can you make it better? Try changing the parameter <code>n_estimators</code> of <code>RandomForestClassifier()</code>, whose default value is <code>10</code>, to <code>15</code>.</p></div>

Instructions
<ul>
<li>Import the <code>RandomForestClassifier</code> from <code>sklearn.ensemble</code>.</li>
<li>Add a <code>RandomForestClassifier()</code> step with <code>n_estimators=15</code> to the pipeline with a name of <code>'clf'</code>.</li>
<li>Hit 'Submit Answer' to fit the pipeline to the training data and compute its accuracy.</li>
</ul>

In [115]:
# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Add model step to pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', RandomForestClassifier(n_estimators=15))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)


Accuracy on budget dataset:  0.32371794871794873


**It's time to get serious and work with the log loss metric. You'll learn expert techniques in the next chapter to take the model to the next level.**