# Learning from the experts
>  In this chapter, you will learn the tricks used by the competition winner, and implement them yourself using scikit-learn. Enjoy!

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 4 exercises "Case Study: School Budgeting with Machine Learning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [7]:
import pandas as pd
import numpy as np

## Learning from the expert: processing

### How many tokens?

<div class=""><p>Recall from previous chapters that how you tokenize text affects the n-gram statistics used in your model.</p>
<p>Going forward, you'll use alpha-numeric sequences, and <em>only</em> alpha-numeric sequences, as tokens. Alpha-numeric tokens contain only letters a-z and numbers 0-9 (no other characters). In other words, you'll tokenize on punctuation to generate n-gram statistics.</p>
<p>In this exercise, you'll make sure you remember how to tokenize on punctuation.</p>
<p>Assuming we tokenize on punctuation, accepting only alpha-numeric sequences as tokens, how many tokens are in the following string from the main dataset?</p>
<pre><code>'PLANNING,RES,DEV,&amp; EVAL      '
</code></pre>
<p>If you want, we've loaded this string into the workspace as <code>SAMPLE_STRING</code>, but you may not need it to answer the question.</p></div>

<pre>
Possible Answers
4, because RES and DEV are not tokens
<b>4, because , and & are not tokens</b>
7, because there are 4 different words, some commas, an & symbol, and whitespace
7, because there are 7 whitespaces
</pre>

**Commas, "&", and whitespace are not alpha-numeric tokens. Keep it up!**

### Deciding what's a word

<div class=""><p>Before you build up to the winning pipeline, it will be useful to look a little deeper into how the text features will be processed.</p>
<p>In this exercise, you will use <code>CountVectorizer</code> on the training data <code>X_train</code> (preloaded into the workspace) to see the effect of tokenization on punctuation.</p>
<p>Remember, since <code>CountVectorizer</code> expects a vector, you'll need to use the preloaded function, <code>combine_text_columns</code> before fitting to the training data.</p></div>

In [9]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/case-study-school-budgeting-with-machine-learning-in-python/data/TrainingData.csv', index_col=0)
LABELS = ['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type', 'Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status']
NUMERIC_COLUMNS = ['FTE', 'Total']

# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])
# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]

X_train, X_test, y_train, y_test = train_test_split(df[NON_LABELS], 
                                                    dummy_labels, 
                                                    test_size=0.2, 
                                                    random_state=123)

In [4]:
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ Takes the dataset as read in, drops the non-feature, non-text columns and
        then combines all of the text columns into a single vector that has all of
        the text for a row.
        
        :param data_frame: The data as read in with read_csv (no preprocessing necessary)
        :param to_drop (optional): Removes the numeric and label columns by default.
    """
    # drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # replace nans with blanks
    text_data.fillna("", inplace=True)
    
    # joins all of the text items in a row (axis=1)
    # with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)

Instructions
<ul>
<li>Create <code>text_vector</code> by preprocessing <code>X_train</code> using <code>combine_text_columns</code>. This is important, or else you won't get any tokens!</li>
<li>Instantiate <code>CountVectorizer</code> as <code>text_features</code>. Specify the keyword argument <code>token_pattern=TOKENS_ALPHANUMERIC</code>.</li>
<li>Fit <code>text_features</code> to the <code>text_vector</code>.</li>
<li>Hit 'Submit Answer' to print the first 10 tokens.</li>
</ul>

In [16]:
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the text vector
text_vector = combine_text_columns(X_train)

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate the CountVectorizer: text_features
text_features = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Fit text_features to the text vector
text_features.fit(text_vector)

# Print the first 10 tokens
print(text_features.get_feature_names()[:10])

['00a', '12', '2nd', '3rd', '4th', '5th', '70', '70h', '8', 'a']


### N-gram range in scikit-learn

<div class=""><p>In this exercise you'll insert a <code>CountVectorizer</code> instance into your pipeline for the main dataset, and compute multiple n-gram features to be used in the model.</p>
<p>In order to look for ngram relationships at multiple scales, you will use the <code>ngram_range</code> parameter as Peter discussed in the video. </p>
<p><strong>Special functions:</strong> You'll notice a couple of new steps provided in the pipeline in this and many of the remaining exercises. Specifically, the <code>dim_red</code> step following the <code>vectorizer</code> step , and the <code>scale</code> step preceeding the <code>clf</code> (classification) step.</p>
<p>These have been added in order to account for the fact that you're using a reduced-size sample of the full dataset in this course. To make sure the models perform as the expert competition winner intended, we have to apply a <a href="https://en.wikipedia.org/wiki/Dimensionality_reduction" target="_blank" rel="noopener noreferrer">dimensionality reduction</a> technique, which is what the <code>dim_red</code> step does, and we have to <a href="https://en.wikipedia.org/wiki/Feature_scaling" target="_blank" rel="noopener noreferrer">scale the features</a> to lie between -1 and 1, which is what the <code>scale</code> step does.</p>
<p>The <code>dim_red</code> step uses a scikit-learn function called <code>SelectKBest()</code>, applying something called the <a href="https://en.wikipedia.org/wiki/Chi-squared_test" target="_blank" rel="noopener noreferrer">chi-squared test</a> to select the K "best" features. The <code>scale</code> step uses a scikit-learn function called <code>MaxAbsScaler()</code> in order to squash the relevant features into the interval -1 to 1.</p>
<p>You won't need to do anything extra with these functions here, just complete the vectorizing pipeline steps below. However, notice how easy it was to add more processing steps to our pipeline!</p></div>

Instructions
<ul>
<li>Import <code>CountVectorizer</code> from <code>sklearn.feature_extraction.text</code>. </li>
<li>Add a <code>CountVectorizer</code> step to the pipeline with the name <code>'vectorizer'</code>.<ul>
<li>Set the token pattern to be <code>TOKENS_ALPHANUMERIC</code>.</li>
<li>Set the <code>ngram_range</code> to be <code>(1, 2)</code>.</li></ul></li>
</ul>

In [55]:
# Import pipeline
from sklearn.pipeline import Pipeline

# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Import other preprocessing modules
from sklearn.impute import SimpleImputer
#from sklearn.preprocessing import Imputer
from sklearn.feature_selection import chi2, SelectKBest

# Select 300 best features
chi_k = 300

# Import functional utilities
from sklearn.preprocessing import FunctionTransformer, MaxAbsScaler
from sklearn.pipeline import FeatureUnion

# Perform preprocessing
get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                                                   ngram_range=(1, 2))),
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

**Log loss score: 1.2681. Great work! You'll now add some additional tricks to make the pipeline even better.**

In [56]:
%%time
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)

CPU times: user 3.03 s, sys: 125 ms, total: 3.15 s
Wall time: 3.21 s


In [57]:
accuracy

0.1987179487179487

## Learning from the expert: a stats trick

### Which models of the data include interaction terms?

<div class=""><p>Recall from the video that interaction terms involve products of features.</p>
<p>Suppose we have two features <code>x</code> and <code>y</code>, and we use models that process the features as follows:</p>
<ol>
<li>βx + βy + ββ</li>
<li>βxy + βx + βy</li>
<li>βx + βy + βx^2 + βy^2</li>
</ol>
<p>where β is a coefficient in your model (not a feature).</p>
<p>Which expression(s) include interaction terms?</p></div>

<pre>
Possible Answers
The first expression.
<b>The second expression.</b>
The third expression.
The first and third expressions.
</pre>

**An xy term is present, which represents interactions between features. Nice work, let''s implement this!**

### Implement interaction modeling in scikit-learn

<div class=""><p>It's time to add interaction features to your model. The <code>PolynomialFeatures</code> object in scikit-learn does just that, but here you're going to use a custom interaction object, <code>SparseInteractions</code>. Interaction terms are a statistical tool that lets your model express what happens if two features appear together in the same row.</p>
<p><code>SparseInteractions</code> does the same thing as <code>PolynomialFeatures</code>, but it uses sparse matrices to do so. You can get the code for <code>SparseInteractions</code> at <a href="https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py" target="_blank" rel="noopener noreferrer">this GitHub Gist</a>.</p>
<p><code>PolynomialFeatures</code> and <code>SparseInteractions</code> both take the argument <code>degree</code>, which tells them what polynomial degree of interactions to compute.</p>
<p>You're going to consider interaction terms of <code>degree=2</code> in your pipeline. You will insert these steps <em>after</em> the preprocessing steps you've built out so far, but <em>before</em> the classifier steps.</p>
<p>Pipelines with interaction terms take a while to train (since you're making n features into n-squared features!), so as long as you set it up right, we'll do the heavy lifting and tell you what your score is!</p></div>

In [21]:
#https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py
from itertools import combinations

import numpy as np
from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin


class SparseInteractions(BaseEstimator, TransformerMixin):
    def __init__(self, degree=2, feature_name_separator="_"):
        self.degree = degree
        self.feature_name_separator = feature_name_separator

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not sparse.isspmatrix_csc(X):
            X = sparse.csc_matrix(X)

        if hasattr(X, "columns"):
            self.orig_col_names = X.columns
        else:
            self.orig_col_names = np.array([str(i) for i in range(X.shape[1])])

        spi = self._create_sparse_interactions(X)
        return spi

    def get_feature_names(self):
        return self.feature_names

    def _create_sparse_interactions(self, X):
        out_mat = []
        self.feature_names = self.orig_col_names.tolist()

        for sub_degree in range(2, self.degree + 1):
            for col_ixs in combinations(range(X.shape[1]), sub_degree):
                # add name for new column
                name = self.feature_name_separator.join(self.orig_col_names[list(col_ixs)])
                self.feature_names.append(name)

                # get column multiplications value
                out = X[:, col_ixs[0]]
                for j in col_ixs[1:]:
                    out = out.multiply(X[:, j])

                out_mat.append(out)

        return sparse.hstack([X] + out_mat)

Instructions
<li>Add the interaction terms step using <code>SparseInteractions()</code> with <code>degree=2</code>. Give it a name of <code>'int'</code>, and make sure it is after the preprocessing step but before scaling.</li>

In [52]:
# Instantiate pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                                                   ngram_range=(1, 2))),  
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('int', SparseInteractions(degree=2)),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

**Log loss score: 1.2256. Nice improvement from 1.2681!**

In [53]:
%%time
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)

CPU times: user 1min 46s, sys: 1min 21s, total: 3min 8s
Wall time: 2min


In [54]:
accuracy

0.2948717948717949

## Learning from the expert: the winning model

### Why is hashing a useful trick?

<div class=""><p>In the video, Peter explained that a <a href="https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick" target="_blank" rel="noopener noreferrer">hash</a> function takes an input, in your case a token, and outputs a hash value. For example, the input may  be a string and the hash value may be an integer.</p>
<p>We've loaded a familiar python datatype, a dictionary called <code>hash_dict</code>, that makes this mapping concept a bit more explicit. In fact, <a href="http://stackoverflow.com/questions/114830/is-a-python-dictionary-an-example-of-a-hash-table" target="_blank" rel="noopener noreferrer">python dictionaries ARE hash tables</a>!</p>
<p>Print <code>hash_dict</code> in the IPython Shell to get a sense of how strings can be mapped to integers.</p>
<p>By explicitly stating how many possible outputs the hashing function may have, we limit the size of the objects that need to be processed. With these limits known, computation can be made more efficient and we can get results faster, even on large datasets.</p>
<p>Using the above information, answer the following:</p>
<p>Why is hashing a useful trick?</p></div>

In [None]:
hash_dict = {'and': 780, 'fluids': 354, 'fuel': 895, 'petro': 354, 'vend': 785}

<pre>
Possible Answers
Hashing isn't useful unless you're working with numbers.
Some problems are memory-bound and not easily parallelizable, but hashing parallelizes them.
<b>Some problems are memory-bound and not easily parallelizable, and hashing enforces a fixed length computation instead of using a mutable datatype (like a dictionary).</b>
Hashing enforces a mutable length computation instead of using a fixed length datatype, like a dictionary.
</pre>

**Enforcing a fixed length can speed up calculations drastically, especially on large datasets!**

### Implementing the hashing trick in scikit-learn

<div class=""><p>In this exercise you will check out the scikit-learn implementation of <code>HashingVectorizer</code> before adding it to your pipeline later.</p>
<p>As you saw in the video, <code>HashingVectorizer</code> acts just like <code>CountVectorizer</code> in that it can accept <code>token_pattern</code> and <code>ngram_range</code> parameters. The important difference is that it creates hash values from the text, so that we get all the computational advantages of hashing!</p></div>

Instructions
<ul>
<li>Import <code>HashingVectorizer</code> from <code>sklearn.feature_extraction.text</code>.</li>
<li>Instantiate the <code>HashingVectorizer</code> as <code>hashing_vec</code> using the <code>TOKENS_ALPHANUMERIC</code> pattern.</li>
<li>Fit and transform <code>hashing_vec</code> using <code>text_data</code>. Save the result as <code>hashed_text</code>.</li>
<li>Hit 'Submit Answer' to see some of the resulting hash values.</li>
</ul>

In [32]:
# Import HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

# Get text data: text_data
text_data = combine_text_columns(X_train)

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' 

# Instantiate the HashingVectorizer: hashing_vec
hashing_vec = HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Fit and transform the Hashing Vectorizer
hashed_text = hashing_vec.fit_transform(text_data)

# Create DataFrame and print the head
hashed_df = pd.DataFrame(hashed_text.data)
print(hashed_df.head())

          0
0  0.213201
1 -0.213201
2 -0.213201
3 -0.213201
4  0.213201


**As you can see, some text is hashed to the same value, but as Peter mentioned in the video, this doesn't neccessarily hurt performance.**

### Build the winning model

<div class=""><p>You have arrived! This is where all of your hard work pays off. It's time to build the model that won DrivenData's competition.</p>
<p>You've constructed a robust, powerful pipeline capable of processing training <em>and</em> testing data. Now that you understand the data and know all of the tools you need, you can essentially solve the whole problem in a relatively small number of lines of code. Wow!</p>
<p>All you need to do is add the <code>HashingVectorizer</code> step to the pipeline to replace the <code>CountVectorizer</code> step.</p>
<p>The parameters <code>non_negative=True</code>, <code>norm=None</code>, and  <code>binary=False</code> make the <code>HashingVectorizer</code> perform similarly to the default settings on the <code>CountVectorizer</code> so you can just replace one with the other.</p></div>

Instructions
<ul>
<li>Import <code>HashingVectorizer</code> from <code>sklearn.feature_extraction.text</code>.</li>
<li>Add a <code>HashingVectorizer</code> step to the pipeline.<ul>
<li>Name the step <code>'vectorizer'</code>.</li>
<li>Use the <code>TOKENS_ALPHANUMERIC</code> token pattern.</li>
<li>Specify the <code>ngram_range</code> to be <code>(1, 2)</code></li></ul></li>
</ul>

In [46]:
# Import the hashing vectorizer
from sklearn.feature_extraction.text import HashingVectorizer

# Instantiate the winning model pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC, norm=None, binary=False,
                                                     ngram_range=(1, 2), alternate_sign=False)), #non_negative=True
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('int', SparseInteractions(degree=2)),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

**Log loss: 1.2258. Looks like the performance is about the same, but this is expected since the HashingVectorizer should work the same as the CountVectorizer. Try this pipeline out on the whole dataset on your local machine to see its full power!**

In [47]:
%%time
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)

CPU times: user 2min 24s, sys: 1min 19s, total: 3min 44s
Wall time: 2min 37s


In [48]:
accuracy

0.2948717948717949

### What tactics got the winner the best score?

<div class=""><p>Now you've implemented the winning model from start to finish. If you want to use this model locally, <a href="https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/notebooks/1.0-full-model.ipynb" target="_blank" rel="noopener noreferrer">this Jupyter notebook</a> contains all the code you've worked so hard on. You can now take that code and build on it!</p>
<p>Let's take a moment to reflect on why this model did so well. What tactics got the winner the best score?</p></div>

<pre>
Possible Answers
The winner used a 500 layer deep convolutional neural network to master the budget data.
The winner used an ensemble of many models for classification, taking the best results as predictions.
<b>The winner used skillful NLP, efficient computation, and simple but powerful stats tricks to master the budget data.</b>
</pre>