# ML multi-class multi-label classification
## - school budgets case study
- build a baseline model - simple, first-pass approach
- use NLP to prepare budgets for modeling
- build a more accurate model
- from DrivenData challenge

Challenge: School budgets take hundreds of hours each year to manually label
Goal: Build a ML algorithm that can automate the process
Budget data
- line-item: "Algebra books for 8th grade students"
- target/labels: "Textbooks", "Math", "Middle School"

Notes:
- Supervised learning problem
- classification problem
- over 100 target variables
- 9 broad categories with many possible sub-label instances

What solution are we seeking?
- Human-in-the-loop learning system
- We want to know the probability suggestions for line items
    - Ie. I'm 60% sure this line is for textbooks, if it's not textbooks I'm 30% sure it's office supplies
    - These suggestions can help prioritize analysts time
    

# 1. Exploring the raw data - EDA

In [None]:
# Load and preview the data
import pandas as pd
sample_df = pd.read_csv('sample_data.csv')
sample_df.head()

# summarize the data
sample_df.info()
sample_df.describe()


In [None]:
# Example
import matplotlib.pyplot as plt

df = pd.read_csv('TrainingData.csv',index_col=0)
df.info()
df.head()
df.tail()

# Print the summary statistics
print(df.describe())

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Create the histogram after dropping null values
plt.hist(df['FTE'].dropna())

# Add title and labels
plt.title('Distribution of %full-time \n employee works')
plt.xlabel('% of full-time')
plt.ylabel('num employees')

# Display the histogram
plt.show()

## Encode labels as categories
- ML algorithms work on numbers, not strings
    - need a numeric representation of these strings
- Strings can be slow compared to numbers
- in pandas, 'category' dtype encodes categorical data numerically
    - can speed up code

In [None]:
# Example: encode labels from string as categories
sample_df.label.head(2)
# dtype: object

sample_df.label = sample_df.label.astype('category')

sample_df.label.head(2)
# dtype: category

In [None]:
# Dummy variable encoding
# aka 'binary indicator' representation
dummies = pd.get_dummies(sample_df[['label']], prefix_sep='_')

In [None]:
# Lambda functions
# - alternative to 'def' syntax
# - easy to make simple, one-line functions
# example
square = lambda x: x*x
square(2)
# out: 4

In [None]:
# Make multiple columns into category type using lambda function

# lambda function to create each column to a category
categorize_label = lambda x: x.astype('category')

sample_df.label = sample_df[['label']].apply(categorize_label, axis=0)
sample_df.info()

In [None]:
# more EDA
df.dtypes.value_counts()
# object     23
# float64     2
# dtype: int64

# Encode labels as categorical variables
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis=0)

# Print the converted dtypes
print(df[LABELS].dtypes)

In [None]:
# Count and Plot unique values for each label category

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Calculate number of unique values for each label: num_unique_labels
num_unique_labels = df[LABELS].apply(pd.Series.nunique)

# Plot number of unique values for each label
num_unique_labels.plot(kind='bar')

# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')

# Display the plot
plt.show()

### Evaluating success
How do we measure success?
- accuracy can be misleading when classes are imbalanced

Metric: log loss
- loss function
- measure of error 
- want to minimize the error (unlike accuracy)

Log loss binary classification
- penalizes predictions that are both wrong and confident
- actual value: y = {1=yes, 0=no}

In [None]:
# Log loss implementation with NumPy and clip()
import numpy as np

def compute_log_loss(predicted, actual, eps=1e-14):
    """ 
    Computes the logarithmic loss between predicted and actual
    when these are 1D arrays.
    
    :param predicted: The predicted probabilities as floats between 0-1
    :param actual: The actual binary labels. Either 0 or 1.
    :param eps (optional): log(0) is inf, so we need to offset our
        predicted values slightly by eps from 0 or 1.
    """
    predicted = np.clip(predicted, eps, 1-eps)
    log = -1 * np.mean(actual * np.log(predicted)
            + (1- actual)
            * np.log(1-predicted))
    return loss

# compute log loss
# confident and wrong item
compute_log_loss(predicted=0.9, actual=0)
# out: 2.3025850929940459

# prediction right in the middle (predicted=0.5)
compute_log_loss(predicted=0.5, actual=1)
# out: 0.6931478055994529

In [7]:
# example log loss calculations
print(compute_log_loss(predicted=0.85, actual=1),
compute_log_loss(predicted=0.99, actual=0),
compute_log_loss(predicted=0.51, actual=0))

0.16251892949777494 4.605170185988091 0.7133498878774648


To see how the log loss metric handles the trade-off between accuracy and confidence, we will use some sample data generated with NumPy and compute the log loss using the provided function compute_log_loss(), which Peter showed you in the video.

5 one-dimensional numeric arrays simulating different types of predictions have been pre-loaded: actual_labels, correct_confident, correct_not_confident, wrong_not_confident, and wrong_confident.

Your job is to compute the log loss for each sample set provided using the compute_log_loss(predicted_values, actual_values). It takes the predicted values as the first argument and the actual values as the second argument.

Using the compute_log_loss() function, compute the log loss for the following predicted values (in each case, the actual values are contained in actual_labels):
correct_confident.
correct_not_confident.
wrong_not_confident.
wrong_confident.
actual_labels.

In [None]:
# Compute and print log loss for 1st case
correct_confident = compute_log_loss(correct_confident, actual_labels)
print("Log loss, correct and confident: {}".format(correct_confident)) 

# Compute log loss for 2nd case
correct_not_confident = compute_log_loss(correct_not_confident, actual_labels)
print("Log loss, correct and not confident: {}".format(correct_not_confident)) 

# Compute and print log loss for 3rd case
wrong_not_confident = compute_log_loss(wrong_not_confident, actual_labels)
print("Log loss, wrong and not confident: {}".format(wrong_not_confident)) 

# Compute and print log loss for 4th case
wrong_confident = compute_log_loss(wrong_confident,actual_labels)
print("Log loss, wrong and confident: {}".format(wrong_confident)) 

# Compute and print log loss for actual labels
actual_labels = compute_log_loss(actual_labels, actual_labels)
print("Log loss, actual labels: {}".format(actual_labels)) 

# <script.py> output:
#     Log loss, correct and confident: 0.05129329438755058
#     Log loss, correct and not confident: 0.4307829160924542
#     Log loss, wrong and not confident: 1.049822124498678
#     Log loss, wrong and confident: 2.9957322735539904
#     Log loss, actual labels: 9.99200722162646e-15

# 2. Create a simple first model
- Train basic model on numeric data only
    - remove text data
- Multi-class logistic regression (each label independent)
    - train classifier on each label separately and use to predict
- Format prediction and save to csv
- Compute log loss score
- use NLP to wrangle text data

Splitting the multi-class dataset
- Train-test-split won't work here
    - may end up with labels in test set that never appear in training set
- Solution?: StratifiedShuffleSplit
    - Only works with a single target variable
    - But, this problem has many target variables
- actual solution: multilabel_train_test_split()
        

### 2.1 Splitting the data

In [None]:
# NUMERIC_COLUMNS is a list of column names that are numeric
# chose -1000 so that the model would behave differently with negative 
#  value than 0
data_to_train = df[NUMERIC_COLUMNS].fillna(-1000)
# create array of target variables
labels_to_use = pd.get_dummies(df[LABELS])
X_train, X_test, y_train, y_test = multilabel_train_test_split(
    data_to_train, labels_to_use, size=0.2, seed=123)


### 2.2 Training the model
- OneVsRestClassifier
    - treats each column of y independently
    - in other words, it fits a separate classifier for each of the columns

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LogisticRegression())
clf.fit(X_train, y_train)

### 2.1.a multilabel_train_test_split()
https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/data/multilabel.py

In [None]:
from warnings import warn

import numpy as np
import pandas as pd

def multilabel_sample(y, size=1000, min_count=5, seed=None):
    """ Takes a matrix of binary labels `y` and returns
        the indices for a sample of size `size` if
        `size` > 1 or `size` * len(y) if size =< 1.
        The sample is guaranteed to have > `min_count` of
        each label.
    """
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).all():
            raise ValueError()
    except (TypeError, ValueError):
        raise ValueError('multilabel_sample only works with binary indicator matrices')

    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')

    if size <= 1:
        size = np.floor(y.shape[0] * size)

    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count

    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))

    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])

    sample_idxs = np.array([], dtype=choices.dtype)

    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])

    sample_idxs = np.unique(sample_idxs)

    # now that we have at least min_count of each, we can just random sample
    sample_count = int(size - sample_idxs.shape[0])

    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices,
                                   size=sample_count,
                                   replace=False)

    return np.concatenate([sample_idxs, remaining_sampled])


def multilabel_sample_dataframe(df, labels, size, min_count=5, seed=None):
    """ Takes a dataframe `df` and returns a sample of size `size` where all
        classes in the binary matrix `labels` are represented at
        least `min_count` times.
    """
    idxs = multilabel_sample(labels, size=size, min_count=min_count, seed=seed)
    return df.loc[idxs]


def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
    """ Takes a features matrix `X` and a label matrix `Y` and
        returns (X_train, X_test, Y_train, Y_test) where all
        classes in Y are represented at least `min_count` times.
    """
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
    train_set_idxs = np.setdiff1d(index, test_set_idxs)

    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask

    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])

### 2.1.b Setting up a train-test split in scikit-learn
Alright, you've been patient and awesome. It's finally time to start training models!

The first step is to split the data into a training set and a test set. Some labels don't occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count examples of each label appear in each split: multilabel_train_test_split.

Feel free to check out the full code for multilabel_train_test_split here.

You'll start with a simple model that uses just the numeric columns of your DataFrame when calling multilabel_train_test_split. The data has been read into a DataFrame df and a list consisting of just the numeric columns is available as NUMERIC_COLUMNS.

In [None]:
# split data

# Create the new DataFrame: numeric_data_only
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])

# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(
                                                    numeric_data_only,
                                                    label_dummies,
                                                    size=0.2, 
                                                    seed=123)

# Print the info
print("X_train info:")
print(X_train.info())
print("\nX_test info:")  
print(X_test.info())
print("\ny_train info:")  
print(y_train.info())
print("\ny_test info:")  
print(y_test.info()) 

# X_train info:
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 1040 entries, 198 to 101861
# Data columns (total 2 columns):
# FTE      1040 non-null float64
# Total    1040 non-null float64
# dtypes: float64(2)
# memory usage: 24.4 KB
# None

# X_test info:
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 520 entries, 209 to 448628
# Data columns (total 2 columns):
# FTE      520 non-null float64
# Total    520 non-null float64
# dtypes: float64(2)
# memory usage: 12.2 KB
# None

# y_train info:
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 1040 entries, 198 to 101861
# Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
# dtypes: float64(104)
# memory usage: 853.1 KB
# None

# y_test info:
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 520 entries, 209 to 448628
# Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
# dtypes: float64(104)
# memory usage: 426.6 KB
# None


### 2.2.a Training a model
With split data in hand, you're only a few lines away from training a model.

In this exercise, you will import the logistic regression and one versus rest classifiers in order to fit a multi-class logistic regression model to the NUMERIC_COLUMNS of your feature data.

Then you'll test and print the accuracy with the .score() method to see the results of training.

Before you train! Remember, we're ultimately going to be using logloss to score our model, so don't worry too much about the accuracy here. Keep in mind that you're throwing away all of the text data in the dataset - that's by far most of the data! So don't get your hopes up for a killer performance just yet. We're just interested in getting things up and running at the moment.

All data necessary to call multilabel_train_test_split() has been loaded into the workspace.

In [None]:
# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Create the DataFrame: numeric_data_only
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])

# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(
                                                       numeric_data_only,
                                                       label_dummies,
                                                       size=0.2, 
                                                       seed=123)

# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())

# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Print the accuracy
print("Accuracy: {}".format(clf.score(X_test, y_test)))

# Accuracy: 0.0

The bad news is that your model scored the lowest possible accuracy: 0.0! 

But hey, you just threw away ALL of the text data in the budget. Later, you won't. 

Before you add the text data, let's see how the model does when scored by log loss.

### 2.3 Predicting on holdout data

In [None]:
holdout = pd.read_csv('HoldoutData.csv',index_col=0)
holdout = holdout[NUMERIC_COLUMNS].fillna(-1000)
predictions = clf.predict_proba(holdout)

If .predict() was used instead of .predict_proba
- output would be 0 or 1
- log loss penalizes being confident and wrong
- so worse performance (score) compared to .predict_proba()

### 2.4 Format and submit predictions
- all formatting can be done with the pandas to to_csv()
- predictions are an array, so convert to a dataframe

In [None]:
# format column names with '__'
prediction_df = pd.DataFram(columns=pd.get_dummies(df[LABELS],
                            prefix_sep='__').columns,
                            index=holdout.index,
                            data=predictions)
prediction_df.to_csv('predictions.csv')
score = score_submission(pred_path='predictions.csv')

### 2.3.a Use your model to predict values on holdout data
You're ready to make some predictions! Remember, the train-test-split you've carried out so far is for model development. The original competition provides an additional test set, for which you'll never actually see the correct labels. This is called the "holdout data."

The point of the holdout data is to provide a fair test for machine learning competitions. If the labels aren't known by anyone but DataCamp, DrivenData, or whoever is hosting the competition, you can be sure that no one submits a mere copy of labels to artificially pump up the performance on their model.

Remember that the original goal is to predict the probability of each label. In this exercise you'll do just that by using the .predict_proba() method on your trained model.

First, however, you'll need to load the holdout data, which is available in the workspace as the file HoldoutData.csv.

In [None]:
# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())

# Fit it to the training data
clf.fit(X_train, y_train)

# Load the holdout data: holdout
holdout = pd.read_csv('HoldoutData.csv', index_col=0)

# Generate predictions: predictions
predictions = clf.predict_proba(
                holdout[NUMERIC_COLUMNS].fillna(-1000))


### 2.4.a Writing out your results to a csv for submission
At last, you're ready to submit some predictions for scoring. In this exercise, you'll write your predictions to a .csv using the .to_csv() method on a pandas DataFrame. Then you'll evaluate your performance according to the LogLoss metric discussed earlier!

You'll need to make sure your submission obeys the correct format.

To do this, you'll use your predictions values to create a new DataFrame, prediction_df.

Interpreting LogLoss & Beating the Benchmark:

When interpreting your log loss score, keep in mind that the score will change based on the number of samples tested. To get a sense of how this very basic model performs, compare your score to the DrivenData benchmark model performance: 2.0455, which merely submitted uniform probabilities for each class.

Remember, the lower the log loss the better. Is your model's log loss lower than 2.0455?

In [None]:
# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))

# Format predictions in DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(
                             df[LABELS]).columns,
                             index=holdout.index,
                             data=predictions)


# Save prediction_df to csv
prediction_df.to_csv('predictions.csv')

# Submit the predictions for scoring: score
score = score_submission(pred_path='predictions.csv')

# Print score
print('Your model, trained with numeric data only, yields logloss score: {}'.format(score))

# Your model, trained with numeric data only, 
# yields logloss score: 1.9067227623381413


### 2.5 Intro to NLP
Data for NLP: text, documents, speech, ...

Tokenization
   - = splitting a string into segments
   - store segments as list
   - Example: 'Natural Language Processing'
       - ['Natural', 'Language', 'Processing']
       
   

### Tokens and token patterns

example: PETRO-VEND FUEL AND FLUIDS

Tokenize on whitespace
- PETRO-VEND | FUEL | AND | FLUIDS

Tokenize on whitespace and punctuation
- PETRO | VEND | FUEL | AND | FLUIDS

Bag of words representation
- count number of times a particular token appears
- "bag of words"
    - cvount number of times a word occurs
    - This approach discards info about word order
        - "Red, not blue" is the same as "blue, not red"

N-grams
- 1-gram, 2-gram, ..., n-gram
    - 1-gram: PETRO, VEND, FUEL, AND, FLUIDS
    - 2-grams: PETRO VEND, VEND FUEL, FUEL AND, AND FLUIDS
    - 3-grams:PETRO VEND FUEL, VEND FUEL AND, FUEL AND FLUIDS
    

### 2.6 Testing your NLP credentials with n-grams
You're well on your way to NLP superiority. Let's test your mastery of n-grams!

In the workspace, we have the loaded a python list, one_grams, which contains all 1-grams of the string petro-vend fuel and fluids, tokenized on punctuation. Specifically,

one_grams = ['petro', 'vend', 'fuel', 'and', 'fluids']

In this exercise, your job is to determine the sum of the sizes of 1-grams, 2-grams and 3-grams generated by the string petro-vend fuel and fluids, tokenized on punctuation.

Recall that the n-gram of a sequence consists of all ordered subsequences of length n.

Answer:
The number of 1-grams + 2-grams + 3-grams is 5 + 4 + 3 = 12

### 2.7 Representing text numerically
- Bag-of-words
    - simple way to represent text in machine learning
    - discards info about grammar and word order
    - computes frequency of occurrence
    
sklearn tools for bag-of-words
- CountVectorizer()
    - Tokenizes all the strings
    - builds a 'vocabulary' (of all the words)
    - counts the occurrences of each token in the vocabulary
    

In [None]:
# CountVectorizer() - use on column of main dataset
from sklearn.feature_extraction.text import CountVectorizer

# regex that does whitespace split
TOKENS_BASIC = '\\S+(?=\\s+)'

# replace NaN with empty string
df.Program_Description.fillna('', inplace=TRUE)

# countvectorizer object to create bag-of-words
vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)

# use fit and transform pattern
# fit will parse out and create vocabulary
vec_basic.fit(df.Program_Description)
# transform will tokenize the text and create an array of counts
msg = 'There are {} tokens in Program_Description if tokens are any non-whitespace'

print(msg.format(len(vec_basic.get_feature_names())))

### 2.8 Creating a bag-of-words in scikit-learn
In this exercise, you'll study the effects of tokenizing in different ways by comparing the bag-of-words representations resulting from different token patterns.

You will focus on one feature only, the Position_Extra column, which describes any additional information not captured by the Position_Type label.

For example, in the Shell you can check out the budget item in row 8960 of the data using df.loc[8960]. Looking at the output reveals that this Object_Description is overtime pay. For who? The Position Type is merely "other", but the Position Extra elaborates: "BUS DRIVER". Explore the column further to see more instances. It has a lot of NaN values.

Your task is to turn the raw text in this column into a bag-of-words representation by creating tokens that contain only alphanumeric characters.

For comparison purposes, the first 15 tokens of vec_basic, which splits df.Position_Extra into tokens when it encounters only whitespace characters, have been printed along with the length of the representation.

In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Fill missing values in df.Position_Extra
df.Position_Extra.fillna('',inplace=True)

# Instantiate the CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Fit to the data
vec_alphanumeric.fit(df.Position_Extra)

# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric"
print(msg.format(len(vec_alphanumeric.get_feature_names())))
print(vec_alphanumeric.get_feature_names()[:15])


### 2.9 Combining text columns for tokenization
In order to get a bag-of-words representation for all of the text data in our DataFrame, you must first convert the text data in each row of the DataFrame into a single string.

In the previous exercise, this wasn't necessary because you only looked at one column of data, so each row was already just a single string. 

- CountVectorizer expects each row to just be a single string, so in order to use all of the text columns, you'll need a method to turn a list of strings into a single string.

In this exercise, you'll complete the function definition combine_text_columns(). When completed, this function will convert all training text data in your DataFrame to a single string per row that can be passed to the vectorizer object and made into a bag-of-words using the .fit_transform() method.

Note that the function uses NUMERIC_COLUMNS and LABELS to determine which columns to drop. These lists have been loaded into the workspace.

In [None]:
# Define combine_text_columns() to tokenize data
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ converts all text in each row of data_frame to single vector """
    
    # Drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # Replace nans with blanks
    text_data.fillna('',inplace=True)
    
    # Join all text items in a row that have a space in between
    # axis=1, drop labels from columns (vs index)
    return text_data.apply(lambda x: " ".join(x), axis=1)

### 2.10 What's in a token?
Now you will use combine_text_columns to convert all training text data in your DataFrame to a single vector that can be passed to the vectorizer object and made into a bag-of-words using the .fit_transform() method.

You'll compare the effect of tokenizing using any non-whitespace characters as a token and using only alphanumeric characters as a token.

In [None]:
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the basic token pattern
TOKENS_BASIC = '\\S+(?=\\s+)'

# Create the alphanumeric token pattern
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate basic CountVectorizer: vec_basic
vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)

# Instantiate alphanumeric CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Create the text vector
text_vector = combine_text_columns(df)

# Fit and transform vec_basic
vec_basic.fit_transform(text_vector)

# Print number of tokens of vec_basic
print("There are {} tokens in the dataset".format(
    len(vec_basic.get_feature_names())))

# Fit and transform vec_alphanumeric
vec_alphanumeric.fit_transform(text_vector)

# Print number of tokens of vec_alphanumeric
print("There are {} alpha-numeric tokens in the dataset".format(
    len(vec_alphanumeric.get_feature_names())))

# There are 1405 tokens in the dataset
# There are 1117 alpha-numeric tokens in the dataset

# 3. Improve benchmark model using pipelines
- use text and numeric data
- use pipelines to process multiple types of data
- use pipeline workflow

The pipeline workflow
- repeatable way to go from raw data to trained model
- Pipeline object takes sequential list of steps
    - output of one step is input to next step
- Each step is a tuple with two elements
    1. Name: string
    2. Transform: obj implementing .fit() and .transform()
- Flexible: a step can itself be another pipeline

In [None]:
# simple pipeline with one step
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
pl = Pipeline([
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# train and test with smaple numeric data to predict 'label'
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                    sample_df[['numeric']],
                                    pd.get_dummies(sample_df['label']),
                                    random_state=2)
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
print('accuracy on numeric data, no nans: ', accuracy)
# accuracy on numeric data, no nans: 0.44

# Adding more steps to the pipeline
# Adding 'with_missing' gives an error b/c input has NaN
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                    sample_df[['numeric','with_missing']],
                                    pd.get_dummies(sample_df['label']),
                                    random_state=2)
pl.fit(X_train, y_train)
# Add step to pipeline: imputer to take care of NaN values
# Default imputer in sklearn is to fill NaN with mean
fromsklearn.preprocessing import Imputer
X_train, X_test, y_train, y_test = train_test_split(
                                    sample_df[['numeric','with_missing']],
                                    pd.get_dummies(sample_df['label']),
                                    random_state=2)
pl = Pipeline([
        ('imp', Imputer()),
         'clf', OneVsRestClassifier(LogisticRegression()))
    ])
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
print('accuracy on numeric data, incl nans: ', accuracy)
# accuracy on numeric data, incl nans: 0.48

### 3.1 Instantiate pipeline
In order to make your life easier as you start to work with all of the data in your original DataFrame, df, it's time to turn to one of scikit-learn's most useful objects: the Pipeline.

For the next few exercises, you'll reacquaint yourself with pipelines and train a classifier on some synthetic (sample) data of multiple datatypes before using the same techniques on the main dataset.

The sample data is stored in the DataFrame, sample_df, which has three kinds of feature data: numeric, text, and numeric with missing values. It also has a label column with two classes, a and b.

In this exercise, your job is to instantiate a pipeline that trains using the numeric column of the sample data.

In [None]:
# Import Pipeline
from sklearn.pipeline import Pipeline

# Import other necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Split and select numeric data only, no nans 
X_train, X_test, y_train, y_test = train_test_split(
    sample_df[['numeric']],
    pd.get_dummies(sample_df['label']), 
    random_state=22)

# Instantiate Pipeline object: pl
pl = Pipeline([
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - numeric, no nans: ", accuracy)

# Accuracy on sample data - numeric, no nans:  0.62

### 3.2 Preprocessing numeric features
Preprocessing step to deal with NaNs

What would have happened if you had included the with 'with_missing' column in the last exercise? Without imputing missing values, the pipeline would not be happy (try it and see). So, in this exercise you'll improve your pipeline a bit by using the Imputer() imputation transformer from scikit-learn to fill in missing values in your sample data.

By default, the imputer transformer replaces NaNs with the mean value of the column. That's a good enough imputation strategy for the sample data, so you won't need to pass anything extra to the imputer.

After importing the transformer, you will edit the steps list used in the previous exercise by inserting a (name, transform) tuple. Recall that steps are processed sequentially, so make sure the new tuple encoding your preprocessing step is put in the right place.

The sample_df is in the workspace, in case you'd like to take another look. Make sure to select both numeric columns- in the previous exercise we couldn't use with_missing because we had no preprocessing step!

In [None]:
# Import the Imputer object
from sklearn.preprocessing import Imputer

# Create training and test sets using only numeric data
X_train, X_test, y_train, y_test = train_test_split(
                                    sample_df[['numeric', 'with_missing']],
                                    pd.get_dummies(sample_df['label']), 
                                    random_state=456)

# Insantiate Pipeline object: pl
pl = Pipeline([
        ('imp', Imputer()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all numeric, incl nans: ", accuracy)

# Accuracy on sample data - all numeric, incl nans:  0.636

### 3.3 Text features and feature unions - in pipeline


In [None]:
# Preprocessing text features

from sklearn.feature_extraction.text import CountVectorizer

X_train, X_test, y_train, y_test = train_test_split(
                                    sample_df['text'],
                                    pd.get_dummies(sample_df['label']), 
                                    random_state=2)
pl = Pipeline([
        ('vec', CountVectorizer()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data: ", accuracy)

#### 3.3.a Preprocessing multiple dtypes
- Want to use ALL available features in one pipeline
- Problem
    - Pipeline steps for numeric and text processing can't follow each other
    - ie. output of CountVectorizer can't be input to Imputer
- Need to separately operate on text vs numeric columns
- Solution
    - FunctionTransformer()
    - FeatureUnion()

FunctionTransformer
- job: Turns a python function into an object that scikit-learn pipeline can understand
- Need to write two functions for pipeline preprocessing
    - Take entire DataFrame, return numeric columns
    - Take entire DataFrame, return text columns
- Can then preprocess numeric and text data in separate pipelines

In [None]:
# FunctionTransformer()
X_train, X_test, y_train, y_test = train_test_split(
                                    sample_df[['numeric','with_missing',
                                             'text']],
                                    pd.get_dummies(sample_df['label']), 
                                    random_state=2)
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion

# create 2 FunctionTransformer objects
# :param validate=False means don't check for NaNs or validate dtypes
#  as we'll do it ourself
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x['numeric',
                                                  'with_missing'], 
                                       validate=False)

In [None]:
# FeatureUnion Text and Numeric features
# See how FeatureUnion works and how it works in pipeline below
from sklearn.pipeline import FeatureUnion
union = FeatureUnion([
        ('numeric', numeric_pipeline),
        ('text', text_pipeline)
])

# entire pipeline
numeric_pipeline = Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
])
text_pipeline = Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
])
pl = Pipeline([
    ('union', FeatureUnion([
        ('numeric', numeric_pipeline),
        ('text', text_pipeline)
    ])),
    ('clf', OneVsRestClassifier(LogisticRegression()))
])
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data: ", accuracy)

#### 3.3.b Preprocessing text features
Here, you'll perform a similar preprocessing pipeline step, only this time you'll use the text column from the sample data.

To preprocess the text, you'll turn to CountVectorizer() to generate a bag-of-words representation of the data, as in Chapter 2. Using the default arguments, add a (step, transform) tuple to the steps list in your pipeline.

Make sure you select only the text column for splitting your training and test sets.

As usual, your sample_df is ready and waiting in the workspace.

In [None]:
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Split out only the text data
X_train, X_test, y_train, y_test = train_test_split(
                                        sample_df['text'],
                                        pd.get_dummies(sample_df['label']), 
                                        random_state=456)

# Instantiate Pipeline object: pl
pl = Pipeline([
        ('vec', CountVectorizer()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - just text data: ", accuracy)

# Accuracy on sample data - just text data:  0.808

#### 3.3.c Multiple types of processing: FunctionTransformer
The next two exercises will introduce new topics you'll need to make your pipeline truly excel.

Any step in the pipeline must be an object that implements the fit and transform methods. The FunctionTransformer creates an object with these methods out of any Python function that you pass to it. We'll use it to help select subsets of data in a way that plays nicely with pipelines.

You are working with numeric data that needs imputation, and text data that needs to be converted into a bag-of-words. You'll create functions that separate the text from the numeric variables and see how the .fit() and .transform() methods work.

In [None]:
# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Obtain the text data: get_text_data
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)

# Obtain the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[['numeric',
                                                    'with_missing']], 
                                                    validate=False)

# Fit and transform the text data: just_text_data
just_text_data = get_text_data.fit_transform(sample_df)

# Fit and transform the numeric data: just_numeric_data
just_numeric_data = get_numeric_data.fit_transform(sample_df)

# Print head to check results
print('Text Data')
print(just_text_data.head())
print('\nNumeric Data')
print(just_numeric_data.head())

# Text Data
# 0           
# 1        foo
# 2    foo bar
# 3           
# 4    foo bar
# Name: text, dtype: object

# Numeric Data
#      numeric  with_missing
# 0 -10.856306      4.433240
# 1   9.973454      4.310229
# 2   2.829785      2.469828
# 3 -15.062947      2.852981
# 4  -5.786003      1.826475

#### 3.3.d Multiple types of processing: FeatureUnion
### Nested Pipeline example
Now that you can separate text and numeric data in your pipeline, you're ready to perform separate steps on each by nesting pipelines and using FeatureUnion().

These tools will allow you to streamline all preprocessing steps for your model, even when multiple datatypes are involved. Here, for example, you don't want to impute our text data, and you don't want to create a bag-of-words with our numeric data. Instead, you want to deal with these separately and then join the results together using FeatureUnion().

In the end, you'll still have only two high-level steps in your pipeline: preprocessing and model instantiation. The difference is that the first preprocessing step actually consists of a pipeline for numeric data and a pipeline for text data. The results of those pipelines are joined using FeatureUnion().

In [None]:
# Nested pipeline and featureunion

# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Split using ALL data in sample_df
X_train, X_test, y_train, y_test = train_test_split(
                            sample_df[['numeric', 'with_missing', 'text']],
                            pd.get_dummies(sample_df['label']), 
                            random_state=22)

# Create a FeatureUnion with nested pipeline: process_and_join_features
process_and_join_features = FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )

# Instantiate nested pipeline: pl
pl = Pipeline([
        ('union', process_and_join_features),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])


# Fit pl to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all data: ", accuracy)

# Accuracy on sample data - all data:  0.928

### 3.4 Choosing a classification model
Recall school budget dataset
- 14 text columns


In [None]:
# Main dataset: lots of text
LABELS = ['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type',
         'Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status']
NON_LABELS = [c for c in df.columns if c not in LABELS]
len(NON_LABELS) - len(NUMERIC_COLUMNS)
# 14

# Recall combine_text_columns() that combined all text columns into
# a single column.
# Use this function in the budget dataset.

# Using pipeline with main dataset
import numpy as np
import pandas as pd
df = pd.read_csv('TrainingSetSample.csv', index_col=0)
dummy_labels = pd.get_dummies(df[LABELS])
X_train,X_test,y_train,y_test = multilabel_train_test_split(
                                df[NON_LABELS], dummy_labels, 0.2)

get_text_data = FunctionTransformer(combine_text_columns,
                                   validate=False)
get_numeric_data = FunctionTransformer(lambda x:
                                      x[NUMERIC_COLUMNS], validate=False)

pl = Pipeline([
    ('union', FeatureUnion([
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
            ])
    ),
    ('clf', OneVsRestClassifier(LogisticRegression))
])
pl.fit(X_train, y_train)


### 3.4.a Flexibility of model step
- Is current model the best?
- Can quickly try different models with pipelines
    - Pipeline preprocessing steps unchanges
    - Edit the model step in your pipeline
    - Ie. classifiers: Random Forest, Na√Øve Bayes, k-NN

In [None]:
# Easily try new models using pipeline: ie. Random Forest
# import classifier
from sklearn.ensemble import RandomForestClassifier
pl = Pipeline([
    ('union', FeatureUnion([
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
            ])
    ),
    # change one line for different model
    ('clf', OneVsRestClassifier(RandomForestClassifier))
])
pl.fit(X_train, y_train)

### Using FunctionTransformer on the main dataset
In this exercise you're going to use FunctionTransformer on the primary budget data, before instantiating a multiple-datatype pipeline in the next exercise.

Recall from Chapter 2 that you used a custom function combine_text_columns to select and properly format text data for tokenization; it is loaded into the workspace and ready to be put to work in a function transformer!

Concerning the numeric data, you can use NUMERIC_COLUMNS, preloaded as usual, to help design a subset-selecting lambda function.

You're all finished with sample data. The original df is back in the workspace, ready to use.

In [None]:
# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])

# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
                                                               dummy_labels,
                                                               0.2, 
                                                               seed=123)

# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

# Preprocess the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], 
                                       validate=False)


### Add a model to the pipeline
You're about to take everything you've learned so far and implement it in a Pipeline that works with the real, DrivenData budget line item data you've been exploring.

Surprise! The structure of the pipeline is exactly the same as earlier in this chapter:

the preprocessing step uses FeatureUnion to join the results of nested pipelines that each rely on FunctionTransformer to select multiple datatypes
the model step stores the model object
You can then call familiar methods like .fit() and .score() on the Pipeline object pl.

In [None]:
# Complete the pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

# Accuracy on budget dataset:  0.203846153846

### Try a different class of model: RandomForest
Now you're cruising. One of the great strengths of pipelines is how easy they make the process of testing different models.

Until now, you've been using the model step ('clf', OneVsRestClassifier(LogisticRegression())) in your pipeline.

But what if you want to try a different model? Do you need to build an entirely new pipeline? New nests? New FeatureUnions? Nope! You just have a simple one-line change, as you'll see in this exercise.

In particular, you'll swap out the logistic-regression model and replace it with a random forest classifier, which uses the statistics of an ensemble of decision trees to generate predictions.


In [None]:
# Import random forest classifer
from sklearn.ensemble import RandomForestClassifier

# Edit model step in pipeline
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        # removed OneVsRestClassifier from previous example
        ('clf', RandomForestClassifier())
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

# Accuracy on budget dataset:  0.296153846154

### Can you adjust the model or parameters to improve accuracy?
You just saw a substantial improvement in accuracy by swapping out the model. Pipelines are amazing!

Can you make it better? Try changing the parameter n_estimators of RandomForestClassifier(), whose default value is 10, to 15.

In [None]:
# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Add model step to pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', RandomForestClassifier(n_estimators=15))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

# Accuracy on budget dataset:  0.321153846154

# 4. Learning from experts/other models