# Simple classifier for both text and non-text features
I've developed a pretty large number of text classifiers, and usually do not perform rules-based feature extraction. When I went to do it for a recent project however, I discovered a fun fact: 

There are a hell of a lot of roadblocks.

Here's a condensed version of the journey, for myself and posterity. <3 <3

## Setup and imports

I'm using Python 3.6.3 (if you're using 2.7 still... I'm sorry? Here's your [deathclock](https://pythonclock.org/).)

* [pandas 0.21.0](https://pandas.pydata.org/pandas-docs/stable/whatsnew.html)
* [numpy 1.13.3](https://github.com/numpy/numpy/releases/tag/v1.13.3)
* [xgboost 0.6a2](https://pypi.python.org/pypi/xgboost/)
* [scikit-learn 0.19.1](https://pypi.python.org/pypi/scikit-learn/0.19.1)

In [None]:
from nlp.icd.classifier import *
import pandas as pd
import warnings
from sklearn.model_selection import *
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
from datetime import datetime
import re
# from collections import Counter
from nltk import ngrams
import numpy as np
from sklearn.base import TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline

## Data
I'll use the most basic sample data here. Keep in mind that testing with single letters like 'a', 'b', etc won't work when we go to vectorize the text because single letters are usually stopwords. An empty vector model (a matrix of all zeros) is checked for and disallowed.

For that reason, our test data has full words in the textual column.


In [50]:
data = pd.DataFrame()
data['x1'] = [0, 1, 1, 0]
data['x2'] = ['love', 'hate', 'kiss', 'kill']
data['y'] = [1, 1, 0, 0]

Our input matrix (denoted X in math and machine learning, but because I pretend to follow pep8 'x' henceforth in my code) should contain both the text and non-text columns, and y is brilliantly labelled already.

In [51]:
x = data[['x1', 'x2']].copy()
y = data['y']

Input matrix with records and features:

In [52]:
display(x)

Unnamed: 0,x1,x2
0,0,love
1,1,hate
2,1,kiss
3,0,kill


Output array with the right answers for each input record:

In [53]:
display(y)

0    1
1    1
2    0
3    0
Name: y, dtype: int64

For later on, we'll go ahead and split it into train and test sets.

In [54]:
# x, y = shuffle(x, y, random_state=0)
x_train, x_test, y_train, y_test = train_test_split(x, 
                                                    y, test_size=0.5, 
                                                    stratify=y)

## Setting up the pipeline
Because there's kind of a lot going on in the data preparation before we can call the final ```.fit(x, y)```, I use sklearn's incredibly inconvenient Pipeline api. 

The overall flow of what we need to do looks something like this:
1. pass the textual column to a vectorizer
2. don't do anything to the non-textual column, it's already one-hot encoded and everything
3. join both the features in a FeatureUnion
4. make the meta-pipeline from the feature union (data-prep) to the xgboost classifier (the model itself)

The steps in the pipeline (except for the end model, in our case an XGBClassifier) have to all implement ```transform(...)``` and ```fit(...)```.

You'll see why I don't particularly enjoy working with pipelines, starting with all the little wrapper classes referred to as **transformations** (read "functions"). I come from the world of FP and have never really understood why languages don't make it easier to pass around simple functions, but I digress.

### Transforming sparse matrices

The initial transformation I define here is kind of a weird one. Its purpose is to transform tfidf sparse matrices into something usable downstream.

This is firstly to get around [a bug in xgboost](https://github.com/dmlc/xgboost/issues/1238) in the processing of csr sparse matrices (xgboost can't do it, so I convert them to pandas experimental [SparseDataFrames](https://pandas.pydata.org/pandas-docs/stable/sparse.html#sparsedataframe). For what it's worth, if you build from source they've got it fixed; I installed with pip3, so no such luck.

The second purpose of this little guy is to get around [another bug, this time in pandas](https://github.com/pandas-dev/pandas/issues/5470).
We have to add a column of ones to the tfidf matrix (you'll have to do it anytime you work with a sparse matrix, probably). You can technically get around it by using ```.todense()```
on another matrix or by converting it to an array, but that's **incredibly** memory-intensive. Like, intractably so for any commercial problem. 

Adding the column of 1s is the best way I've found to do it.

In [55]:
class FixSparseMatrix:

    @staticmethod
    def transform(xs):
        # work around https://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543
        # another problem : https://github.com/pandas-dev/pandas/issues/5470
        df = pd.SparseDataFrame(xs)
        df[df.shape[1] + 1] = np.ones(df.shape[0])
#         return df.to_coo()
        return df.values

    def fit(self, *args):
        return self

## Transforming a matrix into a subset of itself
Technically a transformer, but it would have been *so* much more convenient if we thought of these as functions... ugh.

Basically, if you remember our input matrix x, it had heterogenous data types that need to be processed differently. A machine learning algorithm takes a matrix of numbers, and right now we have text - something has to convert the text into correlative numeric values (NLP to the rescue!).

Before we can convert the text though, we (annoyingly) have to convert x into the subset we want (in this case, it's ```x['text']```).

I took this directly from the [scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/hetero_feature_union) on using a TfidfVectorizer in a larger pipeline. They call it "ItemSelector", but in an act of defiance I name it consistently with the rest of the pipeline. Keep in mind that any transformation needs to implement both ```fit(...)``` and ```transform(...)```. 

In [56]:
class Subset(TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

## Reshaping within a subset transformation
So for some data, all we need is to grab a subset. But unfortunately, we're joining multiple features into one pipeline. Because [sklearn's FeatureUnion does not properly handle one-dimensional arrays](https://stackoverflow.com/questions/42022487/combining-heterogenous-features-in-scikit-learn), you have to reshape 1d's into 2d's. Otherwise you'll get 
```ValueError: all the input arrays must have same number of dimensions```

This happens  when we try to grab just the non-textual data from x.

We need another wrapper class. Look at all these lines of code for such teeny functionality!

In [57]:
class ReshapeSubset(TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key].values.reshape(-1, 1)

## Transforming text into numbers
I didn't write this class, so I don't get to be snarky about what it's called - but the bread and butter of NLP is the vector model - one of many ways to convert text into numbers. I instantiate a TfidfVectorizer from scikit-learn, just for funsies. 

It already implements ```fit(...)``` and ```transform(...)```, so it'll fit into the pipeline as-is.

In [58]:
vec = TfidfVectorizer(analyzer='word')

## Hooking it up 
IT BEGINS (after way too much effort)

Take the text column and vectorize it, transforming its output into a pandas SparseDataFrame to get around those bugs in the libraries:

In [59]:
text_pipeline = make_pipeline(Subset(key='x2'), vec, FixSparseMatrix())

It's annoying, but afaik you have to label the transformations in sklearn (I'm making dummies for them with 't' and 'nt'). The only time you don't have to do that is when you create a
Pipeline object using the make_pipeline constructor; I don't know of any such thing for FeatureUnion though, so we can't get around it completely.

In [60]:
all_features = FeatureUnion([('t', text_pipeline), ('nt', ReshapeSubset(key='x1'))])

And we've finally arrive at the finished pipeline:

In [61]:
pipeline = make_pipeline(all_features, xgb.XGBClassifier())

Now we can train/test on the raw X, y as if it were one
single step.

In [62]:
pipeline.fit(x, y)

Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('t', Pipeline(memory=None,
     steps=[('subset', <__main__.Subset object at 0x119070978>), ('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding...logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1))])

## And the results?
Well these results are terrible, which is exactly as accurate as the data. But the pipeline works! Go find you some better data. 

In [65]:
predictions = pipeline.predict(x_test)
predictions = [round(value) for value in predictions]
accuracy = accuracy_score(y_test, predictions)
msg = "Accuracy: %.2f%%\n" % (accuracy * 100.0)
print(msg)

Accuracy: 50.00%

