# Relevance Classification Models V2

A series of modifications are made to the classification models in this notebook. The potential improvements being tested are:

- Inclusion of bi and trigrams
- Inclusion of POS tagged tokens
- More model types

**Scikit learn n-grams**
- CountVectorizer with ngrams 1 - n
- Tf-idf
- SVD (sklearn/gensime/jakevdp method)

**Scikit learn n-grams**
- CountVectorizer with ngrams 1 - n
- Convert to gensim corpus
- SVD gensim

**Gensim n-grams**
- Build grams up to n
- Add n-grams onto original document by concatenating lists and dropping terms already in text

**Separate ngram corpora**
- Create separate bi and trigram corpora with gensim
- ML on all three and combine

** Split text into all possibel n-grams**
- Use ALL ngrams
- (is this what sklearn does anyway?)

** Count Vectorizer **
- Same as methods above but with no Tf-Idf

In [None]:
import pandas as pd
import numpy as np
import re
import string
from langdetect import detect
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.model_selection import KFold, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
# from mlxtend.classifier import StackingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.externals import joblib

import gensim
from gensim.models.phrases import Phraser
from gensim.models import Phrases

import pandas.io.sql as psql
import psycopg2 as pg

#import matplotlib.pyplot as plt
#import seaborn as sns
from gensim.sklearn_integration.sklearn_wrapper_gensim_lsimodel import SklLsiModel
from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklLdaModel
import pycountry
import unicodedata

#%matplotlib inline

In [None]:
pd.options.display.max_columns = 999

In [None]:
import spacy
nlp = spacy.load('en_default')

In [None]:
from idetect.geotagger import strip_accents, compare_strings, strip_words, common_names, LocationType
from idetect.geotagger import subdivision_country_code, match_country_name, city_subdivision_country

## Training Data

- Import training data
- Filter to English only

In [None]:
with open('../data/stop_words_en_long.txt', 'r') as f:
    stop_words = f.read()
stop_words = stop_words.split('\n')

## Preprocessing

### Example Texts

For quick testing purposes, a single example sentence, and a list of example sentences are defined.

In [None]:
# example_text = 'Flooding has stranded many people. 100 families fled across the border.'
example_text = ['At least 6171 people in London were forced to evacuate homes, including more than 2600 staying in emergency shelters']

In [None]:
example_sentences = [u'Flooding has stranded many peole100 families were evacuated by the army.',
                     u'Some people have fled to safety from the wild fire.',
                     u'Flooding has stranded more people in nearby London.',
                     u'It was all going ok but now flooding has stranded them.']

In [None]:
example_sentence = """At least six thousand people were forced to evacuate homes, including more than 2600 staying in emergency shelters, according to Taiwan's Central Emergency Operation Center. Taiwanese weather authorities lifted sea and land warnings on Thursday as Trami blew away from the island. Tell us more at jon@disasters.com."""

In [None]:
doc = nlp(example_sentence)
docs = [nlp(sent) for sent in example_sentences]

### Cleaning Processor

The Cleaner class replaces commonly enountered errors in the texts that have been manually identified.

In [None]:
class LocationProcessor(BaseEstimator, TransformerMixin):
    """Transformer that replaces all country and subdivisions
    mentioned in text with common names.
    """
    
    def tag_entities(self, text):
        tokens = []
        for token in text:
            if token.ent_type_ == 'GPE':
                if match_country_name(token.text)[0]:
                    tokens.append('Switzerland')
                elif city_subdivision_country(token.text):
                    tokens.append('Zurich')
                else:
                    tokens.append('Zurich')
            elif token.like_num:
                tokens.append('1000')
            elif token.like_url:
                continue
            elif token.like_email:
                continue
            else:
                tokens.append(token.text)
        return tokens

    def join_phrases(self, phrases):
        joined = []
        for phrase in phrases:
            tokens = []
            for token in phrase:
                if isinstance(token, spacy.tokens.token.Token):
                    tokens.append(token.lemma_)
                else:
                    tokens.append(token)
            if len(tokens) < 2:
                continue
            joined.append('_'.join(tokens))
        return joined
    
    def single_string(self, texts):
        strings = [' '.join(t) for t in texts]
        return strings
    
    def fit(self, texts, *args):
        return self
    
    def transform(self, texts, *args):
        texts = [nlp(t) for t in texts]
        texts = [self.tag_entities(t) for t in texts]
        texts = self.single_string(texts)
        return texts

In [None]:
cleaner = LocationProcessor()

In [None]:
cleaner.fit_transform(example_sentences)

### Phrase Processor

There is often significant vocabulary overlap between articles of different categories. However, the phrases that describe actual events are often different. The `Phraser` attempts to extract small snippets of phrases from the text in the hope that they may be common within a class. 

In [None]:
class PhraseProcessor(BaseEstimator, TransformerMixin):
    """Transformer that turns documents in string form
    into token lists, with various processing steps applied.
    
    Parameters
    ----------
    pos_tags : bool, required
        Whether to tag words with their part of speech labels.
    lemmatize : bool, required
        Whether to lemmatize tokens.
    stop_words : book, required
        Whether to remove stop words.
    """
    
    def __init__(self, stop_words):
        self.stop_words = stop_words
    
    def parse_phrases(self, doc):
        '''Return a list of lists, with each sublist containing a token from the text,
        it's parent token and it's grandparent token. Does not return any repeat tokens
        in each phrase.'''
        phrases = []
        for d in doc:
            if not d.is_punct:
                if d.head != d:
                    if d.head.head != d.head:
                        phrases.append([d, d.head, d.head.head])
                    else:
                        phrases.append([d, d.head])
        return phrases

    def join_phrases(self, phrases):
        joined = []
        for phrase in phrases:
            tokens = []
            for token in phrase:
                if isinstance(token, spacy.tokens.token.Token):
                    tokens.append(token.lemma_)
                else:
                    tokens.append(token)
            if len(tokens) < 2:
                continue
            joined.append('_'.join(tokens))
        return joined
    
    def single_string(self, texts):
        strings = [' '.join(t) for t in texts]
        return strings
    
    def fit(self, texts, *args):
        return self
    
    def transform(self, texts, *args):
#         import pdb; pdb.set_trace()
        docs = [nlp(t) for t in texts]
        phrases = [self.parse_phrases(d) for d in docs]
        joined = [self.join_phrases(p) for p in phrases]
        text = self.single_string(joined)
        return text

In [None]:
phraser = PhraseProcessor(stop_words)

In [None]:
processed_sentences = phraser.transform(example_sentences)
processed_sentence = phraser.transform(example_sentence)

### Part of Speech Tokenizer

This processor has options to label the tokens of a text with their part of speech (POS) tags, lemmatize them, and remove stop words. It also removes certain POS entities and words of length < 2.

In [None]:
class POSProcessor(BaseEstimator, TransformerMixin):
    """Transformer that turns documents in string form
    into token lists, with various processing steps applied.
    
    Parameters
    ----------
    pos_tags : bool, required
        Whether to tag words with their part of speech labels.
    lemmatize : bool, required
        Whether to lemmatize tokens.
    stop_words : book, required
        Whether to remove stop words.
    """
    
    def __init__(self, stop_words, pos_tags=True,
                rejoin=True):
        self.stop_words = stop_words
        self.pos_tags = pos_tags
        self.rejoin = rejoin

    def tag_pos(self, text):
        return [(t, t.pos_) for t in text]
    
    def get_lemmas(self, text):
        return [t[0].lemma_ for t in text]

    def remove_noise(self, text):
        noise_tags = ['DET', 'NUM', 'SYM']
        text = [t for t in text if t[0].text not in self.stop_words]
        text = [t for t in text if len(t[0]) > 2]
        text = [t for t in text if t[1] not in noise_tags]
        text = [t for t in text if ~t[0].like_num]
        return text
    
    def join_pos_lemmas(self, pos, lemmas):
        return ['{}_{}'.format(l, p[1]).lower() for p, l
                in zip(pos, lemmas)]
    
    def fit(self, texts, *args):
        return self
    
    def single_string(self, texts):
        strings = [' '.join(t) for t in texts]
        return strings
    
    def transform(self, texts, *args):
        docs = [nlp(sent) for sent in texts]
        docs = [self.tag_pos(d) for d in docs]
        docs = [self.remove_noise(d) for d in docs]
        lemmas = [self.get_lemmas(d) for d in docs]
        if self.pos_tags:
            docs = [self.join_pos_lemmas(d, l) for d, l
                    in zip(docs, lemmas)]
        if self.rejoin:
            docs = self.single_string(docs)
        return docs

In [None]:
pos_processor = POSProcessor(stop_words, rejoin=False)

In [None]:
pos_processor.fit_transform(example_sentences)

### N-gram Processor

The n-gram processor identifies common word co-occurences of between 2 or 3 words.

In [None]:
class NGramProcessor(BaseEstimator, TransformerMixin):
    """Transformer that finds and returns common bi and tri-grams
    that appear in a text.
    
    Parameters
   ----------
    bi_min_count : int, required
        Minimum number of times a bi-gram must occur in the corpus
        to be counted.
    bi_threshold : int, required
        Threshold value used to calculate threshold scoring for 
        bi-grams to be counted. See gensim docs for more details.
    tri_min_count : int, required
        Minimum number of times a tri-gram must occur in the corpus
        to be counted.
    tri_threshold : int, required
        Threshold value used to calculate threshold scoring for 
        tri-grams to be counted. See gensim docs for more details.
    mode : 'str', required (default = 'trigram)
        If 'trigram+', then original text is returned with both bi 
        and tri-grams replacing corresponding tokens.
        If 'bigram+', then original text is returned with only bi-
        grams replacing corresponding tokens.
        If 'everything', then full original text is returned with all
        bi and tri-grams appended.
        If 'bigram
    """
    
    def __init__(self, bi_min_count=5, bi_threshold=10,
                 tri_min_count=5, tri_threshold=10,
                 mode='trigram'):
        self.pos_processor = POSProcessor(stop_words, rejoin=False)
        self.bi_min_count = bi_min_count
        self.bi_threshold = bi_threshold
        self.tri_min_count = tri_min_count
        self.tri_threshold = tri_threshold
        self.mode = mode
    
    def fit(self, texts, *args):
        #docs = self.pre_processor.transform(texts)
        self.build_grammer(texts)
        return self
    
    def build_grammer(self, texts):
        """Creates bi and tri-gram Phraser models.
        """
        self.bigram = Phrases(texts, 
                          min_count=self.bi_min_count, 
                          threshold=self.bi_threshold)
        self.bigrammer = Phraser(self.bigram)
        if (self.mode=='trigram') | (self.mode=='trigram+') | (self.mode=='everything'):
            self.trigram = Phrases(self.bigrammer[texts], 
                              min_count=self.tri_min_count, 
                              threshold=self.tri_threshold)
            self.trigrammer = Phraser(self.trigram)

    def make_grams(self, text):
        """Applies Phraser models. Returns text with bigrams or trigrams replacing
        their corresponding tokens, or both bigrams and trigrams, plus all original tokens."""
        bigrams = self.bigrammer[text]
        if self.mode=='bigram':
            return bigrams
        elif (self.mode=='trigram') | (self.mode=='everything'):
            trigrams = self.trigrammer[bigrams]
            if self.mode=='trigram':
                return trigrams
            elif self.mode=='everything':
                trigrams = [t for t in trigrams if t not in bigrams]
                all_grams = bigrams + trigrams
                return all_grams
    
    def merge(self, text, ngrams):
        """Returns text where the original text is kept, and
        all n-grams are appended."""
        grams_only = [ng for ng in ngrams if ng not in list(set(text))]
        return text + grams_only
    
    def stringify(self, grams):
        gram_strs = []
        for g in grams:
            gram_strs.append(u' '.join(g))
        return gram_strs
    
    def transform(self, texts, *args):
        texts = self.pos_processor.fit_transform(texts)
        grams = [self.make_grams(d) for d in texts]
        if (self.mode=='bigram') | (self.mode=='trigram'):
            return self.stringify(grams)
        elif self.mode=='everything':
            grammed_text = [self.merge(d, ng) for d, ng
                            in zip(docs, grams)]
            return self.stringify(grammed_text)

In [None]:
grammer = NGramProcessor(bi_min_count=2, bi_threshold=1,
                 tri_min_count=1, tri_threshold=1,
                 mode='bigram')

In [None]:
grammer.fit(example_sentences)

In [None]:
grammed_text = grammer.transform(example_sentences)

In [None]:
grammed_text

### LSI

In [None]:
from scipy import sparse
from gensim import matutils, models

class CustomSklLsiModel(SklLsiModel):
    """Gensim's Lsi model with sklearn wrapper, modified to handle sparse matrices 
    for both fit and transform. Makes the class compatible with sklearn's Tfidf and 
    Count vectorizers.
    """
    
    def sparse_2_tupes(self, sparse):
        """Converts sparse matrix into manageable tuple format."""
        for t in t_skltfidf:
            cx = t.tocoo()
            tups = []
            for i, j in zip(cx.col, cx.data):
                tups.append((i, j))
        return tups
    
    def fit(self, X, y=None):
        """
        Fit the model according to the given training data.
        Calls gensim.models.LsiModel
        """
        if sparse.issparse(X):
            corpus = matutils.Sparse2Corpus(X, documents_columns=False)
        else:
            corpus = X

        self.gensim_model = models.LsiModel(corpus=corpus, num_topics=self.num_topics, id2word=self.id2word, chunksize=self.chunksize,
            decay=self.decay, onepass=self.onepass, power_iters=self.power_iters, extra_samples=self.extra_samples)
        return self
    
    def transform(self, docs):
        """
        Takes a list of documents as input ('docs').
        Returns a matrix of topic distribution for the given document bow, where a_ij
        indicates (topic_i, topic_probability_j).
        The input `docs` should be in BOW format and can be a list of documents like : [ [(4, 1), (7, 1)], [(9, 1), (13, 1)], [(2, 1), (6, 1)] ]
        or a single document like : [(4, 1), (7, 1)]
        """
        if self.gensim_model is None:
            raise NotFittedError("This model has not been fitted yet. Call 'fit' with appropriate arguments before using this method.")

        # The input as array of array
        # import pdb; pdb.set_trace()
        # check = lambda x: [x] if isinstance(x[0], tuple) else x
        # docs = check(docs)
        if sparse.issparse(docs):
            docs = matutils.Sparse2Corpus(docs, documents_columns=False)
        X = [[] for i in range(0, len(docs))];
        for k,v in enumerate(docs):
            doc_topics = self.gensim_model[v]
            probs_docs = list(map(lambda x: x[1], doc_topics))
            # Everything should be equal in length
            if len(probs_docs) != self.num_topics:
                probs_docs.extend([1e-12]*(self.num_topics - len(probs_docs)))
            X[k] = probs_docs
            probs_docs = []
        return np.reshape(np.array(X), (len(docs), self.num_topics))

### LDA

In [None]:
class CustomSklLdaModel(SklLdaModel):
    """Gensim's Lsi model with sklearn wrapper, modified to handle sparse matrices 
    for both fit and transform. Makes the class compatible with sklearn's Tfidf and 
    Count vectorizers.
    """
    
    def sparse_2_tupes(self, sparse):
        """Converts sparse matrix into manageable tuple format."""
        for t in t_skltfidf:
            cx = t.tocoo()
            tups = []
            for i, j in zip(cx.col, cx.data):
                tups.append((i, j))
        return tups
    
    def fit(self, X, y=None):
        """
        Fit the model according to the given training data.
        Calls gensim.models.LsiModel
        """
        if sparse.issparse(X):
            corpus = matutils.Sparse2Corpus(X, documents_columns=False)
        else:
            corpus = X

        self.gensim_model = models.LdaModel(corpus=corpus, num_topics=self.num_topics, id2word=self.id2word,
            chunksize=self.chunksize, passes=self.passes, update_every=self.update_every,
            alpha=self.alpha, eta=self.eta, decay=self.decay, offset=self.offset,
            eval_every=self.eval_every, iterations=self.iterations,
            gamma_threshold=self.gamma_threshold, minimum_probability=self.minimum_probability,
            random_state=self.random_state)
        return self
    
    def transform(self, docs):
        """
        Takes a list of documents as input ('docs').
        Returns a matrix of topic distribution for the given document bow, where a_ij
        indicates (topic_i, topic_probability_j).
        The input `docs` should be in BOW format and can be a list of documents like : [ [(4, 1), (7, 1)], [(9, 1), (13, 1)], [(2, 1), (6, 1)] ]
        or a single document like : [(4, 1), (7, 1)]
        """
        if self.gensim_model is None:
            raise NotFittedError("This model has not been fitted yet. Call 'fit' with appropriate arguments before using this method.")

        # The input as array of array
        # import pdb; pdb.set_trace()
        # check = lambda x: [x] if isinstance(x[0], tuple) else x
        # docs = check(docs)
        if sparse.issparse(docs):
            docs = matutils.Sparse2Corpus(docs, documents_columns=False)
        X = [[] for i in range(0, len(docs))];
        for k, v in enumerate(docs):
            doc_topics = self.gensim_model[v]
            probs_docs = list(map(lambda x: x[1], doc_topics))
            # Everything should be equal in length
            if len(probs_docs) != self.num_topics:
                probs_docs.extend([1e-12]*(self.num_topics - len(probs_docs)))
            X[k] = probs_docs
        return np.reshape(np.array(X), (len(docs), self.num_topics))

### Two Countries

In [None]:
# TODO

## Model Building

## Relevance

### Data Prep

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

In [None]:
le = LabelEncoder()

In [None]:
df_rel = pd.read_csv('../data/training_data_clean_09242017.csv')

In [None]:
df_rel.head(1)

In [None]:
df_rel['is_displacement'].value_counts()

In [None]:
df_rel['category'].value_counts()

In [None]:
def detect_lang(text):
    try:
        return detect(text)
    except:
        return 'x'
    
def english_only(df):
    language = df['text'].apply(detect_lang)
    df = df[language == 'en']
    return df

In [None]:
# df = english_only(df)

In [None]:
df_rel['is_displacement_label'] = le.fit_transform(df_rel['is_displacement'])
df_rel_modeling = df_rel.sample(frac=1)

In [None]:
# cleaner = CleaningProcessor()
# df_rel_modeling['text'] = cleaner.fit_transform(df_rel_modeling['text'])

In [None]:
# df_rel_modeling.to_csv('../../data/training_data_clean_09242017.csv')

In [None]:
X_rel_train, X_rel_test, y_rel_train, y_rel_test = train_test_split(df_rel_modeling['text'], 
                                                    df_rel_modeling['is_displacement_label'],
                                                    test_size=0.2)

In [None]:
pos_processor = POSProcessor(stop_words)
phrase_processor = PhraseProcessor(stop_words)
pos_processor, phrase_processor = (pos_processor.fit(X_train), phrase_processor.fit(X_train))
X_train_pos, X_train_phrase = (pos_processor.transform(X_train), phrase_processor.transform(X_train))

In [None]:
X_test_pos, X_test_phrase = (pos_processor.transform(X_test), phrase_processor.transform(X_test))

In [None]:
X_train_all = []
for pos, phrase in zip(X_train_pos, X_train_phrase):
    X_train_all.append(pos + phrase)

X_test_all = []
for pos, phrase in zip(X_test_pos, X_test_phrase):
    X_test_all.append(pos + phrase)

### Corpus Statistics

In [None]:
# from collections import Counter
# import operator

In [None]:
# def splitter(x):
#     s = x.split(' ')
#     return s

# word_counter = Counter()
# for word in X_train.apply(splitter):
#     word_counter.update(word)
    
# sorted_x = sorted(word_counter.items(), key=operator.itemgetter(1),
#                   reverse=True)
# len(sorted_x)

In [None]:
## Uncomment to see frequency of words in corpus

# sorted_x

### Preprocessing and Vectorising

In [None]:
vocab = ['leave', 'evacuate', 'homeless', 'forced', 'flee', 'fled', 'destroyed', 'destruction', 
         'submerged', 'wrecked', 'washed', 'away', 'devastated', 'under', 'water', 'underwater', 
         'inundated', 'camp', 'collapse', 'left', 'reconstruct', 'demolished', 'uninhabitable', 
         'border', 'across', 'refugee', 'shelter', 'crops', 'corn', 'rice', 'maize', 'wheat', 
         'field']
counter = CountVectorizer(vocabulary=vocab)

In [None]:
## Set number of Lsi topics here

num_topics = 50

processor = Pipeline([
               ('tfidf', TfidfVectorizer(norm='l2')),
               ('lsi', CustomSklLsiModel(num_topics=num_topics))
])

In [None]:
# processor = processor.fit(X_train)
processor_pos = processor.fit(X_train_pos)
# processor_phrase = processor.fit(X_train_phrase)
# processor_all = processor.fit(X_train_all)

### Lsi Vector Plot

In [None]:
def visualize_processor_vectors(X, y, processor):
    vecs = processor.transform(X)
    
    v0 = []
    v1 = []
    v2 = []
    v3 = []
    v4 = []
    for v in vecs:
        v0.append(v[0])
        v1.append(v[1])
        v2.append(v[2])
        v3.append(v[3])
        v4.append(v[4])

    df = pd.DataFrame(data={'v0': v0, 'v1': v1, 'v2': v2, 'v3':v3, 'v4': v4, 'category': y})
            
    sns.pairplot(df, hue="category", plot_kws={'alpha': 0.2})
    
    return df

In [None]:
df = visualize_processor_vectors(X_train_pos, y_train, processor_pos)

### LSI Topic Investigation

In [None]:
# get vocabulary from tfidf to assess topics in LSI model
tfidf = processor.named_steps['tfidf']
id2word = {t[1]: t[0] for t in list(tfidf.vocabulary_.items())}
lsi_model = processor.named_steps['lsi'].gensim_model
lsi_model.id2word = id2word

In [None]:
lsi_model.print_topics()

### Feature Resolving Power

### Models

The models chosen are all capable of multiclass classification.

In [None]:
n_jobs = -1

rf_clf = RandomForestClassifier(n_jobs=n_jobs)

knn_clf = KNeighborsClassifier(n_jobs=n_jobs)

svm_clf = LinearSVC()

gnb_clf = GaussianNB()

lr_clf = LogisticRegression(multi_class='ovr')

### Simple Mode

Simple mode means transforming all of the features once for all of the classifiers. This has the advantage of speed, but the disadvantage of not tuning the input features for each classifier.

In [None]:
clf = joblib.load('../python/idetect/nlp_models/relevance_classifier_svm_10052017.pkl')

In [None]:
clf.predict(['Lots of people were evacuated due to the flooding.'])

### LSI

In [None]:
lsi_pipe = Pipeline([
    ('union', FeatureUnion(transformer_list=[
        ('phrase', Pipeline([
            ('processor', PhraseProcessor(stop_words)),
            ('tfidf', TfidfVectorizer(max_features=20000)),
            ('lsi', CustomSklLsiModel(num_topics=300))
        ])),
        ('pos', Pipeline([
            ('processor', POSProcessor(stop_words)),
            ('tfidf', TfidfVectorizer(max_features=20000, max_df=0.7)),
            ('lsi', CustomSklLsiModel(num_topics=300))
        ]))
    ]))
])

In [None]:
# lsi_pipe = joblib.load(lsi_pipe, 'lsi_union_10022017.pkl')
lsi_pipe.fit(df_rel_modeling['text'])

In [None]:
# lsi_vecs_rel_train = lsi_pipe.fit_transform(X_rel_train)
# lsi_vecs_rel_test = lsi_pipe.transform(X_rel_test)

In [None]:
lsi_vecs_rel_all = lsi_pipe.transform(df_rel_modeling['text'])

In [None]:
# joblib.dump(lsi_pipe, 'lsi_union_10022017.pkl')

In [None]:
union = lsi_pipe.named_steps['union']
tfidf_phrase = union.transformer_list[0][1].named_steps['tfidf']
tfidf_pos = union.transformer_list[0][1].named_steps['tfidf']

In [None]:
d = {'a': 'A', 'b': 'B'}
for a,b in d.items():
    print(a)

In [None]:
def vocab_to_id2word(vocab):
    id2word = []
    for k, v in vocab.items():
        id2word.append((v, k))
    return id2word

In [None]:
# union.transformer_list[0][1].named_steps['lsi'].id2word = vocab_to_id2word(tfidf_phrase.vocabulary_)
# union.transformer_list[1][1].named_steps['lsi'].id2word = vocab_to_id2word(tfidf_pos.vocabulary_)
# union.transformer_list[0][1].named_steps['lsi'].gensim_model.id2word = vocab_to_id2word(tfidf_phrase.vocabulary_)
# union.transformer_list[1][1].named_steps['lsi'].gensim_model.id2word = vocab_to_id2word(tfidf_pos.vocabulary_)

In [None]:
lsi_pos = union.transformer_list[1][1].named_steps['lsi']

In [None]:
tfidf_phrase.vocabulary_

### SVM Linear

In [None]:
svm = LinearSVC()
svm_params = {'C': [0.03, 0.1, 0.3, 1, 3, 10]}
svm_grid = GridSearchCV(svm, svm_params, cv=5, scoring=None, n_jobs=-1, verbose=1)

In [None]:
svm_grid.fit(lsi_vecs_rel_train, y_rel_train)

In [None]:
svm_best = svm_grid.best_estimator_
svm_best

In [None]:
svm_preds = svm_best.predict(lsi_vecs_rel_test)
print(classification_report(y_rel_test, svm_preds))

In [None]:
svm_final = LinearSVC(C=1)

In [None]:
svm_final.fit(lsi_vecs_rel_all, df_rel_modeling['is_displacement_label'])

In [None]:
svm_pipe = Pipeline([
    ('location', LocationProcessor()),
    ('lsi', lsi_pipe),
    ('svm', svm_final)
])

In [None]:
joblib.dump(svm_pipe, 'relevance_classifier_svm_10132017.pkl')

### Logistic Regression

In [None]:
lr_params = {'C': [0.03, 0.1, 0.3, 1, 3, 10]}
lr_grid = GridSearchCV(lr_clf, lr_params, cv=5, scoring=None, n_jobs=-1, verbose=1)

In [None]:
lr_grid.fit(lsi_vecs_rel_train, y_rel_train)

In [None]:
lr_best = lr_grid.best_estimator_
lr_preds = lr_best.predict(lsi_vecs_test)
print(classification_report(y_rel_test, lr_preds))

### GNB

In [None]:
gnb_clf.fit(lsi_vecs_rel_train, y_rel_train)

In [None]:
gnb_best = gnb_clf
gnb_preds = gnb_best.predict(lsi_vecs_test)
print(classification_report(y_rel_test, lr_preds))

### Random Forest

In [None]:
rf_params = {'n_estimators': [1000],
             'max_features': [18, 19, 20],
             'max_depth': [100],
             'min_samples_split': [4,5,6]}
rf_grid = GridSearchCV(rf_clf, rf_params, cv=5, scoring=None, n_jobs=-1, verbose=1)

In [None]:
rf_grid.fit(lsi_vecs_train, y_train)

In [None]:
rf_best = rf_grid.best_estimator_
rf_best

In [None]:
rf_preds = rf_best.predict(lsi_vecs_test)
print(classification_report(y_test, rf_preds))

### KNN

In [None]:
knn_params = {'n_neighbors': [2,3,4,5,6,7],
             'leaf_size': [10,20,30,40]}
knn_grid = GridSearchCV(knn_clf, knn_params, cv=5, scoring=None, n_jobs=-1, verbose=1)

In [None]:
knn_grid.fit(lsi_vecs_train, y_train)

In [None]:
knn_best = knn_grid.best_estimator_
knn_best

In [None]:
knn_preds = knn_best.predict(lsi_vecs_test)
print(classification_report(y_test, knn_preds))

### Ensemble

In [None]:
clfs = [svm_best, gnb_best, lr_best]
sclf = StackingClassifier(classifiers=clfs,
                          meta_classifier=GaussianNB())

In [None]:
sclf.fit(lsi_vecs_train, y_train)
# joblib.load(sclf, 'stacked_classifier_10022017.pkl')

In [None]:
preds = sclf.predict(lsi_vecs_test)
print(classification_report(y_test, preds))

In [None]:
# joblib.dump(sclf, 'stacked_classifier_10022017.pkl')

### Single Pipeline

In [None]:
stacked_pipe = Pipeline([
    ('cleaner', CleaningProcessor()),
    ('lsi', lsi_pipe),
    ('stacked', sclf)
])

In [None]:
stacked_pipe.fit(df_rel_modeling['text'], df_rel_modeling['is_displacement_label'])

In [None]:
joblib.dump(stacked_pipe, 'relevance_classifier_10032017.pkl')

## Category

In [None]:
df_cat = pd.read_csv('../../data/category_training_data_en_09102017.csv')

In [None]:
df_cat_modeling = df_cat.sample(frac=1)

In [None]:
cleaner = CleaningProcessor()
df_cat_modeling['text'] = cleaner.fit_transform(df_cat_modeling['text'])

In [None]:
le = LabelEncoder()
df_cat_modeling['category_label'] = le.fit_transform(df_cat_modeling['category'])

In [None]:
X_cat_train, X_cat_test, y_cat_train, y_cat_test = train_test_split(df_cat_modeling['text'], 
                                                    df_cat_modeling['category_label'],
                                                    test_size=0.2, random_state=42)

In [None]:
pos_processor = POSProcessor(stop_words)
phrase_processor = PhraseProcessor(stop_words)
pos_processor, phrase_processor = (pos_processor.fit(X_train), phrase_processor.fit(X_train))
X_train_pos, X_train_phrase = (pos_processor.transform(X_train), phrase_processor.transform(X_train))

In [None]:
X_test_pos, X_test_phrase = (pos_processor.transform(X_test), phrase_processor.transform(X_test))

In [None]:
lsi_pipe = Pipeline([
    ('union', FeatureUnion(transformer_list=[
        ('phrase', Pipeline([
            ('processor', PhraseProcessor(stop_words)),
            ('tfidf', TfidfVectorizer(max_features=20000)),
            ('lda', CustomSklLsiModel(num_topics=300))
        ])),
        ('pos', Pipeline([
            ('processor', POSProcessor()),
            ('tfidf', TfidfVectorizer(max_features=20000)),
            ('lda', CustomSklLsiModel(num_topics=300))
        ]))
    ]))
])

In [None]:
# lsi_pipe = joblib.load(lsi_pipe, 'lsi_union_10022017.pkl')

In [None]:
lsi_vecs_train = lsi_pipe.fit_transform(X_cat_train)
lsi_vecs_test = lsi_pipe.transform(X_cat_test)

In [None]:
n_jobs = -1

rf_cat_clf = RandomForestClassifier(n_jobs=n_jobs)

knn_cat_clf = KNeighborsClassifier(n_jobs=n_jobs)

svm_cat_clf = LinearSVC(multi_class='ovr')

gnb_cat_clf = GaussianNB()

lr_cat_clf = LogisticRegression(multi_class='ovr')

### SVM Linear

In [None]:
svm_cat_params = {'C': [0.1, 0.9, 1, 1.1, 1.2]}
svm_cat_grid = GridSearchCV(svm_cat_clf, svm_cat_params, cv=5, scoring=None, n_jobs=-1, verbose=1)

In [None]:
svm_cat_grid.fit(lsi_vecs_train, y_train)

In [None]:
svm_cat_best = svm_cat_grid.best_estimator_
svm_cat_best

In [None]:
svm_cat_preds = svm_cat_best.predict(lsi_vecs_test)
print(classification_report(y_test, svm_cat_preds))

### Logistic Regression

In [None]:
lr_cat_params = {'C': [8, 9, 10, 11, 12, 100]}
lr_cat_grid = GridSearchCV(lr_cat_clf, lr_cat_params, cv=5, scoring=None, n_jobs=-1, verbose=1)

In [None]:
lr_cat_grid.fit(lsi_vecs_train, y_train)

In [None]:
lr_cat_best = lr_cat_grid.best_estimator_
lr_cat_best

In [None]:
lr_cat_preds = lr_cat_best.predict(lsi_vecs_test)
print(classification_report(y_test, lr_cat_preds))

### GNB

In [None]:
gnb_cat_clf.fit(lsi_vecs_train, y_train)

In [None]:
gnb_cat_best = gnb_cat_clf
gnb_cat_preds = gnb_cat_best.predict(lsi_vecs_test)
print(classification_report(y_test, gnb_cat_preds))

### Ensemble

In [None]:
clfs = [svm_cat_best, gnb_cat_best, lr_cat_best]
sclf_cat = StackingClassifier(classifiers=clfs,
                          meta_classifier=LinearSVC(multi_class='ovr', C=10))

In [None]:
sclf_cat.fit(lsi_vecs_train, y_train)

In [None]:
sclf_cat_preds = sclf_cat.predict(lsi_vecs_test)

In [None]:
print(classification_report(y_test, sclf_cat_preds))

In [None]:
sclf_cat_grid = GridSearchCV(sclf_cat, sclf_cat_params, cv=5, scoring=None, verbose=1)

In [None]:
sclf.fit(lsi_vecs_train, y_train)
# joblib.load(sclf, 'stacked_classifier_10022017.pkl')

In [None]:
cat_pipeline = Pipeline([
    ('cleaner', CleaningProcessor()),
    ('lsi', lsi_pipe),
    ('svm', svm_cat_best)
])

In [None]:
cat_pipeline.fit(df_cat_modeling['text'])

In [None]:
joblib.dump(cat_pipeline, 'category_classifier_10032017.pkl')