## Plan of attack

1. Use spacy to better tokeninze the text.
2. Build an custom text cleaner step in the pipeline process.
3. Build document vector conversion step.
4. Build the pairing of the questions, or somehow build in parallel.
5. Run XGBoost to classify.

In [1]:
# data manipulation
from utils import save, load
import numpy as np
import pandas as pd

# text manipulation
import spacy

nlp = spacy.load('en_core_web_lg')
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

import string
punctuations = string.punctuation

# modeling
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn import metrics

from xgboost import XGBClassifier

In [2]:
train_df = load('train')

Functions to clean bad text, and tokenize the Quora questions.

In [3]:
def clean_text(question):
    ''' Pre-processeor to clean the Quora questions
    '''
    # found 1 example of no space after a question mark, which causes issues with the tokenzier
    for p in punctuations:
        question = question.replace(p, ' ')
    
    return question

def spacy_tokenizer(question):
    ''' Tokenizer that lemmatizes and removes stop words and punctuation
    '''
    tokens = nlp.tokenizer(question)
#     tokens = [tok.lemma_.lower().strip() for tok in tokens if tok.lemma_ != '-PRON-']
#     tokens = [tok for tok in tokens if tok not in punctuations and tok not in spacy_stopwords]     
    
    return [token.lemma_ for token in tokens]

In [4]:
q = train_df.loc[:,'question1'].sample(1).values[0]
q

'How can we know the number of people connected with the hotspot of Moto G4 Plus?'

In [5]:
spacy_tokenizer(q)

['How',
 'can',
 'we',
 'know',
 'the',
 'numb',
 'of',
 'people',
 'connect',
 'with',
 'the',
 'hotspot',
 'of',
 'Moto',
 'G4',
 'Plus',
 '?']

Pipeline will be the following,

1. Transform pairs into documents
2. TF-IDF of the transformation, with the `spacy_tokenizer`
3. NMF with 50 components
4. Transform the document back into pairs
5. XGBoost classification

In [6]:
def stack_questions(df):
    ''' Takes the pair of questions, and stacks them as individual documents to be processed.
    
    df: DataFrame 
    The data frame must have the 3 cols (id, question1, question2).
    
    return: DataFrame
    Returns a data frame of documents (questions)
    '''
    X = df.loc[:, ['id', 'question1']]
    df = df.drop(columns='question1')
    df = df.rename(columns={'question2':'question1'})
    
    X = X.append(df.loc[:, ['id', 'question1']], sort=False)
    X = X.sort_values('id').reset_index()
    
    return np.array(X['question1'])

def unstack_questions(X):
    ''' Takes X (n_question*2, 1) and transforms it to a (n_questions, 2) numpy array. 
    '''
    odd_idx = [i for i in range(len(X)) if i % 2 == 1]
    even_idx = [i for i in range(len(X)) if i % 2 == 0]
    
    return np.hstack([X[odd_idx], X[even_idx]])

Now let's build a simple pipeline to confirm the questions can be transformed from pairs, to an array of documents, back to pairs.

In [7]:
pipeline = Pipeline(
    [
        ('stack', FunctionTransformer(stack_questions, validate=False)),
        ('tf', TfidfVectorizer(stop_words=spacy_stopwords)),
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(unstack_questions, validate=False)),
        ('xgb', XGBClassifier(n_estimators=500, n_jobs=-1))
    ]
)

y = train_df.loc[:,'is_duplicate'].values
pipeline.fit(train_df, y)

Pipeline(memory=None,
     steps=[('stack', FunctionTransformer(accept_sparse=False, check_inverse=True,
          func=<function stack_questions at 0x1a8f8d4a60>,
          inv_kw_args=None, inverse_func=None, kw_args=None,
          pass_y='deprecated', validate=False)), ('tf', TfidfVectorizer(analyzer='word', binary=False,...tate=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1))])

Need to see what the AUC is on the training data set, and also perform a cross-validation.

In [8]:
len(y[y == 1]) / len(y)

0.3692197711407835

In [11]:
def log_scores(model, X, y, m_name, p_cut = 0.5):
    probs = model.predict_proba(X)[:, 1]
    score = (probs >= p_cut).astype(int)
    
    measures = np.array([
        metrics.accuracy_score(y, score),
        metrics.precision_score(y, score),
        metrics.recall_score(y, score),
        metrics.f1_score(y, score),
        metrics.roc_auc_score(y, probs)
    ])
    
    return pd.DataFrame(data = measures.reshape(1, 5), columns=['accuracy', 'precision', 'recall', 'f1', 'auc'], index=[m_name])

In [12]:
results_df = log_scores(pipeline, train_df, y, 'mvp (tf-idf, nmf(5), xgboost)')
results_df

Unnamed: 0,accuracy,precision,recall,f1,auc
"mvp (tf-idf, nmf(5), xgboost)",0.704685,0.672101,0.390847,0.494264,0.746805


In [13]:
save(pipeline, 'mvp_model')
save(results_df, 'results')

## Improvements

1. Add lemmatizer and fully incorporate a tokenizer with spacy.
2. Analyze and determine if any further data cleaning is needed.
  * Look at questions not ending in a ? mark.
3. Build a pipeline using the GloVe vectors.