## Plan of attack

1. Use spacy to better tokeninze the text.
2. Build an custom text cleaner step in the pipeline process.
3. Build document vector conversion step.
4. Build the pairing of the questions, or somehow build in parallel.
5. Run XGBoost to classify.

In [51]:
# data manipulation
from utils import save, load
import numpy as np
import pandas as pd

# text manipulation
import spacy

nlp = spacy.load('en_core_web_lg')
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

import string
punctuations = string.punctuation

# modeling
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [2]:
train_df = load('train')

Functions to clean bad text, and tokenize the Quora questions.

In [8]:
def clean_text(question):
    ''' Pre-processeor to clean the Quora questions
    '''
    # found 1 example of no space after a question mark, which causes issues with the tokenzier
    for p in punctuations:
        question = question.replace(p, ' ')
    
    return question

def spacy_tokenizer(question):
    ''' Tokenizer that lemmatizes and removes stop words and punctuation
    '''
    tokens = nlp(clean_text(question))
    tokens = [tok.lemma_.lower().strip() for tok in tokens if tok.lemma_ != '-PRON-']
    tokens = [tok for tok in tokens if tok not in punctuations and tok not in spacy_stopwords]     
    
    return tokens

In [12]:
q = train_df.loc[:,'question1'].sample(1).values[0]
q

'How can I direct message someone on Instagram from my computer?'

In [13]:
spacy_tokenizer(q)

['direct', 'message', 'instagram', 'computer']

Pipeline will be the following,

1. Transform pairs into documents
2. TF-IDF of the transformation, with the `spacy_tokenizer`
3. NMF with 50 components
4. Transform the document back into pairs
5. XGBoost classification

In [54]:
def stack_questions(df):
    ''' Takes the pair of questions, and stacks them as individual documents to be processed.
    
    df: DataFrame 
    The data frame must have the 3 cols (id, question1, question2).
    
    return: DataFrame
    Returns a data frame of documents (questions)
    '''
    X = df.loc[:, ['id', 'question1']]
    df = df.drop(columns='question1')
    df = df.rename(columns={'question2':'question1'})
    
    X = X.append(df.loc[:, ['id', 'question1']], sort=False)
    X = X.sort_values('id').reset_index()
    
    return np.array(X['question1'])

def unstack_questions(X):
    ''' Takes X (n_question*2, 1) and transforms it to a (n_questions, 2) numpy array. 
    '''
    odd_idx = [i for i in range(len(X)) if i % 2 == 1]
    even_idx = [i for i in range(len(X)) if i % 2 == 0]
    
    return np.vstack([X[odd_idx], X[even_idx]]).T

Now let's build a simple pipeline to confirm the questions can be transformed from pairs, to an array of documents, back to pairs.

In [55]:
simple_pip = Pipeline(
    [
        ('stack', FunctionTransformer(stack_questions, validate=False)),
        ('unstack', FunctionTransformer(unstack_questions, validate=False))
    ]
)

simple_pip.transform(train_df)[:5]

array([['What is the step by step guide to invest in share market?',
        'What is the step by step guide to invest in share market in india?'],
       ['What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',
        'What is the story of Kohinoor (Koh-i-Noor) Diamond?'],
       ['How can Internet speed be increased by hacking through DNS?',
        'How can I increase the speed of my internet connection while using a VPN?'],
       ['Find the remainder when [math]23^{24}[/math] is divided by 24,23?',
        'Why am I mentally very lonely? How can I solve it?'],
       ['Which fish would survive in salt water?',
        'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?']],
      dtype=object)