## Enhanced featrue engineer model

There seems to be some questions which are different based on the numbers in the question. This pipeline will no longer strip numbers from the questions.

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - **include numbers**
3. UNION
    1. n_gram similarity
    2. min/max/avg distance
4. Lemmatize questions
5. UNION
    1. n_gram similarity
    2. min/max/avg distances
6. UNION together both sets of features
7. Random Forrest

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')
MEM_PATH = '../data/transform_memory/'
model_name = 'xgb_feat_eng_incl_nums'

In [3]:
clean_text_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False, kw_args={'excl_num':False})),
    ],
    memory = MEM_PATH
)

lemma_text_pipe = Pipeline(
    [
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False))        
    ],
    memory = MEM_PATH
)

# feature engineering pipes
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pair_question_pipe = Pipeline(
    [
        ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False))
    ]
)

# clean text pipe
clean_text_features = Pipeline(
    [
        ('clean', clean_text_pipe),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# lemma pipe
lemma_features = Pipeline(
    [
        ('clean', clean_text_pipe),
        ('lemma', lemma_text_pipe),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# pre-process pipe
feature_union_pipe = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_features', clean_text_features),
                ('lemma_pipe', lemma_features)
            ]
        )),
        ('xgb', XGBClassifier(n_estimators=500, n_jobs=1, random_state=42))
    ]
)

In [4]:
skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(feature_union_pipe, 
               X_train, 
               y_train, 
               cv=skf, 
               n_jobs=1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'),
               verbose=100)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV]  ................................................................


KeyboardInterrupt: 

In [None]:
results_df = utils.load('results')

results_df = results_df.drop(index=model_name, errors='ignore')
results_df = results_df.append(utils.log_scores(cv, model_name))
results_df

In [None]:
utils.save(results_df, 'results')

## Results

Wow! The feature engineering shows a significant jump in AUC from 0.8 to 0.82.

In [7]:
rf.fit(X_train, y_train)
utils.save(rf, model_name)

y_probs = rf.predict_proba(X_train)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

Unnamed: 0,gt,prob,diff
0,0,0.15,-0.15
1,0,0.18,-0.18
2,0,0.086,-0.086
3,0,0.0,0.0
4,0,0.042,-0.042


In [8]:
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
    ]
)
X_train_lemma = lemma_pipe.transform(X_train)

## Top false negative errors

In [9]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

Prob: 0.022035150817837166

Is unifunds.com legit?
Is unifunds legit?

Lemma--------

unifund legit
unifundscom legit

Feature Space------
[ 0.66666667  0.33333333  0.33333333  6.57793499  6.57793499  6.57793499
  0.732846    0.732846    0.732846   88.33192412 88.33192412 88.33192412
  6.57793499  6.57793499  6.57793499  0.732846    0.732846    0.732846
 88.33192412 88.33192412 88.33192412  0.5         0.5         0.5
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.        ]
-------------------------------------------

Prob: 0.06587658468825115

What is the remainder when 2^100 is divided by 101?
What is the remainder when [math]2^{100}[/math] is divided by 101?

Lemma--------

remainder  divide
remainder  divide

Feature Space------
[  1.           1.           1.           0.           7.61300637
   5.48716164   0.           0.83449638   0.

## Top false positive errors

In [10]:
fn_idx = class_errors_df.sort_values('diff').head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

Prob: 0.859919422244422

Can I make 30,000 a month betting on horses?
Can I make 80,000 a month betting on horses?

Lemma--------

 month bet horse
 month bet horse

Feature Space------
[  1.           1.           1.           3.54843587   8.64389945
   6.99204975   0.24125347   0.86209143   0.6860896   48.7158108
 119.75634825  95.64765528   3.54843587   8.64389945   6.99204975
   0.24125347   0.86209143   0.6860896   48.7158108  119.75634825
  95.64765528   1.           1.           1.           6.92833324
   8.06033018   7.32694928   0.60332816   0.81230419   0.70401332
  95.01284819 112.79109566 101.19969094   6.92833324   8.06033018
   7.32694928   0.60332816   0.81230419   0.70401332  95.01284819
 112.79109566 101.19969094]
-------------------------------------------

Prob: 0.8402367798867798

What was the significance of the battle of Somme, and how did this battle compare and contrast to the Battle of Taiyuan?
What was the significance of the battle of Somme, and how did this 

## Next Steps

1. The set of false negative and false postive errors appears to be different compared to XGBoost model. 
  * Run an ensemble model with RandomForrest and XGBoost to smooth out the bias.
2. Short questions with numbers and stop words seem to be a common theme for the false positive errors. 
  * Generate features including stop words and numbers.