In [14]:
import nltk
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('universal_tagset')

from nltk.corpus import stopwords

import pandas as pd
import scipy
from sklearn import *
import re

from SimpleCountVectorizer import *
from SimpleCountVectorizerAMC import *

from TFIDFVectorizer import *
from utils import *

from nltk.stem import WordNetLemmatizer, SnowballStemmer
import pickle
import xgboost as xgb

# Count Vectorizer

In [15]:
train_df = pd.read_csv("./data/quora_train_data.csv")
test_df = pd.read_csv('./data/quora_test_data.csv')

In [16]:
train_df.shape, test_df.shape

((323432, 6), (80858, 6))

In [17]:
all_questions = cast_list_as_strings(list(train_df.loc[:, 'question1'])+list(train_df.loc[:, 'question2']))
print(set(type(x).__name__ for x in all_questions))

{'str'}


### Tokenizer function

The tokenization function is the most important function of our CountVectorizer. It is in charge of deciding which tokens will represent a document (or phrase). As we can see, multiple functionalities have been added, which we will detail below:

* **Stopwords**: deactivated by default, it removes the most common English words. 
    This functionality made us reduce the evaluation metrics in that specific problem but it is a good functionality to take into account in future projects.


* **Numbers to words**: allows to solve problems like:
    * Q1: How much is 2+2?
    * Q2: What is the sum of two plus two?
    
    In this case the numbers are converted to their string representation thanks to a function implemented in the utils library.


* **Stemmer and Lemmatizer**: Two great allies of any text model, they serve to standardize the words by converting them to their root word, remove the 's' from the plurals...


* **N-grams**: To improve prediction and not use only tokens, we have introduced tuples of tokens. As in sklearn, we can specify the size of the N-grams with a function parameter.

* **N-tokens**: We added an extra field to indicate the number of tokens of that document. This feature helps to improve accuracy.

* **Duplicate question words**: In order to enhance the type of the question, we duplicate the keyword.

* **Duplicate verbs**: Verbs are extremely important in deciphering the underlying meaning of a sentence. Therefore, we attributed more importance to them via duplication. 

* **Duplicate nouns**: Nouns are extremely important in deciphering the underlying meaning of a sentence. Therefore, we attributed more importance to them via duplication.


### Fitting the improved SimpleCountVectorizer
+ Thanks to pickle we load the data directly. For more details of the process check the notebook **F1_Building_the_model**.

In [18]:
count_vect = pickle.load(open("models/CountVect.pkl", 'rb'))

### Transforming the datasets into sparse matrices

In [19]:
X_tr_q1q2 = pickle.load(open("models/X_tr_q1q2.pkl", 'rb'))
X_te_q1q2 = pickle.load(open("models/X_te_q1q2.pkl", 'rb'))

## Checking shapes
X_tr_q1q2.shape, train_df.shape, X_te_q1q2.shape, test_df.shape

((323432, 9425768), (323432, 6), (80858, 9425768), (80858, 6))

In [24]:
y_train = train_df["is_duplicate"].values
y_test = test_df['is_duplicate'].values

y_train.shape, y_test.shape

((323432,), (80858,))

## Base model (Linear Regression)

In [25]:
# load the model from disk
loaded_linear_reg = pickle.load(open("models/model_lr_count.pkl", 'rb'))
result_train = loaded_linear_reg.score(X_tr_q1q2, y_train)
result_test = loaded_linear_reg.score(X_te_q1q2, y_test)

print("Accuracy in training:", result_train)
print("Accuracy in testing:",result_test)

Accuracy in training: 0.9990075193549185
Accuracy in testing: 0.8131662915234115


## Improving results (XGBoost)

In [36]:
xgb_model_countvect = pickle.load(open("models/xgboost_model_countvect.pkl", 'rb'))


## Training curves

In [37]:
from matplotlib import pyplot as plt
%matplotlib inline

results = xgb_model_countvect.evals_result()
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)

fig = plt.figure(figsize=(20,6))

# plot log loss
ax = fig.add_subplot(121)
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
ax.set_ylabel('Log Loss')
ax.set_title('XGBoost Log Loss')

# plot classification AUC
ax = fig.add_subplot(122)
ax.plot(x_axis, results['validation_0']['auc'], label='Train')
ax.plot(x_axis, results['validation_1']['auc'], label='Test')
ax.legend()
ax.set_ylabel('Classification AUC')
ax.set_title('XGBoost Classification AUC')
plt.show()

# TFIDF

In our case, TFIDF has a lower performance than SimpleCountVectorizer (with the default parameters it already had it). We have managed to raise the score a little bit although we only use the CountVectorizer implemented at the beginning to predict.

In [39]:
tfidf_vectorizer = TFIDFVectorizer(count_vect.vocabulary, count_vect.word_to_ind, my_tokenizer_func)
tfidf_vectorizer.fit(all_questions)

In [40]:
X_tfidf_tr_q1q2 = get_features_from_df(train_df, tfidf_vectorizer)
X_tfidf_te_q1q2  = get_features_from_df(test_df, tfidf_vectorizer)

X_tfidf_tr_q1q2.shape, train_df.shape, test_df.shape, X_tfidf_te_q1q2.shape

In [41]:
logistic = sklearn.linear_model.LogisticRegression(solver="liblinear", verbose=1, max_iter=1000)
logistic.fit(X_tfidf_tr_q1q2, y_train)

logistic.score(X_tfidf_tr_q1q2, y_train), logistic.score(X_tfidf_te_q1q2, y_test)

In [42]:
N = 10000 # With early stopping
xgb_model = xgb.XGBClassifier(n_estimators=N)
xgb_model.fit(X_tfidf_tr_q1q2, y_train, 
              verbose=10, 
              eval_set=[(X_tfidf_tr_q1q2, y_train),(X_tfidf_te_q1q2, y_test)], 
              early_stopping_rounds =10,
              eval_metric=['auc','logloss'],
              )

In [43]:
xgb_model.save_model('models/model_tfidf.dat')

Cosas importantes:
He subido el notebook final 1: F1_Building_the_model
He separado algunas funciones en una libreria utils.py (int2num, cast2int)
He creado tambien una libreria mistakes.py con las cuatro funciones de mistakes. Quien escriba el notebook, que lo tenga en cuenta de no ponerlas en el notebook, solo importar 

from utils import *
from mistakes import *

He generado dos modelos finales model_count.dat, model_tfidf.dat, que pueden ser cargados y comparados con los de sklearn. El primero llega a una AUC de 88% i el segundo a 80%