In [20]:
# Pre-existing libraries
import nltk
from nltk.corpus import stopwords
import pandas as pd
import scipy
from sklearn import *
import re
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import pickle
import xgboost as xgb

# Custom made modules
from SimpleCountVectorizer import *
from SimpleCountVectorizerAMC import *
from TFIDFVectorizer import *
from utils import *
from mistakes import *

# Count Vectorizer

In [21]:
train_df = pd.read_csv("./data/quora_train_data.csv")
test_df = pd.read_csv('./data/quora_test_data.csv')

In [22]:
train_df.shape, test_df.shape

((323432, 6), (80858, 6))

In [23]:
all_questions = cast_list_as_strings(list(train_df.loc[:, 'question1'])+list(train_df.loc[:, 'question2']))
print(set(type(x).__name__ for x in all_questions))

{'str'}


### Tokenizer function

The tokenization function is the most important function of our CountVectorizer. It is in charge of deciding which tokens will represent a document (or phrase). As we can see, multiple functionalities have been added, which we will detail below:

* **Stopwords**: deactivated by default, it removes the most common English words. 
    This functionality made us reduce the evaluation metrics in that specific problem but it is a good functionality to take into account in future projects.


* **Numbers to words**: allows to solve problems like:
    * Q1: How much is 2+2?
    * Q2: What is the sum of two plus two?
    
    In this case the numbers are converted to their string representation thanks to a function implemented in the utils library.


* **Stemmer and Lemmatizer**: Two great allies of any text model, they serve to standardize the words by converting them to their root word, remove the 's' from the plurals...


* **N-grams**: To improve prediction and not use only tokens, we have introduced tuples of tokens. As in sklearn, we can specify the size of the N-grams with a function parameter.

* **N-tokens**: We suspected questions that had the same number of tokens would be more likely to have the same meaning. Therefore, we implemented a feature which appended a token to every question corresponding to the question length (i.e. ['How', 'old', 'am', 'I'] --> ['How', 'old', 'am', 'I', 'four_words']). This effectively produces a one-hot-encoding for the number of tokens in each question.



* **Duplicate question words**: In order to enhance the type of the question, we duplicate the keyword.

* **Duplicate verbs**: Verbs are extremely important in deciphering the underlying meaning of a sentence. Therefore, we built a feature to apply more importance to them via duplication. Though this approach seemed intuitive, it did not actually improve the performance of the classifier and we therefore left this feature disabled by default. 

* **Duplicate nouns**: Nouns are extremely important in deciphering the underlying meaning of a sentence. Therefore, we attributed more importance to them via duplication.


## Simple Count Vectorizer

* **N-tokens**: We added an extra feature when applying transform in SimpleCountVectorizerAMC which encodes the number of tokens in each question. This method differs from the previous implementation in that it only adds a single feature to the sparse matrix. Moreover, it helps the classifier by providing a distance between the lengths of questions. Both N-token's features improved accuracy. 

* **Out-of-vocabulary words**: After importing several word distances we tried to find the closest word in the vocabulary to each out-of-vocabulary word. 
Due to both Levenstein and Jaccard distandces having a complexity of $O(N^2)$ and due to the extremely large number of features produced from our simple count vectorizer, the distances' computations proved to be unfeasible. Seeing this, we implemented a naive yet fast distance algorithm which had a complexity upperbound of $O(min(L1, L2))$ where L1 and L2 are the respective lengths of each word.  In the end, this faster approach also proved to be infeasible. Therefore, we didn't include either of these approaches in the final algorithm. 

### Fitting the improved SimpleCountVectorizer
+ Now we are loading the sparse matrices directory. The sparse matrices are produced in **F1_Building_the_model**.

### Transforming the datasets into sparse matrices

In [16]:
X_tr_q1q2 = scipy.sparse.load_npz('models/X_tr_q1q2.npz')
X_te_q1q2 = scipy.sparse.load_npz('models/X_te_q1q2.npz')

## Checking shapes
X_tr_q1q2.shape, train_df.shape, X_te_q1q2.shape, test_df.shape

((323432, 9425768), (323432, 6), (80858, 9425768), (80858, 6))

In [17]:
y_train = train_df["is_duplicate"].values
y_test = test_df['is_duplicate'].values

y_train.shape, y_test.shape

((323432,), (80858,))

## Base model (Linear Regression)

In [19]:
# load the model from disk
loaded_linear_reg = pickle.load(open("models/model_lr_count.pkl", 'rb'))
result_train = loaded_linear_reg.score(X_tr_q1q2, y_train)
result_test = loaded_linear_reg.score(X_te_q1q2, y_test)

print("Accuracy in training:", result_train)
print("Accuracy in testing:",result_test)
loaded_linear_reg

Accuracy in training: 0.9990075193549185
Accuracy in testing: 0.8131662915234115


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=1,
                   warm_start=False)

## Improving results (XGBoost)

In [11]:
xgb_model_countvect = pickle.load(open("models/xgboost_model_countvect.pkl", 'rb'))

result_train = xgb_model_countvect.score(X_tr_q1q2, y_train)
result_test = xgb_model_countvect.score(X_te_q1q2, y_test)

print("Accuracy in training:", result_train)
print("Accuracy in testing:",result_test)

FileNotFoundError: [Errno 2] No such file or directory: 'models/xgboost_model_countvect.pkl'

In [12]:
N=10000
xgb_model = xgb.XGBClassifier(n_estimators=N)
xgb_model.load_model('models/model_tfidf.dat')
xgb_model.predict(X_tr_q1q2)



array([0, 0, 0, ..., 0, 0, 0])

# Some coments

+ TFIDF vectorizer underperformed with respect to SimpleCountVectorizer in all the attempts performed. We hypothesize that the TFIDF vectors are worse for this specific task and one of the reasons could be that, using TFIDF, the question words (such as what, why, when, who, etc) appear in lots of documents rendering them less important in the TFIDF feature vector. This could very well induce a strong retraction in the performance of any classifier.

