If there are random parts in the code, make sure to have seeds to make your results
reproducible.\
This notebook should contain TRAIN, VALIDATION results of ROC AUC (`sklearn.metrics.roc_auc_score`).\
Optional: TEST results can be obtained sending results to Kaggle\
This notebook does not have to train anything.\
It should be relatively fast to execute (probably less than 10 minutes since there is no
training).
This notebook should only load from disk trained models, make predictions and compute
metrics.

In [3]:
import pickle # to load model
from sklearn.metrics import roc_auc_score
import pandas as pd
from utils import *
from sklearn.model_selection import train_test_split
RANDOM_SEED = 123 # taken from train models

# load in quora datasets
# use this to train and VALIDATE your solution
data = pd.read_csv("./quora_train_data.csv")

A_df, test_df, y_A, y_test = train_test_split(data, data["is_duplicate"].values, test_size=0.05, random_state=RANDOM_SEED)
train_df, va_df, y_train, y_val = train_test_split(A_df,y_A, test_size=0.05, random_state=RANDOM_SEED)

print('tr_df.shape=',train_df.shape) # tr_df.shape= (307260, 156550)
print('va_df.shape=',va_df.shape) # va_df.shape= (16172, 156550)
print('te_df.shape=',test_df.shape) # te_df.shape= (80858, 156550)

tr_df.shape= (291897, 6)
va_df.shape= (15363, 6)
te_df.shape= (16172, 6)


In [7]:
# load the model from disk
filename="logreg.sav"
with open(filename, 'rb') as f:
    count_vectorizer, logistic = pickle.load(f)

In [8]:
# preprocess data in order to predict
from sklearn.metrics import roc_auc_score
# Validation
print("Validation Results")
va_df_prep = get_features_from_df(va_df,count_vectorizer)
predictions = logistic.predict(va_df_prep)
result = roc_auc_score(y_val, predictions)
print("Val ROC-AUC: %.3f"%(result))
      
# Test   
print("\nTest Results")
te_df_prep = get_features_from_df(test_df,count_vectorizer)
predictions = logistic.predict(te_df_prep)
result = roc_auc_score(test_df["is_duplicate"].values, predictions)
print("Test ROC-AUC: %.3f"%(result))

Validation Results
Val ROC-AUC: 0.720

Test Results
Test ROC-AUC: 0.729


### Improved version using cosine similiarity
Our manually written preprocess function is very inefficient and takes quite long to run, instead we use CountVectorizer with similar hyperparameters.

In [10]:
# Train Set Performance
train_results = pd.read_csv("train_cosine_similiarity.csv")
print("Train Results")
predictions = np.where(train_results > 0.5,1,0)
# result = roc_auc_score(y_train[:SUBSET], predictions)
result = roc_auc_score(y_train, predictions)
print("Train ROC-AUC: %.3f"%(result))

Train Results
Train ROC-AUC: 0.672


In [12]:
# Validation
print("Validation Results")
val_results = pd.read_csv("val_cosine_similiarity.csv")
predictions = np.where(val_results > 0.5,1,0)
result = roc_auc_score(y_val, predictions)
print("Val ROC-AUC: %.3f"%(result))
      
# Test   
print("\nTest Results")
test_results = pd.read_csv("test_cosine_similiarity.csv")
predictions = np.where(test_results > 0.5,1,0)
result = roc_auc_score(test_df["is_duplicate"].values, predictions)
print("Test ROC-AUC: %.3f"%(result))

Validation Results
Val ROC-AUC: 0.669

Test Results
Test ROC-AUC: 0.668


### Improved version using cosine similiarity and tf-idf

In [17]:
# Train Set Performance
train_results = pd.read_csv("train_tfidf_cos.csv")
print("Train Results")
predictions = np.where(train_results > 0.5,1,0)
# result = roc_auc_score(y_train[:SUBSET], predictions)
result = roc_auc_score(y_train, predictions)
print("Train ROC-AUC: %.3f"%(result))

# Validation
print("\nValidation Results")
val_results = pd.read_csv("val_tfidf_cos.csv")
predictions = np.where(val_results > 0.5,1,0)
result = roc_auc_score(y_val, predictions)
print("Val ROC-AUC: %.3f"%(result))
      
# Test   
print("\nTest Results")
test_results = pd.read_csv("test_tfidf_cos.csv")
predictions = np.where(test_results > 0.5,1,0)
result = roc_auc_score(test_df["is_duplicate"].values, predictions)
print("Test ROC-AUC: %.3f"%(result))

Train Results
Train ROC-AUC: 0.675

Validation Results
Val ROC-AUC: 0.666

Test Results
Test ROC-AUC: 0.669
