## Training

### Import Libraries and functions from utils.py

In [1]:
import pandas as pd
import scipy
import sklearn
from sklearn import *
import numpy as np
import os
from utils import *

In [2]:
path_data =  os.path.expanduser('~') 


# use this to train and VALIDATE your solution
train_df = pd.read_csv("./data/quora_train_data.csv")

# use this to provide the expected generalization results
test_df = pd.read_csv("./data/quora_test_data.csv")

In [3]:
train_df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,346692,38482,10706,Why do I get easily bored with everything?,Why do I get bored with things so quickly and ...,1
1,327668,454117,345117,How do I study for Honeywell company recruitment?,How do I study for Honeywell company recruitme...,1
2,272993,391373,391374,Which search engine algorithm is Quora using?,Why is Quora not using reliable search engine?,0
3,54070,82673,95496,How can I smartly cut myself?,Can someone who thinks about suicide for 7 yea...,0
4,46450,38384,72436,How do I see who is viewing my Instagram videos?,Can one tell who viewed my Instagram videos?,1


In [4]:
#TODO: don't we load the test set to test_df?
A_df, te_df = sklearn.model_selection.train_test_split(train_df, test_size=0.05,random_state=123)

tr_df, va_df = sklearn.model_selection.train_test_split(A_df, test_size=0.05,random_state=123)
print('tr_df.shape=',tr_df.shape)
print('va_df.shape=',va_df.shape)
print('te_df.shape=',te_df.shape)

tr_df.shape= (291897, 6)
va_df.shape= (15363, 6)
te_df.shape= (16172, 6)


<b>cast_list_as_strings</b> casts each element in the input list to a string.

In [5]:
q1_train =  cast_list_as_strings(list(train_df["question1"]))
q2_train =  cast_list_as_strings(list(train_df["question2"]))
q1_test  =  cast_list_as_strings(list(test_df["question1"]))
q2_test  =  cast_list_as_strings(list(test_df["question2"]))

In [6]:
q1_train[0], q2_train[0]

('Why do I get easily bored with everything?',
 'Why do I get bored with things so quickly and easily?')

Use all the questions in train and test partitions to create a single <b>list all_questions to fit the count_vectorizer.</b>

In [7]:
all_questions = q1_train + q2_train

In [8]:
count_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,1))
count_vectorizer.fit(all_questions)

Use the function <b>get_features_from_df</b>, that given a dataframe containing the format of the train data it returns a scipy sparse matrix with the features from question 1 and question 2.

In [9]:
X_tr_q1q2 = get_features_from_df(train_df,count_vectorizer)
X_te_q1q2  = get_features_from_df(test_df, count_vectorizer)

In [10]:
X_tr_q1q2.shape, train_df.shape, test_df.shape, X_te_q1q2.shape

((323432, 156550), (323432, 6), (80858, 6), (80858, 156550))

### Models

Now we can use this representation X_tr_q1q2 to fit a model. We are going to train two models:
- Perceptron
- Logistic Regression

In [11]:
perceptron = sklearn.linear_model.Perceptron()

y_train = train_df["is_duplicate"].values
perceptron.fit(X_tr_q1q2, y_train)

In [12]:
logistic = sklearn.linear_model.LogisticRegression(solver="liblinear",
                                                   random_state=123)
#y_train = train_df["is_duplicate"].values
logistic.fit(X_tr_q1q2, y_train)

### Store the trained Logistic Regression model to disk

In [13]:
# save logistic regression model to disk
save_model_with_overwrite(logistic, 'model_artifacts/logistic_model.joblib')


The file already exists. Do you want to overwrite it? (y/n) n
Aborting. Model not saved.


In [14]:
# Save perceptron model to disk
save_model_with_overwrite(perceptron, 'model_artifacts/perceptron_model.joblib')

The file already exists. Do you want to overwrite it? (y/n) n
Aborting. Model not saved.


### Output

In [15]:
print("Models trained and saved successfully!")
print(f"Perceptron model saved to: {'perceptron_model.joblib'}")
print(f"Logistic regression model saved to: {'logistic_model.joblib'}")

Models trained and saved successfully!
Perceptron model saved to: perceptron_model.joblib
Logistic regression model saved to: logistic_model.joblib
