# Problem Description

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.
* Input: 
  a question asked on Quora
* Output:
  0/1 (Yes/ No) - predicting whether a question asked on Quora is sincere or not

#Installations and imports

**Note:** Restart the runtime after installing the packages 

In [None]:
! pip install flyvec



In [None]:
!pip install -U PyYAML
import yaml

yaml.__version__



'6.0'

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import gc
import re
import spacy

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras import backend as K
from keras.layers import *
from keras.models import *
from keras.initializers import Constant
from keras.utils.vis_utils import plot_model

import torch
import torch.nn as nn
from tensorflow.keras.optimizers import Adam

from torch import LongTensor, FloatTensor, DoubleTensor
from torch.utils.data import Dataset, DataLoader, sampler
from torch.utils.data.distributed import DistributedSampler

from tqdm.notebook import tqdm
from IPython.core.display import display, HTML
tqdm().pandas()

pd_ctx = pd.option_context('display.max_colwidth', 100)

0it [00:00, ?it/s]

# Preprocess and Prepare Dataset

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/dataset /train.csv')
df.info()

test_df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/dataset /test.csv')
test_df.info()

df['word_count']= df.question_text.progress_apply(lambda x: len(x.split()))
sincere_data = df[df['target']==0]
insincere_data = df[df['target']==1]
print("Sincere question")
display(sincere_data.head())
print("Insincere question")
display(insincere_data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1306122 entries, 0 to 1306121
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   qid            1306122 non-null  object
 1   question_text  1306122 non-null  object
 2   target         1306122 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 29.9+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 375806 entries, 0 to 375805
Data columns (total 2 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   qid            375806 non-null  object
 1   question_text  375806 non-null  object
dtypes: object(2)
memory usage: 5.7+ MB


  0%|          | 0/1306122 [00:00<?, ?it/s]

Sincere question


Unnamed: 0,qid,question_text,target,word_count
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0,13
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0,16
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0,10
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0,9
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0,15


Insincere question


Unnamed: 0,qid,question_text,target,word_count
22,0000e91571b60c2fb487,Has the United States become the largest dicta...,1,11
30,00013ceca3f624b09f42,Which babies are more sweeter to their parents...,1,15
110,0004a7fcb2bf73076489,If blacks support school choice and mandatory ...,1,15
114,00052793eaa287aff1e1,I am gay boy and I love my cousin (boy). He is...,1,34
115,000537213b01fd77b58a,Which races have the smallest penis?,1,6


**Data size:**
1.306.122 question

**Data fields**
   * `qid` - unique question identifier
   * `question_text` - Quora question text
   * `target` - a question labeled "insincere" has a value of 1, otherwise 0
   

*No data is null or missing*   

 ## Clean data 
 
* Replace math equations, links by "MATHEQUATION", "URL"
* Make abbreviations complete
* Correcting mispell words
* Remove punctuation

*I consider that removing stopword and lowercase is not a good idea in classification task. In the context of sentiment analysis, removing stop words can be problematic if context is affected. For example stop word corpus includes ‘not’, which is a negation that can alter the valence of the passage. In addition, proper nouns can have a big effect in context classification as we analysis above (insincere question have many controversial proper nouns such as "Trump", "Indian", "Kim Jung Un", etc. Therefore, the data will contain stopword and uppercase words which is necessary.*

In [None]:
contractions= {"i'm": 'i am',"i'm'a": 'i am about to',"i'm'o": 'i am going to',"i've": 'i have',"i'll": 'i will',"i'll've": 'i will have',"i'd": 'i would',"i'd've": 'i would have',"Whatcha": 'What are you',"amn't": 'am not',"ain't": 'are not',"aren't": 'are not',"'cause": 'because',"can't": 'can not',"can't've": 'can not have',"could've": 'could have',"couldn't": 'could not',"couldn't've": 'could not have',"daren't": 'dare not',"daresn't": 'dare not',"dasn't": 'dare not',"didn't": 'did not','didn’t': 'did not',"don't": 'do not','don’t': 'do not',"doesn't": 'does not',"e'er": 'ever',"everyone's": 'everyone is',"finna": 'fixing to',"gimme": 'give me',"gon't": 'go not',"gonna": 'going to',"gotta": 'got to',"hadn't": 'had not',"hadn't've": 'had not have',"hasn't": 'has not',"haven't": 'have not',"he've": 'he have',"he's": 'he is',"he'll": 'he will',"he'll've": 'he will have',"he'd": 'he would',"he'd've": 'he would have',"here's": 'here is',"how're": 'how are',"how'd": 'how did',"how'd'y": 'how do you',"how's": 'how is',"how'll": 'how will',"isn't": 'is not',"it's": 'it is',"'tis": 'it is',"'twas": 'it was',"it'll": 'it will',"it'll've": 'it will have',"it'd": 'it would',"it'd've": 'it would have',"kinda": 'kind of',"let's": 'let us',"luv": 'love',"ma'am": 'madam',"may've": 'may have',"mayn't": 'may not',"might've": 'might have',"mightn't": 'might not',"mightn't've": 'might not have',"must've": 'must have',"mustn't": 'must not',"mustn't've": 'must not have',"needn't": 'need not',"needn't've": 'need not have',"ne'er": 'never',"o'": 'of',"o'clock": 'of the clock',"ol'": 'old',"oughtn't": 'ought not',"oughtn't've": 'ought not have',"o'er": 'over',"shan't": 'shall not',"sha'n't": 'shall not',"shalln't": 'shall not',"shan't've": 'shall not have',"she's": 'she is',"she'll": 'she will',"she'd": 'she would',"she'd've": 'she would have',"should've": 'should have',"shouldn't": 'should not',"shouldn't've": 'should not have',"so've": 'so have',"so's": 'so is',"somebody's": 'somebody is',"someone's": 'someone is',"something's": 'something is',"sux": 'sucks',"that're": 'that are',"that's": 'that is',"that'll": 'that will',"that'd": 'that would',"that'd've": 'that would have',"em": 'them',"there're": 'there are',"there's": 'there is',"there'll": 'there will',"there'd": 'there would',"there'd've": 'there would have',"these're": 'these are',"they're": 'they are',"they've": 'they have',"they'll": 'they will',"they'll've": 'they will have',"they'd": 'they would',"they'd've": 'they would have',"this's": 'this is',"those're": 'those are',"to've": 'to have',"wanna": 'want to',"wasn't": 'was not',"we're": 'we are',"we've": 'we have',"we'll": 'we will',"we'll've": 'we will have',"we'd": 'we would',"we'd've": 'we would have',"weren't": 'were not',"what're": 'what are',"what'd": 'what did',"what've": 'what have',"what's": 'what is',"what'll": 'what will',"what'll've": 'what will have',"when've": 'when have',"when's": 'when is',"where're": 'where are',"where'd": 'where did',"where've": 'where have',"where's": 'where is',"which's": 'which is',"who're": 'who are',"who've": 'who have',"who's": 'who is',"who'll": 'who will',"who'll've": 'who will have',"who'd": 'who would',"who'd've": 'who would have',"why're": 'why are',"why'd": 'why did',"why've": 'why have',"why's": 'why is',"will've": 'will have',"won't": 'will not',"won't've": 'will not have',"would've": 'would have',"wouldn't": 'would not',"wouldn't've": 'would not have',"y'all": 'you all',"y'all're": 'you all are',"y'all've": 'you all have',"y'all'd": 'you all would',"y'all'd've": 'you all would have',"you're": 'you are',"you've": 'you have',"you'll've": 'you shall have',"you'll": 'you will',"you'd": 'you would',"you'd've": 'you would have','jan.': 'january','feb.': 'february','mar.': 'march','apr.': 'april','jun.': 'june','jul.': 'july','aug.': 'august','sep.': 'september','oct.': 'october','nov.': 'november','dec.': 'december','I’m': 'I am','I’m’a': 'I am about to','I’m’o': 'I am going to','I’ve': 'I have','I’ll': 'I will','I’ll’ve': 'I will have','I’d': 'I would','I’d’ve': 'I would have','amn’t': 'am not','ain’t': 'are not','aren’t': 'are not','’cause': 'because','can’t': 'can not','can’t’ve': 'can not have','could’ve': 'could have','couldn’t': 'could not','couldn’t’ve': 'could not have','daren’t': 'dare not','daresn’t': 'dare not','dasn’t': 'dare not','doesn’t': 'does not','e’er': 'ever','everyone’s': 'everyone is','gon’t': 'go not','hadn’t': 'had not','hadn’t’ve': 'had not have','hasn’t': 'has not','haven’t': 'have not','he’ve': 'he have','he’s': 'he is','he’ll': 'he will','he’ll’ve': 'he will have','he’d': 'he would','he’d’ve': 'he would have','here’s': 'here is','how’re': 'how are','how’d': 'how did','how’d’y': 'how do you','how’s': 'how is','how’ll': 'how will','isn’t': 'is not','it’s': 'it is','’tis': 'it is','’twas': 'it was','it’ll': 'it will','it’ll’ve': 'it will have','it’d': 'it would','it’d’ve': 'it would have','let’s': 'let us','ma’am': 'madam','may’ve': 'may have','mayn’t': 'may not','might’ve': 'might have','mightn’t': 'might not','mightn’t’ve': 'might not have','must’ve': 'must have','mustn’t': 'must not','mustn’t’ve': 'must not have','needn’t': 'need not','needn’t’ve': 'need not have','ne’er': 'never','o’': 'of','o’clock': 'of the clock','ol’': 'old','oughtn’t': 'ought not','oughtn’t’ve': 'ought not have','o’er': 'over','shan’t': 'shall not','sha’n’t': 'shall not','shalln’t': 'shall not','shan’t’ve': 'shall not have','she’s': 'she is','she’ll': 'she will','she’d': 'she would','she’d’ve': 'she would have','should’ve': 'should have','shouldn’t': 'should not','shouldn’t’ve': 'should not have','so’ve': 'so have','so’s': 'so is','somebody’s': 'somebody is','someone’s': 'someone is','something’s': 'something is','that’re': 'that are','that’s': 'that is','that’ll': 'that will','that’d': 'that would','that’d’ve': 'that would have','there’re': 'there are','there’s': 'there is','there’ll': 'there will','there’d': 'there would','there’d’ve': 'there would have','these’re': 'these are','they’re': 'they are','they’ve': 'they have','they’ll': 'they will','they’ll’ve': 'they will have','they’d': 'they would','they’d’ve': 'they would have','this’s': 'this is','those’re': 'those are','to’ve': 'to have','wasn’t': 'was not','we’re': 'we are','we’ve': 'we have','we’ll': 'we will','we’ll’ve': 'we will have','we’d': 'we would','we’d’ve': 'we would have','weren’t': 'were not','what’re': 'what are','what’d': 'what did','what’ve': 'what have','what’s': 'what is','what’ll': 'what will','what’ll’ve': 'what will have','when’ve': 'when have','when’s': 'when is','where’re': 'where are','where’d': 'where did','where’ve': 'where have','where’s': 'where is','which’s': 'which is','who’re': 'who are','who’ve': 'who have','who’s': 'who is','who’ll': 'who will','who’ll’ve': 'who will have','who’d': 'who would','who’d’ve': 'who would have','why’re': 'why are','why’d': 'why did','why’ve': 'why have','why’s': 'why is','will’ve': 'will have','won’t': 'will not','won’t’ve': 'will not have','would’ve': 'would have','wouldn’t': 'would not','wouldn’t’ve': 'would not have','y’all': 'you all','y’all’re': 'you all are','y’all’ve': 'you all have','y’all’d': 'you all would','y’all’d’ve': 'you all would have','you’re': 'you are','you’ve': 'you have','you’ll’ve': 'you shall have','you’ll': 'you will','you’d': 'you would','you’d’ve': 'you would have'}
missing_spell = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'bitcoin', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization','electroneum':'bitcoin','nanodegree':'degree','hotstar':'star','dream11':'dream','ftre':'fire','tensorflow':'framework','unocoin':'bitcoin','lnmiit':'limit','unacademy':'academy','altcoin':'bitcoin','altcoins':'bitcoin','litecoin':'bitcoin','coinbase':'bitcoin','cryptocurency':'cryptocurrency','simpliv':'simple','quoras':'quora','schizoids':'psychopath','remainers':'remainder','twinflame':'soulmate','quorans':'quora','brexit':'demonetized','iiest':'institute','dceu':'comics','pessat':'exam','uceed':'college','bhakts':'devotee','boruto':'anime','cryptocoin':'bitcoin','blockchains':'blockchain','fiancee':'fiance','redmi':'smartphone','oneplus':'smartphone','qoura':'quora','deepmind':'framework','ryzen':'cpu','whattsapp':'whatsapp','undertale':'adventure','zenfone':'smartphone','cryptocurencies':'cryptocurrencies','koinex':'bitcoin','zebpay':'bitcoin','binance':'bitcoin','whtsapp':'whatsapp','reactjs':'framework','bittrex':'bitcoin','bitconnect':'bitcoin','bitfinex':'bitcoin','yourquote':'your quote','whyis':'why is','jiophone':'smartphone','dogecoin':'bitcoin','onecoin':'bitcoin','poloniex':'bitcoin','7700k':'cpu','angular2':'framework','segwit2x':'bitcoin','hashflare':'bitcoin','940mx':'gpu','openai':'framework','hashflare':'bitcoin','1050ti':'gpu','nearbuy':'near buy','freebitco':'bitcoin','antminer':'bitcoin','filecoin':'bitcoin','whatapp':'whatsapp','empowr':'empower','1080ti':'gpu','crytocurrency':'cryptocurrency','8700k':'cpu','whatsaap':'whatsapp','g4560':'cpu','payymoney':'pay money','fuckboys':'fuck boys','intenship':'internship','zcash':'bitcoin','demonatisation':'demonetization','narcicist':'narcissist','mastuburation':'masturbation','trignometric':'trigonometric','cryptocurreny':'cryptocurrency','howdid':'how did','crytocurrencies':'cryptocurrencies','phycopath':'psychopath','bytecoin':'bitcoin','possesiveness':'possessiveness','scollege':'college','humanties':'humanities','altacoin':'bitcoin','demonitised':'demonetized','brasília':'brazilia','accolite':'accolyte','econimics':'economics','varrier':'warrier','quroa':'quora','statergy':'strategy','langague':'language','splatoon':'game','7600k':'cpu','gate2018':'gate 2018','in2018':'in 2018','narcassist':'narcissist','jiocoin':'bitcoin','hnlu':'hulu','7300hq':'cpu','weatern':'western','interledger':'blockchain','deplation':'deflation', 'cryptocurrencies':'cryptocurrency', 'bitcoin':'blockchain cryptocurrency'}
#Replace math equations, links by "MATHEQUATION", "URL"
def clean_tag(x):
    if '[math]' in x:
        x = re.sub('\[math\].*?math\]', 'math equation', x) #replacing with [MATH EQUATION]
    if 'http' in x or 'www' in x:
        x = re.sub('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', 'url', x) #replacing with [url]
    return x
#Make abbreviations complete
def contraction_fix(word):
    try:
        a=contractions[word]
    except KeyError:
        a=word
    return a
#Correcting mispell words
def misspell_fix(word):
    try:
        a=missing_spell[word]
    except KeyError:
        a=word
    return a


def clean_text(text):
    text = clean_tag(text)
    text = " ".join([contraction_fix(w) for w in text.split()]) 
    text = " ".join([misspell_fix(w) for w in text.split()]) 
    #Remove punctuation
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text) 
    return text

def apply_clean_text(question_text):
    tmp = pd.DataFrame()
    tmp['question_text'] = question_text;
    tmp['clean'] = tmp.question_text.progress_map(clean_text)
    with pd_ctx:
        display(tmp)
    return tmp['clean']


trainX_ques = apply_clean_text(df.question_text)
testX_ques = apply_clean_text(test_df.question_text)

  0%|          | 0/1306122 [00:00<?, ?it/s]

Unnamed: 0,question_text,clean
0,How did Quebec nationalists see their province as a nation in the 1960s?,How did Quebec nationalists see their province as a nation in the 1960s
1,"Do you have an adopted dog, how would you encourage people to adopt and not shop?",Do you have an adopted dog how would you encourage people to adopt and not shop
2,Why does velocity affect time? Does velocity affect space geometry?,Why does velocity affect time Does velocity affect space geometry
3,How did Otto von Guericke used the Magdeburg hemispheres?,How did Otto von Guericke used the Magdeburg hemispheres
4,Can I convert montra helicon D to a mountain bike by just changing the tyres?,Can I convert montra helicon D to a mountain bike by just changing the tyres
...,...,...
1306117,What other technical skills do you need as a computer science undergrad other than c and c++?,What other technical skills do you need as a computer science undergrad other than c and c
1306118,Does MS in ECE have good job prospects in USA or like India there are more IT jobs present?,Does MS in ECE have good job prospects in USA or like India there are more IT jobs present
1306119,Is foam insulation toxic?,Is foam insulation toxic
1306120,How can one start a research project based on biochemistry at UG level?,How can one start a research project based on biochemistry at UG level


  0%|          | 0/375806 [00:00<?, ?it/s]

Unnamed: 0,question_text,clean
0,Why do so many women become so rude and arrogant when they get just a little bit of wealth and p...,Why do so many women become so rude and arrogant when they get just a little bit of wealth and p...
1,When should I apply for RV college of engineering and BMS college of engineering? Should I wait ...,When should I apply for RV college of engineering and BMS college of engineering Should I wait ...
2,What is it really like to be a nurse practitioner?,What is it really like to be a nurse practitioner
3,Who are entrepreneurs?,Who are entrepreneurs
4,Is education really making good people nowadays?,Is education really making good people nowadays
...,...,...
375801,How many countries listed in gold import in indua?,How many countries listed in gold import in indua
375802,Is there an alternative to dresses on formal parties?,Is there an alternative to dresses on formal parties
375803,Where I can find best friendship quotes in Telugu?,Where I can find best friendship quotes in Telugu
375804,What are the causes of refraction of light?,What are the causes of refraction of light


## Embedding text to vectors

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, each token is a word in sentences. I using `torchtext` to tokenize by go through the words in the entire document to build up a dictionary. Then, the words will be sorted by their frequency of occurrence. The words appear frequently have the lower the index. Then we will use this dictionary to transform each sentence in text form into a sequence of numbers.

1. Create a dictionary of the dataset
    The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an \<unk> token.

In [None]:
import torchtext

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(testX_ques), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

In [None]:
len(vocab)
vocab(['here', 'is', 'an', 'example'])
vocab.lookup_token(1)

'the'

2. Turn each question into a sequence of numbers.

In [None]:
# text_pipeline = lambda x: vocab(tokenizer(x))
# text_pipeline('here is the an example')

def tokenize_ques(data_iter):
    text_list = []
    text_pipeline = lambda x: vocab(tokenizer(x))
    for text in data_iter:
        processed_text = text_pipeline(text)
        text_list.append(processed_text)
    return text_list

word_sequences = tokenize_ques(testX_ques)

print("Length of 20 first word_sequences:")
print(list(map(lambda x: len(x) ,word_sequences[:20])))

print("\n20 first word_sequences:")
for sequence in word_sequences[:20]:
    print(sequence)

Length of 20 first word_sequences:
[21, 30, 10, 3, 7, 9, 6, 12, 25, 8, 13, 6, 7, 9, 7, 10, 11, 5, 8, 12]

20 first word_sequences:
[16, 10, 53, 70, 123, 96, 53, 1687, 11, 3932, 34, 48, 35, 98, 4, 739, 1576, 7, 2466, 11, 296]
[34, 32, 8, 366, 13, 8949, 145, 7, 152, 11, 7008, 145, 7, 152, 32, 8, 1708, 13, 1, 4228, 1059, 25, 78, 8, 1698, 5, 366, 146, 1, 1059]
[2, 3, 17, 129, 44, 5, 26, 4, 4510, 14599]
[43, 12, 3821]
[3, 447, 129, 359, 54, 39, 2502]
[9, 10, 14, 1014, 4, 12587, 5, 687, 1634]
[2, 3, 1, 1595, 6, 34472]
[2, 3, 1, 246, 13, 10839, 15, 1, 139, 905, 151, 1965]
[18, 1064, 1540, 3, 7989, 32243, 18, 2371, 1064, 1258, 84, 5, 25441, 3509, 3193, 6, 18, 760, 2371, 1540, 2, 3, 18, 1064, 312]
[9, 73, 27, 4, 6240, 616, 6, 583]
[2, 12, 1, 36, 222, 5, 910, 126, 13, 242, 397, 6, 583]
[2, 1327, 15, 8, 708, 12908]
[12, 508, 432, 1411, 11639, 19673, 4018]
[16, 18, 596, 23, 35, 52, 3394, 24, 105]
[1961, 4, 50507, 108, 11375, 6306, 375]
[2, 12, 1, 36, 173, 4848, 7919, 239, 1833, 2647]
[16, 3, 17, 1

**Padding and Truncating**
We can see in tokenize steps, these question don't have same length which can lead to difficult in trainning model. Hence, let's regularize sequences with padding and truncating:
Each sequences will have the fixed length 60 as we analyze in section Data Analysis - 99.9% sentences have length less than or equal to 54 words.  
* Padding: if the sequences shorter than the fixed length, adding 0 after the sequences.
* Truncatting: if the sequences shorter than the fixed length, shorten by remove the balance of the sequences.
* 'post': padding or truncatting at the end of the word

In [None]:
MAX_SENTENCE_LENGTH = 60 
PADDING_TYPE = 'post' 
TRUNCATE_TYPE = 'post'
def create_sequence(word_sequences):
    padded_word_sequences = pad_sequences(word_sequences, maxlen=MAX_SENTENCE_LENGTH, padding=PADDING_TYPE, truncating=TRUNCATE_TYPE)
    return padded_word_sequences
padded_sequences = create_sequence(word_sequences)

print("Array size:",padded_sequences.shape)

print("Length of 20 first word_sequences:")
print(list(map(lambda x: len(x) ,padded_sequences[:20])))

print("\n10 first word_sequences:")
for sequence in padded_sequences[:10]:
    print(sequence)

Array size: (375806, 60)
Length of 20 first word_sequences:
[60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60]

10 first word_sequences:
[  16   10   53   70  123   96   53 1687   11 3932   34   48   35   98
    4  739 1576    7 2466   11  296    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]
[  34   32    8  366   13 8949  145    7  152   11 7008  145    7  152
   32    8 1708   13    1 4228 1059   25   78    8 1698    5  366  146
    1 1059    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]
[    2     3    17   129    44     5    26     4  4510 14599     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
  

## Split dataset to valid/train/test set
In this section, I defined `QuoraDataset` using data preparing functions above to process the dataset.I also split the dataset into train set and valid set used to give an estimate of model skill while tuning model’s hyperparameters.

In [None]:
class QuoraDataset(Dataset):
    def __init__(self, dataset):
        #contain all question in data
        self.text = dataset.question_text
        #target 0/1 for training data and - len for test and validation
        self.target = dataset.target if "target" in dataset.columns else [-1]*len(dataset)
        
    def __len__(self):
        return len(self.text)

    def __getitem__(self, i):
        target = [self.target[i]]
        question = str(self.text[i])
        question_id = create_sequence([vocab(tokenizer(question))])
        return FloatTensor(target), question, question_id

In [None]:
from torch.utils.data.dataset import random_split

BATCH_SIZE = 1024
# df = df.sample(n=500, random_state=123).reset_index(drop=True)
split = np.int32(0.8*len(df))
valid_data, training_data = df[split:], df[:split]
valid_data = valid_data.reset_index(drop=True)
val_dataset = QuoraDataset(valid_data)
val_loader = DataLoader(dataset=val_dataset, batch_size=BATCH_SIZE,
                        num_workers=0, shuffle=True)

training_data = training_data.reset_index(drop=True)
train_dataset = QuoraDataset(training_data)
train_loader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE,
                           num_workers=0, shuffle=True)

In [None]:
EMBEDDING_DIM = 400
VOCAB_SIZE = len(vocab)

# Model architecture
This section will describe model architectures are experimented to classify insincere questions and improve the evaluation metrics F1 score.

I will use LSTM to train the network. Flyvec and GloVe are used to embedding words feeding the model.

1. Loading pretrained Glove word embedding



In [None]:
def load_glove_vectors(glove_file="/content/gdrive/MyDrive/Colab Notebooks/dataset /glove.6B.50d.txt"):
    """Load the glove word vectors"""
    word_vectors = {}
    with open(glove_file) as f:
        for line in f:
            split = line.split()
            word_vectors[split[0]] = np.array([float(x) for x in split[1:]])
    return word_vectors

In [None]:
def get_GloveEmb_matrix(pretrained, vocab, emb_size = 50):
    """ Creates embedding matrix from word vectors"""
    vocab_size = len(vocab) + 2
    vocab_to_idx = {}
    dic = ["", "UNK"]
    W = np.zeros((vocab_size, emb_size), dtype="float32")
    W[0] = np.zeros(emb_size, dtype='float32') # adding a vector for padding
    W[1] = np.random.uniform(-0.25, 0.25, emb_size) # adding a vector for unknown words 
#     dic["UNK"] = 1
    i = 2
    for i in range(vocab_size-2):
        word = vocab.lookup_token(i)
        if word in word_vecs:
            W[i+2] = word_vecs[word]
        else:
            W[i+2] = np.random.uniform(-0.25,0.25, emb_size)
        vocab_to_idx[word] = i+2
        dic.append(word)
        i = i+ 1   
    return W, np.array(dic), vocab_to_idx

In [None]:
word_vecs = load_glove_vectors()
glove_weights, dicts, vocab2index = get_GloveEmb_matrix(word_vecs, vocab)

2. Loading pretrained Flyvec

In [None]:
import numpy as np
from flyvec import FlyVec

model = FlyVec.load(force_redownload=True)

embed_info = model.get_sparse_embedding("market"); embed_info

  0% |                                                                        |

Downloading flyvec data:





{'embedding': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0,

In [None]:
e= model.get_sparse_embedding("DefNotAWord")
e['embedding']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
def get_FlyEmb_matrix(vocab, emb_size = 400):
    """ Creates embedding matrix from word vectors"""
    vocab_size = len(vocab) + 2
    vocab_to_idx = {}
    dic = ["", "UNK"]
    W = np.zeros((vocab_size, emb_size), dtype="int32")
    W[0] = np.zeros(emb_size, dtype='int32') # adding a vector for padding
    e= model.get_sparse_embedding("DefNotAWord")
    W[1] = e['embedding'] # adding a vector for unknown words 
    i = 2
    for i in range(vocab_size-2):
        word = vocab.lookup_token(i)
        w = model.get_sparse_embedding(word)
        W[i+2] = w['embedding']
        vocab_to_idx[word] = i+2
        dic.append(word)
        i = i+ 1   
    return W, np.array(dic), vocab_to_idx

In [None]:
fly_weights, dictionary, vocab2index = get_FlyEmb_matrix(vocab)

3. LSTM with pre-trained word embeddings:


In [None]:
class LSTM_emb_vecs(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim, glove_weights) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embeddings.weight.data.copy_(torch.from_numpy(glove_weights))
        self.embeddings.weight.requires_grad = False ## freeze embeddings
        self.dropout = nn.Dropout(0.2)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers = 2, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 1)   
    def forward(self, x):
        x = torch.tensor(x).to(device)
        x = self.embeddings(x) #size: 1024, 60, 300 - batch_size x leng_sentence x embedding_size
        lstm_out, (ht, ct) = self.lstm(x) #ht_size: 1, 1024, 128 - batch_size x hidden_size
        return self.linear(ht[-1])

lstm_flyvec_model = LSTM_emb_vecs(VOCAB_SIZE+2, 400, 128, fly_weights)
lstm_glove_model = LSTM_emb_vecs(VOCAB_SIZE+2, 50, 128, glove_weights)

# Model Training 

### Using F1 score
Data analysis show that the dataset for training is imbalanced. Therefore using accuracy as a performance metric can be achieved wrong evaluation of the model. Therefore,F1 score is a suitable measure of models tested with this imbalance classification datasets.
Formula of F1: 

$F1 \textrm{Score} = 2*\frac{\textrm{Precision*Recall}}{\textrm{Precision+Recall}}$

Where $\textrm{Recall} = \frac{\textrm{#True Positives}}{\textrm{Relevant items}}$

and $\textrm{Precision} = \frac{\textrm{#True Positives}}{\textrm{Total Positives}}$

These 2 elements can be represent in this picture:

<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" width="300"/>
</div>

In [None]:
def f1_score(y_pred, y_true):
    y_true = y_true.squeeze()
    y_pred = torch.round(nn.Sigmoid()(y_pred)).squeeze()
    tp = (y_true * y_pred).sum().to(torch.float32)
    fp = ((1 - y_true) * y_pred).sum().to(torch.float32)
    fn = (y_true * (1 - y_pred)).sum().to(torch.float32)
    tn = ((1 - y_true) * (1 - y_pred)).sum().to(torch.float32)
    epsilon = 1e-7
    recall = tp / (tp + fn + epsilon)
    precision = tp / (tp + fp + epsilon)
    return 2*(precision*recall) / (precision + recall + epsilon)

### GPU 

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### Model Training 
I selected the learning rate 0.01 and batch size of 1024 for LSTMs model and 128 for RoBerta (because the model is quite big, training with big size make the gpu run out of memory). The valid set will train on bigger batch_size since it don't need gradient calculation.BCE is the loss function which is commonly used in binary classification tasks.


In [None]:
LEARNING_RATE = 0.001
NUM_EPOCHS = 10
MODEL_SAVE_PATH = 'insincerity_model.pt'

global val_f1s; global train_f1s
global val_losses; global train_losses
global metric_lists

def train_quoraModel(model, train_loader, valid_loader):
    criterion = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) 
    val_losses, val_f1s = [], []
    train_losses, train_f1s = [], []
    model.to(device)
    for epoch in range(NUM_EPOCHS):
        print("EPOCH :" + str(epoch+1))
        batch = 1
        model.train()  
        for train_batch in tqdm(train_loader):
            train_targ, train_ques, train_id = train_batch
            train_targ = train_targ.to(device)
            train_id = train_id.to(device)
            train_preds = model.forward(train_id.squeeze(dim=1))
            train_preds = train_preds.to(device)
            train_loss = criterion(train_preds, train_targ)
            train_f1 = f1_score(train_preds, train_targ)
            f1 = np.round(train_f1.item(), 3)
            optimizer.zero_grad()
            train_loss.backward()
            optimizer.step()
            batch = batch + 1
            if (batch + 1) % 100 == 0:
                print(
                    f"Step [{batch + 1}], "
                    f"F1Score [{f1}], "
                    f"Loss: {train_loss.item():.4f}"
                )
        val_loss, val_f1, val_points = 0, 0, 0

        model.eval()
        with torch.no_grad():
            for val_batch in val_loader:
                val_targ, val_ques, val_id = val_batch
                val_targ = val_targ.to(device)
                val_id = val_id.to(device)
                val_preds = model.forward(val_id.squeeze(dim=1))
                val_points = val_points + len(val_targ)
                val_loss = val_loss + criterion(val_preds, val_targ).item()
                val_f1 = val_f1 + f1_score(val_preds, val_targ.squeeze(dim=1)).item()*len(val_preds)
        val_f1 = val_f1/ val_points
        val_loss = val_loss/ val_points
        val_f1s.append(val_f1); train_f1s.append(train_f1.item())
        val_losses.append(val_loss); train_losses.append(train_loss.item())
    print("END TRAINING")
    metric_lists = [val_losses, train_losses, val_f1s, train_f1s]
    torch.save(model.state_dict(), MODEL_SAVE_PATH); del model; gc.collect()

    metric_lists = [val_losses, train_losses, val_f1s, train_f1s]
    metric_names = ['val_loss_', 'train_loss_', 'val_f1_', 'train_f1_']
    for i, metric_list in enumerate(metric_lists):
        for j, metric_value in enumerate(metric_list):
            torch.save(metric_value, metric_names[i] + str(j) + '.pt')
    return metric_lists

In [None]:
glove_lists = train_quoraModel(lstm_glove_model, train_loader, val_loader)
fly_lists = train_quoraModel(lstm_flyvec_model, train_loader, val_loader)


EPOCH :1


  0%|          | 0/1021 [00:00<?, ?it/s]

  # This is added back by InteractiveShellApp.init_path()


Step [100], F1Score [0.0], Loss: 0.2020
Step [200], F1Score [0.0], Loss: 0.2495
Step [300], F1Score [0.0], Loss: 0.1883
Step [400], F1Score [0.0], Loss: 0.2131
Step [500], F1Score [0.0], Loss: 0.2179
Step [600], F1Score [0.0], Loss: 0.2075
Step [700], F1Score [0.0], Loss: 0.1955
Step [800], F1Score [0.0], Loss: 0.1755
Step [900], F1Score [0.033], Loss: 0.1784
Step [1000], F1Score [0.289], Loss: 0.1596
EPOCH :2


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.337], Loss: 0.1726
Step [200], F1Score [0.179], Loss: 0.1695
Step [300], F1Score [0.232], Loss: 0.1695
Step [400], F1Score [0.314], Loss: 0.1419
Step [500], F1Score [0.341], Loss: 0.1751
Step [600], F1Score [0.167], Loss: 0.1618
Step [700], F1Score [0.29], Loss: 0.1339
Step [800], F1Score [0.261], Loss: 0.1841
Step [900], F1Score [0.412], Loss: 0.1605
Step [1000], F1Score [0.389], Loss: 0.1764
EPOCH :3


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.326], Loss: 0.1686
Step [200], F1Score [0.472], Loss: 0.1311
Step [300], F1Score [0.465], Loss: 0.1244
Step [400], F1Score [0.247], Loss: 0.1653
Step [500], F1Score [0.293], Loss: 0.1481
Step [600], F1Score [0.222], Loss: 0.1967
Step [700], F1Score [0.289], Loss: 0.1396
Step [800], F1Score [0.535], Loss: 0.1409
Step [900], F1Score [0.364], Loss: 0.1791
Step [1000], F1Score [0.5], Loss: 0.1279
EPOCH :4


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.372], Loss: 0.1633
Step [200], F1Score [0.505], Loss: 0.1317
Step [300], F1Score [0.462], Loss: 0.1419
Step [400], F1Score [0.346], Loss: 0.1572
Step [500], F1Score [0.433], Loss: 0.1287
Step [600], F1Score [0.325], Loss: 0.1440
Step [700], F1Score [0.442], Loss: 0.1494
Step [800], F1Score [0.44], Loss: 0.1453
Step [900], F1Score [0.505], Loss: 0.1282
Step [1000], F1Score [0.535], Loss: 0.1196
EPOCH :5


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.424], Loss: 0.1433
Step [200], F1Score [0.484], Loss: 0.1114
Step [300], F1Score [0.539], Loss: 0.1427
Step [400], F1Score [0.404], Loss: 0.1440
Step [500], F1Score [0.514], Loss: 0.1306
Step [600], F1Score [0.552], Loss: 0.1249
Step [700], F1Score [0.404], Loss: 0.1438
Step [800], F1Score [0.538], Loss: 0.1429
Step [900], F1Score [0.587], Loss: 0.0949
Step [1000], F1Score [0.611], Loss: 0.1168
EPOCH :6


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.541], Loss: 0.1285
Step [200], F1Score [0.619], Loss: 0.1165
Step [300], F1Score [0.5], Loss: 0.1179
Step [400], F1Score [0.441], Loss: 0.1464
Step [500], F1Score [0.494], Loss: 0.1145
Step [600], F1Score [0.481], Loss: 0.1290
Step [700], F1Score [0.538], Loss: 0.1469
Step [800], F1Score [0.495], Loss: 0.1434
Step [900], F1Score [0.654], Loss: 0.1146
Step [1000], F1Score [0.661], Loss: 0.1021
EPOCH :7


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.5], Loss: 0.1516
Step [200], F1Score [0.4], Loss: 0.1490
Step [300], F1Score [0.532], Loss: 0.1087
Step [400], F1Score [0.368], Loss: 0.1220
Step [500], F1Score [0.559], Loss: 0.1317
Step [600], F1Score [0.477], Loss: 0.1544
Step [700], F1Score [0.46], Loss: 0.1097
Step [800], F1Score [0.531], Loss: 0.1257
Step [900], F1Score [0.442], Loss: 0.1081
Step [1000], F1Score [0.514], Loss: 0.1441
EPOCH :8


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.37], Loss: 0.1751
Step [200], F1Score [0.62], Loss: 0.0906
Step [300], F1Score [0.629], Loss: 0.1097
Step [400], F1Score [0.56], Loss: 0.1494
Step [500], F1Score [0.486], Loss: 0.1303
Step [600], F1Score [0.541], Loss: 0.1294
Step [700], F1Score [0.549], Loss: 0.1450
Step [800], F1Score [0.562], Loss: 0.1410
Step [900], F1Score [0.613], Loss: 0.1281
Step [1000], F1Score [0.569], Loss: 0.1375
EPOCH :9


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.556], Loss: 0.0992
Step [200], F1Score [0.583], Loss: 0.1419
Step [300], F1Score [0.551], Loss: 0.1145
Step [400], F1Score [0.661], Loss: 0.1099
Step [500], F1Score [0.652], Loss: 0.1391
Step [600], F1Score [0.574], Loss: 0.1055
Step [700], F1Score [0.526], Loss: 0.1218
Step [800], F1Score [0.578], Loss: 0.1297
Step [900], F1Score [0.61], Loss: 0.1173
Step [1000], F1Score [0.517], Loss: 0.1315
EPOCH :10


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.575], Loss: 0.1028
Step [200], F1Score [0.603], Loss: 0.1301
Step [300], F1Score [0.593], Loss: 0.1437
Step [400], F1Score [0.558], Loss: 0.1225
Step [500], F1Score [0.576], Loss: 0.1238
Step [600], F1Score [0.574], Loss: 0.1072
Step [700], F1Score [0.579], Loss: 0.1236
Step [800], F1Score [0.581], Loss: 0.1208
Step [900], F1Score [0.596], Loss: 0.1013
Step [1000], F1Score [0.644], Loss: 0.1172
END TRAINING
EPOCH :1


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.0], Loss: 0.2103
Step [200], F1Score [0.0], Loss: 0.2392
Step [300], F1Score [0.0], Loss: 0.2140
Step [400], F1Score [0.0], Loss: 0.2366
Step [500], F1Score [0.0], Loss: 0.1771
Step [600], F1Score [0.25], Loss: 0.1584
Step [700], F1Score [0.519], Loss: 0.1547
Step [800], F1Score [0.41], Loss: 0.1510
Step [900], F1Score [0.312], Loss: 0.1704
Step [1000], F1Score [0.304], Loss: 0.1585
EPOCH :2


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.439], Loss: 0.1272
Step [200], F1Score [0.505], Loss: 0.1306
Step [300], F1Score [0.484], Loss: 0.1695
Step [400], F1Score [0.417], Loss: 0.1409
Step [500], F1Score [0.632], Loss: 0.1466
Step [600], F1Score [0.55], Loss: 0.1514
Step [700], F1Score [0.556], Loss: 0.1143
Step [800], F1Score [0.436], Loss: 0.1286
Step [900], F1Score [0.4], Loss: 0.1274
Step [1000], F1Score [0.569], Loss: 0.1085
EPOCH :3


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.523], Loss: 0.1297
Step [200], F1Score [0.533], Loss: 0.1212
Step [300], F1Score [0.356], Loss: 0.1327
Step [400], F1Score [0.662], Loss: 0.1209
Step [500], F1Score [0.579], Loss: 0.1197
Step [600], F1Score [0.614], Loss: 0.1321
Step [700], F1Score [0.606], Loss: 0.1241
Step [800], F1Score [0.559], Loss: 0.1079
Step [900], F1Score [0.721], Loss: 0.1074
Step [1000], F1Score [0.531], Loss: 0.1278
EPOCH :4


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.505], Loss: 0.1269
Step [200], F1Score [0.495], Loss: 0.1281
Step [300], F1Score [0.524], Loss: 0.1397
Step [400], F1Score [0.585], Loss: 0.1206
Step [500], F1Score [0.514], Loss: 0.1175
Step [600], F1Score [0.593], Loss: 0.1415
Step [700], F1Score [0.597], Loss: 0.1292
Step [800], F1Score [0.495], Loss: 0.1392
Step [900], F1Score [0.662], Loss: 0.1119
Step [1000], F1Score [0.526], Loss: 0.1164
EPOCH :5


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.656], Loss: 0.1194
Step [200], F1Score [0.597], Loss: 0.1210
Step [300], F1Score [0.523], Loss: 0.1245
Step [400], F1Score [0.56], Loss: 0.1188
Step [500], F1Score [0.532], Loss: 0.1061
Step [600], F1Score [0.544], Loss: 0.1436
Step [700], F1Score [0.607], Loss: 0.1082
Step [800], F1Score [0.525], Loss: 0.1502
Step [900], F1Score [0.484], Loss: 0.1114
Step [1000], F1Score [0.632], Loss: 0.1205
EPOCH :6


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.412], Loss: 0.1273
Step [200], F1Score [0.604], Loss: 0.1028
Step [300], F1Score [0.68], Loss: 0.1147
Step [400], F1Score [0.648], Loss: 0.1259
Step [500], F1Score [0.621], Loss: 0.1123
Step [600], F1Score [0.547], Loss: 0.1113
Step [700], F1Score [0.641], Loss: 0.1225
Step [800], F1Score [0.474], Loss: 0.1115
Step [900], F1Score [0.656], Loss: 0.1107
Step [1000], F1Score [0.687], Loss: 0.1141
EPOCH :7


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.689], Loss: 0.0944
Step [200], F1Score [0.617], Loss: 0.1174
Step [300], F1Score [0.65], Loss: 0.1122
Step [400], F1Score [0.606], Loss: 0.1083
Step [500], F1Score [0.581], Loss: 0.1178
Step [600], F1Score [0.624], Loss: 0.1152
Step [700], F1Score [0.653], Loss: 0.0813
Step [800], F1Score [0.632], Loss: 0.1114
Step [900], F1Score [0.634], Loss: 0.1020
Step [1000], F1Score [0.584], Loss: 0.1267
EPOCH :8


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.667], Loss: 0.0771
Step [200], F1Score [0.6], Loss: 0.1012
Step [300], F1Score [0.632], Loss: 0.1056
Step [400], F1Score [0.638], Loss: 0.0961
Step [500], F1Score [0.639], Loss: 0.1090
Step [600], F1Score [0.617], Loss: 0.0919
Step [700], F1Score [0.636], Loss: 0.0937
Step [800], F1Score [0.577], Loss: 0.0991
Step [900], F1Score [0.696], Loss: 0.1000
Step [1000], F1Score [0.696], Loss: 0.0910
EPOCH :9


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.66], Loss: 0.0902
Step [200], F1Score [0.717], Loss: 0.1088
Step [300], F1Score [0.701], Loss: 0.0929
Step [400], F1Score [0.598], Loss: 0.0941
Step [500], F1Score [0.716], Loss: 0.0878
Step [600], F1Score [0.763], Loss: 0.0882
Step [700], F1Score [0.738], Loss: 0.0860
Step [800], F1Score [0.641], Loss: 0.0986
Step [900], F1Score [0.642], Loss: 0.1077
Step [1000], F1Score [0.564], Loss: 0.1036
EPOCH :10


  0%|          | 0/1021 [00:00<?, ?it/s]

Step [100], F1Score [0.646], Loss: 0.0893
Step [200], F1Score [0.623], Loss: 0.1149
Step [300], F1Score [0.609], Loss: 0.1053
Step [400], F1Score [0.73], Loss: 0.0865
Step [500], F1Score [0.684], Loss: 0.0870
Step [600], F1Score [0.631], Loss: 0.0806
Step [700], F1Score [0.716], Loss: 0.0796
Step [800], F1Score [0.637], Loss: 0.1027
Step [900], F1Score [0.635], Loss: 0.0739
Step [1000], F1Score [0.595], Loss: 0.1006
END TRAINING


# Experimental results report

In [None]:
glo_val_f1s = [0] + [metric_value for metric_value in glove_lists[2]]
glo_train_f1s = [0] + [metric_value for metric_value in glove_lists[3]]
glo_val_losses = [0.25] + [metric_value for metric_value in glove_lists[0]]
glo_train_losses = [0.25] + [metric_value for metric_value in glove_lists[1]]

fly_val_f1s = [0] + [metric_value for metric_value in fly_lists[2]]
fly_train_f1s = [0] + [metric_value for metric_value in fly_lists[3]]
fly_val_losses = [0.25] + [metric_value for metric_value in fly_lists[0]]
fly_train_losses = [0.25] + [metric_value for metric_value in fly_lists[1]]

In [None]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Scatter(x=np.arange(1, len(glo_val_losses)+1),
                         y=glo_val_losses, mode="lines+markers", name="glo_val",
                         marker=dict(color="indianred", line=dict(width=.5,
                                                                  color='rgb(0, 0, 0)'))))

fig.add_trace(go.Scatter(x=np.arange(1, len(glo_train_losses)+1),
                         y=glo_train_losses, mode="lines+markers", name="glo_train",
                         marker=dict(color="darkorange", line=dict(width=.5,
                                                                   color='rgb(0, 0, 0)'))))
fig.add_trace(go.Scatter(x=np.arange(1, len(fly_val_losses)+1),
                         y=fly_val_losses, mode="lines+markers", name="fly_val",
                         marker=dict(color="cornflowerblue", line=dict(width=.5,
                                                                  color='rgb(0, 0, 0)'))))

fig.add_trace(go.Scatter(x=np.arange(1, len(fly_train_losses)+1),
                         y=fly_train_losses, mode="lines+markers", name="fly_train",
                         marker=dict(color="cadetblue", line=dict(width=.5,
                                                                   color='rgb(0, 0, 0)'))))

fig.update_layout(xaxis_title="Epochs", yaxis_title="Binary Cross Entropy",
                  title_text="Binary Cross Entropy vs. Epochs", template="plotly_white", paper_bgcolor="#f0f0f0")

fig.show()

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=np.arange(1, len(glo_val_f1s)+1),
                         y=glo_val_f1s, mode="lines+markers", name="glov_val",
                         marker=dict(color="indianred", line=dict(width=.5,
                                                                  color='rgb(0, 0, 0)'))))

fig.add_trace(go.Scatter(x=np.arange(1, len(glo_train_f1s)+1),
                         y=glo_train_f1s, mode="lines+markers", name="glov_train",
                         marker=dict(color="darkorange", line=dict(width=.5,
                                                                   color='rgb(0, 0, 0)'))))
fig.add_trace(go.Scatter(x=np.arange(1, len(fly_val_f1s)+1),
                         y=fly_val_f1s, mode="lines+markers", name="fly_val",
                         marker=dict(color="cornflowerblue", line=dict(width=.5,
                                                                  color='rgb(0, 0, 0)'))))

fig.add_trace(go.Scatter(x=np.arange(1, len(fly_train_f1s)+1),
                         y=fly_train_f1s, mode="lines+markers", name="fly_train",
                         marker=dict(color="cadetblue", line=dict(width=.5,
                                                                   color='rgb(0, 0, 0)'))))

fig.update_layout(xaxis_title="Epochs", yaxis_title="F1 Score",
                  title_text="F1 Score vs. Epochs", template="plotly_white", paper_bgcolor="#f0f0f0")

fig.show()


In [None]:
def predict_insincerity(question, network):
    pg, tg = 'post', 'post'
    ins = {0: 'sincere', 1: 'insincere'}
    print(question.strip())
    quest_id = create_sequence([vocab(tokenizer(question))])
    quest_id = torch.tensor(quest_id).to(device)
#     print(quest_id)
    network.to(device)
    output = network.forward(quest_id)
    return ins[int(np.round(nn.Sigmoid()(output.detach().cpu()).item()))]

print(predict_insincerity("How can I train roBERTa base on TPUs?", lstm_flyvec_model))
print(predict_insincerity("Why is that stupid man the biggest dictator in the world?", lstm_flyvec_model))

How can I train roBERTa base on TPUs?
sincere
Why is that stupid man the biggest dictator in the world?
insincere



To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).

