# Quora question pairs classification


We will have a pair of sentences, we need to classify whether they are semantically similar to each other or not.

This classification problem has a wide range of applications like:
- Detecting the similar question in social media platforms like quora, stackoverflow, etc..
- Finding similar blog posts on medium etc..
- Finding similar search results in search engines.
- Can be used as a classifier in GAN's
- Finding whether two sentences are paraphrases to each other or not

#Quora Dataset

Let's take a look into quora question pairs dataset, how the duplicate questions look like.

> id | qid1 | qid2 | question1 | question2 | is_duplicate
> --- | --- | --- | --- | --- | ---
>360472 |	364011 | 490273 |	What causes stool color to change to yellow? |	What can cause stool to come out as little balls? |	0
>150662	|155721 |	7256 |	What can one do after MBBS? |	What do i do after my MBBS ? |	1

As we can see, how the duplicate data & non-duplicate data looks like. Let's build a model on how to detect these automatically.

#Model


There are various model architectures we can implement to detect the semantically similar sentences. For example:
- Logistic regression by converting words into vectors by using bag of words 
- Random Forests build using nlp features
- Converting words into vectors using word embeddings and using RNN's
- Converting words into vectors using word embeddings and using CNN's
- Combing RNN's and CNN's

There are numerous number of architectures can be used. We will use the state-of-the-art model architecture **BERT.**

Google AI released a model called BERT. If you don't know what BERT is, please go through the following links:
- [paper](https://arxiv.org/pdf/1810.04805.pdf)
- [blog](http://jalammar.github.io/illustrated-bert/)

BERT obtains new state-of-the-art results on eleven natural language processing tasks.

We will use BERT to predict the given pair of sentences are semantically similar to each other or not.

---

In the paper, BERT suggests an architecture for text classification type problems.

Let's see the steps on using the pertrained BERT:


*   We have Question A and Question B, and we need to classify whether they are similar or not
*    Input to the BERT model contains three types of embeddings
     
     - **Token Embeddings : ** Tokens of the input sentences
     - **Segment Embeddings : ** Ids to indicate the different sentences. 0's for Sentence A, 1's for Sentence B
     - **Position Embeddings : ** To indicate the positions of the words in the sentences.
*   So for our case, we need to convert the sentences into tokens, and segment ids. Position embeddings are learnt during the pre-training
*   A **[CLS]** token added at the start of each input sequence.
*   A **[SEP]** token is added at the end of each sentence in a input sequence.
*   So in our case input sequence is : **[CLS] Question A [SEP] Question B [SEP]**
*    For example:
     
     - *Question A* :   what can one do after mbbs ?
     - *Question B* :    what do i do after my mbbs ?
     -  *Input tokens* : [CLS] what can one do after mbbs ? [SEP] what do i do after my mbbs ? [SEP]
     -  *segment ids* : 0          0         0    0    0     0       0       0    0        1     1  1 1     1     1      1      1    1

*    The above mentioned input preparation steps are implemented in the function : **convert_examples_to_features**
*    Then we feed the tokenized input sequences to the BERT model.
*     We take the output corresponding to the [CLS] token and add Linear layer outputting 2 labels: to indicate similar or not. (This has been already implemented as the class **BertForSequenceClassification** )
*      We will use the class *BertForSequenceClassification*, which returns the logits. Then we calculated the loss using the ground truth labels.

*Let's get into coding !!!*

## Tutorial - fine tuning 없이 있는 모델 바로 적용하기



https://www.kaggle.com/abhilash1910/bertsimilarity-library

In [1]:
import sys
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive/master_degree/paper')
!ls

Mounted at /content/drive
'논문계획서 (2021.01.22).gdoc'		 '기존 분석결과.gdoc'
'논문 초안 (2021.02).gdoc'		 '논문 지도교수 배정신청서.gdoc'
 archive.zip				 '레퍼런스 링크.gdoc'
 booking.com_hotel_review_korea.csv.zip  '논문계획 정보모음.gdoc'
 code_R					  glue_data
 colab					 '연구계획 표.gsheet'
 data					  model


In [None]:
#kaggle data download
! pip install -q kaggle
from google.colab import files
files.upload()

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list
!kaggle competitions download -c quora-question-pairs

In [None]:
# unzip
!pip install patool
import patoolib
patoolib.extract_archive("data/quora-question-pairs.zip", outdir='data/quora/')

In [None]:
for dirname, _, filenames in os.walk('data/quora'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data/quora/sample_submission.csv.zip
data/quora/test.csv
data/quora/test.csv.zip
data/quora/train.csv.zip


In [3]:
data = pd.read_csv("data/quora/train.csv.zip")

In [4]:
data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
len(data)

404290

In [6]:
!pip install BERTSimilarity

Collecting BERTSimilarity
  Downloading https://files.pythonhosted.org/packages/f9/f1/52928d627e616b185ffcdc877f6db647ff0232302da0b502b3b56fed1785/BERTSimilarity-0.1.tar.gz
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/88/b1/41130a228dd656a1a31ba281598a968320283f48d42782845f6ba567f00b/transformers-4.2.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 5.0MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 23.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 47.5MB/s 
Building wheels for collected packages: BERTSimilarity, s

In [7]:
##Example Use case
import BERTSimilarity.BERTSimilarity as bertsimilarity

if __name__=='__main__':
    f1='The man is playing soccer.'
    f2='The man is playing football.'
    bertsimilarity=bertsimilarity.BERTSimilarity()
    dist=bertsimilarity.calculate_distance(f1,f2)
    print('The distance between sentence1: '+f1+' and sentence2: '+f2+' is '+str(dist))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…


The distance between sentence1: The man is playing soccer. and sentence2: The man is playing football. is 0.9718281626701355


In [10]:
##Example Use case
import BERTSimilarity.BERTSimilarity as bertsimilarity

if __name__=='__main__':
    f1='How are you?'
    f2='How old are you?'
    bertsimilarity=bertsimilarity.BERTSimilarity()
    dist=bertsimilarity.calculate_distance(f1,f2)
    print('The distance between sentence1: '+f1+' and sentence2: '+f2+' is '+str(dist))

The distance between sentence1: How are you? and sentence2: How old are you? is 0.722168505191803


In [11]:
##Example Use case
import BERTSimilarity.BERTSimilarity as bertsimilarity

if __name__=='__main__':
    f1='What is your age?'
    f2='How old are you?'
    bertsimilarity=bertsimilarity.BERTSimilarity()
    dist=bertsimilarity.calculate_distance(f1,f2)
    print('The distance between sentence1: '+f1+' and sentence2: '+f2+' is '+str(dist))

The distance between sentence1: What is your age? and sentence2: How old are you? is 0.8757752180099487


##Testing the BERTSimilarity library on a small sample of 100 texts¶

In [None]:
%%time

#Function to find similarity between the sentences/paragraphs
def calculate_similarity(q1,q2,bertsimilarity):
    dist=bertsimilarity.calculate_distance(q1,q2)
    return dist

if __name__=='__main__':
    distances=[]
    for i in range(len(data[:100])):
        q1=data['question1'][i]
        q2=data['question2'][i]
        z=calculate_similarity(q1,q2,bertsimilarity)
        distances.append(z)
        print(q1,"---", q2, "---", z, "\n")
    # print(distances)   

What is the step by step guide to invest in share market in india? --- What is the step by step guide to invest in share market? --- 0.9701508283615112 

What is the story of Kohinoor (Koh-i-Noor) Diamond? --- What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? --- 0.905712902545929 

How can I increase the speed of my internet connection while using a VPN? --- How can Internet speed be increased by hacking through DNS? --- 0.9232534170150757 

Why am I mentally very lonely? How can I solve it? --- Find the remainder when [math]23^{24}[/math] is divided by 24,23? --- 0.6107032299041748 

Which one dissolve in water quikly sugar, salt, methane and carbon di oxide? --- Which fish would survive in salt water? --- 0.7617752552032471 

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me? --- I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me? --- 0.9052724242210388 

Should I buy ti

In [None]:
result_dataset=pd.DataFrame(columns=['question1','question2','similarity_score'])
result_dataset['question1']=data['question1'][:100]
result_dataset['question2']=data['question2'][:100]
result_dataset['similarity_score']=distances

In [None]:
result_dataset.head()

Unnamed: 0,question1,question2,similarity_score
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0.970151
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0.905713
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0.923253
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0.610703
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0.761775
