# Quora question pairs classification


We will have a pair of sentences, we need to classify whether they are semantically similar to each other or not.

This classification problem has a wide range of applications like:
- Detecting the similar question in social media platforms like quora, stackoverflow, etc..
- Finding similar blog posts on medium etc..
- Finding similar search results in search engines.
- Can be used as a classifier in GAN's
- Finding whether two sentences are paraphrases to each other or not

#Quora Dataset

Let's take a look into quora question pairs dataset, how the duplicate questions look like.

> id | qid1 | qid2 | question1 | question2 | is_duplicate
> --- | --- | --- | --- | --- | ---
>360472 |	364011 | 490273 |	What causes stool color to change to yellow? |	What can cause stool to come out as little balls? |	0
>150662	|155721 |	7256 |	What can one do after MBBS? |	What do i do after my MBBS ? |	1

As we can see, how the duplicate data & non-duplicate data looks like. Let's build a model on how to detect these automatically.

#Model


There are various model architectures we can implement to detect the semantically similar sentences. For example:
- Logistic regression by converting words into vectors by using bag of words 
- Random Forests build using nlp features
- Converting words into vectors using word embeddings and using RNN's
- Converting words into vectors using word embeddings and using CNN's
- Combing RNN's and CNN's

There are numerous number of architectures can be used. We will use the state-of-the-art model architecture **BERT.**

Google AI released a model called BERT. If you don't know what BERT is, please go through the following links:
- [paper](https://arxiv.org/pdf/1810.04805.pdf)
- [blog](http://jalammar.github.io/illustrated-bert/)

BERT obtains new state-of-the-art results on eleven natural language processing tasks.

We will use BERT to predict the given pair of sentences are semantically similar to each other or not.

---

In the paper, BERT suggests an architecture for text classification type problems.

Let's see the steps on using the pertrained BERT:


*   We have Question A and Question B, and we need to classify whether they are similar or not
*    Input to the BERT model contains three types of embeddings
     
     - **Token Embeddings : ** Tokens of the input sentences
     - **Segment Embeddings : ** Ids to indicate the different sentences. 0's for Sentence A, 1's for Sentence B
     - **Position Embeddings : ** To indicate the positions of the words in the sentences.
*   So for our case, we need to convert the sentences into tokens, and segment ids. Position embeddings are learnt during the pre-training
*   A **[CLS]** token added at the start of each input sequence.
*   A **[SEP]** token is added at the end of each sentence in a input sequence.
*   So in our case input sequence is : **[CLS] Question A [SEP] Question B [SEP]**
*    For example:
     
     - *Question A* :   what can one do after mbbs ?
     - *Question B* :    what do i do after my mbbs ?
     -  *Input tokens* : [CLS] what can one do after mbbs ? [SEP] what do i do after my mbbs ? [SEP]
     -  *segment ids* : 0          0         0    0    0     0       0       0    0        1     1  1 1     1     1      1      1    1

*    The above mentioned input preparation steps are implemented in the function : **convert_examples_to_features**
*    Then we feed the tokenized input sequences to the BERT model.
*     We take the output corresponding to the [CLS] token and add Linear layer outputting 2 labels: to indicate similar or not. (This has been already implemented as the class **BertForSequenceClassification** )
*      We will use the class *BertForSequenceClassification*, which returns the logits. Then we calculated the loss using the ground truth labels.

*Let's get into coding !!!*

##Required installations

We will use the [repo](https://github.com/huggingface/pytorch-pretrained-BERT) which contains the implementation of BERT and pretained models. We will install it using the following command

https://www.kaggle.com/abhilash1910/bertsimilarity-library

In [1]:
import sys

In [16]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive/master_degree/paper')
!ls

Mounted at /content/drive
'논문계획서 (2021.01.22).gdoc'		 '기존 분석결과.gdoc'
'논문 초안 (2021.02).gdoc'		 '논문 지도교수 배정신청서.gdoc'
 archive.zip				 '레퍼런스 링크.gdoc'
 booking.com_hotel_review_korea.csv.zip  '논문계획 정보모음.gdoc'
 code_R					  glue_data
 colab					 '연구계획 표.gsheet'
 data					  model


In [10]:
! pip install -q kaggle
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"mapmateers","key":"ae15a237d2224c31d385e7bb82c36717"}'}

In [17]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets list
!kaggle competitions download -c quora-question-pairs

mkdir: cannot create directory ‘/root/.kaggle’: File exists
cp: cannot stat 'kaggle.json': No such file or directory
ref                                                    title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
-----------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
ayushggarg/all-trumps-twitter-insults-20152021         All Trump's Twitter insults (2015-2021)             581KB  2021-01-20 16:51:05           1680        196  1.0              
sevgisarac/temperature-change                          Temperature change                                  778KB  2020-12-24 20:06:36           1606         73  1.0              
gpreda/reddit-wallstreetsbets-posts                    Reddit WallStreetBets Posts                           5MB  2021-02-04 07:44:01            145         34  1.0   

In [12]:
!pip install BERTSimilarity



In [2]:
# cloning bert github repo
# !git clone -q https://github.com/google-research/bert.git

# add bert to sys.path  
if not 'bert' in sys.path:
  sys.path += ['bert']
  
# Instead of cloning you can install via pip also
# !pip install bert-tensorflow

# We will use base uncased model, you can give try with large models
PRETRAINED_DIR = 'gs://cloud-tpu-checkpoints/bert/'+'uncased_L-12_H-768_A-12'

Data Loading

In [13]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [15]:
os.listdir()

['.config', 'quora-question-pairs.zip', 'kaggle.json', 'sample_data']