This notebook contains code used to understand the Question data from Quora duplicate question dataset

### Summary (Aug 29, 2021):
1. Three records with question1 or question2 as Nan. Removed them from the dataset <br>
2. For almost all of the records with is_duplicate == 1, the question length is not so big

### Next Steps:
1. Pick out only the records with is_duplicate == 1, and prepare modelling dataset for training using Siamese network with triplet loss


In [1]:
from pathlib import Path
import numpy as np
import pandas as pd

In [2]:
## Read train data
train_df = pd.read_csv(Path.cwd().joinpath('data', 'train.csv'))
train_df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [3]:
train_df.shape

(404290, 6)

In [4]:
train_df.dtypes

id               int64
qid1             int64
qid2             int64
question1       object
question2       object
is_duplicate     int64
dtype: object

In [5]:
train_df.isnull().any()

id              False
qid1            False
qid2            False
question1        True
question2        True
is_duplicate    False
dtype: bool

In [6]:
## Remove records with Nan questions
train_df = train_df.loc[(~train_df.question1.isna()) & (~train_df.question2.isna())]

In [7]:
train_df.is_duplicate.value_counts()

0    255024
1    149263
Name: is_duplicate, dtype: int64

In [8]:
## Compute text length
train_df = train_df.copy()
train_df['question1_length'] = train_df.question1.map(lambda x: len(x.split()))
train_df['question2_length'] = train_df.question2.map(lambda x: len(x.split()))

In [9]:
train_df[['question1_length', 'question2_length']].describe()

Unnamed: 0,question1_length,question2_length
count,404287.0,404287.0
mean,10.942256,11.182017
std,5.428812,6.30521
min,1.0,1.0
25%,7.0,7.0
50%,10.0,10.0
75%,13.0,13.0
max,125.0,237.0


In [10]:
## Question length for duplicate records
train_df.loc[train_df.is_duplicate == 1, ['question1_length', 'question2_length']].describe()

Unnamed: 0,question1_length,question2_length
count,149263.0,149263.0
mean,9.847665,9.859999
std,4.162756,4.155931
min,1.0,1.0
25%,7.0,7.0
50%,9.0,9.0
75%,11.0,11.0
max,80.0,60.0
