In [None]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2022-01-15
# GitHub: https://github.com/jaaack-wang 

## Get Quora duplicate questions corpus

We will use one part of this corpus to build a text matching classifier

In [1]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv

--2022-01-15 10:43:59--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 151.101.53.2
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|151.101.53.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘quora_duplicate_questions.tsv’


2022-01-15 10:44:05 (10.4 MB/s) - ‘quora_duplicate_questions.tsv’ saved [58176133/58176133]



## Loading data

In [2]:
data = open('quora_duplicate_questions.tsv', 'r').readlines()

In [3]:
# take a quick look at it

len(data), data[:4]

(404302,
 ['id\tqid1\tqid2\tquestion1\tquestion2\tis_duplicate\n',
  '0\t1\t2\tWhat is the step by step guide to invest in share market in india?\tWhat is the step by step guide to invest in share market?\t0\n',
  '1\t3\t4\tWhat is the story of Kohinoor (Koh-i-Noor) Diamond?\tWhat would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?\t0\n',
  '2\t5\t6\tHow can I increase the speed of my internet connection while using a VPN?\tHow can Internet speed be increased by hacking through DNS?\t0\n'])

In [4]:
tmp = "{:3}{:5}{:5}{}\t{}\t{}"

for line in data[:4]:
    print(tmp.format(*line.split('\t')))

id qid1 qid2 question1	question2	is_duplicate

0  1    2    What is the step by step guide to invest in share market in india?	What is the step by step guide to invest in share market?	0

1  3    4    What is the story of Kohinoor (Koh-i-Noor) Diamond?	What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?	0

2  5    6    How can I increase the speed of my internet connection while using a VPN?	How can Internet speed be increased by hacking through DNS?	0



## Converting the data

For text matching, we only want text pairs along with their labels.

In [5]:
corpus = []

# we do not want the header to be included 
for line in data[1:]:
    line = line.split('\t')
    try:
        # If this cannot be done, there is a problem and we do not want to save this example
        text_a, text_b, label = line[-3], line[-2], line[-1].strip()
        int(label) # just a test, to make sure that the label is convertible to int
        corpus.append([text_a, text_b, label])
    except:
        pass
    
    
len(corpus), corpus[:3]

(404283,
 [['What is the step by step guide to invest in share market in india?',
   'What is the step by step guide to invest in share market?',
   '0'],
  ['What is the story of Kohinoor (Koh-i-Noor) Diamond?',
   'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',
   '0'],
  ['How can I increase the speed of my internet connection while using a VPN?',
   'How can Internet speed be increased by hacking through DNS?',
   '0']])

In [6]:
# see how many pairs matched and how many unmatched

matched = [c for c in corpus if c[-1] == "1"]
unmatched = [c for c in corpus if c[-1] == "0"]

len(matched), len(unmatched)

(149263, 255020)

## Make a small dataset

For illustration purposes and efficiency concern, we will only use 5000 pairs of questoions. 3000 of them go to the train set, 1000 go to the dev set, and the rest 1000 go to the test set. We will make the number of matched pairs and unmatched pairs balanced. 

In [7]:
from random import seed, sample, shuffle

seed(32)
part1 = sample(matched, 2500)
part2 = sample(unmatched, 2500)

train = part1[:1500] + part2[:1500]
dev = part1[1500:2000] + part2[1500:2000]
test = part1[2000:] + part2[2000:]

shuffle(train)
shuffle(dev)
shuffle(test)

len(train), len(dev), len(test)

(3000, 1000, 1000)

In [8]:
print("This is train set sample:", train[:3])
print("\nThis is dev set sample:", dev[:3])
print("\nThis is test set sample:", test[:3])

This is train set sample: [['How do I write a good essay?', 'How do I write an essay in English?', '1'], ['Which is the best thriller movies ?', 'Which are the best brain twisting psychological thriller movies ever made?', '0'], ['How do I preserve a journal that has pencil and pen writings in it?', "If I'm depressed, can I atleast write my feelings down in a diary/journal? Is it good or bad?", '0']]

This is dev set sample: [['What is the customer service number for PAYPAL Philippines?', 'What is the customer service number for PayPal?', '0'], ['If you found a genie and had 3 wishes, what would you wish for?', 'If you had three wishes, what would they be and why?', '1'], ['How can I root android 2.3?', 'How can one root android devices?', '0']]

This is test set sample: [['Why are my questions always flagged as needing improvement?', 'Why are so many questions about president elect Trump getting flagged as needing improvement when many are clear and concise?', '1'], ['How does the ban

## Save the dataset

In [9]:
def save(dataset, fpath):
    dataset = ['\t'.join(d) for d in dataset]
    with open(fpath, 'w') as f:
        f.write('\n'.join(dataset))
        f.close()
        print(f"{fpath} has been saved!")

In [10]:
save(train, "train.txt")
save(dev, "dev.txt")
save(test, "test.txt")

train.txt has been saved!
dev.txt has been saved!
test.txt has been saved!
