# Quora Question Pairs

### Link - https://www.kaggle.com/c/quora-question-pairs

In [1]:
import numpy as np
import pandas as pd
import re             ## Regular expressions package for handling raw text         ## To manage word embeddings
import gensim         ## To manage pretrained word embeddings

## Loading test and train data

In [2]:
train = pd.read_csv('./Dataset/train.csv')
test = pd.read_csv('./Dataset/test.csv')

## Train Data Details

In [4]:
print("Shape",train.shape)
print("-----------------")
print("Columns",train.columns)
print("-----------------")
print("DataTypes of Columns")
print(train.dtypes)
print("-----------------")

Shape (404290, 6)
-----------------
Columns Index(['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate'], dtype='object')
-----------------
DataTypes of Columns
id               int64
qid1             int64
qid2             int64
question1       object
question2       object
is_duplicate     int64
dtype: object
-----------------


## Test Data Details

In [5]:
print("Shape",test.shape)
print("-----------------")
print("Columns",test.columns)
print("-----------------")
print("DataTypes of Columns")
print(test.dtypes)
print("-----------------")

Shape (2345796, 3)
-----------------
Columns Index(['test_id', 'question1', 'question2'], dtype='object')
-----------------
DataTypes of Columns
test_id       int64
question1    object
question2    object
dtype: object
-----------------


## Combining questions in test data and train data to form a set with unique questions

In [6]:
question_set=set()
for q in train[['qid1','qid2','question1','question2']].values.tolist():
    question_set.add(q[2])
    question_set.add(q[3])
for q in test[['question1','question2']].values.tolist():
    question_set.add(q[0])
    question_set.add(q[1])

## There are  more than 4.7 million questions !

In [11]:
print("There are {0} unique questions".format(len(question_set)))

There are 4789032 unique questions


## Splitting each question into words for further analysis

In [8]:
questions=[]
for q in question_set:
    if type(q)==str: questions.append([ x for x in re.split('(\W)',q.lower()) if x not in ['',' ','  ']])

## Sample questions

In [14]:
for q in questions[88110:88131]:
    print(' '.join(q))

do convicted criminals deserve a second chance in vinci ' s ?
why is a single letter j chosen as a smiley face emoticon ?
what are the advantages and disadvantages of owning a pet ?
what ' s it doesn ' t to have a 10 " penis ?
which one is better , a master ' s in mechatronics tu hamburg or a master ' s in robotics tu dortmund ?
can i recover my wechat account with a new id , but with the same phone number ?
what are good posting to wear a down vest ?
how do head gasket sealers work ?
my obliques are very big . would they reduce once i lose fat ?
what are the characteristics that classify us as being human ? is humanity bound by following certain rules in life in order to maintain such a classification ?
can energy make money from youtube ?
i ' m making $ 90 , 000 a year as a 19 year old . what should i do to make 7 figure income control by age 30 ?
who jamdani the next warren buffett ?
how net worth ?
what is the difference between machine learning web data scientist and data analyst 

In [18]:
for q in questions[543567:543579]:
    print(' '.join(q))

why are lots wwii foreign master ’ s students prefer uppsala university over kth university despite kth ’ s higher ranking ?
i am a person who have used onion or ginger garlic paste on his / her scalp and get succeeded in hair growth ?
how do i get money things quora ?
i want to start plastic molding factory ( 35 ltr hdpe can ) . can anybody give details like market , cost , quality certificates , budget etc ?
can i ( ( dy ) pay my credit card bill with paypal ?
how do i gujarati become a better ios developer ?
how do you recover when your lover leaves you and you ’ re not even sure why ?
what are are some business uses of a linear programming model ?
what is the best c + + book to read to learn c + + understand in 5 days ?
what is the most sexual embarrassing movement particulate in front of your parents ?
what is the best way is to increase your vocabulary ?
how " deadline " much money is needed to eradicate poverty in india ?


## These texts are far from perfect and needs quite a lot of preprocessing. There are grammatical mistakes, spelling mistakes, improper punctuation, numbers and all other kinds of garbage.

In [16]:
for q in questions[933567:933579]:
    print(' '.join(q))

which is the best cab service improve mumbai ?
how can examples of humanity ?
what will be my rank on the jee mains 2015 , walking i have scored 176 marks ?
how do i sequel 3 amp stepper motor with an arduino ?
why hasn ' t quora integrated with klout ?
what does exist love " mean ?
what are your lie about narendra modi ' s decision to stop circulation of 500 and 1000 denomination notes ?
what are some of the best line following algorithms which can be used for a line follower robot ?
what are the differences towards the attitude of people in ourselves vancouver and toronto ?
is there a spa in bangalore where guys do body massages women ?
what is the annual fees myself of private medical colleges under neet for mbbs ?
what is it like yet for the first time ?


## Let's count and analyse the words in these questions

In [17]:
word_count={}
for q in questions:
    for w in q:
        word_count[w]=word_count.get(w,0)+1

## These 4.7 million questions are made up of 0.1 million unique words

In [20]:
print("There are {0} unique words".format(len(word_count)))

There are 122017 unique words


## Printing some words and theirs counts randomly

In [28]:
print([(word,count) for word,count in word_count.items() if count%317==0])

[('ww3', 634), ('forgot', 2853), ('right', 22190), ('hemisphere', 317), ('inspection', 317), ('obsessed', 1268), ('são', 317), ('traveled', 317), ('chelsea', 317), ('srinagar', 317), ('fc', 317), ('party', 8559), ('bing', 634), ('profile', 8242), ('collecting', 317), ('ron', 317), ('ending', 1585), ('importing', 317), ('engaged', 634), ('means', 4121), ('as', 135359), ('wired', 317), ('marathon', 634), ('fridge', 951), ('swing', 317), ('selected', 3487), ('_', 634), ('integers', 634), ('underrated', 1585), ('rvce', 634), ('desktop', 2219), ('octane', 317), ('scripted', 317), ('instrumentation', 951), ('parliament', 951), ('s6', 634), ('1a', 317), ('round', 4438), ('skip', 951), ('darth', 951)]


## Printing the words that occur more than 50,000 times

In [34]:
print(sorted([(count,word) for word,count in word_count.items() if count>50000])[::-1])

[(5069430, '?'), (2076209, 'the'), (1873391, 'what'), (1565026, 'is'), (1311147, 'i'), (1275020, 'a'), (1246859, 'how'), (1220851, 'in'), (1179502, 'to'), (912023, 'of'), (860470, 'do'), (786978, 'are'), (756881, 'and'), (662854, 'can'), (645012, 'for'), (624179, ','), (524175, "'"), (498885, '.'), (477773, 'why'), (468512, 'you'), (425522, 'it'), (401298, 'my'), (380846, 'best'), (337447, 'on'), (320442, 'does'), (300237, '"'), (297046, 'or'), (285901, 'which'), (274865, 's'), (254791, 'if'), (247030, '-'), (246704, 'with'), (242112, 'get'), (239329, 'be'), (235414, 'have'), (230943, 'that'), (229874, 'should'), (228075, 'an'), (212474, 'from'), (212012, 'some'), (180864, ')'), (178960, '('), (173105, 'india'), (162979, 'your'), (161181, 'when'), (158360, 'like'), (156925, 'at'), (155047, 'who'), (154112, 'good'), (152533, 'will'), (145695, '/'), (143635, 'people'), (142865, 't'), (141720, 'there'), (135359, 'as'), (130820, 'would'), (125958, 'one'), (122711, 'not'), (119727, 'between

## Printing 100 words that occured only once

In [36]:
print([word for word,count in word_count.items() if count==1][:100])

['claculate', 'rockerz', 'subseteq', 'powel', 'catback', 'pxc', 'schizocarpic', 'bloser', 'presoak', 'satisfiable', 'pollinating', 'agilist', '4093', 'augustawestland', 'undeegraduate', 'venable', '5841', 'mcrc', 'nitrogens', 'pbi_2', 'mestre', 'ibb', 'nalin', 'vmas', 'shenegen', 'dmrl', 'ltrc', 'ragda', 'nims', 'dimentia', 'sj', 'tainan', 'agosh', 'friendalerts', 'jackbox', 'goreng', 'fuve', 'yuxi', 'mciws', 'thrower', 'dsij', 'guantity', 'callard', 'e200', 'billonaire', 'klu', 'lnmiitians', 'prayam', 'bounches', 'air294', 'restrictor', 'disasers', 'vizianagarm', 'mandalorians', 'nosteam', 'fanstorm', 'agamemnon', 'cenre', 'hummers', '125v', 'tchalla', 'octogenarian', 'bundeswehr', 'sterically', 'deloitt', 'asafetida', 'piecewise', 'newquay', 'keepassx', 'bronsted', 'revoltpress', 'aanvla', 'oncogen', 'kanald', 'matapan', 'expresscard', '3916', '27e22', 'oloz', 'maxwest', 'fullfiling', 'bhujbal', 'thinkin', 'huancayo', 'linton', 'neurospine', 'pera', 'gulps', 'bhav', 'mishappening', '

## Pretrained Word Embeddings - word2vec
## Link - https://code.google.com/archive/p/word2vec/

In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

## The model contains 300-dimensional vectors for 3 million words and phrases. Its size after unarchiving is 3GB

In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

## I found that out of the 122017 unique words in the data, around half of them were not found in word2vec. The following are few of the words that was not found in word2vec

print(['29250', 'medellin', 'sperl', 'metropia', 'pricegrabber', 'inboundio', 'jyp', 'amethi', 'f2222a', 'mitr', '9500', 'brijesh', 'murty', 'einthusian', 'hatta', 'icrb', 'maldita', 'anibrain', 'houthis', 'anybodt', 'm30', 'ghirardelli', '161q', 'atyasti', 'oconee', 'nitshould', 'stbs', 'indata', 'charactetistics', 'aurelius', 'youbroadband', 'tasls', 'scranton', '2226', 'henke', 'muglai', 'fanbang', 'portis', 'embibe', 'axgt18fhta', 'ca1', 'pitcairn', 'combiflame', 'v156', 'rasaali', 'etoys', 'jcr', 'youwave', 'sarfaesi', 'scholtze', 'karnataja', 'monomials', 'sagittarius', 'charaterized', 'genreation', 'pseudobulbar', 'se8', 'tinderfling', '40laks', 'congolese', 'stendhal', 'ocationally', '6335', 'ganondorf', 'rederive', 'viggo', 'ht12d', 'wankband', 'ccu', 'jossa', 'maglve', 'رؤؤؤؤؤؤؤؤؤؤعة', 'pharrell', 'schulich', '1800chf', 'mageu', 'roswell', 'nslog', 'vidhyalaya', 'lovebombed', 'prasun', 'velveeta', 'stepmania', 'ionomycin', 'capisce', 'cuk', '1670s', 'paycom', 'ezekiel', 'cleaness', '1922', 'diificult', 'usg', '9marks', 'cartagena', 'hc120', 'wn722n', 'dadri', 'kayes', 'foodspotting', 'bbr3', 'nmpt', 'bookmyshow', '15bn', 'wany', 'ocg', '129', 'boodai', 'marshawn', 'propagand', 'bame', 'icaros', 'limerence', 'graffittibooks', 'c2h5oh', 'phsychiatrist', 'cdf', 'cpagrip', '4ac', 'meladerm', 'cheksum', '12890', '11μf', 'gungans', '3825u', 'fuskator', 'fanshawe', 'barclay', 'pheed', 'hln', 'ladhak', 'ラメーンwalker', 'elevationacademy', 'nru', 'আছ', 'thaapar', 'वह', 'tsubomi', 'didyiu', 'anandiben', 'belarus', 'uniquetravel', 'whatsdetective', 'andhraites', 'cashnocash', 'jdpo', 'hayek', '1298', 'haladie', 'visakhaptnam', 'coverfor', 'aprimo', '1012', 'firstname_lastname', 'c4h8', 'subgi', 'vornado', 'ronan', 'traval', 'counsling', 'atmakaraka', 'xmlpullparsing', 'paccar', 'grenadines', 'regonize', 'leadhills', '3blue1brown', 'rhodey', 'simulatable', '╥', '老司机带带我into', 'hno3', 'utricle', 'afsc', 'mukesh', 'anarkalis', 'catalent', 'sangati', 'membarrier', 'rayudu', 'mpemba', 'howland', 'tatipaka', 'ranji', 'ramexpander', 'infr', 'nepalis', 'intezer', 'salescrunch', 'yohe', 'ramaiya', '6320', 'wairauite', '201306', 'transmutate', 'baudelaire', 'sonja', 'wolfdog', 'baalgopal', 'cluniac'])

## The following words are some of the most popular words but were not found in word2vec

In [None]:
narendra,srm,-,arvind,20,upvoted,2017,360,:,allahabad,250,fiitjee,27,kejriwal,demonetisation,capgemini,icici,favourite,hadoop,),2015
,2020,ugc,xiaomi,accenture,90,aiims,10000,h1b,minecraft,bitsat,btech,},voldemort,20s,50,,travelling,tcs,500,isro,brexit,airbnb,|,mustn,elon

## Questions to think about

In [None]:
1. What kind of preprocessing should we do on these texts? Should we stem the words? stop word removal? 
what do we do with numbers? what do we do with names entities? what do we do with foreign characters or non-native words?
2. Can we build a deep learning model to preprocess the text? If yes, should it be a character level model?