# TheMedNet Data Challenge: Lukasz Przychodzien

The goal is to identify if two questions are duplicates are not. Please submit a GitHub repository which includes a python notebook. Report your performance on the test set which is included. 
https://www.kaggle.com/c/quora-question-pairs/overview 

For each ID in the test set, you must predict the probability that the questions are duplicates (a number between 0 and 1). The file should contain a header and have the following format:

In [None]:
test_id,is_duplicate
0,0.5
1,0.4
2,0.9
etc.

The goal of this competition is to predict which of the provided pairs of questions contain two questions with the same meaning. The ground truth is the set of labels that have been supplied by human experts. The ground truth labels are inherently subjective, as the true meaning of sentences can never be known with certainty. Human labeling is also a 'noisy' process, and reasonable people will disagree. As a result, the ground truth labels on this dataset should be taken to be 'informed' but not 100% accurate, and may include incorrect labeling. We believe the labels, on the whole, to represent a reasonable consensus, but this may often not be true on a case by case basis for individual items in the dataset.

## Data fields
id - the id of a training set question pair

qid1, qid2 - unique ids of each question (only available in train.csv)

question1, question2 - the full text of each question

is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

## Evaluation
Submissions are evaluated on the log loss between the predicted values and the ground truth. Therefore, we are stiving for a lower log-loss value. 

https://www.kaggle.com/dansbecker/what-is-log-loss

https://www.kaggle.com/anokas/data-analysis-xgboost-starter-0-35460-lb 

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
os.chdir('C:/Users/lprzy/Documents/takehome/quora-question-pairs')

In [9]:
df_train = pd.read_csv('train.csv')

In [10]:
df_train.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


## Exploratory Analysis of Training Data

Add visualizations

In [11]:
#Total number of question pairs for training
len(df_train.id)

404290

In [13]:
#Total number of duplicate pairs
len(df_train[df_train.is_duplicate==1])

149263

In [14]:
#Percent of pairs that are duplicates
len(df_train[df_train.is_duplicate==1])/len(df_train.id)

0.369197853026293

In [23]:
#Total number of unique question
totqid = list(df_train.qid1)+list(df_train.qid2)
unique_elem, counts_elem = np.unique(totqid, return_counts=True)
len(unique_elem)

537933

In [33]:
#Total number of duplicate questions
np.sum(counts_elem > 1)

111780

In [35]:
#Percent of duplicate questions of total questions
np.sum(counts_elem > 1)/len(counts_elem)

0.20779539459375052

In [37]:
#Maximum number of times one question appears in the dataset
counts_elem.max()

157

In [65]:
totquest = pd.Series(df_train.question1.tolist() + df_train.question2.tolist()).astype(str)

In [66]:
#Maximum number of characters found in a question
totquest.apply(len).max()

1169

In [67]:
#Maxium number of words found in a question
totquest.apply(lambda x: len(x.split(' '))).max()

237

In [68]:
#Number of unqiue words in our corpus 
totquest.apply(lambda x: len(x.split(' '))).nunique()

102