**Objective:**
Our primary objective is to assess the textual similarity between two sets of questions, labeled as "**question1**" and "**question2**," in our dataset. The dataset includes various features, such as unique identifiers (**"qid1/qid2"**), the actual textual content of the questions ("**question1**" and "**question2**"), and a binary classification label ("**is_duplicate**") indicating whether the two questions are duplicates (labeled as 1) or not (labeled as 0).

**Primary Goals:**
1. **Model Accuracy:** Our foremost goal is to construct a highly accurate machine learning model. This model aims to evaluate the similarity between pairs of questions effectively. By achieving high accuracy, we aim to enhance the overall user experience on Quora, ensuring that users receive meaningful and relevant question recommendations.

2. **Recommendation System:** In cases where two questions are deemed similar, we intend to utilize the model to provide users with recommendations to explore similar questions. This recommendation system is designed to enhance user engagement and satisfaction on the platform.

**Considerations:**
- **Latency:** While we prioritize accuracy, latency (response time) is not a critical concern in this context. Our focus is primarily on delivering precise recommendations, and we are willing to accept slightly longer processing times to achieve this objective.

- **Interpretability:** The interpretability of the model is not a primary concern. We prioritize performance and precision over the ability to easily interpret the model's decision-making processes.

- **Precision:** We emphasize the importance of precision in our model. High precision ensures that the majority of the predicted similar question pairs are indeed true positives, minimizing the risk of recommending irrelevant questions to users.

In summary, our formal agenda revolves around constructing a high-accuracy, low-latency, and high-precision model to evaluate the similarity of questions on Quora. This model will serve as the foundation for a recommendation system aimed at enhancing the user experience and engagement on the platform.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print('File name:    ', os.path.join(dirname, filename), '\tFile Size:', str(round(os.path.getsize(os.path.join(dirname, filename)) / 1000000, 2)) + 'MB')

In [None]:
df = pd.read_csv('/kaggle/input/quora-question-pairs/train.csv.zip')
df.head(10)

We have question1 and question2 in the form of text which we wish to compare and check whether both the sentences (questions) are similar or not.

* We would like to build a high accuracy model in order to ensure that the costumer experience at Quora is not hampered.
* If two questions are similar, we could recommend the user to refer to the similar questions
* Latency is not a major concern
* Interpretabilty is not important.
* We need a high precision model here

**`id:`** Index of the data

**`qid(1,2):`** Question IDs of each data

**`question(1,2):`** The actual data/text of the questions.

**`is_duplicate:`** The class label that we have to predict (0 for non-duplicates, 1 for duplicates).

# EDA

In [None]:
df.shape

We have about 400k records or question pairs.

In [None]:
# we have about 63% of points that are dissimilar and about 37% of the points that are similar.
# Class label is partially balanced.

print(df['is_duplicate'].value_counts())
print(df['is_duplicate'].value_counts(normalize=True))

df.groupby("is_duplicate")['id'].count().plot.bar()

In [None]:
# We have two questions in question2 and one question in question 1 as NULL

print(df.info())

In [None]:
total_questions_id = pd.concat([df['qid1'], df['qid2']], axis = 0)
plt.figure(figsize=(12, 5))
plt.hist(total_questions_id.value_counts(), bins=100)
plt.yscale('log')
plt.title('Log-Histogram of question appearance counts')
plt.xlabel('Number of occurences of question')
plt.ylabel('Number of questions')

In [None]:
# Checking for duplicates - no dupicate rows

# df.duplicated(['qid1', 'qid2', 'question1', 'question2']).any()

df.duplicated(['question1', 'question2']).any()

In [None]:
# Looking at the null questions
# Looks like there are 3 Null rows that I will simply drop

nan_rows = df[df.isnull().any(1)]
print(nan_rows)

df.dropna(0, inplace = True)

# Feature Engineering

In [None]:
df['freuqency_qid1'] = df.groupby('qid1')['qid1'].transform('count') 
df['freuqency_qid2'] = df.groupby('qid2')['qid2'].transform('count')
df['len_ques1'] = df['question1'].str.len() 
df['len_ques2'] = df['question2'].str.len()
df['#_of_words_ques1'] = df['question1'].apply(lambda row: len(row.split(" ")))
df['#_of_words_ques2'] = df['question2'].apply(lambda row: len(row.split(" ")))

def word_common_calculator(row):
    w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
    w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
    return 1.0 * len(w1 & w2)
df['word_common'] = df.apply(word_common_calculator, axis=1)

def word_share_calculator(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
df['word_share'] = df.apply(word_share_calculator, axis=1)


df.head()

In [None]:
# import seaborn as sns
# sns.set_style('whitegrid')
# sns.pairplot(df.iloc[:, 5:], hue = 'is_duplicate', height = 3)
# plt.show()

pd.options.display.max_colwidth = 100

In [None]:
df[['question1', 'question2', 'is_duplicate']][df['is_duplicate'] == 0]

In [None]:
df[['question1', 'question2', 'is_duplicate']][df['is_duplicate'] == 1]

In [None]:
from nltk.corpus import stopwords

STOP_WORDS = stopwords.words("english")
def remove_stop_word(row):
    # Get the non-stopwords in Questions
    q1_words = [word.replace("?", "") for word in row.lower().split() if word.replace("?", "") not in STOP_WORDS]
#     q2_words = set([word.replace("?", "") for word in row['question2'].lower().split() if word.replace("?", "") not in STOP_WORDS])
    
    return ' '.join(q1_words)

def include_stop_word(row):
    # Get the non-stopwords in Questions
    q1_words = [word.replace("?", "") for word in row.lower().split() if word.replace("?", "") in STOP_WORDS]
#     q2_words = set([word.replace("?", "") for word in row['question2'].lower().split() if word.replace("?", "") not in STOP_WORDS])
    
    return ' '.join(q1_words)

df['transformed_question1'] = df.question1.apply(remove_stop_word)
df['transformed_question2'] = df.question2.apply(remove_stop_word)
df['stopword_question1'] = df.question1.apply(include_stop_word)
df['stopword_question2'] = df.question2.apply(include_stop_word)

df['word_common_transformed_question'] = df.apply(word_common_calculator, axis=1)
df['word_share_transformed_question'] = df.apply(word_share_calculator, axis=1)
df['word_common_stopword_question'] = df.apply(word_common_calculator, axis=1)
df['word_share_stopword_question'] = df.apply(word_share_calculator, axis=1)

In [None]:
df.head()

In [None]:
df['transformed_question_diff'] = df['transformed_question1'].apply(len) - df['transformed_question1'].apply(len)

In [None]:
df.head(5)