**Objective:**
Our main goal is to compare two sets of questions, called "question1" and "question2," in our dataset. We're doing this because we want to figure out which questions on Quora are basically repeats of ones that have already been asked. This is really helpful because it allows us to quickly provide answers to questions that have already been answered before. To do this, we have the job of deciding whether a pair of questions are duplicates or not. We'll be submitting our predictions in a binary format, and we'll be evaluated based on a metric called log loss.

**And if you find my work valuable, I would greatly appreciate your support by giving me upvotes. They're like a boost of motivation for me.**

Now, let's dive into the data and get started!

**Primary Goals:**
1. **Model Accuracy:** Our foremost goal is to construct a highly accurate machine learning model. This model aims to evaluate the similarity between pairs of questions effectively. By achieving high accuracy, we aim to enhance the overall user experience on Quora, ensuring that users receive meaningful and relevant question recommendations.

2. **Recommendation System:** In cases where two questions are deemed similar, we intend to utilize the model to provide users with recommendations to explore similar questions. This recommendation system is designed to enhance user engagement and satisfaction on the platform.

**Considerations:**
- **Latency:** While we prioritize accuracy, latency (response time) is not a critical concern in this context. Our focus is primarily on delivering precise recommendations, and we are willing to accept slightly longer processing times to achieve this objective.

- **Interpretability:** The interpretability of the model is not a primary concern. We prioritize performance and precision over the ability to easily interpret the model's decision-making processes.

- **Precision:** We emphasize the importance of precision in our model. High precision ensures that the majority of the predicted similar question pairs are indeed true positives, minimizing the risk of recommending irrelevant questions to users.

In summary, our formal agenda revolves around constructing a high-accuracy, low-latency, and high-precision model to evaluate the similarity of questions on Quora. This model will serve as the foundation for a recommendation system aimed at enhancing the user experience and engagement on the platform.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print('File name:    ', os.path.join(dirname, filename), '\tFile Size:', str(round(os.path.getsize(os.path.join(dirname, filename)) / 1000000, 2)) + 'MB')

In [None]:
df = pd.read_csv('/kaggle/input/quora-question-pairs/train.csv.zip')
df.head(10)

We have question1 and question2 in the form of text which we wish to compare and check whether both the sentences (questions) are similar or not.

* We would like to build a high accuracy model in order to ensure that the costumer experience at Quora is not hampered.
* If two questions are similar, we could recommend the user to refer to the similar questions
* Latency is not a major concern
* Interpretabilty is not important. Hence ,there is no need to elucidate or substantiate the rationale behind the model's prediction of a particular class label.
* We need a high precision model here

**`id:`** Index of the data

**`qid(1,2):`** Question IDs of each data

**`question(1,2):`** The actual data/text of the questions.

**`is_duplicate:`** The class label that we have to predict (0 for non-duplicates, 1 for duplicates).

# EDA

In [None]:
df.shape

We have about 400k records or question pairs.

In [None]:
# we have about 63% of points that are dissimilar and about 37% of the points that are similar.
# Class label is partially balanced.

print(df['is_duplicate'].value_counts())
print(df['is_duplicate'].value_counts(normalize=True))

df.groupby("is_duplicate")['id'].count().plot.bar()

In [None]:
# We have two questions in question2 and one question in question 1 as NULL

df.info()

**Question Distribution**

Let's explore the frequency of question:

In [None]:
total_questions_id = pd.concat([df['qid1'], df['qid2']], axis = 0)
plt.figure(figsize=(12, 5))
plt.hist(total_questions_id.value_counts(), bins=100)
plt.yscale('log')
plt.title('Log-Histogram of question appearance counts')
plt.xlabel('Number of occurences of question')
plt.ylabel('Number of questions')

# We note that a few questions are asked multiple times and many questions occur a fewer times.
# Out of 40k questions, 37% of the questions are duplicate

**Word Distribution**

Let's explore the distribution of the words

In [None]:
# Looking at the null questions - let's remove them first. Because they won't impact your aanalysis much
# Looks like there are 3 Null rows that I will simply drop

nan_rows = df[df.isnull().any(1)]
print(nan_rows)

df.dropna(0, inplace = True)

In [None]:
# Calculate word count for each question
df['no_of_words_ques1'] = df['question1'].apply(lambda row: len(row.split(" ")))
df['no_of_words_ques2'] = df['question2'].apply(lambda row: len(row.split(" ")))


# Prepare data for line chart
word_counts1 = df['no_of_words_ques1'].value_counts().sort_index()
word_counts2 = df['no_of_words_ques2'].value_counts().sort_index()

# Create a line chart for word count distribution
plt.figure(figsize=(10, 5))

# Line chart for question 1
plt.subplot(1, 2, 1)
plt.plot(word_counts1.index, word_counts1.values, color='blue')
plt.title('Word Count Distribution for Question 1')
plt.xlabel('Word Count')
plt.ylabel('Frequency')

# Line chart for question 2
plt.subplot(1, 2, 2)
plt.plot(word_counts2.index, word_counts2.values, color='green')
plt.title('Word Count Distribution for Question 2')
plt.xlabel('Word Count')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# It looks like word count for question 1 follow Log normal distribution.

In [None]:
# Checking for duplicates - no dupicate rows

# df.duplicated(['qid1', 'qid2', 'question1', 'question2']).any()

df.duplicated(['question1', 'question2']).any()

# Feature Engineering

In [None]:
from nltk.corpus import stopwords
STOP_WORDS = stopwords.words("english")

def word_match(row):
    w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
    w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
    return 1.0 * len(w1 & w2)

df['word_match'] = df.apply(word_match, axis=1)

def word_share(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
df['word_share'] = df.apply(word_share, axis=1)


df['freuqency_qid1'] = df.groupby('qid1')['qid1'].transform('count') 
df['freuqency_qid2'] = df.groupby('qid2')['qid2'].transform('count')
df['len_ques1'] = df['question1'].str.len() 
df['len_ques2'] = df['question2'].str.len()
df['transformed_question1'] = df.question1.apply(lambda row: ' '.join([word.replace("?", "") for word in row.lower().split()
                                                           if word.replace("?", "") not in STOP_WORDS]))
df['transformed_question2'] = df.question2.apply(lambda row: ' '.join([word.replace("?", "") for word in row.lower().split()
                                                           if word.replace("?", "") not in STOP_WORDS]))
df['word_common_transformed_question'] = df.apply(word_match, axis=1)
df['word_share_transformed_question'] = df.apply(word_share, axis=1)
df['transformed_question_diff'] = abs(df['transformed_question1'].apply(lambda row: len(row.split(" "))) - df['transformed_question2'].apply(lambda row: len(row.split(" "))))

pd.options.display.max_colwidth = 100

df.head()