# Text Data Cleaning and EDA Overview
This notebook documents the process of cleaning and exploring the Quora question pairs dataset. The goal is to identify and address common text issues to prepare the data for downstream NLP tasks. Each step below highlights a specific problem, the rationale for addressing it and the solution applied.

In [None]:
import pandas as pd
import re
import contractions
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk
from tqdm.notebook import tqdm
import unicodedata

In [None]:
# Set display options for pandas
pd.set_option('display.max_colwidth',1000)
pd.set_option('display.max_rows', 100)

# Download necessary NLTK resources
nltk.download('stopwords')

# Set up tqdm for pandas
tqdm.pandas()

In [3]:
df = pd.read_csv('quora.csv')
df['question1'] = df['question1'].astype(str)
df['question2'] = df['question2'].astype(str)

Now we will scan the dataframe to see what needs cleaning and what doesn't.

## Step 1: Detecting and Handling HTML Tags
Text data scraped from the web often contains HTML tags, which can interfere with NLP models. Here, we identify questions containing HTML and replace tags with descriptive tokens to preserve information while removing markup.

In [4]:
# Problem: HTML tags
# detect using regex: starts with < followed by a letter and ends with >
html_pattern = re.compile(r'</?[a-zA-Z][^>]*>')

# Check for HTML in both columns
q1_html = df[df['question1'].astype(str).apply(lambda x: bool(html_pattern.search(x)))]
q2_html = df[df['question2'].astype(str).apply(lambda x: bool(html_pattern.search(x)))]

# Display summary
print(f"Total questions with HTML tags in question1: {len(q1_html)}")
print(f"Total questions with HTML tags in question2: {len(q2_html)}")

# Concatenate for side-by-side view (not preserving pairs)
html_examples = pd.concat([q1_html['question1'].reset_index(drop=True), q2_html['question2'].reset_index(drop=True)], axis=1)

# Show top 25 rows
html_examples[['question1', 'question2']].head(25)


Total questions with HTML tags in question1: 32
Total questions with HTML tags in question2: 24


Unnamed: 0,question1,question2
0,“><img src=x onerror=prompt(1)>,"""><img src=x onerror=prompt(0) >?"
1,Is this the correct way to implement a stack using an arrayList<Integer> and a queue as an arrayList<Integer> in Java?,“><img src=x onerror=prompt(1)>
2,What is the function of <head> <title>Page Title</title> </head> in HTML?,What are the pros and <bold>cons</bold> of demonetizing Rs.500 and Rs.1000 notes?
3,How do you make one <div> layer show over another in HTML/CSS?,What is #include<conio.h>?
4,Avg antivirus 1800</v\>251<’-‘>4919 Avg tech support phone number 24x7?,"What do T1,T2,FLAIR, <D>, and FA tell you about someone's brain from MRI?"
5,"How can one define a constant <img> src URL in HTML, so that it can be reused in multiple attributes?","<h1 style=""color: blue;"">HTML injection</h1>?"
6,What is #include<stdio.h>?,What is the h for in #include <stdio.h>?
7,"How do I open a windows file/folders in local drive using html ""<a href>"" tag?",What is #include<stdio.h>?
8,Avg antivirus 1800</v\>251<’-‘>4919 Avg tech support phone number 24x7?,What is #include<stdlib.h>?
9,What is #include<stdio.h>?,How do I display a <div> table when a link is clicked? (variable-based)


In [5]:
# Solution: encode tags : replace them with tokens.
def tag_to_token(text):
    return re.sub(r'</?([a-zA-Z]+)[^>]*>', r' \1tag ', text)

df['q1_html'] = df['question1'].astype(str).apply(tag_to_token)
df['q2_html'] = df['question2'].astype(str).apply(tag_to_token)

# Check results
q1_html = df[df['q1_html'].astype(str).apply(lambda x: bool(html_pattern.search(x)))]
q2_html = df[df['q2_html'].astype(str).apply(lambda x: bool(html_pattern.search(x)))]
print(f"Total questions with HTML tags in question1: {len(q1_html)}")
print(f"Total questions with HTML tags in question2: {len(q2_html)}")
html_examples = pd.concat([q1_html['q1_html'].reset_index(drop=True), q2_html['q2_html'].reset_index(drop=True)], axis=1)
html_examples[['q1_html', 'q2_html']].head(25)

Total questions with HTML tags in question1: 0
Total questions with HTML tags in question2: 0


Unnamed: 0,q1_html,q2_html


## Step 2: Expanding Contractions and Removing Possessives
Contractions and possessives can introduce inconsistencies in text analysis. This step expands contractions (e.g., "can't" to "cannot") and removes possessive forms.

In [6]:
# Problem: contractions, possessives, apostrophes in general

# Filter rows that contain common contractions
contractions_pattern = r"\b(?:I'm|you're|he's|she's|it's|we're|they're|I've|you've|they've|I'd|you'd|we'd|I'll|you'll|won't|can't|n't|'re|'ve|'ll|'d|'s)\b"
contractions_q1 = df[df['question1'].str.contains(contractions_pattern, case=False, na=False)]
contractions_q2 = df[df['question2'].str.contains(contractions_pattern, case=False, na=False)]

# Display summary
print(f"Total questions with contractions in question1: {len(contractions_q1)}")
print(f"Total questions with contractions in question2: {len(contractions_q2)}")

# Concatenate for side-by-side view (not preserving pairs)
contraction_examples = pd.concat([contractions_q1['question1'].reset_index(drop=True), contractions_q2['question2'].reset_index(drop=True)], axis=1)

# Show top 25 rows
contraction_examples[['question1', 'question2']].head(25)

Total questions with contractions in question1: 25517
Total questions with contractions in question2: 25326


Unnamed: 0,question1,question2
0,What's causing someone to be jealous?,"I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?"
1,What's one thing you would like to do better?,What's one thing you do despite knowing better?
2,I'm a 19-year-old. How can I improve my skills or what should I do to become an entrepreneur in the next few years?,What is the most delicious dish you've ever eaten and why?
3,What is the best/most memorable thing you've ever eaten and why?,Why can't I do my homework?
4,I was suddenly logged off Gmail. I can't remember my Gmail password and just realized the recovery email is no longer alive. What can I do?,Do you apply for programs like RSI when you're a rising senior?
5,How do I log out of my Gmail account on my friend's phone?,I can't remember my Gmail password or my recovery email. How can I recover my e-mail?
6,When will the BJP government strip all the Muslims and the Christians of the Indian citizenship and put them on boats like the Rohingya's of Burma?,What's the purpose of life? What is life actually about?
7,What's the difference between love and pity?,What's the difference between honest and sincere?
8,What is the most creative college admissions essay you've read?,What's life after retirement?
9,What's the best way to start learning robotics?,Has any college admission officer cried in public while reading an applicant's essay?


In [7]:
# Solution: Expand contractions

def expand_contractions(text):
    return contractions.fix(text)

#  on both columns
df['q1_expanded'] = df['question1'].apply(expand_contractions)
df['q2_expanded'] = df['question2'].apply(expand_contractions)

# Optionally remove possesives completely
def remove_possesives(text):
    return re.sub(r"(?i)\b's\b", "", text)

df['q1_expanded'] = df['q1_expanded'].apply(remove_possesives)
df['q2_expanded'] = df['q2_expanded'].apply(remove_possesives)

# Check results
contractions_q1 = df[df['q1_expanded'].str.contains(contractions_pattern, case=False, na=False)]
contractions_q2 = df[df['q2_expanded'].str.contains(contractions_pattern, case=False, na=False)]
print(f"Total questions with contractions in question1: {len(contractions_q1)}")
print(f"Total questions with contractions in question2: {len(contractions_q2)}")
contraction_examples = pd.concat([contractions_q1['q1_expanded'].reset_index(drop=True), contractions_q2['q2_expanded'].reset_index(drop=True)], axis=1)
contraction_examples[['q1_expanded', 'q2_expanded']].head(25)


Total questions with contractions in question1: 5
Total questions with contractions in question2: 9


Unnamed: 0,q1_expanded,q2_expanded
0,Can Adam'D Angelo be banned on Quora?,I am 15 and 5'll.So starting bodybuilding at this age a good idea?
1,How can I catch a Farfetch'd in Pokémon Go while visiting in China?,WI'll it be easy to get job in private sector after 2.5 years sitting jobless?
2,"How do I identify the original Skullcandy earphones, Ink'd 2.0?","In Punk'd, who is the punk?"
3,"I am looking to switch from android to IOS. Give me some reasons, suggestions and insight into why I should/n't? Always used android.",İ've heard that smoking 3 cigarettes a day does not harm health. Is it true?
4,What would have Theodore Hertzl'd proposed seven starred flag from his “the Jewish state” actually look like?,I am a 5 letter word. I am normally below you If you remove my 1st letter you'll find me above you If you remove my 1st & 2nd letters you cannot see me Answer is really very interesting Let us see who solves this.... ⏰Time limit :- today YOU can also send to other grps if I?
5,,"Why cannot rich black people create their're own NBA, NFL team what is stopping them, nothing stop them in the past with the negro leagues?"
6,,"If you only have 1099 salespeople and no W2'd employees, can you decide to turn some into W2 employees and keep some as 1099s? Can the ones who stay as 1099s do anything legally?"
7,,Would you second guess your interactions on Quora if the site was SEO optimized and caused your answers to appear prominently when someone Google'd your name? Why?
8,,"I go kind of blank when someone asks me, ""what is up"" or ""how've you been?"" How can one respond to these genuinely if his life is going just normal and there is nothing special to tell about?"


## Step 3: Normalizing Unicode and Removing Non-ASCII Characters
Text data may contain accented or non-standard Unicode characters, which can cause issues for tokenization and modeling. This step normalizes text to a standard Unicode form and removes diacritics and unusual symbols.

In [8]:
# Problem: Non-ASCII (e.g., accented or unusual symbols)

# Regex to detect characters outside the basic ASCII range
non_ascii_pattern = re.compile(r'[^\x00-\x7F]')

# Mask for each column
q1_unicode = df[df['question1'].astype(str).apply(lambda x: bool(non_ascii_pattern.search(x)))]
q2_unicode = df[df['question2'].astype(str).apply(lambda x: bool(non_ascii_pattern.search(x)))]

# Display summary
print(f"Total questions with Non-ASCII in question1: {len(q1_unicode)}")
print(f"Total questions with Non-ASCII in question2: {len(q2_unicode)}")

# Concatenate for side-by-side view (not preserving pairs)
unicode_examples = pd.concat([q1_unicode['question1'].reset_index(drop=True), q2_unicode['question2'].reset_index(drop=True)], axis=1)

# Show top 25 rows
unicode_examples[['question1', 'question2']].head(25)


Total questions with Non-ASCII in question1: 5053
Total questions with Non-ASCII in question2: 4559


Unnamed: 0,question1,question2
0,When do you use シ instead of し?,How many times a day do a clock’s hands overlap?
1,What would a Trump presidency mean for current international master’s students on an F1 visa?,How can I ask a question without getting marked as ‘need to improve’?
2,"I got job offer @ Chelmsford-Essex, London with £3764 PM pay-after tax deduction. Pls advice tentative monthly expenses for couple? & saving possible?","What is it like to live in Köln, Germany?"
3,When travelling to a new region is it better to immerse yourself in 1–2 cities or to see as many cities as you can cram in?,What will be the impact of scrapping of ₹500 and ₹1000 rupee notes on the real estate market?
4,Why can’t charged molecules pass through the lipid cell membrane?,How can I best invest ₹5000 over the next 6 months?
5,What’s the best time to have sex?,What jobs are available with a bachelor’s degree in Homeland Security?
6,What would be today’s technology had we never realized the value of binary numbers and harnessed it to produce digital technology?,Emoticons: What does “:/” mean?
7,Whenever its about “her” its a very special feeling. Tried hard to forget her but she's alwys spcl. Should I stop talking to her even as a friend ?,When will the Pokémon series end?
8,What individuals and events in history are a source of pride for São Tomé and Príncipe?,"Can poetry be learned? Or is it a “you have it or not"" sort of thing?"
9,"My parents give me ""advice"" and they get really mad if I dont take it… What am I supposed to do?",Why does Quora mark my questions as needing improvement/clarification before I have time to give it details? Literally within seconds…


In [None]:
# Solution: Normalize characters to follow Unicode

def normalize_unicode(text, remove_accents=True):
    # Normalize to NFKD (decomposed form)
    text = unicodedata.normalize('NFKD', text)

    if remove_accents:
        # Remove diacritics (accents) by dropping combining characters
        text = ''.join([char for char in text if not unicodedata.combining(char)])

    # Re-compose characters (helps keep things standard)
    text = unicodedata.normalize('NFC', text)

    # Replace curly quotes dashes etc.
    text = text.replace('“', '"').replace('”', '"')
    text = text.replace("‘", "'").replace("’", "'")
    text = text.replace("–", "-").replace("—", "-")
    text = re.sub(r'\s+', ' ', text)  # for excess whitespace

    return text.strip()

df['q1_unicode'] = df['question1'].apply(normalize_unicode)
df['q2_unicode'] = df['question2'].apply(normalize_unicode)

# Check results
q1_unicode = df[df['q1_unicode'].astype(str).apply(lambda x: bool(non_ascii_pattern.search(x)))]
q2_unicode = df[df['q2_unicode'].astype(str).apply(lambda x: bool(non_ascii_pattern.search(x)))]
print(f"Total questions with Non-ASCII in question1: {len(q1_unicode)}")
print(f"Total questions with Non-ASCII in question2: {len(q2_unicode)}")
unicode_examples = pd.concat([q1_unicode['q1_unicode'].reset_index(drop=True), q2_unicode['q2_unicode'].reset_index(drop=True)], axis=1)
unicode_examples[['q1_unicode', 'q2_unicode']].head(25)


Total questions with Non-ASCII in question1: 974
Total questions with Non-ASCII in question2: 1030


Unnamed: 0,q1_unicode,q2_unicode
0,When do you use シ instead of し?,What will be the impact of scrapping of ₹500 and ₹1000 rupee notes on the real estate market?
1,"I got job offer @ Chelmsford-Essex, London with £3764 PM pay-after tax deduction. Pls advice tentative monthly expenses for couple? & saving possible?",How can I best invest ₹5000 over the next 6 months?
2,Which is the best SEO ‪Company‬ in ‪Delhi‬?,What are your views on demonetization of ₹500 & ₹1000 notes in India?
3,What is the difference between ج and ز?,"Why is Persian word ""ذليل"" meaning 'a Muslim' translated incorrectly on Microsoft Translator? Is it intentional or careless?"
4,How can I prove that (A × B) − (C × D) = (A − C) × B ∪ A × (B − D)?,What are going to be the rammifications of the Indian government's decision affecting invalid ₹500/₹1000 currency notes?
5,How will demonetization of ‎₹1000 and ‎₹500 notes will help curb the rampant black currency in India?,What is the best thing I can buy for 2€ on Amazon?
6,Which are the positive benefits of banning existing ₹500 and ₹1000 currency notes in India?,How can I get Clash Of Clans Gem or Gold for free? ‎
7,"Now after banning of ₹500 & ₹1000 notes, what are the ways in which people can convert their black money into white and how can it be prevented?",What does 분위기 mean?
8,What would be the quadratic equation for the roots 2+√2 and 2-√2?,What is the relationship between e and π?
9,How much effect can banning of ₹500 and ₹1000 notes have on people having black money (won't have they invested their money in areas)?,How will discontinuing ₹500 and ₹1000 notes affect India's economy?


## Step 4: Handling Currency Symbols
Currency symbols may not be handled well by standard tokenizers. Here we replace common currency symbols with their corresponding acronyms (e.g., $ to USD).

In [10]:
# Problem: Currency symbols

# Define common currency symbols (you can expand this if needed)
currency_symbols = r'[$€£¥₹₩₽₺฿₫₪₴₦]'

# Check for presence of currency symbols
q1_currency = df[df['question1'].astype(str).str.contains(currency_symbols, regex=True)]
q2_currency = df[df['question2'].astype(str).str.contains(currency_symbols, regex=True)]

# Display summary
print(f"Total questions with currency symbols in question1: {len(q1_currency)}")
print(f"Total questions with currency symbols in question2: {len(q2_currency)}")

# Concatenate for side-by-side view (not preserving pairs)
currency_examples = pd.concat([q1_currency['question1'].reset_index(drop=True), q2_currency['question2'].reset_index(drop=True)], axis=1)

# Show top 25 rows
currency_examples[['question1', 'question2']].head(25)

Total questions with currency symbols in question1: 1425
Total questions with currency symbols in question2: 1499


Unnamed: 0,question1,question2
0,"I got job offer @ Chelmsford-Essex, London with £3764 PM pay-after tax deduction. Pls advice tentative monthly expenses for couple? & saving possible?",What will be the impact of scrapping of ₹500 and ₹1000 rupee notes on the real estate market?
1,What is the easiest way to become a billionaire($)?,How can I best invest ₹5000 over the next 6 months?
2,"What is the best way to invest $500 legally so that I can get tangible profits over a relatively short period of time, say 6 months?",What are your views on demonetization of ₹500 & ₹1000 notes in India?
3,"How comfortably can I live in Washington DC on a $80,000 salary?","Can I live comfortably in DC on $80,000 - $114,000 salary?"
4,"If you make $130,000/year in NYC, what is your take-home bi-weekly payment?","I own a multinational company worth $450 million dollars, and I want to prepare my son to be the future CEO. What should I advise him?"
5,A cab's flat rate is $6 at $0.75 per mile. How many miles could you travel if you only have $20 to spend?,What are going to be the rammifications of the Indian government's decision affecting invalid ₹500/₹1000 currency notes?
6,"Can a family live on $120,000 yearly in New York City?",What is the best thing I can buy for 2€ on Amazon?
7,"Why, or why not, should the minimum wage be raised to $15?","How do I ship 100 kgs of luggage (clothes) from the USA to India, within $150?"
8,I want a real and effective way to make $ 500 per month with the knowledge that I have no money to invest?,"Can a family live comfortable on $150,000 a year in New York City?"
9,"What is the best waterproof and shockproof 15.6"" laptop bag for every day use under $50?","What does ""$"" mean in Java?"


In [11]:
# Solution: replace with acronyms

def convert_currency_symbols(text):
    currency_map = {
        '$': 'USD',
        '€': 'EUR',
        '£': 'GBP',
        '¥': 'JPY',
        '₹': 'INR'
    }
    # Replace each currency symbol with its acronym
    for symbol, code in currency_map.items():
        text = text.replace(symbol, f' {code} ')
    return re.sub(r'\s+', ' ', text).strip()  # Clean up spacing

df['q1_currency'] = df['question1'].apply(convert_currency_symbols)
df['q2_currency'] = df['question2'].apply(convert_currency_symbols)

# Check results
q1_currency = df[df['q1_currency'].astype(str).str.contains(currency_symbols, regex=True)]
q2_currency = df[df['q2_currency'].astype(str).str.contains(currency_symbols, regex=True)]
print(f"Total questions with currency symbols in question1: {len(q1_currency)}")
print(f"Total questions with currency symbols in question2: {len(q2_currency)}")
currency_examples = pd.concat([q1_currency['q1_currency'].reset_index(drop=True), q2_currency['q2_currency'].reset_index(drop=True)], axis=1)
currency_examples[['q1_currency', 'q2_currency']].head(25)

Total questions with currency symbols in question1: 0
Total questions with currency symbols in question2: 0


Unnamed: 0,q1_currency,q2_currency


## Step 5: Comprehensive Cleaning Functions
After addressing individual issues, we define comprehensive cleaning functions that combine all previous steps. Multiple cleaning strategies are created (squeaky, light, transformer) to suit different modeling needs, from aggressive cleaning to minimal.

In [None]:
# The usual NLP preprocessing steps

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def strip_whitespace(text):
  return re.sub(r'\s+', ' ', text).strip()

def remove_punctuation(text):
  return re.sub(r'[^\w\s]', ' ', text)

def remove_stopwords(text):
  tokens = text.split()
  tokens = [word for word in tokens if word not in stop_words]
  return ' '.join(tokens)

def stem_words(text):
  tokens = text.split()
  tokens = [stemmer.stem(word) for word in tokens]  
  return ' '.join(tokens)


In [None]:
# Final total and complete cleaning

def squeaky_cleaning(text):
  text = tag_to_token(text)
  text = normalize_unicode(text)
  text = expand_contractions(text)
  text = remove_possesives(text)
  text = convert_currency_symbols(text)
  text = text.lower()
  text = remove_punctuation(text)
  text = remove_stopwords(text)
  text = stem_words(text)
  text = strip_whitespace(text)
  return text

def light_cleaning(text):
  text = tag_to_token(text)
  text = normalize_unicode(text)
  text = expand_contractions(text)
  text = remove_possesives(text)
  text = convert_currency_symbols(text)
  text = text.lower()
  text = remove_punctuation(text)
  text = remove_stopwords(text)
  text = strip_whitespace(text)
  return text

def transformer_cleaning(text):
  text = tag_to_token(text)
  text = normalize_unicode(text)
  text = expand_contractions(text)
  text = convert_currency_symbols(text)
  text = strip_whitespace(text)
  return text

df['question1_squeaky'] = df['question1'].astype(str).progress_apply(squeaky_cleaning)
df['question2_squeaky'] = df['question2'].astype(str).progress_apply(squeaky_cleaning)

df['question1_light'] = df['question1'].astype(str).progress_apply(light_cleaning)
df['question2_light'] = df['question2'].astype(str).progress_apply(light_cleaning)

df['question1_transformer'] = df['question1'].astype(str).progress_apply(transformer_cleaning)
df['question2_transformer'] = df['question2'].astype(str).progress_apply(transformer_cleaning)


Save the cleaned csv to use in the main notebook.

In [None]:
df = df[['question1', 'question2', 'question1_squeaky', 'question2_squeaky', 'question1_light', 'question2_light', 'question1_transformer', 'question2_transformer', 'is_duplicate']]

df.to_csv('quora_cleaned.csv', index=False)

## Step 6: Inspecting Missing and Invalid Data
After cleaning, it is interesting to check for rows with missing, empty, or invalid questions.

In [15]:
def missingness(df, col):
  # Look for 'fake' NaNs: string entries that literally say "nan"
  mask_q1_string_nan = df[col].astype(str).str.lower().eq('nan')

  print(f"'nan' strings in {col}: {mask_q1_string_nan.sum()}")

  # Look for fully empty strings or just whitespace
  mask_q1_empty = df[col].astype(str).str.strip() == ''

  print(f"Empty strings in {col}: {mask_q1_empty.sum()}")

for col in ['question1', 'question2', 'question1_squeaky', 'question2_squeaky', 'question1_light', 'question2_light', 'question1_transformer', 'question2_transformer']:
  missingness(df, col)

'nan' strings in question1: 1
Empty strings in question1: 0
'nan' strings in question2: 2
Empty strings in question2: 0
'nan' strings in question1_squeaky: 1
Empty strings in question1_squeaky: 85
'nan' strings in question2_squeaky: 2
Empty strings in question2_squeaky: 72
'nan' strings in question1_light: 1
Empty strings in question1_light: 85
'nan' strings in question2_light: 2
Empty strings in question2_light: 72
'nan' strings in question1_transformer: 1
Empty strings in question1_transformer: 0
'nan' strings in question2_transformer: 2
Empty strings in question2_transformer: 0


In [16]:
mask_q1_valid = ~df['question1'].str.lower().eq('nan')
mask_q2_valid = ~df['question2'].str.lower().eq('nan')

# Keep only rows where both questions are valid
df = df[mask_q1_valid & mask_q2_valid]

In [17]:
empty_q1 = df['question1_squeaky'].astype(str).str.strip() == ''
empty_q2 = df['question2_squeaky'].astype(str).str.strip() == ''

# Combine the two masks
empty_rows = df[empty_q1 | empty_q2]

# Display them
print(f"Total rows with empty cleaned questions: {len(empty_rows)}")
empty_rows[['question1','question1_squeaky', 'question2','question2_squeaky']].head(25)


Total rows with empty cleaned questions: 136


Unnamed: 0,question1,question1_squeaky,question2,question2_squeaky
3306,.,,Why is Cornell's endowment the lowest in the Ivy League?,cornel endow lowest ivi leagu
7120,Is it proper to use a comma after saying thank you?,proper use comma say thank,What is here and not there?,
9581,How can I just be myself?,,How can I not be myself?,
13016,?,,Why should one not work at Google?,one work googl
17486,"I am neither good at studies nor at anything else, what should a loser like me do to transform self?",neither good studi anyth els loser like transform self,If + * + = + then why - * - ! = -?,
19007,What is hirarki?,hirarki,"What is ""what is""?",
20072,How could I solve this?,could solv,…………..,
20794,?,,What is the Gmail tech support help phone number?,gmail tech support help phone number
25228,What?,,What should Indians do if Donald Trump becomes President?,indian donald trump becom presid
26795,What is quoro?,quoro,What is ∀?,


Another thing to keep in mind is questions that are too short even before cleaning, indicating possible bad quality of the raw dataset.

In [18]:
# Helper: Count words after stripping whitespace
def is_too_short(text):
    if not isinstance(text, str):
        return True
    return len(text.strip().split()) <= 1

# Apply to both columns
short_q1 = df['question1'].apply(is_too_short)
short_q2 = df['question2'].apply(is_too_short)

# Combine masks
short_rows = df[short_q1 | short_q2]

# Display them
print(f"Total rows with very short questions: {len(short_rows)}")
short_rows[['question1', 'question2', 'is_duplicate']]


Total rows with very short questions: 91


Unnamed: 0,question1,question2,is_duplicate
3306,.,Why is Cornell's endowment the lowest in the Ivy League?,0
13016,?,Why should one not work at Google?,0
17682,deleted,Which website will be suitable for downloading eBooks and lectures?,0
20072,How could I solve this?,…………..,0
20794,?,What is the Gmail tech support help phone number?,0
23305,deleted,Which are some best websites for downloading newly published books/eBooks?,0
23884,HH,What is hh?,0
25228,What?,What should Indians do if Donald Trump becomes President?,0
25315,deleted,What kind of questions on Quora aren't OK? What is Quora's policy on question deletion?,0
39769,deleted,What is a website where I can download eBooks legally?,0


Please note that those cases identified above were intnentionally kept and not discarded, because we decided that it's worth exploring how our models will handle them.

# Next Steps
The cleaned data is now ready for feature engineering and model development for duplicate question detection.