<a href="https://colab.research.google.com/github/koushik-ace/NLP/blob/main/Lab7_Text_Similarity_Name_2403A52258.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import re

import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## STEP 1: Dataset Description

We created a dataset of 20 sentence pairs (40 documents).

Each pair contains sentences with:
- Similar meaning
- Paraphrasing
- Synonyms
- Related context

The dataset helps evaluate lexical and semantic similarity.

It allows comparison of different similarity techniques
on medium-sized textual data.


In [3]:
data = {
    "Sentence1": [
        "I love machine learning",
        "The doctor treated the patient",
        "She likes reading books",
        "He is playing football",
        "Artificial intelligence is powerful",
        "The cat is sleeping",
        "He bought a new car",
        "Students are studying hard",
        "She cooked delicious food",
        "Weather is very hot today",

        "He is learning data science",
        "She enjoys painting pictures",
        "The teacher explained the lesson",
        "Birds are flying in the sky",
        "He repaired the broken phone",
        "The baby is crying loudly",
        "She wrote a beautiful poem",
        "They are watching a movie",
        "The farmer grows crops",
        "He is jogging every morning"
    ],

    "Sentence2": [
        "I enjoy studying artificial intelligence",
        "The physician helped the sick person",
        "She enjoys novels",
        "He plays soccer",
        "AI is very strong",
        "The kitten is resting",
        "He purchased a vehicle",
        "Learners are working seriously",
        "She prepared tasty meals",
        "It is extremely warm today",

        "He studies machine learning",
        "She loves drawing artwork",
        "The professor taught the topic",
        "Birds are soaring above",
        "He fixed his mobile device",
        "The infant is weeping loudly",
        "She composed a nice poem",
        "They are viewing a film",
        "The farmer cultivates plants",
        "He runs daily in the morning"
    ]
}

df = pd.DataFrame(data)

df


Unnamed: 0,Sentence1,Sentence2
0,I love machine learning,I enjoy studying artificial intelligence
1,The doctor treated the patient,The physician helped the sick person
2,She likes reading books,She enjoys novels
3,He is playing football,He plays soccer
4,Artificial intelligence is powerful,AI is very strong
5,The cat is sleeping,The kitten is resting
6,He bought a new car,He purchased a vehicle
7,Students are studying hard,Learners are working seriously
8,She cooked delicious food,She prepared tasty meals
9,Weather is very hot today,It is extremely warm today


## STEP 2: Text Preprocessing

Steps:

1. Lowercase:
   Converts text to lowercase for uniformity.

2. Remove punctuation and numbers:
   Removes special characters and digits.

3. Remove stopwords:
   Removes common words like "is", "the", "and".

4. Tokenization:
   Splits sentences into words.

5. Lemmatization (Optional):
   Converts words to base form (running → run).

These steps reduce noise and improve similarity accuracy.


In [4]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    # Lowercase
    text = text.lower()

    # Remove punctuation & numbers
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords and lemmatize
    cleaned = []
    for word in tokens:
        if word not in stop_words:
            word = lemmatizer.lemmatize(word)
            cleaned.append(word)

    return " ".join(cleaned)


In [7]:
import nltk
nltk.download('punkt_tab', quiet=True)

df["Clean1"] = df["Sentence1"].apply(preprocess)
df["Clean2"] = df["Sentence2"].apply(preprocess)

df

Unnamed: 0,Sentence1,Sentence2,Clean1,Clean2
0,I love machine learning,I enjoy studying artificial intelligence,love machine learning,enjoy studying artificial intelligence
1,The doctor treated the patient,The physician helped the sick person,doctor treated patient,physician helped sick person
2,She likes reading books,She enjoys novels,like reading book,enjoys novel
3,He is playing football,He plays soccer,playing football,play soccer
4,Artificial intelligence is powerful,AI is very strong,artificial intelligence powerful,ai strong
5,The cat is sleeping,The kitten is resting,cat sleeping,kitten resting
6,He bought a new car,He purchased a vehicle,bought new car,purchased vehicle
7,Students are studying hard,Learners are working seriously,student studying hard,learner working seriously
8,She cooked delicious food,She prepared tasty meals,cooked delicious food,prepared tasty meal
9,Weather is very hot today,It is extremely warm today,weather hot today,extremely warm today


In [8]:
all_text = list(df["Clean1"]) + list(df["Clean2"])

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(all_text)

tfidf_matrix.shape


(40, 98)

## STEP 3: Cosine Similarity

Cosine similarity measures the angle between two vectors.

Value range: 0 to 1

- 1 → Very similar
- 0 → No similarity

Higher score = more similar meaning.


In [9]:
n = len(df)

cosine_scores = []

for i in range(n):
    v1 = tfidf_matrix[i]
    v2 = tfidf_matrix[i+n]

    score = cosine_similarity(v1, v2)[0][0]
    cosine_scores.append(score)

df["Cosine_Similarity"] = cosine_scores
df


Unnamed: 0,Sentence1,Sentence2,Clean1,Clean2,Cosine_Similarity
0,I love machine learning,I enjoy studying artificial intelligence,love machine learning,enjoy studying artificial intelligence,0.0
1,The doctor treated the patient,The physician helped the sick person,doctor treated patient,physician helped sick person,0.0
2,She likes reading books,She enjoys novels,like reading book,enjoys novel,0.0
3,He is playing football,He plays soccer,playing football,play soccer,0.0
4,Artificial intelligence is powerful,AI is very strong,artificial intelligence powerful,ai strong,0.0
5,The cat is sleeping,The kitten is resting,cat sleeping,kitten resting,0.0
6,He bought a new car,He purchased a vehicle,bought new car,purchased vehicle,0.0
7,Students are studying hard,Learners are working seriously,student studying hard,learner working seriously,0.0
8,She cooked delicious food,She prepared tasty meals,cooked delicious food,prepared tasty meal,0.0
9,Weather is very hot today,It is extremely warm today,weather hot today,extremely warm today,0.287868


## STEP 4: Jaccard Similarity

Jaccard similarity measures word overlap.

Formula:
Common words / Total unique words

It focuses only on exact word matching.


In [10]:
def jaccard_similarity(s1, s2):
    set1 = set(s1.split())
    set2 = set(s2.split())

    intersection = set1.intersection(set2)
    union = set1.union(set2)

    return len(intersection) / len(union)


jaccard_scores = []

for i in range(n):
    score = jaccard_similarity(df["Clean1"][i], df["Clean2"][i])
    jaccard_scores.append(score)

df["Jaccard_Similarity"] = jaccard_scores
df


Unnamed: 0,Sentence1,Sentence2,Clean1,Clean2,Cosine_Similarity,Jaccard_Similarity
0,I love machine learning,I enjoy studying artificial intelligence,love machine learning,enjoy studying artificial intelligence,0.0,0.0
1,The doctor treated the patient,The physician helped the sick person,doctor treated patient,physician helped sick person,0.0,0.0
2,She likes reading books,She enjoys novels,like reading book,enjoys novel,0.0,0.0
3,He is playing football,He plays soccer,playing football,play soccer,0.0,0.0
4,Artificial intelligence is powerful,AI is very strong,artificial intelligence powerful,ai strong,0.0,0.0
5,The cat is sleeping,The kitten is resting,cat sleeping,kitten resting,0.0,0.0
6,He bought a new car,He purchased a vehicle,bought new car,purchased vehicle,0.0,0.0
7,Students are studying hard,Learners are working seriously,student studying hard,learner working seriously,0.0,0.0
8,She cooked delicious food,She prepared tasty meals,cooked delicious food,prepared tasty meal,0.0,0.0
9,Weather is very hot today,It is extremely warm today,weather hot today,extremely warm today,0.287868,0.2


## STEP 5: WordNet Semantic Similarity

WordNet connects words with similar meanings.

We use Wu-Palmer similarity.

It measures semantic distance in WordNet hierarchy.

This helps capture meaning beyond exact words.


In [11]:
def wordnet_similarity(w1, w2):
    syn1 = wordnet.synsets(w1)
    syn2 = wordnet.synsets(w2)

    if not syn1 or not syn2:
        return 0

    return syn1[0].wup_similarity(syn2[0]) or 0


def sentence_similarity(s1, s2):
    words1 = s1.split()
    words2 = s2.split()

    scores = []

    for w1 in words1:
        max_sim = 0
        for w2 in words2:
            sim = wordnet_similarity(w1, w2)
            if sim > max_sim:
                max_sim = sim
        scores.append(max_sim)

    return np.mean(scores)


wordnet_scores = []

for i in range(n):
    score = sentence_similarity(df["Clean1"][i], df["Clean2"][i])
    wordnet_scores.append(score)

df["WordNet_Similarity"] = wordnet_scores
df


Unnamed: 0,Sentence1,Sentence2,Clean1,Clean2,Cosine_Similarity,Jaccard_Similarity,WordNet_Similarity
0,I love machine learning,I enjoy studying artificial intelligence,love machine learning,enjoy studying artificial intelligence,0.0,0.0,0.371503
1,The doctor treated the patient,The physician helped the sick person,doctor treated patient,physician helped sick person,0.0,0.0,0.678571
2,She likes reading books,She enjoys novels,like reading book,enjoys novel,0.0,0.0,0.208689
3,He is playing football,He plays soccer,playing football,play soccer,0.0,0.0,0.771667
4,Artificial intelligence is powerful,AI is very strong,artificial intelligence powerful,ai strong,0.0,0.0,0.422222
5,The cat is sleeping,The kitten is resting,cat sleeping,kitten resting,0.0,0.0,0.375
6,He bought a new car,He purchased a vehicle,bought new car,purchased vehicle,0.0,0.0,0.733333
7,Students are studying hard,Learners are working seriously,student studying hard,learner working seriously,0.0,0.0,0.457516
8,She cooked delicious food,She prepared tasty meals,cooked delicious food,prepared tasty meal,0.0,0.0,0.506536
9,Weather is very hot today,It is extremely warm today,weather hot today,extremely warm today,0.287868,0.2,0.560606


## STEP 6: Comparison of Methods

1. TF-IDF + Cosine works well for short text.
2. Jaccard depends on exact word matching.
3. WordNet captures meaning better.

Cosine similarity gives stable results.
Jaccard fails when words are different.
WordNet handles synonyms well.

Scores disagree when:
- Different words have same meaning.
- Sentences use paraphrasing.

WordNet performs best for semantic similarity.


# STEP 7: LAB REPORT

## Objective
To analyze text similarity using cosine, Jaccard, and WordNet methods.

## Dataset Description
We used 10 sentence pairs containing similar and different meanings.

## Preprocessing
Lowercasing, punctuation removal, stopword removal, tokenization, lemmatization.

## Cosine Similarity
TF-IDF vectors were used to compute similarity.

## Jaccard Similarity
Measured word overlap between sentences.

## WordNet Similarity
Used Wu-Palmer similarity for semantic comparison.

## Comparison
WordNet performed best for meaning.
Cosine gave balanced results.
Jaccard was sensitive to vocabulary.

## Conclusion
Semantic methods are better for real-life NLP tasks.


## Questions and Answers

1. Text similarity in NLP:
   Measures how close two texts are in meaning.

2. Lexical vs Semantic:
   Lexical = word matching
   Semantic = meaning-based

3. Why cosine is popular?
   Works well with high-dimensional data.

4. When Jaccard fails?
   When synonyms are used.

5. How WordNet helps?
   Uses semantic relationships.

6. Preprocessing effect?
   Reduces noise, improves accuracy.

7. Applications:
   - Plagiarism detection
   - Search engines
