We import necessary libraries including nltk for text preprocessing (word_tokenize, stopwords) and sklearn for TF-IDF vectorization (`Tfidf
Vectorizer) and cosine similarity computation (cosine_similarity`).

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

We download NLTK resources (punkt for tokenization and stopwords for stop word removal) needed for text preprocessing.

In [2]:
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nikhi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nikhi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

We define a function preprocess_text to preprocess text data by tokenizing, converting to lowercase, and removing stopwords using NLTK.

In [3]:
# Function to preprocess text data
def preprocess_text(text):
    # Tokenize text into words
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    # Return preprocessed text as a single string
    return ' '.join(filtered_tokens)

We define a function calculate_cosine_similarity that takes two text documents as input, preprocesses them using preprocess_text, then calculates cosine similarity between their TF-IDF representations using TfidfVectorizer and cosine_similarity from sklearn.

In [4]:
# Function to calculate cosine similarity between two text documents
def calculate_cosine_similarity(text1, text2):
    # Preprocess text
    processed_text1 = preprocess_text(text1)
    processed_text2 = preprocess_text(text2)
    
    # Create TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    
    # Fit and transform text data into TF-IDF vectors
    tfidf_matrix = vectorizer.fit_transform([processed_text1, processed_text2])
    
    # Calculate cosine similarity between the TF-IDF vectors
    cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]
    
    return cosine_sim

In [5]:
# Example usage
if __name__ == "__main__":
    # Define two text documents
    text1 = "Natural language processing is a subfield of artificial intelligence."
    text2 = "NLP deals with the interaction between computers and humans using natural language."

    # Calculate cosine similarity between the two text documents
    similarity_score = calculate_cosine_similarity(text1, text2)
    
    # Print the cosine similarity score
    print("Cosine Similarity Score:", similarity_score)

Cosine Similarity Score: 0.17077611319011654
