<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_023_text_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Load Libraries

### 1. **`nltk.corpus.stopwords` (`from nltk.corpus import stopwords`)**
   - **Purpose**: NLTK (Natural Language Toolkit) is a library for working with human language data (text). The `stopwords` module in NLTK provides lists of common words (like "the", "is", etc.) that are often removed from text during preprocessing in NLP tasks.
   - **Common Uses**:
     - Removing stop words from text to focus on meaningful words.
     - Preprocessing text data for tasks like text classification, sentiment analysis, or keyword extraction.

### 2. **`nltk.download('stopwords')`**
   - **Purpose**: This downloads the list of stop words from the NLTK corpus, which is necessary before using `stopwords.words()` in the code.
   - **Common Uses**: Ensures that stop words are available for text preprocessing.

### 3. **`re` (`import re`)**
   - **Purpose**: The `re` module provides support for working with regular expressions in Python. Regular expressions are used for pattern matching, searching, and manipulating strings.
   - **Common Uses**:
     - Searching for patterns within text.
     - Replacing or removing specific patterns (e.g., removing special characters, validating formats like emails, etc.).

### 4. **`TfidfVectorizer` (`from sklearn.feature_extraction.text import TfidfVectorizer`)**
   - **Purpose**: `TfidfVectorizer` from the `scikit-learn` library is used to convert a collection of raw text documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. It helps quantify the importance of words in a document relative to a corpus.
   - **Common Uses**:
     - Converting text data into numerical features for machine learning models.
     - Measuring the importance of words in text documents for tasks like text classification or information retrieval.

### 5. **`cosine_similarity` (`from sklearn.metrics.pairwise import cosine_similarity`)**
   - **Purpose**: This function computes the cosine similarity between two sets of vectors. Cosine similarity measures the cosine of the angle between two vectors, making it useful for comparing text documents based on the similarity of their content.
   - **Common Uses**:
     - Finding similarity between text documents (e.g., finding related documents).
     - Clustering or comparing documents in NLP tasks.

### 6. **`euclidean_distances` (`from sklearn.metrics.pairwise import euclidean_distances`)**
   - **Purpose**: This function computes the Euclidean distance between two sets of vectors. Euclidean distance is the straight-line distance between two points in multi-dimensional space, often used for measuring the difference between text documents or data points.
   - **Common Uses**:
     - Measuring the distance between documents or data points in machine learning tasks.
     - Comparing how different two text documents or vectors are in a numerical space.



In [2]:
# !pip install --upgrade nltk

In [3]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
# Sample corpus
documents = ['Machine learning is the study of computer algorithms that improve automatically through experience.\
              Machine learning algorithms build a mathematical model based on sample data, known as training data.\
              The discipline of machine learning employs various approaches to teach computers to accomplish tasks \
              where no fully satisfactory algorithm is available.',
              'Machine learning is closely related to computational statistics, which focuses on making predictions using computers.\
              The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.',
              'Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. \
              It involves computers learning from data provided so that they carry out certain tasks.',
              'Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal"\
              or "feedback" available to the learning system: Supervised, Unsupervised and Reinforcement',
              'Software engineering is the systematic application of engineering approaches to the development of software.\
              Software engineering is a computing discipline.',
              'A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned\
              about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.\
              Developing a machine learning application is more iterative and explorative process than software engineering.'
]

### Step 1: Load the documents into a DataFrame
The code starts by loading the `documents` (which is assumed to be a list or a collection of text data) into a pandas DataFrame.

- **What this does**: Creates a DataFrame `documents_df` with a single column named `'documents'`, where each row corresponds to a document from the `documents` list.

In [5]:
documents_df=pd.DataFrame(documents,columns=['documents'])
for d in documents_df.documents:
  print(d)

Machine learning is the study of computer algorithms that improve automatically through experience.              Machine learning algorithms build a mathematical model based on sample data, known as training data.              The discipline of machine learning employs various approaches to teach computers to accomplish tasks               where no fully satisfactory algorithm is available.
Machine learning is closely related to computational statistics, which focuses on making predictions using computers.              The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so.               It involves computers learning from data provided so that they carry out certain tasks.
Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal"        

### Step 2: Clean the text by removing special characters and stop words

- **Goal**: This step aims to clean the text by:
  - Removing special characters (anything that is not a letter).
  - Converting text to lowercase.
  - Removing stop words (common words like "the," "is," etc.) based on the stop word list `stop_words_l`.

- **What needs cleaning**:
  - The use of `re.sub()` inside the `if` statement is redundant since it's being applied twice. We can clean this up.

In [6]:
import re
from nltk.corpus import stopwords

# Create a list of English stop words
stop_words_l = stopwords.words('english')

# Function to clean text
def clean_text(doc):
    # Split the document into words, remove non-alphabetical characters, convert to lowercase, and remove stop words
    return " ".join(
        re.sub(r'[^a-zA-Z]', ' ', word).lower()
        for word in doc.split()
        if re.sub(r'[^a-zA-Z]', ' ', word).lower() not in stop_words_l
    )

# Apply the clean_text function to the documents
documents_df['documents_cleaned'] = documents_df['documents'].apply(clean_text)

In [7]:
stop_words_l = stopwords.words('english')

documents_df['documents_cleaned'] = documents_df.documents.apply(
    lambda x: " ".join(
        re.sub(r'[^a-zA-Z]', ' ', w).lower()
        for w in x.split()
        if re.sub(r'[^a-zA-Z]', ' ', w).lower() not in stop_words_l
    )
)

### Step 3: TF-IDF Vectorization

- **Explanation**:
  - `TfidfVectorizer()` is used to convert the cleaned documents into **TF-IDF vectors**.
  - The `fit()` method learns the vocabulary from the cleaned documents.
  - The `transform()` method converts the cleaned documents into a sparse matrix of TF-IDF scores (one vector per document).

- **What this does**: The `tfidf_vectors` matrix represents the term frequency-inverse document frequency (TF-IDF) score of each word in each document, allowing for further analysis like similarity or distance calculations.

In [8]:
tfidfvectoriser = TfidfVectorizer()
tfidfvectoriser.fit(documents_df['documents_cleaned'])
tfidf_vectors = tfidfvectoriser.transform(documents_df['documents_cleaned'])

### Step 4: Calculate Pairwise Similarities and Differences

- **Cosine Similarity**:
  - `np.dot(tfidf_vectors, tfidf_vectors.T).toarray()` computes the **cosine similarity** between the TF-IDF vectors. This gives a similarity matrix where each entry is the cosine similarity between two documents.
  - Cosine similarity measures the angle between two vectors, effectively capturing how similar the content of two documents is (ignoring magnitude).

- **Euclidean Distance**:
  - `euclidean_distances(tfidf_vectors)` computes the Euclidean distance between each pair of TF-IDF vectors. Euclidean distance measures the straight-line distance between two vectors (documents).

- **Result**: You now have two matrices:
  - `pairwise_similarities`: Contains cosine similarity scores between documents.
  - `pairwise_differences`: Contains Euclidean distances between documents.

In [9]:
pairwise_similarities = np.dot(tfidf_vectors, tfidf_vectors.T).toarray()
pairwise_differences = euclidean_distances(tfidf_vectors)

### Step 5: Function to Find the Most Similar Documents


- **Explanation**:
  - This function prints out the document with the given `doc_id` and finds other documents that are similar based on the provided similarity or distance matrix.
  - It sorts the documents either by **cosine similarity** (descending order, higher is better) or **Euclidean distance** (ascending order, lower is better).
  - It then prints the most similar documents based on the metric.

In [10]:
def most_similar(doc_id, similarity_matrix, matrix_type):
    print(f'Document: {documents_df.iloc[doc_id]["documents"]}')
    print('\nSimilar Documents:')

    # Determine the sorting direction based on similarity or distance
    if matrix_type == 'Cosine Similarity':
        similar_ix = np.argsort(similarity_matrix[doc_id])[::-1]  # Sort in descending order (higher similarity is better)
    elif matrix_type == 'Euclidean Distance':
        similar_ix = np.argsort(similarity_matrix[doc_id])  # Sort in ascending order (lower distance is better)

    # Print similar documents
    for ix in similar_ix:
        if ix == doc_id:  # Skip the document itself
            continue
        print('\n')
        print(f'Document: {documents_df.iloc[ix]["documents"]}')
        print(f'{matrix_type}: {similarity_matrix[doc_id][ix]}')



### Step 6: Testing the Similarity Function

- **What this does**:
  - The function `most_similar(0, ...)` is called with document ID `0` to find and print the documents most similar to the first document.
  - It does this twice: once using cosine similarity and once using Euclidean distance.

### Cleaned and Explained Final Code:

### Explanation Summary:
1. **Text Cleaning**: The code cleans the documents by removing non-alphabetic characters and stop words, then converts the text to lowercase.
2. **TF-IDF Vectorization**: It converts the cleaned text into numerical vectors representing the importance of words in the document using TF-IDF.
3. **Similarity/Difference Calculation**: It calculates pairwise similarities (cosine similarity) and differences (Euclidean distance) between documents.
4. **Most Similar Documents**: It prints documents most similar to a selected document based on cosine similarity or Euclidean distance.



In [16]:
most_similar(0, pairwise_similarities, 'Cosine Similarity')

Document: Machine learning is the study of computer algorithms that improve automatically through experience.              Machine learning algorithms build a mathematical model based on sample data, known as training data.              The discipline of machine learning employs various approaches to teach computers to accomplish tasks               where no fully satisfactory algorithm is available.

Similar Documents:


Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.              The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Cosine Similarity: 0.8262172937393188


Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned              about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and var

In [17]:
most_similar(0, pairwise_differences, 'Euclidean Distance')

Document: Machine learning is the study of computer algorithms that improve automatically through experience.              Machine learning algorithms build a mathematical model based on sample data, known as training data.              The discipline of machine learning employs various approaches to teach computers to accomplish tasks               where no fully satisfactory algorithm is available.

Similar Documents:


Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.              The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Euclidean Distance: 9.6488618850708


Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned              about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and varia

## Full Code

In [12]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from nltk.corpus import stopwords

# Load documents into a DataFrame
documents_df = pd.DataFrame(documents, columns=['documents'])

# Create a list of English stop words
stop_words_l = stopwords.words('english')

# Function to clean text by removing special characters and stop words
def clean_text(doc):
    return " ".join(
        re.sub(r'[^a-zA-Z]', ' ', word).lower()
        for word in doc.split()
        if re.sub(r'[^a-zA-Z]', ' ', word).lower() not in stop_words_l
    )

# Clean the documents
documents_df['documents_cleaned'] = documents_df['documents'].apply(clean_text)

# TF-IDF vectorization
tfidfvectoriser = TfidfVectorizer()
tfidfvectoriser.fit(documents_df['documents_cleaned'])
tfidf_vectors = tfidfvectoriser.transform(documents_df['documents_cleaned'])

# Compute pairwise cosine similarities and Euclidean distances
pairwise_similarities = np.dot(tfidf_vectors, tfidf_vectors.T).toarray()
pairwise_differences = euclidean_distances(tfidf_vectors)

# Function to print the most similar documents
def most_similar(doc_id, similarity_matrix, matrix_type):
    print(f'Document: {documents_df.iloc[doc_id]["documents"]}')
    print('\nSimilar Documents:')

    if matrix_type == 'Cosine Similarity':
        similar_ix = np.argsort(similarity_matrix[doc_id])[::-1]  # Sort descending for similarity
    elif matrix_type == 'Euclidean Distance':
        similar_ix = np.argsort(similarity_matrix[doc_id])  # Sort ascending for distance

    for ix in similar_ix:
        if ix == doc_id:  # Skip the original document
            continue
        print('\n')
        print(f'Document: {documents_df.iloc[ix]["documents"]}')
        print(f'{matrix_type}: {similarity_matrix[doc_id][ix]}')

# Test: Find documents similar to the first document
most_similar(0, pairwise_similarities, 'Cosine Similarity')
most_similar(0, pairwise_differences, 'Euclidean Distance')

Document: Machine learning is the study of computer algorithms that improve automatically through experience.              Machine learning algorithms build a mathematical model based on sample data, known as training data.              The discipline of machine learning employs various approaches to teach computers to accomplish tasks               where no fully satisfactory algorithm is available.

Similar Documents:


Document: Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so.               It involves computers learning from data provided so that they carry out certain tasks.
Cosine Similarity: 0.2282251849266025


Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.              The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Cosine Similarity: 0.2072169372318548


## Sentence Transformers

In [13]:
!pip install sentence-transformers



In [14]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [15]:
document_embeddings = sbert_model.encode(documents_df['documents_cleaned'])

pairwise_similarities=cosine_similarity(document_embeddings)
pairwise_differences=euclidean_distances(document_embeddings)

most_similar(0,pairwise_similarities,'Cosine Similarity')
most_similar(0,pairwise_differences,'Euclidean Distance')

Document: Machine learning is the study of computer algorithms that improve automatically through experience.              Machine learning algorithms build a mathematical model based on sample data, known as training data.              The discipline of machine learning employs various approaches to teach computers to accomplish tasks               where no fully satisfactory algorithm is available.

Similar Documents:


Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.              The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Cosine Similarity: 0.8262172937393188


Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned              about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and var