# Text Similarity using Deep Learning

transformers ga support Keras 3, kudu downgrade pake pip install tf-keras

In [12]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Load pre-trained BERT tokenizer and model
model = 'indobenchmark/indobert-base-p1'
tokenizer = BertTokenizer.from_pretrained(model)
model = TFBertModel.from_pretrained(model)

Some layers from the model checkpoint at indobenchmark/indobert-base-p1 were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at indobenchmark/indobert-base-p1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [13]:
# Function to tokenize and get embeddings from BERT
def get_embeddings(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True, max_length=128)
    
    # Get BERT embeddings
    outputs = model(inputs)
    
    # Use the last hidden state of the [CLS] token for sentence-level embeddings
    cls_embeddings = outputs.last_hidden_state[:, 0, :]
    
    return cls_embeddings

# Function to calculate cosine similarity between two embeddings
def calculate_similarity(text1, text2):
    # Get embeddings for both texts
    embeddings1 = get_embeddings(text1)
    embeddings2 = get_embeddings(text2)
    
    # Convert embeddings to numpy arrays
    embeddings1 = embeddings1.numpy()
    embeddings2 = embeddings2.numpy()
    
    # Calculate cosine similarity
    similarity = cosine_similarity(embeddings1, embeddings2)
    
    return similarity[0][0]

# Example usage
text1 = "Rubah cokelat yang cepat melompati anjing yang malas."
text2 = "Seekor hewan cokelat yang cepat melompati anjing yang sedang tidur."

similarity_score = calculate_similarity(text1, text2)
print(f"Cosine Similarity: {similarity_score}")

Cosine Similarity: 0.9854233264923096


In [14]:
# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

# Function to tokenize and get embeddings from BERT
def get_embeddings(text):
    inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True, max_length=128)
    outputs = model(inputs)
    cls_embeddings = outputs.last_hidden_state[:, 0, :]
    return cls_embeddings

# Function to calculate cosine similarity between two embeddings
def calculate_similarity(text1, text2):
    embeddings1 = get_embeddings(text1).numpy()
    embeddings2 = get_embeddings(text2).numpy()
    similarity = cosine_similarity(embeddings1, embeddings2)
    return similarity[0][0]

# Load the dataset
df = pd.read_excel('examples-datasets.xlsx')

# Choose the student text to test (e.g., first row)
student_text = df['Pelajar'].iloc[0]

max_similarity = 0
closest_gpt_text = ''

# Compare with each GPT text
for gpt_text in df['GPT']:
    similarity_score = calculate_similarity(student_text, gpt_text)
    
    # Check if this is the highest similarity found
    if similarity_score > max_similarity:
        max_similarity = similarity_score
        closest_gpt_text = gpt_text

# Determine if it's closer to student or GPT
similarity_percentage = max_similarity * 100  # Convert to percentage
if max_similarity > 0.5:  # Assuming a threshold for deciding between student and GPT
    if similarity_percentage > 50:  # Adjust this threshold as needed
        print(f"Teks Siswa : '{student_text}")
        print(f"{similarity_percentage:.2f}% sepertinya mirip GPT")
        print(f"Teks yang paling mirip: '{closest_gpt_text}'")
    else:
        print(f"Teks Siswa : '{student_text}")
        print(f"{similarity_percentage:.2f}% sepertinya buatan pelajar")
else:
    print(f"Teks Siswa : '{student_text}")
    print(f"{similarity_percentage:.2f}% sepertinya buatan pelajar")

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Teks Siswa : 'Kesehatan mental di kalangan pelajar itu penting banget, tapi seringkali nggak diperhatikan. Banyak yang mikir kalau kesehatan mental cuma soal “nggak gila” atau “nggak stres berat,” padahal lebih dari itu. Pelajar kayak kita sering ngerasain tekanan dari tugas, ujian, ekspektasi orang tua, dan lingkungan sekolah. Akibatnya, banyak dari kita yang jadi cemas, susah tidur, bahkan kadang sampai merasa depresi. Tapi, karena kurangnya pengetahuan, kita sering anggap remeh perasaan itu atau malah nggak tahu harus gimana ngatasinnya.

Belum lagi, ngomongin kesehatan mental di sekolah tuh sering dianggap tabu. Malu, takut dikira lemah, atau malah takut dihakimi. Padahal, kalau kita bisa lebih terbuka tentang apa yang kita rasain, mungkin kita bisa dapet dukungan yang lebih baik, entah dari guru, teman, atau konselor. Hal-hal sederhana kayak istirahat yang cukup, punya hobi, atau ngobrol sama orang yang bisa dipercaya, bisa banget bantu jaga kesehatan mental kita.

Jadi, penting b