In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import joblib
import os

In [3]:
df = pd.read_csv(os.path.join(os.getcwd(), '..', 'data', 'dataset.csv'))
df.head()

Unnamed: 0,title,abstract,authors
0,Entanglement and control operations in Ising i...,Entanglement generated by Ising model has be...,"Francisco Delgado (Tecnologico de Monterrey, C..."
1,Coronal lines and dust formation in SN 2005ip:...,We present optical photometry and spectrosco...,"Nathan Smith, Jeffrey M. Silverman, Ryan Chorn..."
2,Products of straight spaces,A metric space X is straight if for each fin...,"Alessandro Berarducci, Dikran Dikranjan, Jan P..."
3,The Carina Nebula: A Laboratory for Feedback a...,The Carina Nebula (NGC 3372) is our richest ...,Nathan Smith and Kate J. Brooks
4,Metric groups attached to biextensions,"Let $G$ be a connnected, unipotent, perfect ...",Swarnendu Datta


In [4]:
df.shape

(1256074, 3)

In [5]:
# Preprocessing
df['abstract_cleaned'] = df['abstract'].str.lower()

# Convert abstracts to TF-IDF vectors
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['abstract_cleaned'])

In [6]:
# Input query
query = "Deep learning is"
query_vec = vectorizer.transform([query])

# Compute similarities
similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()

# Sort the abstracts by similarity in descending order
sorted_indices = similarities.argsort()[::-1]

# Display abstracts with similarity score > 0.5
print("Abstracts with similarity score > 0.5:\n")
for idx in sorted_indices:
    if similarities[idx] > 0.5:  # Check if similarity score is greater than 0.5
        print(f"Similarity Score: {similarities[idx]:.4f}")
        print(f"Abstract: {df['abstract'].iloc[idx]}\n")
        print("-" * 80)

Abstracts with similarity score > 0.5:

Similarity Score: 0.7577
Abstract:   Deep learning has achieved a great success in many areas, from computer
vision to natural language processing, to game playing, and much more. Yet,
what deep learning is really doing is still an open question. There are a lot
of works in this direction. For example, [5] tried to explain deep learning by
group renormalization, and [6] tried to explain deep learning from the view of
functional approximation. In order to address this very crucial question, here
we see deep learning from perspective of mechanical learning and learning
machine (see [1], [2]). From this particular angle, we can see deep learning
much better and answer with confidence: What deep learning is really doing? why
it works well, how it works, and how much data is necessary for learning. We
also will discuss advantages and disadvantages of deep learning at the end of
this work.


-------------------------------------------------------------

In [8]:
# Save the TF-IDF vectorizer
joblib.dump(vectorizer, os.path.join(os.getcwd(), '..', 'model', 'tfidf_vectorizer.pkl'))

# Save the TF-IDF matrix
joblib.dump(tfidf_matrix, os.path.join(os.getcwd(), '..', 'model', 'tfidf_matrix.pkl'))

['/Users/naufalputra/Downloads/chatbot_pka/notebook/../model/tfidf_matrix.pkl']