In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = ['Machine learning is the study of computer algorithms that improve automatically through experience.\
Machine learning algorithms build a mathematical model based on sample data, known as training data.\
The discipline of machine learning employs various approaches to teach computers to accomplish tasks \
where no fully satisfactory algorithm is available.',
'Machine learning is closely related to computational statistics, which focuses on making predictions using computers.\
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.',
'Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. \
It involves computers learning from data provided so that they carry out certain tasks.',
'Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal"\
or "feedback" available to the learning system: Supervised, Unsupervised and Reinforcement',
'Software engineering is the systematic application of engineering approaches to the development of software.\
Software engineering is a computing discipline.',
'A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned\
about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.\
Developing a machine learning application is more iterative and explorative process than software engineering.'
]

#Creating DataFrame from Documents using Pandas
documentsDf=pd.DataFrame(documents,columns=['documents'])

# Removing stop words and special characters from the text
stopWords=stopwords.words('english')
documentsDf['CleanedDocuments']=documentsDf.documents.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() for w in x.split() if re.sub(r'[^a-zA-Z]',' ',w).lower() not in stopWords) )

# To transform the text into embeddings(way to represent the text in a vector space), TF-IDF method is being used.
#Creating TF-IDF vectors from CleanedDocuments
tfIdfVectoriser=TfidfVectorizer()
tfIdfVectoriser.fit(documentsDf.CleanedDocuments)
tfidfVectors=tfIdfVectoriser.transform(documentsDf.CleanedDocuments)


def mostSimilarCosine(docId,tfidfVectors):
    print (f'Document: {documentsDf.iloc[docId]["documents"]}')
    print ('\n')
    print ('Similar Documents:')
    pairwiseSimilarities=np.dot(tfidfVectors,tfidfVectors.T).toarray()
    similarIndex=np.argsort(pairwiseSimilarities[docId])[::-1]
    for i in similarIndex:
        if i==docId:
            continue
        print('\n')
        print (f'Document: {documentsDf.iloc[i]["documents"]}')
        print (f'CosineSimilarity: {pairwiseSimilarities[docId][i]}')
       
# Calculate the similarity between the embeddings using Cosine Similarity Method.
mostSimilarCosine(0,tfidfVectors)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model based on sample data, known as training data.The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available.


Similar Documents:


Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
CosineSimilarity: 0.22860560787391593


Document: Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tas