## We are doing similarity matching against an input job description with each document classified in a cluster using cosine similarity. The document with the highest similarity is considered the closest match in the cluster

### Import python libraries required for processing

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

### Job Description - will start with simple text

In [2]:
job_description = "This is a new document with similarities."

### Resumes - will start with simple texts

In [3]:
resumes = ["This is the first document.",
             "This document is the second document.",
             "And this is the third one.",
             "Is this the first document?"]

In [4]:
df = pd.DataFrame({'Text': resumes})

### Data cleansing - removal of stop words# TF-IDF vectorization

In [5]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['Text'])

### Apply K-means clustering on the resumes - currently on sample texts

In [6]:
num_clusters = 2
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

  super()._check_params_vs_input(X, default_n_init=10)


### TF-IDF vectorization for the input text

In [7]:
input_vector = vectorizer.transform([job_description])

### Calculate cosine similarity with each document in the cluster

In [8]:
df['Similarity'] = df['Text'].apply(lambda x: cosine_similarity(vectorizer.transform([x]), input_vector).item())

### Display close matches in the same cluster

In [9]:
cluster_matches = df[df['Cluster'] == df.loc[df['Similarity'].idxmax(), 'Cluster']]
print("Close matches in the cluster:")
print(cluster_matches[['Text', 'Similarity']])

Close matches in the cluster:
                                    Text  Similarity
0            This is the first document.    1.000000
1  This document is the second document.    0.787223
3            Is this the first document?    1.000000
