## 📦 Importing Libraries  
We begin by importing essential libraries for text preprocessing, stemming, and creating an inverted index.


In [1]:
import re  # for text cleaning
from nltk.stem.porter import PorterStemmer  # for stemming (getting root words)
from collections import defaultdict  # better version of dictionary
import pandas as pd

## 📋 Sample Job Descriptions  
Here’s a sample dataset of job descriptions we’ll use to test our search engine.


In [2]:
df = pd.read_csv("sample_job_dataset.csv")
df.head()


Unnamed: 0,id,title,description,category
0,1,Frontend Developer,"Develop UI using HTML, CSS, JavaScript and React.",Web Development
1,2,Data Analyst,"Analyze data using SQL, Excel and create dashb...",Data Analysis
2,3,Backend Developer,Build server-side logic with Node.js and manag...,Web Development
3,4,Machine Learning Engineer,Design ML models using Python and scikit-learn.,Machine Learning
4,5,Full Stack Developer,Work on both frontend and backend using MERN s...,Web Development


## 🧹 Preprocessing Function  
This function cleans and stems the job description text to normalize it for matching.


In [3]:
# Create a stemmer object
stemmer = PorterStemmer()

# Function to clean and stem words
def preprocess(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())  # remove punctuation and make lowercase
    words = text.split()  # split into words
    stemmed = [stemmer.stem(word) for word in words]  # stem each word
    return stemmed


## 📚 Building the Inverted Index  
We create an inverted index to map each word to the job descriptions that contain it.


In [4]:
inverted_index = defaultdict(set)

# Fill the inverted index with job data from CSV
for _, row in df.iterrows():
    words = preprocess(row["description"])  # clean each job's description
    for word in words:
        inverted_index[word].add(row["id"])  # link each word to job ID

## 🔍 Keyword-Based Job Search  
This function searches for relevant jobs using keyword-based matching from the inverted index.



In [5]:
def search(query):
    query_words = preprocess(query)  # clean user query
    job_scores = defaultdict(int)  # store match score for each job

    for word in query_words:
        for job_id in inverted_index.get(word, []):  # get job IDs that contain the word
            job_scores[job_id] += 1  # add score if word matches

    # Sort jobs based on score (most relevant first)
    sorted_jobs = sorted(job_scores.items(), key=lambda x: x[1], reverse=True)

    if not sorted_jobs:
        print("No matching jobs found.")
        return

    print("Top matching jobs:\n")
    for job_id, score in sorted_jobs[:3]:  # show top 3 jobs
        job = df[df["id"] == job_id].iloc[0]  # get the job row from the dataframe
        print(f"🔹 {job['title']} (ID: {job_id}) — Match Score: {score}")
        print(f"📝 Description: {job['description']}\n")

## 🧑‍💻 Run a Keyword-Based Search  
User enters a query and the system returns matching job descriptions using keyword relevance.


In [6]:
# Ask the user for a search query
user_query = input("Enter job keywords (e.g., 'SQL Developer'): ")
search(user_query)


Top matching jobs:

🔹 DevOps Engineer (ID: 9) — Match Score: 1
📝 Description: Implement CI/CD pipelines using Jenkins and Docker.



In [22]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd


In [23]:
df = pd.read_csv("sample_job_dataset.csv")  # Replace with your file name

In [24]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["description"])
y = df["category"]


In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
model = LinearSVC()
model.fit(X_train, y_train)


In [27]:
# Evaluate accuracy
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.3173076923076923


In [28]:
user_query = input("Enter your job-related query: ")
query_vec = vectorizer.transform([user_query])
predicted_category = model.predict(query_vec)[0]
print(f"\nPredicted Category: {predicted_category}")
print(f"\nTop jobs in '{predicted_category}' category:\n")


Predicted Category: DevOps

Top jobs in 'DevOps' category:



In [29]:
# Print matched jobs
matches = df[df["category"] == predicted_category]
for i, row in matches.iterrows():
    print(f"🔹 {row['title']} (ID: {row['id']})")
    print(f"📝 Description: {row['description']}\n")

🔹 DevOps Engineer (ID: 9)
📝 Description: Implement CI/CD pipelines using Jenkins and Docker.

🔹 Kubernetes Administrator (ID: 44)
📝 Description: Manage container orchestration at scale.

🔹 Site Reliability Engineer (ID: 65)
📝 Description: Maintain critical production systems.



In [30]:
# Apply stemming to all job descriptions
processed_descriptions = [" ".join(preprocess(description)) for description in df['description']]


In [31]:
# Get and preprocess the user query
user_query = input("Enter job keywords (e.g., 'SQL Developer'): ")
processed_query = " ".join(preprocess(user_query))


In [32]:
# Combine job descriptions and query into one list
corpus = processed_descriptions + [processed_query]


In [33]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(corpus)


In [34]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(vectors[-1], vectors[:-1]).flatten()


In [35]:
top_indices = similarity_scores.argsort()[::-1][:3]


In [21]:
print("\nTop matching jobs:\n")
for idx in top_indices:
    job = df.iloc[idx]  # Access job data using the index
    score = similarity_scores[idx]
    print(f"🔹 {job['title']} (ID: {job['id']}) — Similarity Score: {round(score, 2)}")
    print(f"📝 Description: {job['description']}\n")


Top matching jobs:

🔹 DevOps Engineer (ID: 9) — Similarity Score: 0.38
📝 Description: Implement CI/CD pipelines using Jenkins and Docker.

🔹 Content Strategist (ID: 32) — Similarity Score: 0.0
📝 Description: Plan and optimize digital content campaigns.

🔹  Quantum Legend Analyst (ID: 497) — Similarity Score: 0.0
📝 Description: Study El Cid loyalty via QC.

