### 📦 1. Importing Required Libraries
We begin by importing the necessary Python libraries for data processing, text cleaning, and building an inverted index:
- `re`: for regex-based text cleaning  
- `PorterStemmer` from NLTK: for stemming words to their root forms  
- `defaultdict`: a handy dictionary subclass from `collections`  
- `pandas`: for loading and handling the dataset


In [None]:
import re  
from nltk.stem.porter import PorterStemmer  
from collections import defaultdict 
import pandas as pd

### 📄 2. Loading the Dataset
We load a CSV file named `sample_job_dataset.csv` containing job descriptions and associated metadata.  
Then we display the first few rows to get a quick overview of the dataset.


In [None]:
df = pd.read_csv("sample_job_dataset.csv")
df.head()


### 🧹 3. Text Preprocessing and Stemming
We define a `preprocess` function that:
- Converts text to lowercase  
- Removes non-alphabetic characters  
- Tokenizes the text into words  
- Applies stemming using `PorterStemmer`

This ensures all job descriptions are normalized before analysis.


In [None]:

stemmer = PorterStemmer()

def preprocess(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())  
    words = text.split()  
    stemmed = [stemmer.stem(word) for word in words] 
    return stemmed


### 🔍 4. Building the Inverted Index
We use a `defaultdict` to create an inverted index that maps each stemmed word to the set of job IDs where it appears.  
This structure allows fast lookup of job descriptions based on keyword matches.


In [None]:
inverted_index = defaultdict(set)


for _, row in df.iterrows():
    words = preprocess(row["description"])  
    for word in words:
        inverted_index[word].add(row["id"]) 

### 🔎 5. Search Function Using the Inverted Index
The `search` function takes a user query, preprocesses it, and calculates how many query words appear in each job description.  
It returns job IDs ranked by how relevant they are to the query.


In [None]:
def search(query):
    query_words = preprocess(query)  
    job_scores = defaultdict(int)  

    for word in query_words:
        for job_id in inverted_index.get(word, []):  
            job_scores[job_id] += 1 

   
    sorted_jobs = sorted(job_scores.items(), key=lambda x: x[1], reverse=True)

    if not sorted_jobs:
        print("No matching jobs found.")
        return

    print("Top matching jobs:\n")
    for job_id, score in sorted_jobs[:3]:
        job = df[df["id"] == job_id].iloc[0]  
        print(f"🔹 {job['title']} (ID: {job_id}) — Match Score: {score}")
        print(f"📝 Description: {job['description']}\n")

### 👤 6. User Search Query Input
The user is prompted to enter job keywords (e.g., "SQL Developer").  
The search function then retrieves and ranks job postings based on the provided keywords.


In [None]:
# Ask the user for a search query
user_query = input("Enter job keywords (e.g., 'SQL Developer'): ")
search(user_query)


### 🛠️ 7. Importing ML Libraries for Text Classification
We now move to building a classification model by importing:
- `TfidfVectorizer`: to convert text to numerical features  
- `LinearSVC`: for the classification algorithm  
- `train_test_split`, `accuracy_score`: for model evaluation


In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd


### 📂 8. Reloading the Dataset for Classification
We reload the job dataset from the CSV file to begin the classification task, where each job description will be categorized.


In [None]:
df = pd.read_csv("sample_job_dataset.csv")  

### ✨ 9. TF-IDF Vectorization
We use `TfidfVectorizer` to convert job descriptions into TF-IDF feature vectors.  
These vectors represent the importance of each word and are used as input to the classifier.


In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["description"])
y = df["category"]


### 🧪 10. Splitting the Data
We split the TF-IDF features and labels into training and testing sets (80-20 split) using `train_test_split`.  
This helps evaluate our model's performance on unseen data.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 🧠 11. Training the Classification Model
We use the `LinearSVC` classifier to train our model on the TF-IDF vectors of job descriptions and their corresponding categories.


In [None]:
model = LinearSVC()
model.fit(X_train, y_train)


### 📊 12. Evaluating Model Accuracy
After training, we predict job categories for the test set and calculate the model's accuracy using `accuracy_score`.


In [None]:

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

### 🔮 13. Predicting Category from User Query
The user provides a job-related search query.  
We transform it using the trained TF-IDF vectorizer and use the model to predict the most suitable job category.


In [None]:
user_query = input("Enter your job-related query: ")
query_vec = vectorizer.transform([user_query])
predicted_category = model.predict(query_vec)[0]
print(f"\nPredicted Category: {predicted_category}")
print(f"\nTop jobs in '{predicted_category}' category:\n")

### 📋 14. Displaying Jobs in the Predicted Category
Once the category is predicted, we filter and display all jobs in that category, including their titles and descriptions.


In [None]:

matches = df[df["category"] == predicted_category]
for i, row in matches.iterrows():
    print(f"🔹 {row['title']} (ID: {row['id']})")
    print(f"📝 Description: {row['description']}\n")

### 🧼 15. Preprocessing All Job Descriptions
We apply our earlier `preprocess` function to all job descriptions in the dataset.  
This prepares the data for similarity-based search.


In [None]:

processed_descriptions = [" ".join(preprocess(description)) for description in df['description']]


### 🗣️ 16. Preprocessing User Search Query
The user provides a natural language search query, which is preprocessed using the same logic as job descriptions.


In [None]:

user_query = input("Enter job keywords (e.g., 'SQL Developer'): ")
processed_query = " ".join(preprocess(user_query))


### 🔄 17. Preparing the Corpus
We combine the preprocessed job descriptions with the preprocessed user query into one list, forming a single corpus for vectorization.



In [None]:
corpus = processed_descriptions + [processed_query]


### 🧮 18. Count Vectorization
We convert the corpus (job descriptions + query) into count vectors using `CountVectorizer`.  
This helps us later compute similarity scores based on word frequency.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(corpus)


### 📐 19. Calculating Cosine Similarity
We compute the cosine similarity between the user query vector and all job description vectors.  
This helps us rank jobs based on how similar they are to the query.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(vectors[-1], vectors[:-1]).flatten()


### 🔝 20. Extracting Top Matches
We sort the cosine similarity scores in descending order and select the top 3 indices.  
These indices correspond to the most relevant job postings for the user query.


In [None]:
top_indices = similarity_scores.argsort()[::-1][:3]


### 🧾 21. Displaying Top Matching Jobs
For each of the top 3 job postings:
- We show the job title and ID  
- We display the similarity score  
- We print the job description

This provides the user with the most contextually relevant job results based on their input.


In [None]:
print("\nTop matching jobs:\n")
for idx in top_indices:
    job = df.iloc[idx]  
    score = similarity_scores[idx]
    print(f"🔹 {job['title']} (ID: {job['id']}) — Similarity Score: {round(score, 2)}")
    print(f"📝 Description: {job['description']}\n")