## **Resume Screening — Model Training**

In this notebook, we will:
- Load the preprocessed resume dataset
- Convert text into numerical vectors using **TF-IDF**
- Train two models: **Logistic Regression** and **Naive Bayes**
- Evaluate their performance
- Save the best model and vectorizer for later use

**Import Libraries**

In [19]:
#For loading preprocessed resume and save best model and vectorizer
import pickle 
import joblib
import pandas as pd #for data handling
# for ml
from sklearn.feature_extraction.text import TfidfVectorizer # to convert text to numeric features  
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

**Load Preprocessed Dataset**

In [20]:
with open('src/resumes_df.pkl','rb') as f:
    resumes_df = pickle.load(f)
    
# features (text resumes) and label(job-category)
X = resumes_df['final_resume']
y = resumes_df['Category'] # target variable

resumes_df[['final_resume','Category']].head()

Unnamed: 0,final_resume,Category
0,skill programming language python panda numpy ...,Data Science
1,education detail may 2013 may 2017 b.e uit-rgp...,Data Science
2,area interest deep learning control system des...,Data Science
3,skill r python sap hana tableau sap hana sql s...,Data Science
4,education detail mca ymcaust faridabad haryana...,Data Science


**Convert Text to TF-IDF Vectors**

In [21]:
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2)) # initialize the vectorizer
X_vec = tfidf.fit_transform(X) # transform the text data to feature vectors
X_vec.shape

(962, 5000)

- `max_features=5000`: keep top 5000 important words 
- `ngram_range=(1,2)`: consider both single words and pairs (bigrams) 
- output: sparse matrix (962 samples, 5000 features)

**Split Dataset into Train & Test**

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, stratify=y, random_state=42)
print(f"Training set size: {X_train.shape}, Test set size: {X_test.shape}")

Training set size: (769, 5000), Test set size: (193, 5000)


- 80% training data, 20% testing data
- **`stratify=y`** so the distribution of categories remains balanced in both sets.
- Total samples = **962 (769 train + 193 test)**.

**Train Logistic Regression Model**

In [23]:
lr_model = LogisticRegression(class_weight="balanced", max_iter=1000)
lr_model.fit(X_train, y_train)

# Predict & Evaluate
lr_preds = lr_model.predict(X_test)
print("Logistic Regression Report:")
print(classification_report(y_test, lr_preds)) # comparison of true labels (y_test) vs predicted labels (lr_preds)

Logistic Regression Report:
                           precision    recall  f1-score   support

                 Advocate       1.00      1.00      1.00         4
                     Arts       1.00      1.00      1.00         7
       Automation Testing       0.83      1.00      0.91         5
               Blockchain       1.00      1.00      1.00         8
         Business Analyst       1.00      1.00      1.00         6
           Civil Engineer       1.00      1.00      1.00         5
             Data Science       1.00      1.00      1.00         8
                 Database       1.00      1.00      1.00         7
          DevOps Engineer       1.00      0.91      0.95        11
         DotNet Developer       1.00      1.00      1.00         5
            ETL Developer       1.00      1.00      1.00         8
   Electrical Engineering       1.00      1.00      1.00         6
                       HR       1.00      1.00      1.00         9
                   Hadoop       1

**Insights from Logistic Regression Report**
- **Overall accuracy ≈ 99%** model predicts almost all test samples correctly
- **Precision/Recall/F1 ≈ 1.0** → very low errors across classes.
- **DevOps Engineer** slightly weaker **(recall = 0.91)**.
- Small classes (e.g. **Advocate = 4 samples**) still classified perfectly.
- Model shows **strong generalization** on this dataset.

**Train Naive Bayes Model**

In [24]:
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict & Evaluate
nb_preds = nb_model.predict(X_test)
print("Naive Bayes Classification Report:")
print(classification_report(y_test, nb_preds)) # comparison of true labels (y_test) vs predicted labels (nb_preds)

Naive Bayes Classification Report:
                           precision    recall  f1-score   support

                 Advocate       1.00      0.50      0.67         4
                     Arts       1.00      1.00      1.00         7
       Automation Testing       1.00      0.60      0.75         5
               Blockchain       1.00      1.00      1.00         8
         Business Analyst       1.00      1.00      1.00         6
           Civil Engineer       1.00      1.00      1.00         5
             Data Science       1.00      0.88      0.93         8
                 Database       1.00      1.00      1.00         7
          DevOps Engineer       1.00      0.91      0.95        11
         DotNet Developer       1.00      0.40      0.57         5
            ETL Developer       1.00      1.00      1.00         8
   Electrical Engineering       1.00      1.00      1.00         6
                       HR       1.00      0.78      0.88         9
                   Hadoop 

**Insights from Naive Bayes Report**
- **Accuracy ~94%** → good but lower than Logistic Regression (99%).
- **Strong classes:** Blockchain, Business Analyst, Civil/Mechanical Engineer, Database, etc. got perfect scores.
- **Weaker classes:** Advocate (50%), DotNet Developer (40%), HR, Automation Testing, Data Science (<1.0 recall) showing **misclassifications**.
- **Java & Python Developers** → high recall but lower precision (some false positives).
- **Macro avg recall = 0.92** → performance varies across classes.

**Comparison of models**
- Naive Bayes works decently but struggles with certain classes, especially small/overlapping ones. 
- Logistic Regression is clearly more robust here.
- We will save Logistic Regression.

**Save Best Model + Vectorizer**

In [25]:
joblib.dump(lr_model, "src/resume_model.pkl")
joblib.dump(tfidf, "src/tfidf.pkl")

['src/tfidf.pkl']