## Candidate Ranking and Re-Ranking with Starring

This section focuses on testing the performance of the candidate ranking system. Key steps and highlights include:

- **Data-Driven Insights**: Automating and refining the ranking process to improve efficiency and decision-making accuracy.
- **Two-Step Ranking Process**:
  1. **Initial Ranking**: Candidates are ranked based on their initial fitness scores (`fit`), providing a baseline.
  2. **Re-Ranking with Starring**: Feedback from users (e.g., starring preferred candidates) dynamically updates the ranking model using Learning-to-Rank techniques.

In [6]:
import os
import sys


try:
    from google.colab import drive
    drive.mount('/content/drive')
    root_dir = "/content/drive/MyDrive/wdir/repos/Apziva/3-potential_talents/"
    os.getcwd()

except ImportError:
    while 'potential_talents' not in os.listdir('.'):
        os.chdir('..')
        root_dir=os.getcwd()
    
    # append term_deposit to system to import custom functions
    sys.path.append('.')
    
%pwd

'/home/sagemaker-user/3-potential_talents'

### **1. Dependencies**

In [7]:
from pathlib import Path
from IPython.display import display, Markdown, clear_output
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

from geopy.distance import geodesic
from fuzzywuzzy import fuzz
import toml
import json

SEED = 42
connection_threshold = 50

data_path = Path("data")
data = pd.read_parquet(data_path  / "interim" / "encoded.parquet", columns=['job_title'])

credentials_path = Path(root_dir) / "config" / "credentials.json"
with open(credentials_path, "r") as file:
    credentials = json.load(file)

keywords_path = Path(root_dir) / "config" / "search_terms.toml"
target_keywords = toml.load(keywords_path)['search_phrases']
target_location = toml.load(keywords_path)['search_phrases']

# API and credentials setup
API_URL = "https://api-inference.huggingface.co/models/sentence-transformers/msmarco-distilbert-base-tas-b"
headers = {"Authorization": f"Bearer {credentials['HUGGINGFACE_TOKEN']}"}

**Key Highlights**:

- `xgboost` is a robust and scalable gradient boosting library, ideal for implementing Learning-to-Rank models.
- `pandas` enables efficient data manipulation and analysis.
- `IPython.display` supports dynamic visualization in notebooks.

### **2. Initial Setup**

#### Data Loading and Initial Ranking

In [8]:
data_path = Path("data")
data = pd.read_parquet(data_path / "processed" / "grouped_results.parquet")\
         .sort_values('fit', ascending=False)
# Initial Ranking
data['rank'] = range(1, len(data) + 1)
data['is_starred'] = 0
data = data.reset_index()
data.head()


Unnamed: 0,job_title,fit,rank,is_starred
0,human resources staffing and recruiting profes...,0.752753,1,0
1,retired army national guard recruiter office m...,0.728867,2,0
2,aspiring human resources professional an ener...,0.727303,3,0
3,aspiring human resources manager seeking inter...,0.720167,4,0
4,human resources coordinator at intercontinenta...,0.717091,5,0


- The `fit` score represents the initial evaluation of candidate suitability for a role.
- A simple ranking (`rank`) is assigned based on descending `fit` values.
- The `is_starred` column allows for supervisory input, marking candidates as starred for preference.

### **3. Incorporating User Feedback for Re-Ranking**

#### **Dynamic Starring Functionality**

In [18]:

def star_candidate(data):
    """
    Allow the user to interactively star a candidate and update the rankings.
    The function continues until the user types 'exit' or 'q'.
    """
    while True:
        # Clear the previous output
        clear_output(wait=True)

        # Display the current candidates table
        display(Markdown("## Current Candidates:"))
        display(data[['job_title', 'fit', 'rank', 'is_starred']])

        # Get user input
        user_input = input("\nEnter the job title or rank of the candidate to star (or type 'exit' or 'q' to quit): ").strip()

        # Exit condition
        if user_input.lower() in ['exit', 'q']:
            print("Exiting the star candidate selection.")
            break

        try:
            if user_input.isdigit():
                rank = int(user_input)
                if rank in data['rank'].values:
                    data.loc[data['rank'] == rank, 'is_starred'] = 1
                    print(f"\nCandidate with rank {rank} has been starred.")
                else:
                    print("Invalid rank. Please try again.")
            elif user_input in data['job_title'].values:
                data.loc[data['job_title'] == user_input, 'is_starred'] = 1
                print(f"\nCandidate '{user_input}' has been starred.")
            else:
                print("Invalid job title or rank. Please try again.")
        except Exception as e:
            print(f"Error: {e}")

    return data

data = star_candidate(data)

## Current Candidates:

Unnamed: 0,job_title,fit,rank,is_starred
0,human resources staffing and recruiting profes...,0.752753,1,1
1,retired army national guard recruiter office m...,0.728867,2,0
2,aspiring human resources professional an ener...,0.727303,3,1
3,aspiring human resources manager seeking inter...,0.720167,4,0
4,human resources coordinator at intercontinenta...,0.717091,5,0
5,experienced retail manager and aspiring human ...,0.715438,6,0
6,not tech is seeking human resources payroll a...,0.713653,7,0
7,human resources manager at not tech shine nort...,0.713335,8,1
8,aspiring human resources professional passion...,0.710101,9,1
9,aspiring human resources manager graduating m...,0.710046,10,0


Exiting the star candidate selection.


- The `star_candidate` function allows users to influence rankings in real time, incorporating domain-specific preferences.
- This interactivity ensures human oversight remains central to the process.

### **4. Re-Ranking with Learning-to-Rank (LTR)**

#### Pairwise Data Generation for Training

In [19]:
def generate_pairwise_data(df):
    """Generate pairwise training data for LTR."""
    starred = df[df['is_starred'] == 1]
    not_starred = df[df['is_starred'] == 0]
    
    X, y = [], []
    for _, s_row in starred.iterrows():
        for _, ns_row in not_starred.iterrows():
            X.append([s_row['fit'] - ns_row['fit']])
            y.append(1)  # Positive pair: Starred is better
            
            X.append([ns_row['fit'] - s_row['fit']])
            y.append(0)  # Negative pair: Not-starred is worse
    return np.array(X), np.array(y)


X, y = generate_pairwise_data(data)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

print("Pairwise Training Data:")
print(X[:5], y[:5])

Pairwise Training Data:
[[ 0.02388589]
 [-0.02388589]
 [ 0.0325863 ]
 [-0.0325863 ]
 [ 0.03566164]] [1 0 1 0 1]


## Training the LambdaMART Model

In [20]:
model = xgb.XGBRanker(
    objective="rank:pairwise",
    learning_rate=0.1,
    max_depth=5,
    n_estimators=100,
    random_state=42
)

# Group by candidate sets
group = [len(X_train)]

model.fit(X_train, y_train, group=group)

In [21]:
data['adjusted_fit'] = model.predict(data[['fit']].values)
data = data.sort_values('adjusted_fit', ascending=False).reset_index(drop=True)
data['rank'] = range(1, len(data) + 1)


**Key Advantages**:

- **Scalability**: LambdaMART is designed for large-scale applications, handling numerous candidates effectively.
- **Feedback Integration**: Incorporating starring ensures the model adapts dynamically to new preferences.

### **5. Filtering Candidates**

#### Pre-Processing Filters

In [22]:
# Define target keywords, location, and thresholds
def filter_by_text(data, keywords, threshold=60):
    """Filter candidates based on fuzzy text matching."""
    return data  # Implementation placeholder

filtered_data = filter_by_text(data, target_keywords, threshold=60)


In [26]:
thresh=0.7
refiltered_data = filtered_data[(filtered_data.fit>thresh) & (filtered_data.adjusted_fit>3.8)]
refiltered_data.to_csv(data_path / "processed" /"filtered.csv", index=False)
refiltered_data.to_parquet(data_path / "processed" /"filtered.parquet", index=False, compression="brotli")
display(refiltered_data)

Unnamed: 0,job_title,fit,rank,is_starred,adjusted_fit
0,human resources staffing and recruiting profes...,0.752753,1,1,3.868388
1,retired army national guard recruiter office m...,0.728867,2,0,3.868388
2,aspiring human resources professional an ener...,0.727303,3,1,3.868388
3,aspiring human resources manager seeking inter...,0.720167,4,0,3.868388
4,human resources coordinator at intercontinenta...,0.717091,5,0,3.868388
5,experienced retail manager and aspiring human ...,0.715438,6,0,3.868388
6,not tech is seeking human resources payroll a...,0.713653,7,0,3.868388
7,human resources manager at not tech shine nort...,0.713335,8,1,3.868388
8,aspiring human resources professional passion...,0.710101,9,1,3.868388
9,aspiring human resources manager graduating m...,0.710046,10,0,3.868388
