### # AI Workshop: Building a Smart Hiring System with RAG-Powered Applicant Tracking

#### Introduction

Traditional Applicant Tracking Systems (ATS) often struggle with handling large volumes of data, maintaining context, and applying insights to niche domains, leading to inefficiencies and inaccuracies.

This is where Retrieval-Augmented Generation (RAG) comes into play, addressing key issues like hallucination in AI responses and providing domain-specific context.

#### What is RAG?

Retrieval-Augmented Generation (RAG) is a cutting-edge approach that combines the strengths of information retrieval and text generation models. Unlike standard language models that generate responses based solely on their training data, RAG retrieves relevant information from external datasets, ensuring that the generated output is accurate, contextually relevant, and grounded in real data. This method significantly reduces the risk of hallucination, where AI generates plausible but incorrect information, and enhances the system's ability to apply insights to niche domains.

#### Benefits to Human Resource Departments

Implementing a RAG-powered Applicant Tracking System (ATS) offers numerous benefits to HR departments:

- **Enhanced Accuracy**: By retrieving and utilizing real-time data, RAG ensures that the AI-generated insights and recommendations are accurate and relevant.
- **Contextual Relevance**: RAG provides contextual answers tailored to specific job roles and industry requirements, improving the quality of candidate evaluations.
- **Efficiency**: Automating the retrieval and analysis of applicant data saves time and reduces the workload on HR personnel, allowing them to focus on strategic decision-making.
- **Scalability**: RAG-powered systems can handle large volumes of applications, ensuring consistent and fair candidate assessments.

#### About the Dataset

The dataset used in this workshop is sourced from Kaggle and can be accessed [here](https://www.kaggle.com/datasets/shivani12sharma/resume-dataset-new/data). It contains a comprehensive collection of resumes across various job roles and industries, providing a rich source of information for training and testing our RAG-powered ATS. The dataset includes details such as candidate skills, experiences, education, and other relevant attributes, enabling us to build a robust system capable of nuanced and context-aware applicant tracking and evaluation.


In [None]:
#@title Download Resume Dataset (Already saved to Google Drive)
!unzip "/content/drive/MyDrive/Colab Notebooks/resume/archive_resume.zip"


Archive:  /content/drive/MyDrive/Colab Notebooks/resume/archive_resume.zip
  inflating: 1.pdf                   
  inflating: 10.pdf                  
  inflating: 11.pdf                  
  inflating: 12.pdf                  
  inflating: 13.pdf                  
  inflating: 14.pdf                  
  inflating: 15.pdf                  
  inflating: 16.pdf                  
  inflating: 17.pdf                  
  inflating: 18.pdf                  
  inflating: 19.pdf                  
  inflating: 2.pdf                   
  inflating: 20.pdf                  
  inflating: 21.docx                 
  inflating: 22.docx                 
  inflating: 23.docx                 
  inflating: 24.png                  
  inflating: 25.jfif                 
  inflating: 26.jfif                 
  inflating: 27.jpg                  
  inflating: 28.jpeg                 
  inflating: 29.jfif                 
  inflating: 3.pdf                   
  inflating: 4.pdf                   
  inflating: 

# What you need to Build System

1. A NoSql Vector Database - MongoDB
2. Data - Dataset of Resume Files (PDFs)
3. Large Language Model - OpenAI


## Set Mongo DB

Setting up a free MongoDB Atlas database is straightforward. Follow these steps to create an account, set up a cluster, and connect to your database:

### 1. **Create a MongoDB Atlas Account**
1. Go to the [MongoDB Atlas website](https://www.mongodb.com/cloud/atlas/register).
2. Fill out the registration form or use an existing Google account to sign up.
3. Verify your email address if prompted.

### 2. **Create a New Project**
1. Once logged in, click on **"New Project"**.
2. Enter a name for your project (e.g., "MyFirstProject").
3. Optionally, add members and set permissions.
4. Click **"Create Project"**.

### 3. **Build a Cluster**
1. In your new project, click **"Build a Cluster"**.
2. Select the **"Shared Clusters"** tab to choose a free tier cluster.
3. Choose a cloud provider and region. The free tier offers multiple options.
4. Customize your cluster settings (or leave defaults) and click **"Create Cluster"**.

### 4. **Configure Network Access**
1. While the cluster is being created, set up network access.
2. Go to **"Network Access"** in the left-hand menu.
3. Click **"Add IP Address"**.
4. You can add your current IP address or allow access from anywhere by adding `0.0.0.0/0`. Note that allowing access from anywhere can be insecure.

### 5. **Create a Database User**
1. Go to **"Database Access"** in the left-hand menu.
2. Click **"Add New Database User"**.
3. Enter a username and password for your database user.
4. Assign roles (e.g., "Atlas Admin").
5. Click **"Add User"**.

### 6. **Connect to Your Cluster**
1. Once your cluster is created (it may take a few minutes), click **"Clusters"** in the left-hand menu.
2. Click **"Connect"** next to your cluster.
3. Choose a connection method:
    - **Connect Your Application**: Get the connection string to use in your application code.
    - **MongoDB Compass**: Use MongoDB's graphical user interface.
    - **Connect from Mongo Shell**: Use the Mongo shell to connect directly.

### 7. **Connecting with pymongo in Python**
If you choose to connect your application, follow these steps to connect using `pymongo`:

1. **Install pymongo**:
    ```bash
    pip install pymongo
    ```

2. **Get the Connection String**:
    - Select **"Connect Your Application"**.
    - Choose your driver and version (e.g., Python and 3.6 or later).
    - Copy the provided connection string.

3. **Use the Connection String in Your Code**:
    Replace `<username>`, `<password>`, and `<dbname>` with your database user credentials and database name.

    ```python
    from pymongo import MongoClient

    # Replace <password> with the password for the <username> user.
    # Replace <dbname> with the name of the database that connections will use by default.
    connection_string = "mongodb+srv://<username>:<password>@<cluster-url>/<dbname>?retryWrites=true&w=majority"

    client = MongoClient(connection_string)

    # Access a database
    db = client.get_database('<dbname>')

    # Access a collection
    collection = db.get_collection('my_collection')

    # Perform operations
    document = collection.find_one()
    print(document)
    ```

By following these steps, you'll set up a free MongoDB Atlas cluster, configure access, create a database user, and connect to the cluster using Python and `pymongo`.



In [None]:
#@title Install all Library Dependencies
!pip install "unstructured[all-docs]" -q
!apt-get -qq install poppler-utils tesseract-ocr -q
!pip install -q --user --upgrade pillow -q
!pip install langchain tiktoken langchain-community
!pip install pymongo==4.6.1 -q
!pip install openai -q


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m274.7/274.7 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m24.8 MB/s[0



### Step-by-Step Guide to Setting Up a Search Index on MongoDB Atlas

#### 1. Log In to Your MongoDB Atlas Account
- Navigate to the [MongoDB Atlas website](https://cloud.mongodb.com/) and log in with your credentials.

#### 2. Select Your Cluster
- From the MongoDB Atlas dashboard, select the cluster where you want to create the search index.

#### 3. Navigate to the Collections
- Click on the **"Collections"** tab for your selected cluster to access the database and collections.

#### 4. Select Your Database and Collection
- Choose the database and the specific collection on which you want to create the search index.

#### 5. Create a Search Index
- In the collection view, click on the **"Indexes"** tab.
- Click on the **"Create Search Index"** button.

#### 6. Configure the Search Index
- **Set the Name of the Index**:
  - Set the name of the index to `default`.
- **Copy and Paste the JSON Configuration**:
  - Copy the following JSON configuration into the index configuration window:

```json
{
  "mappings": {
    "dynamic": false,
    "fields": {
      "embedding": {
        "type": "knnVector",
        "dimensions": 1536,
        "similarity": "cosine"
      }
    }
  }
}
```

#### 7. Explanation of the JSON Configuration
- **dynamic: false**: This setting disables dynamic mapping, meaning only fields explicitly defined in the mappings section will be indexed.
- **fields.embedding**:
  - **type: "knnVector"**: Specifies that the embedding field is a vector field that will be used for k-nearest neighbors (kNN) search.
  - **dimensions: 1536**: Defines the number of dimensions in the vector.
  - **similarity: "cosine"**: Sets the similarity metric to cosine similarity.

#### 8. Example Configuration
- Here is the complete example of how the configuration looks when creating the index:

```json
{
  "mappings": {
    "dynamic": false,
    "fields": {
      "embedding": {
        "type": "knnVector",
        "dimensions": 1536,
        "similarity": "cosine"
      }
    }
  }
}
```

#### 9. Final Steps
- **Create the Index**:
  - After pasting the JSON configuration, click on the **"Create Index"** button to create the search index.
- **Verify the Index**:
  - Once the index creation process is complete, verify that the index is listed under the **"Indexes"** tab with the name `default`.

---

By following these steps, you will successfully set up a search index on MongoDB Atlas, allowing you to perform efficient searches on vector fields using k-nearest neighbors (kNN) and cosine similarity.

In [None]:
#@title Import Libraries
from pymongo import MongoClient
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import MongoDBAtlasVectorSearch
from langchain.document_loaders import DirectoryLoader
import json

In [None]:
#@title Set Key Variables
with open('/content/drive/MyDrive/Colab Notebooks/resume/config.json', 'r') as keys:
    secret_keys = json.load(keys)

db_name = "human-resource-rag"
collection_name = "job-applicants-gpt"

In [None]:
client = MongoClient(secret_keys['mongodb_server_connection_url'])
collection = client[db_name][collection_name]

In [None]:
#@title Load PDF files & Create Embeddings & Load Data to Database
loader = DirectoryLoader( '/content', glob="./*.pdf", show_progress=True)
data = loader.load()


embeddings = OpenAIEmbeddings(api_key=secret_keys['openai_api_key'],organization = secret_keys['openai_api_org'])
vectorStore = MongoDBAtlasVectorSearch.from_documents( data, embeddings, collection=collection)

100%|██████████| 20/20 [00:34<00:00,  1.72s/it]
  warn_deprecated(


## Optional (limit network access to DB)

Visit https://cloud.mongodb.com/v2/{SESSION_ID}/security/network/accessList and add your IP address.

In [None]:
#@title Add current IP Address to MongoDB Whitelist
%%time
import socket

# Get the hostname of the machine
hostname = socket.gethostname()

# Get the IP address associated with the hostname
ip_address = socket.gethostbyname(hostname)

print(f"Hostname: {hostname}")
print(f"IP Address: {ip_address}")

Hostname: 2a5ac730c315
IP Address: 172.28.0.12
CPU times: user 1.55 ms, sys: 0 ns, total: 1.55 ms
Wall time: 1.57 ms


In [None]:
#@title Search Database
vector_search = MongoDBAtlasVectorSearch.from_connection_string(
   secret_keys['mongodb_server_connection_url'],
   db_name+"."+collection_name,
   OpenAIEmbeddings(api_key=secret_keys['openai_api_key'],organization = secret_keys['openai_api_org']),
   index_name="default")

In [None]:
def execute_query(query):
    results = vector_search.similarity_search(query=query,k=2)

    search_results=[]
    if len(results)>0:
      for result in results:
        search_results.append(dict(result))

    return search_results

In [None]:
#@title Execute First Query
query = "How many candidates have Systems Design experience?"

[i['page_content'] for i in execute_query(query)]

dict_keys(['id', 'metadata', 'page_content', 'type'])


['development.\n\nof System Engineer.\n\nImplemented various Machine Learning Algorithms including Logistic Regression, Support Vector Machine, K-• Fold Cross Validation, Random Forest, K-Nearest Neighbor, and Artificial Neural Network to achieve optimal results.\n\nAchieved the highest accuracy in predictions by utilizing the Artificial Neural Network Algorithm. •\n\nSkills\n\nCertificates\n\nCertified - Azure AI Fundamentals (AI 900)\n\n(01/2023 - Present), Certified from Azure\n\nAWS Certified Cloud Practitioner (AWS CCP) (12/2022 - 12/2025), Certified from AWS\n\nMicroso Certified - Azure Fundamentals (AZ 900) (11/2022 - Present), Certified from Azure\n\nGoogle Cloud Certified - Associate Cloud Engineer (GCP ACE) (10/2022 - 10/2025), Certified from GCP\n\nMachine Learning and Statistical Analysis Unit II (03/2020 - Present), Certified from World Quant University (WQU)\n\nScientific Computing and Python for Data Science Unit I (12/2019 - Present), Certified from World Quant Universi

In [None]:
#@title Execute Second Query
new_query = "List Candidates with Data Science Experience"

context=[i['page_content'] for i in execute_query(new_query)]

In [None]:
#@title Processing Output with OpenAI
from openai import OpenAI
client = OpenAI(api_key=secret_keys['openai_api_key'],organization = secret_keys['openai_api_org'])



response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
        {"role": "system", "content": "You are a useful assistant. Use the assistant's content to answer the user's query. Create a markdown table with columns 'Candidate name', 'top 5 skills', 'current job title','years of experience' and 'document source' (metadata source)"},
        {"role": "assistant", "content": f"{context}"},
        {"role": "user", "content": f"{new_query}"}],
    temperature = 0.2
)


ChatCompletion(id='chatcmpl-9mTvuOMkDKAtkPRyqBnKJU3a4Nm1p', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here is a table listing candidates with Data Science experience:\n\n| Candidate Name | Top 5 Skills | Current Job Title | Years of Experience | Document Source |\n|----------------|--------------|-------------------|---------------------|-----------------|\n| Kunika Bhargav | Python, SQL, Machine Learning, Data Visualization, NLP | Data Scientist at Almabetter | 2+ years | Resume |\n| Siddhi Shukla  | R, Python, Machine Learning, SQL, Deep Learning | Data Scientist – Tech Lead at Legato Healthcare | 8 years | Resume |\n\nBoth candidates have significant experience in Data Science, with Kunika Bhargav having over 2 years of experience and Siddhi Shukla having 8 years of experience in the field.', role='assistant', function_call=None, tool_calls=None))], created=1721341834, model='gpt-4o-2024-05-13', object='chat.completion', ser

In [None]:
#@title TABLE: List Candidates with Data Science Experience
from IPython.display import display, Markdown

display(Markdown(response.choices[0].message.content))

Here is a table listing candidates with Data Science experience:

| Candidate Name | Top 5 Skills | Current Job Title | Years of Experience | Document Source |
|----------------|--------------|-------------------|---------------------|-----------------|
| Kunika Bhargav | Python, SQL, Machine Learning, Data Visualization, NLP | Data Scientist at Almabetter | 2+ years | Resume |
| Siddhi Shukla  | R, Python, Machine Learning, SQL, Deep Learning | Data Scientist – Tech Lead at Legato Healthcare | 8 years | Resume |

Both candidates have significant experience in Data Science, with Kunika Bhargav having over 2 years of experience and Siddhi Shukla having 8 years of experience in the field.