Implementation of the Prototype

Input Module

In [9]:
def load_job_description(file_path):
    try:
        with open(file_path, 'r') as file:
            text = file.read()
        return text
    except Exception as e:
        print(f"Error: {e}")
        return None

# Example Usage
#job_description = load_job_description("/content/sample_data/data_scientist_jobs.txt")
job_description = load_job_description("/home/lateefat/Automated Search Strategy Generation/data/prototype_test.txt")
print(job_description)


Job Title: Data Scientist

Company: XYZ Tech Solutions

Location: New York, NY

Job Type: Full-time

Job Description:
XYZ Tech Solutions is seeking a Data Scientist to join our growing analytics team. The ideal candidate will have strong technical expertise in Python, SQL, and Machine Learning frameworks, along with experience in building predictive models and data-driven solutions.

Responsibilities:

Collect, clean, and preprocess large datasets from multiple sources.
Develop and deploy machine learning models for predictive analytics.
Perform exploratory data analysis (EDA) to identify patterns and insights.
Use tools like Python (Pandas, Scikit-learn) and SQL to build data pipelines.
Collaborate with cross-functional teams to develop business solutions.
Communicate results through data visualizations using Power BI or Tableau.
Stay updated with the latest tools and techniques in data science and AI.
Required Skills:

Strong programming skills in Python and SQL.
Experience with Mach

Preprocessing Module

In [10]:
import re
import spacy

def preprocess_text(text):
    nlp = spacy.load("en_core_web_sm")
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())  # Remove special characters
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop]
    return tokens

# Example
cleaned_tokens = preprocess_text(job_description)
print(cleaned_tokens)


['job', 'title', 'data', 'scientist', '\n\n', 'company', 'xyz', 'tech', 'solution', '\n\n', 'location', 'new', 'york', 'ny', '\n\n', 'job', 'type', 'fulltime', '\n\n', 'job', 'description', '\n', 'xyz', 'tech', 'solution', 'seek', 'data', 'scientist', 'join', 'grow', 'analytic', 'team', 'ideal', 'candidate', 'strong', 'technical', 'expertise', 'python', 'sql', 'machine', 'learn', 'framework', 'experience', 'build', 'predictive', 'model', 'datadriven', 'solution', '\n\n', 'responsibility', '\n\n', 'collect', 'clean', 'preprocess', 'large', 'dataset', 'multiple', 'source', '\n', 'develop', 'deploy', 'machine', 'learning', 'model', 'predictive', 'analytic', '\n', 'perform', 'exploratory', 'datum', 'analysis', 'eda', 'identify', 'pattern', 'insight', '\n', 'use', 'tool', 'like', 'python', 'panda', 'scikitlearn', 'sql', 'build', 'datum', 'pipeline', '\n', 'collaborate', 'crossfunctional', 'team', 'develop', 'business', 'solution', '\n', 'communicate', 'result', 'data', 'visualization', 'pow

Keyword Extraction Module

Named Entity Recognition (NER)

In [11]:
def extract_entities(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    entities = {"Job Title": [], "Skills": [], "Location": []}

    for ent in doc.ents:
        if ent.label_ == "ORG" or ent.label_ == "TITLE":
            entities["Job Title"].append(ent.text)
        elif ent.label_ == "GPE":  # Geographical Entity
            entities["Location"].append(ent.text)
        elif ent.label_ == "SKILL" or "NN":  # Add domain-specific labels
            entities["Skills"].append(ent.text)
    return entities

# Example Usage
entities = extract_entities(job_description)
print("Extracted Entities:", entities)


Extracted Entities: {'Job Title': ['XYZ Tech Solutions', 'XYZ Tech Solutions', 'Data Scientist', 'SQL', 'Develop', 'EDA', 'SQL', 'Communicate', 'Power BI', 'AI', 'SQL', 'Machine Learning', 'PyTorch', 'Power BI', 'Data Science, Computer Science, Statistics', 'NLP', 'Spark', 'Hadoop'], 'Skills': ['Machine Learning', 'Pandas', 'Required Skills', 'TensorFlow', 'Matplotlib', '2+ years', '$100,000 - $130,000'], 'Location': ['New York', 'Python', 'Tableau', 'Python', 'Tableau', 'New York']}


TF-IDF for Keyword Extraction

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

def extract_keywords_tfidf(text, top_n=5):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform([text])
    scores = zip(vectorizer.get_feature_names_out(), tfidf_matrix.toarray()[0])
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    return [word for word, score in sorted_scores[:top_n]]

# Example Usage
keywords = extract_keywords_tfidf(job_description)
print("Top Keywords:", keywords)


Top Keywords: ['data', 'experience', 'learning', 'machine', 'skills']


Contextual Embeddings (BERT)

In [13]:
from transformers import pipeline

def summarize_text_bert(text):
    summarizer = pipeline("summarization")
    summary = summarizer(text, max_length=20, min_length=10, do_sample=False)
    return summary[0]['summary_text']

# Example Usage
summary = summarize_text_bert(job_description)
print("BERT Summary:", summary)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


BERT Summary:  XYZ Tech Solutions is seeking a Data Scientist to join our growing analytics team . The


Query Generation Module

In [14]:
def generate_search_query(entities, keywords):
    job_title = ' AND '.join([f'"{title}"' for title in entities["Job Title"]])
    skills = ' OR '.join([f'"{skill}"' for skill in entities["Skills"] + keywords])
    location = ' AND '.join([f'"{loc}"' for loc in entities["Location"]])
    return f"({job_title}) AND ({skills}) AND ({location})"

# Example Usage
search_query = generate_search_query(entities, keywords)
print("Generated Query:", search_query)


Generated Query: ("XYZ Tech Solutions" AND "XYZ Tech Solutions" AND "Data Scientist" AND "SQL" AND "Develop" AND "EDA" AND "SQL" AND "Communicate" AND "Power BI" AND "AI" AND "SQL" AND "Machine Learning" AND "PyTorch" AND "Power BI" AND "Data Science, Computer Science, Statistics" AND "NLP" AND "Spark" AND "Hadoop") AND ("Machine Learning" OR "Pandas" OR "Required Skills" OR "TensorFlow" OR "Matplotlib" OR "2+ years" OR "$100,000 - $130,000" OR "data" OR "experience" OR "learning" OR "machine" OR "skills") AND ("New York" AND "Python" AND "Tableau" AND "Python" AND "Tableau" AND "New York")
