In [42]:

"""
Project Title: Job Matching and Visualization

Problem Statement:
As part of this project, we aim to develop a system that efficiently matches job listings with resumes based on skills, experience, and education. 

Solution Overview:
To address this problem, we will implement a multi-step approach:
1. Data Preparation: We'll gather job listings and resumes data and preprocess it for indexing.
2. Indexing: We'll index the preprocessed data into Elasticsearch, a highly scalable search engine.
3. Matching Algorithm: We'll develop an algorithm to match resumes with job listings based on key criteria such as skills, experience, and education.


Approach:
- Utilize Python and Elasticsearch for data preprocessing, indexing, and matching.
- Implement NLP techniques for feature extraction and keyword matching.
- Test and iterate on the matching algorithm to improve accuracy and efficiency.
- Document each step and provide clear explanations for transparency and reproducibility.
"""




"\nProject Title: Job Matching and Visualization\n\nProblem Statement:\nAs part of this project, we aim to develop a system that efficiently matches job listings with resumes based on skills, experience, and education. \n\nSolution Overview:\nTo address this problem, we will implement a multi-step approach:\n1. Data Preparation: We'll gather job listings and resumes data and preprocess it for indexing.\n2. Indexing: We'll index the preprocessed data into Elasticsearch, a highly scalable search engine.\n3. Matching Algorithm: We'll develop an algorithm to match resumes with job listings based on key criteria such as skills, experience, and education.\n\n\nApproach:\n- Utilize Python and Elasticsearch for data preprocessing, indexing, and matching.\n- Implement NLP techniques for feature extraction and keyword matching.\n- Test and iterate on the matching algorithm to improve accuracy and efficiency.\n- Document each step and provide clear explanations for transparency and reproducibilit

In [43]:
pip install elasticsearch

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [44]:
"""
Generate Fake Job Data using Faker

This section generates synthetic job data using the Faker library while considering specific aspects required for the job data. Faker is a Python library that generates fake data such as names, addresses, and job titles, which is useful for creating realistic but synthetic datasets for testing and development purposes.

In this script, we'll use Faker to generate job-related data for various fields such as required_skills, experience_title, and required_education. We'll ensure that the generated job data aligns with certain aspects, such as the required skills, experience titles, and required education levels specified by the user.

By using Faker, we can easily create large volumes of diverse job data that closely resemble real-world job listings. This synthetic data will be useful for testing the matching algorithm and visualizing the job data in Kibana.

"""


"\nGenerate Fake Job Data using Faker\n\nThis section generates synthetic job data using the Faker library while considering specific aspects required for the job data. Faker is a Python library that generates fake data such as names, addresses, and job titles, which is useful for creating realistic but synthetic datasets for testing and development purposes.\n\nIn this script, we'll use Faker to generate job-related data for various fields such as required_skills, experience_title, and required_education. We'll ensure that the generated job data aligns with certain aspects, such as the required skills, experience titles, and required education levels specified by the user.\n\nBy using Faker, we can easily create large volumes of diverse job data that closely resemble real-world job listings. This synthetic data will be useful for testing the matching algorithm and visualizing the job data in Kibana.\n\n"

In [45]:
from faker import Faker
import random

fake = Faker()

# Provided sample skills and degrees
sample_skills = {
    "Computer Science": ["Python", "Java", "C++", "Machine Learning", "Data Analysis"],
    "Mechanical Engineering": ["CAD", "SolidWorks", "Thermodynamics", "Structural Analysis"],
    "Electrical Engineering": ["Circuit Design", "Microcontrollers", "Power Systems"],
    "Chemical Engineering": ["Chemical Process Design", "Chemical Kinetics", "Mass Transfer"],
    "Biomedical Engineering": ["Biomaterials", "Medical Imaging", "Biomechanics"],
}

sample_degrees = [
    "Bachelor of Science in Computer Science",
    "Bachelor of Engineering in Mechanical Engineering",
    "Bachelor of Science in Electrical Engineering",
    "Bachelor of Science in Chemical Engineering",
    "Bachelor of Science in Biomedical Engineering",
]

degree_to_field = {
    "Bachelor of Science in Computer Science": "Computer Science",
    "Bachelor of Engineering in Mechanical Engineering": "Mechanical Engineering",
    "Bachelor of Science in Electrical Engineering": "Electrical Engineering",
    "Bachelor of Science in Chemical Engineering": "Chemical Engineering",
    "Bachelor of Science in Biomedical Engineering": "Biomedical Engineering",
}

def generate_job_data(num_jobs=10):
    job_data = []
    for _ in range(num_jobs):
        degree = random.choice(sample_degrees)
        field = degree_to_field[degree]
        num_skills = len(sample_skills[field])
        num_skills_to_sample = min(random.randint(2, 4), num_skills)

        job = {
            "required_skills": random.sample(sample_skills[field], k=num_skills_to_sample),
            "experience_title": fake.job(),
            "required_education": degree
        }
        job_data.append(job)
    
    return job_data

# Generate job data
jobs_data = generate_job_data(1000)

# Display a few generated job data
for job in jobs_data[:10]:
    print(job)

import string

def generate_provider_name(index):
    # Generates provider names like 'Provider A', 'Provider B', ...
    return f"Provider {string.ascii_uppercase[index % 26]}"

# Add a 'provider' field to each job
for index, job in enumerate(jobs_data):
    job['provider'] = generate_provider_name(index)

# Example: Print the first 10 jobs to see the provider names
for job in jobs_data[:1000]:
    print(job)



{'required_skills': ['Biomechanics', 'Medical Imaging', 'Biomaterials'], 'experience_title': 'Information systems manager', 'required_education': 'Bachelor of Science in Biomedical Engineering'}
{'required_skills': ['Medical Imaging', 'Biomechanics', 'Biomaterials'], 'experience_title': 'Prison officer', 'required_education': 'Bachelor of Science in Biomedical Engineering'}
{'required_skills': ['Chemical Process Design', 'Chemical Kinetics'], 'experience_title': 'TEFL teacher', 'required_education': 'Bachelor of Science in Chemical Engineering'}
{'required_skills': ['Machine Learning', 'Python', 'C++'], 'experience_title': 'Building control surveyor', 'required_education': 'Bachelor of Science in Computer Science'}
{'required_skills': ['Circuit Design', 'Power Systems', 'Microcontrollers'], 'experience_title': 'Occupational therapist', 'required_education': 'Bachelor of Science in Electrical Engineering'}
{'required_skills': ['Biomaterials', 'Biomechanics', 'Medical Imaging'], 'experie

In [46]:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [47]:
pip install faker

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [48]:

from faker import Faker
import random

fake = Faker()






# Sample data
sample_degrees = [
    "Bachelor of Science in Computer Science",
    "Bachelor of Engineering in Mechanical Engineering",
    "Bachelor of Science in Electrical Engineering",
    "Bachelor of Science in Chemical Engineering",
    "Bachelor of Science in Biomedical Engineering",
]

sample_skills = {
    "Computer Science": ["Python", "Java", "C++", "Machine Learning", "Data Analysis"],
    "Mechanical Engineering": ["CAD", "SolidWorks", "Thermodynamics", "Structural Analysis"],
    "Electrical Engineering": ["Circuit Design", "Microcontrollers", "Power Systems"],
    "Chemical Engineering": ["Chemical Process Design", "Chemical Kinetics", "Mass Transfer"],
    "Biomedical Engineering": ["Biomaterials", "Medical Imaging", "Biomechanics"],
    
}


import random

def generate_experience(degree):
    
    experience_titles = {
        "Bachelor of Science in Computer Science": ["Software Developer", "Data Analyst"],
        "Bachelor of Science in Electrical Engineering": ["Electrical Engineer", "Systems Engineer"],
        "Bachelor of Science in Biomedical Engineering": ["Biomedical Engineer", "Clinical Engineer"],
        "Bachelor of Engineering in Mechanical Engineering": ["Mechanical Engineer", "Design Engineer"],
        "Bachelor of Science in Chemical Engineering": ["Chemical Engineer", "Process Engineer"],
        
    }

    # List to hold experience
    experiences = []

    # Generate a random number of experiences (1 to 3 for this example)
    num_experiences = random.randint(1, 3)

    for _ in range(num_experiences):
        title = random.choice(experience_titles.get(degree, ["General Position"]))
        duration = f"{random.randint(1, 5)} years"  # Random duration between 1 to 5 years

        experiences.append({"title": title, "duration": duration})

    return experiences





# Function to generate a synthetic resume
def generate_resume():
    # Randomly select a degree
    degree = random.choice(sample_degrees)
    
    # Extract the field from the degree (splitting by 'in' and taking the last part)
    field = degree.split('in')[-1].strip()
    
    # Ensure the field is in sample_skills, or choose a random field if not found
    if field not in sample_skills:
        field = random.choice(list(sample_skills.keys()))
    
    # Randomly select skills for the resume, limiting the number of skills to 2-4
    num_skills = random.randint(2, min(4, len(sample_skills[field])))
    skills = random.sample(sample_skills[field], k=num_skills)

    # Generate a synthetic resume with skills, experience, and education
    return {
        "Skills": skills,
        "Experience": generate_experience(degree),
        "Education": degree
    }

# Generate a list of synthetic resumes
num_of_resumes = 1000  # You can change this to generate more resumes
synthetic_resumes = [generate_resume() for _ in range(num_of_resumes)]

# Display the generated resumes
for resume in synthetic_resumes:
    print(resume)



{'Skills': ['Data Analysis', 'C++', 'Machine Learning', 'Python'], 'Experience': [{'title': 'Electrical Engineer', 'duration': '4 years'}, {'title': 'Electrical Engineer', 'duration': '3 years'}, {'title': 'Electrical Engineer', 'duration': '2 years'}], 'Education': 'Bachelor of Science in Electrical Engineering'}
{'Skills': ['Java', 'Data Analysis'], 'Experience': [{'title': 'Systems Engineer', 'duration': '3 years'}, {'title': 'Systems Engineer', 'duration': '3 years'}, {'title': 'Systems Engineer', 'duration': '4 years'}], 'Education': 'Bachelor of Science in Electrical Engineering'}
{'Skills': ['Biomaterials', 'Medical Imaging'], 'Experience': [{'title': 'Electrical Engineer', 'duration': '3 years'}, {'title': 'Systems Engineer', 'duration': '3 years'}, {'title': 'Systems Engineer', 'duration': '2 years'}], 'Education': 'Bachelor of Science in Electrical Engineering'}
{'Skills': ['Chemical Kinetics', 'Chemical Process Design', 'Mass Transfer'], 'Experience': [{'title': 'Biomedical 

In [49]:
def extract_keywords(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    keywords = [word for word in word_tokens if word.isalpha() and word not in stop_words]
    return keywords


def extract_keywords(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    keywords = [word for word in word_tokens if word.isalpha() and word not in stop_words]
    return keywords


for job in jobs_data:
    job['keywords'] = extract_keywords(' '.join(job['required_skills']) + ' ' + job['experience_title'] + ' ' + job['required_education'])

for resume in synthetic_resumes:
    resume['keywords'] = extract_keywords(' '.join(resume['Skills']) + ' ' + ' '.join([exp['title'] for exp in resume['Experience']]) + ' ' + resume['Education'])

    def match_score(job_keywords, resume_keywords):
     return len(set(job_keywords) & set(resume_keywords))

matches = []
for resume in synthetic_resumes:
    for job in jobs_data:
        score = match_score(job['keywords'], resume['keywords'])
        matches.append((score, job, resume))

# Sort matches based on score
matches.sort(reverse=True, key=lambda x: x[0])


for score, job, resume in matches[:500]:  # Display top 10 matches
    print(f"Match Score: {score}\nJob: {job['experience_title']}\nResume: {resume['Experience'][0]['title']}\n")



Match Score: 10
Job: Engineer, control and instrumentation
Resume: Biomedical Engineer

Match Score: 10
Job: Engineer, technical sales
Resume: Biomedical Engineer

Match Score: 10
Job: Engineer, electrical
Resume: Biomedical Engineer

Match Score: 10
Job: Engineer, biomedical
Resume: Biomedical Engineer

Match Score: 10
Job: Engineer, control and instrumentation
Resume: Biomedical Engineer

Match Score: 10
Job: Engineer, chemical
Resume: Biomedical Engineer

Match Score: 10
Job: Engineer, control and instrumentation
Resume: Systems Engineer

Match Score: 10
Job: Engineer, technical sales
Resume: Systems Engineer

Match Score: 10
Job: Engineer, electrical
Resume: Systems Engineer

Match Score: 10
Job: Engineer, biomedical
Resume: Systems Engineer

Match Score: 10
Job: Engineer, control and instrumentation
Resume: Systems Engineer

Match Score: 10
Job: Engineer, chemical
Resume: Systems Engineer

Match Score: 10
Job: Engineer, control and instrumentation
Resume: Systems Engineer

Match S

In [50]:
resume_mapping = {
    "settings": {
        "analysis": {
            "analyzer": {
                "custom_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "Skills": {"type": "text", "analyzer": "custom_analyzer"},
            "Experience": {
                "type": "nested",
                "properties": {
                    "title": {"type": "text", "analyzer": "custom_analyzer"},
                    "duration": {"type": "text", "analyzer": "custom_analyzer"}
                }
            },
            "Education": {"type": "text", "analyzer": "custom_analyzer"}
        }
    }
}



In [51]:
from elasticsearch import Elasticsearch

# Connect to Elasticsearch
es = Elasticsearch("http://localhost:9200")

# Define custom analyzer (if needed)
custom_analyzer = {
    "analyzer": {
        "custom_lowercase_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase"]
        }
    }
}

# Define mapping for job listing index
job_mapping = {
    "settings": {
        "analysis": {
            "analyzer": {
                "custom_lowercase_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "required_skills": {"type": "text", "analyzer": "custom_lowercase_analyzer"},
            "experience_title": {"type": "text", "analyzer": "custom_lowercase_analyzer"},
            "required_education": {"type": "text", "analyzer": "custom_lowercase_analyzer"},
            "keywords": {"type": "text", "analyzer": "custom_lowercase_analyzer"},
            "provider": {"type": "text", "analyzer": "custom_lowercase_analyzer"}  # New field
        }
    }
}

resume_mapping = {
    "settings": custom_analyzer,
    "mappings": {
        "properties": {
            "skills": {"type": "text", "analyzer": "custom_lowercase_analyzer"},
            "experience": {"type": "text", "analyzer": "custom_lowercase_analyzer"},
            "education": {"type": "text", "analyzer": "custom_lowercase_analyzer"},
            "keywords": {"type": "text", "analyzer": "custom_lowercase_analyzer"}
        }
    }
}



In [52]:
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch("http://localhost:9200")



def job_data_generator(job_data):
    for job in job_data:
        yield {
            "_index": "job_listings",
            "_id": job.get("id"),
            "_source": job
        }

es.indices.delete(index='job_listings', ignore=[400, 404])

# Create the new index with the updated mapping
es.indices.create(index='job_listings', body=job_mapping)

# Re-index your job data
helpers.bulk(es, job_data_generator(jobs_data))





# Function to create a generator yielding resume data for bulk indexing
def resume_data_generator(resume_data):
    for resume in resume_data:
        yield {
            "_index": "resumes",
            "_id": resume.get("id"),  # Assuming each resume has a unique 'id' field
            "_source": resume
        }





# Bulk indexing resume data
helpers.bulk(es, resume_data_generator(synthetic_resumes))

print("Indexing complete.")

  es.indices.delete(index='job_listings', ignore=[400, 404])
  es.indices.delete(index='job_listings', ignore=[400, 404])
  es.indices.create(index='job_listings', body=job_mapping)
  helpers.bulk(es, job_data_generator(jobs_data))
  helpers.bulk(es, resume_data_generator(synthetic_resumes))


Indexing complete.


In [53]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

input_resume = {
    "skills": ['CAD', 'Thermodynamics', 'SolidWorks'],  # Relevant skills for mechanical engineering
    "education": "Bachelor's in Mechanical Engineering",
    "experience": "Mechanical Design"  # Optional: Include if experience titles are relevant
}


# Formulate a more targeted but not overly complex search query
query_body = {
    "query": {
        "bool": {
            "should": [
                {"match": {"required_skills": {"query": skill, "boost": 1.2}} for skill in input_resume["skills"]},
                {"match_phrase": {"required_education": input_resume["education"]}},
                # Optional: Include if experience titles are relevant
                {"match": {"experience_title": input_resume["experience"]}}
            ],
            "minimum_should_match": 1
        }
    },
    "size": 10  # Adjust the number of results as needed
}

# Search the job listings index
results = es.search(index="job_listings", body=query_body)

# Print the matched job listings
print("Matched Job Listings:")
for hit in results['hits']['hits']:
    job_listing = hit['_source']
    provider = job_listing.get('provider', 'Unknown Provider')  # Default to 'Unknown Provider' if not present
    print(f"Job Title: {job_listing['experience_title']}, Provider: {job_listing.get('provider', 'Unknown')}, Score: {hit['_score']}")






Matched Job Listings:
Job Title: Mechanical engineer, Provider: Provider U, Score: 6.9844527
Job Title: Programmer, applications, Provider: Provider V, Score: 3.003996
Job Title: Commercial art gallery manager, Provider: Provider Z, Score: 3.003996
Job Title: Agricultural engineer, Provider: Provider T, Score: 3.003996
Job Title: Intelligence analyst, Provider: Provider D, Score: 3.003996
Job Title: Solicitor, Scotland, Provider: Provider B, Score: 3.003996
Job Title: Community education officer, Provider: Provider K, Score: 3.003996
Job Title: Warden/ranger, Provider: Provider R, Score: 3.003996
Job Title: Materials engineer, Provider: Provider U, Score: 3.003996
Job Title: Tourist information centre manager, Provider: Provider L, Score: 3.003996


  results = es.search(index="job_listings", body=query_body)
