## Project Summary – SmartMatch: AI-Powered CV & Job Description Matcher

This notebook implements **SmartMatch**, an AI system that automatically compares a candidate’s CV with multiple job descriptions to determine the best-fit positions. The approach combines semantic similarity (using sentence embeddings) with skill-level matching (using Named Entity Recognition).

---

### Pipeline Overview

- **Library Setup**  
  Imports core libraries: `pandas`, `transformers`, `scikit-learn`, and supporting NLP/embedding utilities.

- **Model Loading**  
  Loads a pre-trained **SentenceTransformer** model to convert job descriptions and CVs into dense vector embeddings for semantic comparison.

- **Data Loading**  
  Loads job descriptions from a `.csv` file and reads a user-uploaded CV file (PDF) using text extraction.

- **Embedding Generation**  
  Computes vector embeddings for each job description and the extracted CV using the sentence transformer.

- **Semantic Scoring**  
  Calculates **cosine similarity** between the CV vector and each job description vector to measure content relevance and contextual overlap. Computes similarity score based on results

- **Skill Extraction with NER and Skills Scoring**  
  Applies a fine-tuned Hugging Face **NER model** (`algiraldohe/lm-ner-linkedin-skills-recognition`) to extract technical, technology-related, and soft skills from both the CV and job descriptions.
  Compares extracted skils and computes a skill match score based on overlapping and missing skills.

- **Final Weighted score**  
  Computed weighted score appling 90% weight to similarity score and 10% weight to skills score, so that the final results reflect an individual weight to technical skills gaps within the positions

- **LLM-powered written feedback**  
  Uses Gemini to generate written feedback regarding the semantic fit between the CV and job descriptions, in addition to factors that the user can improve/highlight on its CV to improve job score.


# 0. Installs

In [1]:
!pip install pymupdf
!pip install -U sentence-transformers
!pip install torch
!pip install deep-translator
!pip install fuzzywuzzy
!pip install rapidfuzz
!pip install transformers
!pip install langdetect

Collecting pymupdf
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.5
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-c

In [6]:
# imports

# Step 1 and 2: Loading data and Pre-processing
import pandas as pd
import fitz  # PyMuPDF
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
from deep_translator import GoogleTranslator
from langdetect import detect
from fuzzywuzzy import fuzz
from rapidfuzz import process, fuzz
from datetime import datetime

# Step 3: Model
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoModel, AutoTokenizer, pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch
import numpy as np
import ast

# Step 4: Semantic Fit
import google.generativeai as genai
from google.colab import userdata
import os
from tqdm import tqdm

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [3]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 1. Data Loading (load jobs, skills and a sample CV)

## 1.1 Load Jobs

In [4]:
# load drive to be able to import file from there
# Load Jobs into a pandas dataframe

# Load Master_Data.Template.csv from drive
jobs = pd.read_csv('/content/drive/My Drive/Master_Data.Template.csv')

# Show head
jobs.head()

Unnamed: 0,id2,site,job_url,job_url_direct,title,company,location,date_posted,job_type,is_remote,job_level,job_function,listing_type,emails,description,company_industry,company_logo,search_term,country
0,gd-1009717598311,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,Internship (Master Thesis) - Data & AI,Atlas Copco IAS GmbH,Bretten,2025-04-22,,0,,,organic,IAS.career@atlascopco.com,**Internship (Master Thesis) \\- Data \\& AI**...,,https://media.glassdoor.com/sql/10368/atlas-co...,Data Science,Germany
1,gd-1009717576144,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,Praxissemester als Data Scientist – KI - AI (m...,XiLLeR GmbH,,2025-04-22,,1,,,organic,karriere@xiller.com,Du bist eingeschriebener Student und suchst ei...,,https://media.glassdoor.com/sql/6051596/xiller...,Data Science,Germany
2,gd-1009717972945,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,Praktikant/Werkstudent Business Intelligence &...,H. Ludendorff GmbH,Darmstadt,2025-04-22,,0,,,organic,,Die H. Ludendorff GmbH ist ein Großhandel für ...,,https://media.glassdoor.com/sql/5474819/h-lude...,Data Science,Germany
3,gd-1009709002018,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,"Praktikum Data Analytics im Bereich Wellbeing,...",Deutsche Telekom AG,Bonn,2025-04-14,,0,,,organic,,**Aufgabe**\n-----------\n\nOb Pflicht\\- oder...,,https://media.glassdoor.com/sql/4092/deutsche-...,Data Science,Germany
4,gd-1009708963480,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,Internship Medical Image Segmentation and Acti...,Bayer,Berlin,2025-04-14,,0,,,organic,,**Where do you want to go? What do you want to...,,https://media.glassdoor.com/sql/4245/bayer-squ...,Data Science,Germany


## 1.2 Load sample CV

In [5]:
# Load in a sample CV

def extract_text_from_pdf(file_path):
    """
    Extracts and returns all text from a PDF file.

    :param file_path: Path to the PDF file (e.g., "Profile (1).pdf")
    :return: Extracted plain text
    """
    doc = fitz.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text.strip()


pdf_path = "/content/drive/My Drive/Sample_LinkedIN_CV.pdf"  # Make sure this path matches your file location
extracted_text = extract_text_from_pdf(pdf_path)

# MANUAL STEP: Set the CV name to the name of the person whose CV you are loading
# so later his/her name is removed before being input into the model for data privacy
CV_name = "Finn Hetzler"

# Print the output of the PDF
print(extracted_text)

Contact
finn.hetzler@whu.edu
www.linkedin.com/in/finn-hetzler
(LinkedIn)
Top Skills
Python (Programming Language)
Microsoft Excel
Microsoft Power BI
Languages
Deutsch (Native or Bilingual)
Englisch (Native or Bilingual)
Spanisch (Limited Working)
Certifications
DELE B2
Master Git and GitHub in 5 Days: Go
from Zero to Hero
Foundation of Generative AI
Analyze Data with Python Skill Path
Learn the Command Line Course
Honors-Awards
NOVA SBE Merit Scholarship
Finn Hetzler
Master in Business Analytics @ Nova SBE
Germany
Summary
Relentlessly curious
Experience
confluentes e. V. – The student's consultancy at WHU
Freelance Consultant
September 2023 - March 2024 (7 months)
Mercedes-Benz.io
Data Analyst
May 2022 - July 2022 (3 months)
Lisbon, Portugal
Mercedes-Benz AG
Business Analyst
May 2021 - August 2021 (4 months)
Stuttgart, Baden-Württemberg, Deutschland
Boehm-Bezing & Cie. GmbH
Consultant
May 2020 - June 2020 (2 months)
Stuttgart, Baden-Württemberg, Deutschland
Education
Nova School of Bus

## 1.3 Load Skills Dataset

In [None]:
# Load in CSV with Skills from EU ESCO database
skills = pd.read_csv('/content/drive/My Drive/skills_en.csv')

# Show head
skills.head()

Unnamed: 0,conceptType,conceptUri,skillType,reuseLevel,preferredLabel,altLabels,hiddenLabels,status,modifiedDate,scopeNote,definition,inScheme,description
0,KnowledgeSkillCompetence,http://data.europa.eu/esco/skill/0005c151-5b5a...,skill/competence,sector-specific,manage musical staff,manage staff of music\ncoordinate duties of mu...,,released,2023-11-30T15:53:37.136Z,,,http://data.europa.eu/esco/concept-scheme/skil...,Assign and manage staff tasks in areas such as...
1,KnowledgeSkillCompetence,http://data.europa.eu/esco/skill/00064735-8fad...,skill/competence,occupation-specific,supervise correctional procedures,oversee prison procedures\nmanage correctional...,,released,2023-11-30T15:04:00.689Z,,,http://data.europa.eu/esco/concept-scheme/memb...,Supervise the operations of a correctional fac...
2,KnowledgeSkillCompetence,http://data.europa.eu/esco/skill/000709ed-2be5...,skill/competence,sector-specific,apply anti-oppressive practices,apply non-oppressive practices\napply an anti-...,,released,2023-11-28T10:45:53.54Z,,,http://data.europa.eu/esco/concept-scheme/skil...,"Identify oppression in societies, economies, c..."
3,KnowledgeSkillCompetence,http://data.europa.eu/esco/skill/0007bdc2-dd15...,skill/competence,sector-specific,control compliance of railway vehicles regulat...,monitoring of compliance with railway vehicles...,,released,2023-11-30T16:29:18.273Z,,,http://data.europa.eu/esco/concept-scheme/skil...,"Inspect rolling stock, components and systems ..."
4,KnowledgeSkillCompetence,http://data.europa.eu/esco/skill/00090cc1-1f27...,skill/competence,cross-sector,identify available services,establish available services\ndetermine rehabi...,,released,2023-11-28T10:38:49.206Z,,,http://data.europa.eu/esco/concept-scheme/memb...,Identify the different services available for ...


# 2. Text Pre-processing

Text preprocessing (lowercasing, punctuation removal, stopword removal, lemmatization)

## 2.1  Jobs Pre-processing

### 2.1.1 Feature selection

In [None]:
# Select only columns that could have info relevant to CV matching:
# These are title, description, company_industry and search_term
# Also, need country column for text pre-processing later
jobs_relevant_cols = jobs[['title', 'description', 'company_industry', 'search_term', 'country']]
jobs_relevant_cols.head()

Unnamed: 0,title,description,company_industry,search_term,country
0,Internship (Master Thesis) - Data & AI,**Internship (Master Thesis) \\- Data \\& AI**...,,Data Science,Germany
1,Praxissemester als Data Scientist – KI - AI (m...,Du bist eingeschriebener Student und suchst ei...,,Data Science,Germany
2,Praktikant/Werkstudent Business Intelligence &...,Die H. Ludendorff GmbH ist ein Großhandel für ...,,Data Science,Germany
3,"Praktikum Data Analytics im Bereich Wellbeing,...",**Aufgabe**\n-----------\n\nOb Pflicht\\- oder...,,Data Science,Germany
4,Internship Medical Image Segmentation and Acti...,**Where do you want to go? What do you want to...,,Data Science,Germany


### 2.1.2 Text Preprocessing

In [None]:
def translate_to_english(text: str, source_lang: str) -> str:
    """
    Translates text to English using GoogleTranslator from deep-translator.
    Handles errors and unknown words gracefully. Splits text into chunks of
    5000 characters or less, respecting word boundaries.

    Args:
        text: The input text string.
        source_lang: The source language code.

    Returns:
        Translated text in English, or the original text if an error occurs
        or if the source language is already English.
    """
    # 1. Pre-processing and early exit
    processed_text = " ".join(text.split()) # Remove excessive spaces and normalize
    if not processed_text:
        return "" # Handle empty or whitespace-only input
    if source_lang == "en":
        return processed_text

    # 2. Initialize translator
    try:
        translator = GoogleTranslator(source=source_lang, target="en")
    except Exception as e:
        print(f"Error initializing translator: {e}. Returning original text.")
        return processed_text # Could not initialize translator

    # 3. Chunking logic
    max_length = 5000 # Google Translate API has a general limit around 5000 chars
    words = processed_text.split(' ')

    if not words: # Should not happen if processed_text is not empty, but good practice
        return ""

    chunks = []
    current_chunk = words[0] # Start with the first word

    for word in words[1:]:
        if len(current_chunk) + len(word) + 1 <= max_length: # +1 for the space
            current_chunk += " " + word
        else:
            chunks.append(current_chunk)
            current_chunk = word # Start new chunk with current word
    chunks.append(current_chunk) # Add the last chunk

    # 4. Translate chunks
    translated_chunks = []
    for chunk in chunks:
        try:
            # Note: deep-translator might have its own internal retry/error handling.
            # If a chunk is empty (e.g. if original text was " "), it might error or return empty.
            if chunk: # Ensure chunk is not empty before attempting translation
                translated_chunk = translator.translate(chunk)
                translated_chunks.append(translated_chunk if translated_chunk else chunk) # Keep original if translation is None/empty
            else:
                translated_chunks.append("") # Preserve empty chunks if they somehow occur
        except Exception as e:
            print(f"Translation error for chunk: '{chunk[:50]}...': {e}. Keeping original chunk.")
            translated_chunks.append(chunk) # Keep original on error

    return " ".join(translated_chunks)

# Apply language detection and translation
jobs_relevant_cols['description_en'] = jobs_relevant_cols['description'].apply(
    lambda text: translate_to_english(text, detect(text)))

# Apply language detection and translation
jobs_relevant_cols['title_en'] = jobs_relevant_cols['title'].apply(
    lambda text: translate_to_english(text, detect(text)))

# Display the results
jobs_relevant_cols[['description', 'description_en']]

Translation error for chunk: '**Praktikum Data Engineer im Bereich Advanced Anal...': **Praktikum Data Engineer im Bereich Advanced Analytics (m/w/d)** **In Deutschland** \\- Berlin \\| Bonn \\| Frankfurt/Main \\| Hamburg \\| Köln \\| München Du willst Praxiserfahrung in einer internationalen Unternehmensberatung sammeln? Bei uns kannst du dein Interesse an Wirtschaftsthemen mit deinen analytischen Fähigkeiten verbinden! Werde für mindestens 10 Wochen Teil eines außergewöhnlichen Teams. Was uns besonders macht: ein starker Unternehmergeist und jede Menge Gestaltungsfreiheit. Der Advanced Analytics Bereich bei Simon\\-Kucher konzentriert sich auf die Entwicklung von Best\\-in\\-Class\\-Ansätzen und \\-Modellen, um mit Hilfe von maschinellem Lernen und fortgeschrittener Analytik komplexe Geschäftsprobleme zu lösen. Bei uns wirst du ein integraler Bestandteil der Projektteams sein, die das Wachstum unserer Kunden vorantreiben. Während deines Praktikums wirst du von unseren erfahrenen Koll

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  jobs_relevant_cols['description_en'] = jobs_relevant_cols['description'].apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  jobs_relevant_cols['title_en'] = jobs_relevant_cols['title'].apply(


Unnamed: 0,description,description_en
0,**Internship (Master Thesis) \\- Data \\& AI**...,**Internship (Master Thesis) \\- Data \\& AI**...
1,Du bist eingeschriebener Student und suchst ei...,Are you a registered student and are you looki...
2,Die H. Ludendorff GmbH ist ein Großhandel für ...,H. Ludendorff GmbH is a wholesale for sanitary...
3,**Aufgabe**\n-----------\n\nOb Pflicht\\- oder...,** Task ** ------------ or voluntary internshi...
4,**Where do you want to go? What do you want to...,**Where do you want to go? What do you want to...
...,...,...
1306,**Azienda**\n**Herzum Software S.R.L. Uniperso...,**Azienda** **Herzum Software S.R.L. Uniperson...
1307,Who We Are\n \n \n\n**We are much more than ...,Who We Are **We are much more than just an IT ...
1308,**Organizational Setting**\n In accordance wit...,**Organizational Setting** In accordance with ...
1309,Energy Team is the leading Italian operator in...,Energy Team is the leading Italian operator in...


In [None]:
# Concatenate the cells in each of the rows into a single column
jobs_relevant_cols['combined_job_text_en'] = jobs_relevant_cols['title_en'].fillna('') + " " + \
                   jobs_relevant_cols['description_en'].fillna('') + " " + \
                   jobs_relevant_cols['company_industry'].fillna('') + " " + \
                   jobs_relevant_cols['search_term'].fillna('') + " " + \
                   jobs_relevant_cols['country'].fillna('')

jobs_relevant_cols.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  jobs_relevant_cols['combined_job_text_en'] = jobs_relevant_cols['title_en'].fillna('') + " " + \


Unnamed: 0,title,description,company_industry,search_term,country,description_en,title_en,combined_job_text_en
0,Internship (Master Thesis) - Data & AI,**Internship (Master Thesis) \\- Data \\& AI**...,,Data Science,Germany,**Internship (Master Thesis) \\- Data \\& AI**...,Internship (Master Thesis) - Data & AI,Internship (Master Thesis) - Data & AI **Inter...
1,Praxissemester als Data Scientist – KI - AI (m...,Du bist eingeschriebener Student und suchst ei...,,Data Science,Germany,Are you a registered student and are you looki...,Practice semester as a data scientist - KI - A...,Practice semester as a data scientist - KI - A...
2,Praktikant/Werkstudent Business Intelligence &...,Die H. Ludendorff GmbH ist ein Großhandel für ...,,Data Science,Germany,H. Ludendorff GmbH is a wholesale for sanitary...,Intern/work student Business Intelligence & Da...,Intern/work student Business Intelligence & Da...
3,"Praktikum Data Analytics im Bereich Wellbeing,...",**Aufgabe**\n-----------\n\nOb Pflicht\\- oder...,,Data Science,Germany,** Task ** ------------ or voluntary internshi...,Internship Data Analytics in the field of Well...,Internship Data Analytics in the field of Well...
4,Internship Medical Image Segmentation and Acti...,**Where do you want to go? What do you want to...,,Data Science,Germany,**Where do you want to go? What do you want to...,Internship Medical Image Segmentation and Acti...,Internship Medical Image Segmentation and Acti...


In [24]:
def preprocess_text(text):
  """
  Preprocesses text by applying lowercasing, punctuation removal,
  stop word removal, stemming, and lemmatization.

  Args:
    text: The input text string.
    country: The country for which to select stop words (optional).

  Returns:
    A list of preprocessed tokens.
  """
  # Lowercasing
  text = text.lower()

  # Removing punctuation and extra spaces
  text = re.sub(r'[^\w\s]', '', text)
  text = " ".join(text.split())

  # Tokenization
  tokens = nltk.word_tokenize(text)

  stop_words = set(stopwords.words('english'))  # Default to English

  filtered_tokens = [w for w in tokens if not w in stop_words]

  # Stemming
  porter = PorterStemmer()
  stemmed_tokens = [porter.stem(w) for w in filtered_tokens]

  # Lemmatization
  lemmatizer = WordNetLemmatizer()
  lemmatized_tokens = [lemmatizer.lemmatize(w) for w in stemmed_tokens]

  # join the tokens back together into a single string
  preprocessed_text = ' '.join(lemmatized_tokens)

  return preprocessed_text

In [None]:
# Create a copy of the relevant columns
processed_jobs = jobs_relevant_cols[['combined_job_text_en']].copy()

# Apply the preprocessing function
processed_jobs['processed_job_text_en'] = processed_jobs.apply(
  lambda row: preprocess_text(row['combined_job_text_en']), axis=1)


In [None]:
# Assign the 'processed_job_text' column to jobs_relevant_cols
jobs_relevant_cols['combined_job_text_en'] = processed_jobs['combined_job_text_en']
jobs_relevant_cols['processed_job_text_en'] = processed_jobs['processed_job_text_en']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  jobs_relevant_cols['combined_job_text_en'] = processed_jobs['combined_job_text_en']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  jobs_relevant_cols['processed_job_text_en'] = processed_jobs['processed_job_text_en']


### 2.1.3 Create merged databse

In [None]:
# Merge the DataFrames on the 'description' column
merged_jobs = jobs.merge(jobs_relevant_cols[['description', 'title_en', 'description_en','combined_job_text_en', 'processed_job_text_en']],
                         on='description', how='left')

### 2.1.4 Process and treat additional description data

In [None]:
merged_jobs['jobType'] = merged_jobs['job_type'].str.lower().apply(
  lambda x: "Internship" if isinstance(x, str) and "internship" in x else
        "Part-Time" if isinstance(x, str) and "parttime" in x else
        "Full-time" if isinstance(x, str) and "fulltime" in x else
        "Full-time"
)

In [None]:
level_mapping = {
  'internship': 'Internship',
  'entry level': 'Entry Level',
  'not applicable': "Entry Level",
  'mid-senior level': 'Mid-Senior Level',
  'associate': 'Mid-Senior Level',
  'executive': 'Executive',
  'director': 'Executive'
}

merged_jobs['jobLevel'] = merged_jobs['job_level'].fillna('entry level').map(level_mapping)

In [None]:
# Define the salary range based on experience level
salary_mapping = {
    'Internship': "Below $90K",  # Internship salary range
    'Entry Level': "$90-120K",  # Entry level salary range
    'Mid-Senior Level': "$120-150K",  # Mid-Senior Level salary range
    'Executive': "$150K+"  # Executive salary range
}

# Function to assign a random salary based on job level
def assign_random_salary(job_level):
    if pd.notnull(job_level):
        return salary_mapping[job_level]
    return "$90-120K"

# Apply the function to assign a random salary to the 'salary' column based on 'jobLevel'
merged_jobs['salarySim'] = merged_jobs['jobLevel'].apply(assign_random_salary)


In [None]:
# Calculate the difference between today and the 'date_posted' column
today = datetime.today()
merged_jobs['postedDate'] = merged_jobs['date_posted'].apply(
    lambda x: f"{(today - datetime.strptime(x, '%Y-%m-%d')).days} days ago"
    if pd.notnull(x) else np.nan
)

In [None]:
companies_data = pd.read_csv('/content/drive/My Drive/companies_data.csv')

In [None]:
# Create a dictionary mapping company names to industries
company_to_industry = dict(zip(companies_data['Company'], companies_data['Industry']))

# Map the 'company' column to the corresponding 'industry' values
merged_jobs['industry'] = merged_jobs['company'].map(company_to_industry)

In [None]:
# Create a dictionary mapping company names to size
company_to_size = dict(zip(companies_data['Company'], companies_data['Size']))

# Map the 'company' column to the corresponding 'size' values
merged_jobs['size'] = merged_jobs['company'].map(company_to_size)

In [None]:
merged_jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1339 entries, 0 to 1338
Data columns (total 29 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   id2                    1339 non-null   object
 1   site                   1339 non-null   object
 2   job_url                1339 non-null   object
 3   job_url_direct         837 non-null    object
 4   title                  1339 non-null   object
 5   company                1329 non-null   object
 6   location               1309 non-null   object
 7   date_posted            1241 non-null   object
 8   job_type               740 non-null    object
 9   is_remote              1339 non-null   int64 
 10  job_level              375 non-null    object
 11  job_function           373 non-null    object
 12  listing_type           245 non-null    object
 13  emails                 202 non-null    object
 14  description            1339 non-null   object
 15  company_industry     

Asign skills according to job description

model = https://huggingface.co/algiraldohe/lm-ner-linkedin-skills-recognition

In [19]:
ner = pipeline("ner", model="algiraldohe/lm-ner-linkedin-skills-recognition", aggregation_strategy="simple")

def find_skills_in_description(description):

    entities = ner(description)

    skills = [ent["word"] for ent in entities if ent["entity_group"] in ["TECHNOLOGY", "TECHNICAL", "SOFT"] and ent["word"].lower() != "doe"]
    return list(set(skills))

merged_jobs["skills"] = merged_jobs["description_en"].apply(
    lambda description: find_skills_in_description(description))

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [21]:
# Save the merged DataFrame to a CSV file
merged_jobs.to_csv('/content/drive/My Drive/merged_jobs.csv', index=False)

In [None]:
# Optional: load the merged DataFrame to avoid having to rerun all the pre-processing
# (especially the translation into English of the jobs)
# merged_jobs = pd.read_csv('/content/drive/My Drive/merged_jobs.csv')

In [22]:
merged_jobs.head()

Unnamed: 0,id2,site,job_url,job_url_direct,title,company,location,date_posted,job_type,is_remote,...,postedDate,jobType,jobLevel,industry,size,score_model1,salarySim,skills,skills_score,weighted_score
0,gd-1009717598311,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,Internship (Master Thesis) - Data & AI,Atlas Copco IAS GmbH,Bretten,2025-04-22,,0,...,20 days ago,Full-time,Entry Level,Education,Mid-size (51-500),0.696794,$90-120K,"[atlas, automotive, ai, ias, large language mo...",0.2,0.547755
1,gd-1009717598311,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,Internship (Master Thesis) - Data & AI,Atlas Copco IAS GmbH,Bretten,2025-04-22,,0,...,20 days ago,Full-time,Entry Level,Education,Mid-size (51-500),0.696794,$90-120K,"[atlas, automotive, ai, ias, large language mo...",0.2,0.547755
2,gd-1009717576144,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,Praxissemester als Data Scientist – KI - AI (m...,XiLLeR GmbH,,2025-04-22,,1,...,20 days ago,Full-time,Entry Level,Healthcare,Large (500+),0.647171,$90-120K,"[data science, data pipelines, automation, ai,...",0.0,0.45302
3,gd-1009717972945,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,Praktikant/Werkstudent Business Intelligence &...,H. Ludendorff GmbH,Darmstadt,2025-04-22,,0,...,20 days ago,Full-time,Entry Level,Marketing,Large (500+),0.698073,$90-120K,"[data science, dashboards, sql, visualization,...",0.4,0.608651
4,gd-1009709002018,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,"Praktikum Data Analytics im Bereich Wellbeing,...",Deutsche Telekom AG,Bonn,2025-04-14,,0,...,28 days ago,Full-time,Entry Level,Technology,Large (500+),0.659626,$90-120K,"[interfaces, strategy, data science, psycholog...",0.0,0.461738


## 2.2 Sample CV Pre-processing

### 2.2.1 Text Preprocessing

In [25]:
# remove personal identifiers from the CV
def remove_personal_identifiers(text, name):
  """Removes personal identifiers from text using regular expressions."""

  # Patterns for common identifiers
  phone_pattern = r"\+?\d[\d -]{8,12}\d"  # Matches various phone number formats
  url_pattern = r"(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
  linkedin_pattern = r"linkedin\.com/in/\w+"  # Matches LinkedIn profile URLs

  # Apply patterns and replace with empty string
  text = re.sub(phone_pattern, "", text)
  text = re.sub(url_pattern, "", text)
  text = re.sub(linkedin_pattern, "", text)
  text = re.sub(name, "", text)

  return text

# Apply the function to the dataframe column, using CV_name from earlier
private_CV_text = remove_personal_identifiers(extracted_text, CV_name)

# Apply the same text pre-processing steps as for the job listings
processed_CV_text = preprocess_text(private_CV_text)

Asign skills from the CV

In [26]:
# Apply the function to find skills for each job description
cv_skills = find_skills_in_description(extracted_text)

# 3. Calculate job scoring based on CV information

Model: https://huggingface.co/sentence-transformers/all-mpnet-base-v2

### 3.1 Model: Load SentenceTransformer

In [29]:
# Preload models
model1 = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### 3.2 Model: Embeddings for job descriptions

In [None]:
merged_jobs = pd.read_csv('/content/drive/My Drive/merged_jobs.csv')

In [None]:
texts = merged_jobs['processed_job_text_en'].fillna("").astype(str).tolist()

print("Calculating embeddings...")

# Calculate embeddings
job_embeddings_model1 = model1.encode(
    texts,
    batch_size=64,              # Adjust batch size based on available resources
    show_progress_bar=True,
    convert_to_numpy=True
)

# Save the embeddings
np.save('job_embeddings_model1.npy', job_embeddings_model1)

Calculating embeddings...


Batches:   0%|          | 0/21 [00:00<?, ?it/s]

### 3.3 Model: Embedding for CV text

In [30]:
# Load the job embeddings from the .npy file
job_embeddings_model1 = np.load('/content/drive/My Drive/job_embeddings_model1.npy')

print("Calculating embeddings for CV text...")
# Calculate embeddings for the CV text only once since it is the same for all rows
cv_embedding_model1 = model1.encode(processed_CV_text)

print("Embeddings calculated!")


Calculating embeddings for CV text...
Embeddings calculated!


### 3.4 Calculate Cosine Similarity

In [31]:
print("Calculating cosine similarity...")
# Calculate cosine similarity for Model 1
similarities_model1 = cosine_similarity(job_embeddings_model1, cv_embedding_model1.reshape(1, -1))
merged_jobs['score_model1'] = similarities_model1.flatten()


Calculating cosine similarity...


### 3.5 Calculate skills score

In [32]:
def calculate_skills_score(cv_skills, job_skills):
    """
    Calculates the percentage of CV skills present in the job's skills list.

    Args:
      cv_skills: A set of skills extracted from the CV.
      job_skills: A list of skills extracted from the job description.

    Returns:
    The percentage of CV skills present in the job's skills list.
    """
    if not job_skills:  # Handle cases where job_skills is empty or NaN
        return 0.0

    # Convert cv_skills to a set
    cv_skills_set = set(cv_skills)

    job_skills_set = set(job_skills)
    return len(cv_skills_set & job_skills_set) / len(job_skills_set)

# Apply the function to calculate the skills score for all jobs in a vectorized manner
merged_jobs['skills_score'] = merged_jobs['skills'].map(
            lambda job_skills: calculate_skills_score(cv_skills, job_skills))

In [33]:
merged_jobs["skills_score"].head(10)

Unnamed: 0,skills_score
0,0.1
1,0.1
2,0.066667
3,0.12
4,0.083333
5,0.25
6,0.111111
7,0.086957
8,0.166667
9,0.0625


### 3.6 Weighted hybrid score

In [47]:
# We choose to give the cosine similarity a weight of 0.9 and the skill score a
# weight of 0.1 to reflect that the cosine similarity looks at the entire
# CV and job description text holisitically while the skill score puts extra emphasis
# specifically on matched skills which often may inform a hiring decision
merged_jobs["weighted_score"] = 0.9 * merged_jobs["score_model1"] + 0.1 * merged_jobs["skills_score"]

### 3.6 Final Results

In [48]:
# Eliminate duplicates based on the 'description' column
merged_jobs_final = merged_jobs.drop_duplicates(subset=['title', 'company', 'score_model1'], keep='first')

print("Top jobs found!")
# Sort jobs by the highest score
top_jobs = merged_jobs_final.sort_values(by='weighted_score', ascending=False).head(5)

Top jobs found!


In [44]:
top_jobs

Unnamed: 0,id2,site,job_url,job_url_direct,title,company,location,date_posted,job_type,is_remote,...,postedDate,jobType,jobLevel,industry,size,score_model1,salarySim,skills,skills_score,weighted_score
102,in-bfe4fd1cc012f49b,indeed,https://de.indeed.com/viewjob?jk=bfe4fd1cc012f49b,https://www.rwth-aachen.de/go/id/kbag/file/V00...,Research Assistant with a bachelor's degree (f...,RWTH Aachen University,"Aachen, NW, DE",2025-04-22,parttime,0,...,20 days ago,Part-Time,Entry Level,Finance,Large (500+),0.740497,$90-120K,"[chemical engineering, prediction, pytorch, so...",0.181818,0.684629
339,li-4207049116,linkedin,https://www.linkedin.com/jobs/view/4207049116,https://tietalent.com/en/jobs/p-1352133/winnen...,Praktikum / Internship User Generated Content ...,TieTalent,"Winnenden, Baden-Württemberg, Germany",2025-04-11,internship,0,...,31 days ago,Internship,Internship,Manufacturing,Startup (1-50),0.723152,Below $90K,"[sustainability, data analytics, analytics, ge...",0.25,0.675837
406,go-yNnP5do5qoukuHtHAAAAAA==,google,https://de.indeed.com/viewjob?jk=80a54c3c1ac40...,,Praktikanten als Data Engineer (m/w/d),GIM - Gesellschaft für Innovative Marktforschu...,"Berlin, Deutschland",,,0,...,,Full-time,Entry Level,Education,Mid-size (51-500),0.724018,$90-120K,"[reliability, gitlab, git, social, data manage...",0.181818,0.669798
371,gd-1009697061393,glassdoor,https://www.glassdoor.de/job-listing/j?jl=1009...,,Intern at Center of Excellence for Artificial ...,MTU Aero Engines AG,München,2025-04-03,,0,...,39 days ago,Full-time,Entry Level,Hospitality,Startup (1-50),0.720075,$90-120K,"[data science, databases, sql, english, use ca...",0.210526,0.66912
121,li-4211498507,linkedin,https://www.linkedin.com/jobs/view/4211498507,https://tietalent.com/en/jobs/p-1419086/berlin...,Werkstudent*in Geowissenschaften (d/m/w),TieTalent,"Berlin, Berlin, Germany",2025-04-16,internship,0,...,26 days ago,Internship,Internship,Manufacturing,Startup (1-50),0.728216,Below $90K,"[data science, sustainability, social, machine...",0.090909,0.664485


In [49]:
# Extract and display the full descriptions of the top jobs
top_job_descriptions = top_jobs['combined_job_text_en'].tolist()
for i, desc in enumerate(top_job_descriptions, 1):
    print(f"Job {i} Description:\n{desc}\n")

Job 1 Description:
Junior Business Controller - Internship We at *** Spreafico Francesco \\ & f.lli spa ***, since 1955, bases our activity on the import, production and distribution of innovative and quality products. We are leader in the fruit and vegetable sector and reliable partner on different channels: from the traditional market, to modern distribution, to the Ho.Re.Ca and Naval sector. With a view to a continuous improvement, we always turn particular attention to the supply chain and end consumers! Entrepreneurship is in the DNA of all our collaborators and all our collaborators, for us it is important to create an environment in which each person, with their own skills and quality, can make a difference! We are a dynamic reality, always ready to embrace new opportunities and grow. For this reason, with a view to enhancing our team dedicated to management control, we are looking for a figure of *** Junior Business Controller. *** The candidate or candidate will begin his trai

## 3.4 Matching and Missing Skills

In [38]:
def process_skills_entry(skills_str_or_list):
    actual_list = []
    # Check if the input is a string or bytes type before checking for NaN
    if isinstance(skills_str_or_list, (str, bytes)) and pd.isna(skills_str_or_list): # Handle NaN values only for strings
        return []
    if isinstance(skills_str_or_list, str):
        try:
            # Attempt to parse string like "['val1', 'val2']"
            evaluated = ast.literal_eval(skills_str_or_list)
            if isinstance(evaluated, list):
                actual_list = [str(item) for item in evaluated] # Ensure items are strings
            elif isinstance(evaluated, str): # If it was just a single skill string like "'Python'"
                actual_list = [evaluated]
            # Add more specific handling if ast.literal_eval could return other types
        except (ValueError, SyntaxError):
            # If not a list literal, assume comma-separated like "skill1, skill2"
            actual_list = [s.strip() for s in skills_str_or_list.split(',') if s.strip()]
    elif isinstance(skills_str_or_list, list):
        actual_list = [str(item) for item in skills_str_or_list] # Ensure items are strings
    return actual_list

In [50]:
# Apply the function to parse the 'skills' column and create a new column
top_jobs['parsed_skills_list'] = top_jobs['skills'].apply(process_skills_entry)

cv_skills_lower_set = {skill.lower() for skill in cv_skills}  # Convert cv_skills to lowercase set

# Now use the parsed list

# Add matching skills
top_jobs['matchingSkills'] = top_jobs['parsed_skills_list'].apply(
    lambda skill_list: [skill.title() for skill in skill_list if skill.lower() in cv_skills_lower_set]
)

# Add missing skills
top_jobs['missingSkills'] = top_jobs['parsed_skills_list'].apply(
    lambda skill_list: [skill.title() for skill in skill_list if skill.lower() not in cv_skills_lower_set]
)

# You can drop the intermediate column if you want
top_jobs = top_jobs.drop(columns=['parsed_skills_list'])

In [51]:
top_jobs

Unnamed: 0,id2,site,job_url,job_url_direct,title,company,location,date_posted,job_type,is_remote,...,jobLevel,industry,size,score_model1,salarySim,skills,skills_score,weighted_score,matchingSkills,missingSkills
1303,li-4204745476,linkedin,https://www.linkedin.com/jobs/view/4204745476,,Junior Business Controller - Internship,SPREAFICO FRANCESCO & F.LLI SPA,"Dolzago, Lombardy, Italy",2025-04-09,internship,0,...,Internship,Finance,Startup (1-50),0.622916,Below $90K,[economics],1.0,0.698332,[Economics],[]
527,li-4210375414,linkedin,https://www.linkedin.com/jobs/view/4210375414,https://www.clever-fit.com/de/karriere/job-suc...,"BA-Student oder IST-Student, Fitness- Wissensc...",clever fit,"Holzgerlingen, Baden-Württemberg, Germany",2025-04-15,internship,0,...,Internship,Healthcare,Large (500+),0.680049,Below $90K,"[economics, planning]",0.5,0.644039,[Economics],[Planning]
102,in-bfe4fd1cc012f49b,indeed,https://de.indeed.com/viewjob?jk=bfe4fd1cc012f49b,https://www.rwth-aachen.de/go/id/kbag/file/V00...,Research Assistant with a bachelor's degree (f...,RWTH Aachen University,"Aachen, NW, DE",2025-04-22,parttime,0,...,Entry Level,Finance,Large (500+),0.740497,$90-120K,"[chemical engineering, prediction, pytorch, so...",0.181818,0.628761,"[Python, Programming]","[Chemical Engineering, Prediction, Pytorch, So..."
339,li-4207049116,linkedin,https://www.linkedin.com/jobs/view/4207049116,https://tietalent.com/en/jobs/p-1352133/winnen...,Praktikum / Internship User Generated Content ...,TieTalent,"Winnenden, Baden-Württemberg, Germany",2025-04-11,internship,0,...,Internship,Manufacturing,Startup (1-50),0.723152,Below $90K,"[sustainability, data analytics, analytics, ge...",0.25,0.628522,[Analytics],"[Sustainability, Data Analytics, German]"
256,in-03400469f6c64469,indeed,https://de.indeed.com/viewjob?jk=03400469f6c64469,https://career.krahn.eu/WERKSTUDENT-PEOPLE-CUL...,WERKSTUDENT PEOPLE & CULTURE (M/W/D),KRAHN CHEMIE GMBH,"Hamburg, HH, DE",2025-04-22,,0,...,Entry Level,Manufacturing,Startup (1-50),0.659092,$90-120K,"[analytics, networking]",0.5,0.627274,[Analytics],[Networking]


In [52]:
# Select relevant columns
top_jobs_output = top_jobs[['title_en', 'company', 'location', 'is_remote', 'weighted_score', 'matchingSkills', 'missingSkills', 'description_en', 'postedDate', 'jobType', 'jobLevel', 'salarySim', 'job_url']]

# 4. Semantic Fit

In [None]:
# Use Gemini to write two bullet point description explaning semantic fit between CV
# and job description and to offer suggestions on what else to include in the
# CV to fit to the job even better

# First configure API key
is_api_key_configured = False

try:
    # Attempt to get the API key from an environment variable
    # Switch back to secret instead of hardcoded API key for final submission
    gemini_api_key = 'AIzaSyDDhgYMcWY42RyjZvjT5z1AJrCLY29TCiI'
    genai.configure(api_key=gemini_api_key)
    is_api_key_configured = True
    print("Gemini API Key configured successfully from environment variable.")
except KeyError:
    print("--------------------------------------------------------------------------------")
    print("🚨 GEMINI_API_KEY environment variable not found.")
    print("🚨 Please try one of the following options:")
    print("🚨 OPTION 1 (Recommended): Set an environment variable or Colab Secret named GEMINI_API_KEY.")
    print("🚨 OPTION 2 (For quick testing, less secure):")
    print("🚨   In the next code cell, uncomment the lines and replace 'YOUR_GEMINI_API_KEY_HERE' with your actual key.")
    print("🚨 Get an API key from Google AI Studio: https://aistudio.google.com/app/apikey")
    print("--------------------------------------------------------------------------------")

Gemini API Key configured successfully from environment variable.


In [None]:
# Set generation configuration - model is made to be deterministic
# to prevent different outputs when the user reruns the same job search
# query
generation_config = {
  "temperature": 0,
  "top_p": 1,
  "top_k": 1,
}

In [None]:
# Load Gemini 2.5 Flash model
model = None # Initialize model to None

if is_api_key_configured:
    try:
        model = genai.GenerativeModel(
            model_name="gemini-2.5-flash-preview-04-17",
            generation_config=generation_config,
            # safety_settings=... # Optional: configure safety settings if needed
        )
        print(f"Gemini model '{model.model_name}' initialized successfully.")
    except Exception as e:
        print(f"🚨 Error initializing Gemini model: {e}")
        print("🚨 This can happen if the API key is invalid, not authorized for the model, or if there are network issues.")
        print("🚨 Please double-check your API key and its permissions in Google AI Studio.")
else:
    print("🚨 Skipping model initialization because the API key was not configured.")

Gemini model 'models/gemini-2.5-flash-preview-04-17' initialized successfully.


In [None]:
# Create helper function to make calls to Gemini API
def gemini_generation(prompt_template, CV_text, top_5_job, model_instance):
    """Generate an output based on a prompt and an input document using Gemini."""
    if not model_instance:
        print("Error: Gemini model is not initialized.")
        return None

    full_prompt = prompt_template.replace("[CV]", CV_text)
    full_prompt = full_prompt.replace("[TOP_5_JOB]", top_5_job)

    try:
        response = model_instance.generate_content(full_prompt)
        return response.text.strip()
    except Exception as e:
        print(f"Error during Gemini API call for CV '{CV_text[:50]}...': {e}")
        print(f"Or Error during Gemini API Call for top 5 jobs")
        return None # Return None or raise error for more robust handling

In [None]:
# Specify prompt template with placeholders for CV and top 5 jobs
prompt_template = """Write a two bullet point description explaning the semantic fit
between the following CV and job description.

CV: '[CV]'

Job Description: '[TOP_5_JOB]'

Additionally, write two bullet points of skills / experience the candidate may
consider to include in their CV if they have the experience to even better
match the job desccription. Finally, add a disclaimer that we do not condone
faking experiences in your CV just to get a better match)
"""
print("Classification prompt template defined.")

Classification prompt template defined.


In [None]:
# In the top_jobs_output df and the semanticFit column, parse through the
# 'description_summary' column as the job description and use the
# private_CV_text CV as the CV to run the gemini_generation function
for index, row in top_jobs_output.iterrows():
    job_description = row['description_summary']  # Get job description from 'description_summary'
    semantic_fit_text = gemini_generation(prompt_template, private_CV_text, job_description, model)
    top_jobs_output.loc[index, 'semanticFit'] = semantic_fit_text  # Add 'semanticFit' column

print("Semantic fit generated and updated in the DataFrame!")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_jobs_output.loc[index, 'semanticFit'] = semantic_fit_text  # Update 'semanticFit' column


Semantic fit generated and updated in the DataFrame!


In [None]:
# print final top_jobs_output df
display(top_jobs_output)

Unnamed: 0,title_en,company,location,is_remote,weighted_score,matchingSkills,missingSkills,description_en,postedDate,jobType,jobLevel,salarySim,job_url,description_summary,semanticFit
102,Research Assistant with a bachelor's degree (f...,RWTH Aachen University,"Aachen, NW, DE",0,0.670795,"[German, R]","[Machine Learning, English, Data Protection]",### **Contact** ### **Name** Antoine Siraudin ...,22 days ago,Part-Time,Entry Level,$90-120K,https://de.indeed.com/viewjob?jk=bfe4fd1cc012f49b,Research assistant will support experiments fo...,Here is the semantic fit analysis and suggesti...
406,Practics ALS Data Engineer (m/w/d),GIM - Gesellschaft für Innovative Marktforschu...,"Berlin, Deutschland",0,0.667365,"[German, R]","[Sql, Computer Science, Dies]","As a full service market research institute, w...",,Full-time,Entry Level,$90-120K,https://de.indeed.com/viewjob?jk=80a54c3c1ac40...,The GIM mbH is one of the leading independent ...,Here is an analysis of the semantic fit and su...
12,Intern Strategy and Market Analysis - Focus on...,MTU Aero Engines AG,München,0,0.666712,[R],"[Mathematics, Database, Mysql, Sql]","\\#UPLIFT**Y**OURFUTURE Over 12,000 People. 19...",41 days ago,Full-time,Entry Level,$90-120K,https://www.glassdoor.de/job-listing/j?jl=1009...,Market Analysis group is part of the Corporate...,Here is the semantic fit analysis and suggesti...
371,Intern at Center of Excellence for Artificial ...,MTU Aero Engines AG,München,0,0.652829,"[German, R]","[Communication, Mathematics, Database]","\\#UPLIFT**Y**OURFUTURE Over 12,000 People. 19...",41 days ago,Full-time,Entry Level,$90-120K,https://www.glassdoor.de/job-listing/j?jl=1009...,Munich University is searching for an intern f...,Here is a description of the semantic fit and ...
121,Work student in geosciences (D/m/f),TieTalent,"Berlin, Berlin, Germany",0,0.652146,[German],"[Ski, Digitization, Data Science, Machine Lear...",** About ** ** Welcome to Toll Collect! ** As ...,28 days ago,Internship,Internship,Below $90K,https://www.linkedin.com/jobs/view/4211498507,Toll Collect has been running one of the world...,Here is the semantic fit analysis and suggesti...


In [None]:
# for each job in the df, extract and display the job title, job description and
# semanticFit, as well as the matching score
for i, (index, row) in enumerate(top_jobs_output.iterrows(), 1):
    job_title = row['title_en']
    job_description = row['description_summary']
    semantic_fit = row['semanticFit']

    print(f"Matching Job Title {i}: {job_title}")
    print(f"Job Description: {job_description}")
    print(f"Semantic Fit: {semantic_fit}")
    print("-" * 20)

Matching Job Title 1: Research Assistant with a bachelor's degree (f/m/d) in the field Learning on Graphs
Job Description: Research assistant will support experiments for an applied machine learning project in the field of chemical engineering. The position is to be filled at the earliest possible date and offered for a fixed term initially for 1 year. The standard weekly hours will be 7
Semantic Fit: Here is the semantic fit analysis and suggestions:

**Semantic Fit between CV and Job Description:**

*   The candidate's Master's degree in Business Analytics, coupled with certifications in Python for Data Analysis and Foundation of Generative AI, demonstrates a strong technical foundation and interest in data-driven fields relevant to supporting an applied machine learning project.
*   The candidate's listed skills in Python and prior experience as a Data Analyst align with the practical data handling and technical support likely required for assisting with machine learning experiments