## Entity extraction from job descriptions with Gemini 1.5 Flash

In this notebook, using Gemini API (Gemini 1.5 Flash), I will extract certain information from the job description text I have scraped and collected from job search site in the past

**About the data used**

For the data, I will use job description of Software Engineer position in Manchester I have scraped and collected in my past Data Scraping project, for more details, please visit - https://github.com/morikaglobal/jobsite_selenium

I will import and use the data collected from the project above available at - https://github.com/morikaglobal/jobsite_selenium/blob/master/Jobsite.csv

References used:

Gemini Reshaping the NLP Task for Extracting Knowledge in Text

https://medium.com/@joansantoso/gemini-reshaping-the-nlp-task-for-extracting-knowledge-in-text-c0d5fdd4edd8

Extract structured data using function calling

https://ai.google.dev/gemini-api/tutorials/extract_structured_data

In [57]:
import pandas as pd

In [58]:
data = pd.read_csv('https://raw.githubusercontent.com/morikaglobal/jobsite_selenium/master/Jobsite.csv')

data.head()

Unnamed: 0,job_title,company,location,salary,job_description
0,Software Engineer - iOS,BBC,"Salford, Greater Manchester",Competitive Salary + Benefits,Job Introduction We are looking for Mobile Sof...
1,Software Engineer,Ultimate Performance Fitness,"M1, Manchester","From £50,000 to £60,000 per annum",Location- based in the heart of Manchester Sal...
2,Software Engineer,Precisely Software Limited,UK,Competitive,Precisely is the leader in data integrity. We ...
3,Software Engineer - Manchester - £57k,DGH Recruitment Ltd,"Manchester, Greater Manchester",£45000 - £57000 per annum,My client is currently recruiting for an exper...
4,Graduate Software Engineer,ITECCO Limited,"Chorley, Lancashire",£20000 - £30000 per annum,Are you ready to kick start your career as a G...


In [59]:
data.shape

(552, 5)

job_description is the column where description of the jobs advertised for Software Engineer position is stored in, and first few descriptions look like this:

In [60]:
for job in data['job_description'][:3]:
  print(job)
  print('______'*5)

Job Introduction We are looking for Mobile Software Engineers to join the Sport App team. BBC Sport App is one of the UK's most well-known and loved brands, and we're looking for passionate team members to join our collaborative, agile, iOS team. We welcome applications from all, regardless of age, gender, ethnicity, disability, sexuality, social background, religion and/or belief. As a Software Engineer for the Sport App team, you will have the opportunity to join an engineering team that delivers an intuitive and engaging sport-oriented experience to millions of audience members every day. To be successful in this role you will need a good understanding of object-oriented programming, clean architecture, and test-driven development.
______________________________
Location- based in the heart of Manchester Salary- £50-£60k Hours: Full-time, Monday - Friday 9am - 5:30pm, hybrid (2 days a week in the office) Our mission-Ultimate Performance (U.P.) has forged a reputation as the worlds f

So for example, for the very first job description:

In [61]:
data['job_description'][0]

"Job Introduction We are looking for Mobile Software Engineers to join the Sport App team. BBC Sport App is one of the UK's most well-known and loved brands, and we're looking for passionate team members to join our collaborative, agile, iOS team. We welcome applications from all, regardless of age, gender, ethnicity, disability, sexuality, social background, religion and/or belief. As a Software Engineer for the Sport App team, you will have the opportunity to join an engineering team that delivers an intuitive and engaging sport-oriented experience to millions of audience members every day. To be successful in this role you will need a good understanding of object-oriented programming, clean architecture, and test-driven development."

I want following texts to be extracted as position:

"Mobile Software Engineers"

and following texts to be extracted as experience:

"a good understanding of object-oriented programming, clean architecture, and test-driven development"



## Gemini 1.5 Flash

In [62]:
!pip install -U -q google-generativeai

In [63]:
import json

In [64]:
from google.colab import userdata
import google.generativeai as genai

GOOGLE_API_KEY=userdata.get("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)

In [65]:
model = genai.GenerativeModel(
    model_name="models/gemini-1.5-flash-latest"
)

Prompt

First, I will try my first prompt and see how it works on the first job descrption in the dataframe

In [66]:
prompt1='''
1.You are an Entity Recognition in English Language.
2.Understand the given query of job description and do some analysis to extract the desired skills and experiences that the employer is looking for in potential candidates.
3.Output the role or position advertised as POSITION, background or experience that candidates must possess to apply for the specific role as EXPERIENCE and programming language or skill desired as SKILL.

Analyze the sentence as follow: "'
'''

In [67]:
query1 = data['job_description'][0]
query1

"Job Introduction We are looking for Mobile Software Engineers to join the Sport App team. BBC Sport App is one of the UK's most well-known and loved brands, and we're looking for passionate team members to join our collaborative, agile, iOS team. We welcome applications from all, regardless of age, gender, ethnicity, disability, sexuality, social background, religion and/or belief. As a Software Engineer for the Sport App team, you will have the opportunity to join an engineering team that delivers an intuitive and engaging sport-oriented experience to millions of audience members every day. To be successful in this role you will need a good understanding of object-oriented programming, clean architecture, and test-driven development."

In [68]:
response1 = model.generate_content(prompt1 + query1, generation_config={'response_mime_type':'application/json'})

print(response1.text)

{"POSITION": "Mobile Software Engineer", "EXPERIENCE": "good understanding of object-oriented programming, clean architecture, and test-driven development", "SKILL": "iOS"}



In [69]:
response_dict = json.loads(response1.text)
response_dict

{'POSITION': 'Mobile Software Engineer',
 'EXPERIENCE': 'good understanding of object-oriented programming, clean architecture, and test-driven development',
 'SKILL': 'iOS'}

The prompt works pretty well to extract the entities as I expected, now trying the same prompt on the second job description and see how well it works

In [70]:
query2 = data['job_description'][1]
query2

'Location- based in the heart of Manchester Salary- £50-£60k Hours: Full-time, Monday - Friday 9am - 5:30pm, hybrid (2 days a week in the office) Our mission-Ultimate Performance (U.P.) has forged a reputation as the worlds foremost body transformation experts, delivering exceptional client results under the mantra maximum results, minimum time. Our vision is to empower everyone across the world to live healthier lives. The business has grown to become the worlds only truly international personal training company, with 22 private personal training facilities across the globe and growing. But U.P. is more than just a gym. We operate world-leading online and virtual personal training, training camps, meal preparation services, as well as develop a range of premium, results-focused supplements.'

Although this job advertised was scraped from the result of searching for job advertisement of Software Engineer position, this job description seems to be more for recruiting Personal Fitness trainer.


I will improve my prompt so that if the job description is most likely to be for non-related position to Software Engineer, I get that in the JSON result returned

In [71]:
prompt_new='''
1.You are an Entity Recognition in English Language.
2.Understand the given query of job description and do some analysis to extract the desired skills and experiences that the employer is looking for in potential candidates.
3.Output the role or position advertised as POSITION, background or experience that candidates must possess to apply for the specific role as EXPERIENCE and programming language or skill desired as SKILL.
4.If you judge that the job description is not for Software enginner or related positions, output Y in DIFFERNT_POSITION.

Analyze the sentence as follow: "'
'''

In [72]:
response_new = model.generate_content(prompt_new + query2, generation_config={'response_mime_type':'application/json'})

print(response_new.text)

{"POSITION": "Personal Trainer", "EXPERIENCE": "Experience in personal training", "SKILL": [], "DIFFERNT_POSITION": "Y"}



In [73]:
response_dict = json.loads(response_new.text)
response_dict

{'POSITION': 'Personal Trainer',
 'EXPERIENCE': 'Experience in personal training',
 'SKILL': [],
 'DIFFERNT_POSITION': 'Y'}

I think the new prompt is working well so I will test the new prompt on the first job description from the dataframe

In [74]:
response_new = model.generate_content(prompt_new + query1, generation_config={'response_mime_type':'application/json'})

print(response_new.text)

{"POSITION": "Mobile Software Engineer", "EXPERIENCE": "good understanding of object-oriented programming, clean architecture, and test-driven development", "SKILL": "iOS", "DIFFERNT_POSITION": "N"}



In [75]:
response_dict = json.loads(response_new.text)
response_dict

{'POSITION': 'Mobile Software Engineer',
 'EXPERIENCE': 'good understanding of object-oriented programming, clean architecture, and test-driven development',
 'SKILL': 'iOS',
 'DIFFERNT_POSITION': 'N'}

I think the new prompt is working well again so I will test the new prompt on another job description from the dataframe

In [76]:
query3 = data['job_description'][3]
query3

'My client is currently recruiting for an experienced Software Engineer to support a large change and transformation project. This is an excellent opportunity to work on an important project for my client and gain further exposure to technologies such as Power BI. The successful candidate will be experienced in the full software development life cycle and posses strong .NET, C# development skills Skills required * C# * ASP.NET * JavaScript * MVC * HTML * SQL Desirable * Power BI * Dynamics 365 In accordance with the Employment Agencies and Employment Businesses Regulations 2003, this position is advertised based upon DGH Recruitment Limited having first sought approval of its client to find candidates for this position. DGH Recruitment Limited acts as both an Employment Agency and Employment Business'

In [77]:
response_new = model.generate_content(prompt_new + query3, generation_config={'response_mime_type':'application/json'})

print(response_new.text)

{"POSITION": "Software Engineer", "EXPERIENCE": "experienced in the full software development life cycle", "SKILL": "C#, ASP.NET, JavaScript, MVC, HTML, SQL, Power BI, Dynamics 365", "DIFFERNT_POSITION": "N"}



In [78]:
response_dict = json.loads(response_new.text)
response_dict

{'POSITION': 'Software Engineer',
 'EXPERIENCE': 'experienced in the full software development life cycle',
 'SKILL': 'C#, ASP.NET, JavaScript, MVC, HTML, SQL, Power BI, Dynamics 365',
 'DIFFERNT_POSITION': 'N'}

## Apply the prompt to the dataframe

Now I will apply the prompt to the first 15 rows of the dataframe, so that I can extract position, experience and skill required for each job description as well as whether the position is likely to be related to Software Engineer or not

In [81]:
df = data.iloc[:15][:]
df

Unnamed: 0,job_title,company,location,salary,job_description
0,Software Engineer - iOS,BBC,"Salford, Greater Manchester",Competitive Salary + Benefits,Job Introduction We are looking for Mobile Sof...
1,Software Engineer,Ultimate Performance Fitness,"M1, Manchester","From £50,000 to £60,000 per annum",Location- based in the heart of Manchester Sal...
2,Software Engineer,Precisely Software Limited,UK,Competitive,Precisely is the leader in data integrity. We ...
3,Software Engineer - Manchester - £57k,DGH Recruitment Ltd,"Manchester, Greater Manchester",£45000 - £57000 per annum,My client is currently recruiting for an exper...
4,Graduate Software Engineer,ITECCO Limited,"Chorley, Lancashire",£20000 - £30000 per annum,Are you ready to kick start your career as a G...
5,Software Engineer .Net,IO Associates,"Manchester, Greater Manchester",£50000 - £70000 per annum + benefits,Role: Software Engineer .Net Location: Remote ...
6,Software Engineer,Bright Purple,UK,"From £50,000 to £65,000 per annum",Heres a rare opportunity to join a global engi...
7,Software Engineer/Data Scientist/Maths,RE&M,"WA14, Altrincham",£50000 - £60000 per annum,Software engineers & Data Scientists are expec...
8,Software Engineer,Sopra Steria Limited,UK,"Up to £44,000 per annum",This opportunity is for a Software Engineer lo...
9,Principal Mobile Software Engineer,BBC,"Salford, Greater Manchester",Competitive Salary + Benefits,"Job Introduction The BBC Mobile Platform, part..."


In [82]:
position, experience, skill, different_position = [], [], [], []

for job in df['job_description']:
  print(job)
  response = model.generate_content(prompt_new + job, generation_config={'response_mime_type':'application/json'})

  # This is to avoid FinishReason.RECITATION and response.text quick accessor not containing a valid 'Part' to cause error and stop the code
  if response.parts == []:
    position.append('Not available')
    experience.append('Not available')
    skill.append('Not available')
    different_position.append('Not available')
  else:
    response_dict = json.loads(response.text)
    print(response_dict)

    if response_dict['POSITION'] == "null":
      position.append('N/A')
      print('N/A')
    else:
      position.append(response_dict['POSITION'])
      print(response_dict['POSITION'])

    if response_dict['EXPERIENCE'] == "null":
      experience.append('N/A')
      print('N/A')
    else:
      experience.append(response_dict['EXPERIENCE'])
      print(response_dict['EXPERIENCE'])

    if response_dict['SKILL'] == "null" or len(response_dict['SKILL']) == 0:
      skill.append('N/A')
      print('N/A')
    else:
      if type(response_dict['SKILL']) == list:
        print('LIST LIST LIST')
        skill_string = ", ".join(response_dict['SKILL'])
        skill.append(skill_string)
        print(skill_string)
      else:
        skill.append(response_dict['SKILL'])
        print(response_dict['SKILL'])

    if response_dict['DIFFERNT_POSITION'] == 'N':
      different_position.append('N')
      print('N')
    else:
      different_position.append('Y')
      print('Y')

  print('___________')

print(position)
print(experience)
print(skill)
print(different_position)

Job Introduction We are looking for Mobile Software Engineers to join the Sport App team. BBC Sport App is one of the UK's most well-known and loved brands, and we're looking for passionate team members to join our collaborative, agile, iOS team. We welcome applications from all, regardless of age, gender, ethnicity, disability, sexuality, social background, religion and/or belief. As a Software Engineer for the Sport App team, you will have the opportunity to join an engineering team that delivers an intuitive and engaging sport-oriented experience to millions of audience members every day. To be successful in this role you will need a good understanding of object-oriented programming, clean architecture, and test-driven development.
{'POSITION': 'Mobile Software Engineer', 'EXPERIENCE': 'good understanding of object-oriented programming, clean architecture, and test-driven development', 'SKILL': 'object-oriented programming, clean architecture, test-driven development', 'DIFFERNT_POS

Extracted entities are in lists, I will add them as new columns to the dataframe

In [83]:
df['position'] = pd.Series(position)
df['experience'] = pd.Series(experience)
df['skill'] = pd.Series(skill)
df['different_position'] = pd.Series(different_position)

In [84]:
df

Unnamed: 0,job_title,company,location,salary,job_description,position,experience,skill,different_position
0,Software Engineer - iOS,BBC,"Salford, Greater Manchester",Competitive Salary + Benefits,Job Introduction We are looking for Mobile Sof...,Mobile Software Engineer,good understanding of object-oriented programm...,"object-oriented programming, clean architectur...",N
1,Software Engineer,Ultimate Performance Fitness,"M1, Manchester","From £50,000 to £60,000 per annum",Location- based in the heart of Manchester Sal...,Personal Trainer,world-leading online and virtual personal trai...,,Y
2,Software Engineer,Precisely Software Limited,UK,Competitive,Precisely is the leader in data integrity. We ...,,,,Y
3,Software Engineer - Manchester - £57k,DGH Recruitment Ltd,"Manchester, Greater Manchester",£45000 - £57000 per annum,My client is currently recruiting for an exper...,Software Engineer,experienced in the full software development l...,"C#, ASP.NET, JavaScript, MVC, HTML, SQL",N
4,Graduate Software Engineer,ITECCO Limited,"Chorley, Lancashire",£20000 - £30000 per annum,Are you ready to kick start your career as a G...,Graduate Software Developer,2:1 or above,"C#, .NET Core, JavaScript",N
5,Software Engineer .Net,IO Associates,"Manchester, Greater Manchester",£50000 - £70000 per annum + benefits,Role: Software Engineer .Net Location: Remote ...,Software Engineer .Net,Mid to Senior/Tech Lead level,.Net,N
6,Software Engineer,Bright Purple,UK,"From £50,000 to £65,000 per annum",Heres a rare opportunity to join a global engi...,Senior Software Engineer,Software Engineering,"C++, C#, SQL Server",N
7,Software Engineer/Data Scientist/Maths,RE&M,"WA14, Altrincham",£50000 - £60000 per annum,Software engineers & Data Scientists are expec...,Not available,Not available,Not available,Not available
8,Software Engineer,Sopra Steria Limited,UK,"Up to £44,000 per annum",This opportunity is for a Software Engineer lo...,Software Engineer,some software engineering experience,"Enterprise Resource and Planning applications,...",N
9,Principal Mobile Software Engineer,BBC,"Salford, Greater Manchester",Competitive Salary + Benefits,"Job Introduction The BBC Mobile Platform, part...",Principal Software Engineer,developing media player components,media player components,N


I want to remove rows where different_position column value is Y,

as that means it is most likely a position that is not related or judged as not related to Software Engineer position

In [85]:
df[df['different_position'] == 'Y']

Unnamed: 0,job_title,company,location,salary,job_description,position,experience,skill,different_position
1,Software Engineer,Ultimate Performance Fitness,"M1, Manchester","From £50,000 to £60,000 per annum",Location- based in the heart of Manchester Sal...,Personal Trainer,world-leading online and virtual personal trai...,,Y
2,Software Engineer,Precisely Software Limited,UK,Competitive,Precisely is the leader in data integrity. We ...,,,,Y


The above positions do indeed look like they are not exactly Software Engineer positions, so I will remove them from the dataframe

In [86]:
df = df[df['different_position']!='Y']

Now I have the dataframe of Software Engineer or related positions with entities extracted from the job descriptions

In [87]:
df

Unnamed: 0,job_title,company,location,salary,job_description,position,experience,skill,different_position
0,Software Engineer - iOS,BBC,"Salford, Greater Manchester",Competitive Salary + Benefits,Job Introduction We are looking for Mobile Sof...,Mobile Software Engineer,good understanding of object-oriented programm...,"object-oriented programming, clean architectur...",N
3,Software Engineer - Manchester - £57k,DGH Recruitment Ltd,"Manchester, Greater Manchester",£45000 - £57000 per annum,My client is currently recruiting for an exper...,Software Engineer,experienced in the full software development l...,"C#, ASP.NET, JavaScript, MVC, HTML, SQL",N
4,Graduate Software Engineer,ITECCO Limited,"Chorley, Lancashire",£20000 - £30000 per annum,Are you ready to kick start your career as a G...,Graduate Software Developer,2:1 or above,"C#, .NET Core, JavaScript",N
5,Software Engineer .Net,IO Associates,"Manchester, Greater Manchester",£50000 - £70000 per annum + benefits,Role: Software Engineer .Net Location: Remote ...,Software Engineer .Net,Mid to Senior/Tech Lead level,.Net,N
6,Software Engineer,Bright Purple,UK,"From £50,000 to £65,000 per annum",Heres a rare opportunity to join a global engi...,Senior Software Engineer,Software Engineering,"C++, C#, SQL Server",N
7,Software Engineer/Data Scientist/Maths,RE&M,"WA14, Altrincham",£50000 - £60000 per annum,Software engineers & Data Scientists are expec...,Not available,Not available,Not available,Not available
8,Software Engineer,Sopra Steria Limited,UK,"Up to £44,000 per annum",This opportunity is for a Software Engineer lo...,Software Engineer,some software engineering experience,"Enterprise Resource and Planning applications,...",N
9,Principal Mobile Software Engineer,BBC,"Salford, Greater Manchester",Competitive Salary + Benefits,"Job Introduction The BBC Mobile Platform, part...",Principal Software Engineer,developing media player components,media player components,N
10,C# Software Engineer,Pimlico Banks Recruitment,"M1, Manchester","Up to £50,000 per annum + amazing benefits + 2...",Are you a talented and skilled Software Engine...,Software Engineer,talented and skilled Software Engineer with st...,"C#, .NET, .NET Core, JavaScript, React, AWS, K...",N
11,Embedded Software Engineer,KO2 Embedded Recruitment Solutions LTD,"Barnsley, South Yorkshire, S75 3JT",£35000 - £45000 per annum,Position: Embedded Engineer Package: Salary up...,Embedded Engineer,"Embedded C, RTOS or BareMetal development, SPI...","Embedded C, RTOS, BareMetal, SPI, UART, I2C, E...",N
