In [11]:
import openai
import pandas as pd
import spacy

In [None]:
#!pip install spacy
#!python -m spacy download en_core_web_sm

### Parameters you can play with 
1. `limit` (token length for our paragraph, paragraph is what we pass to GPT 3 as input) parameter in function `get_paras`.
2. `temperature` (randomness of the GPT 3 output), `model` parameters for the `get_response` function.

In [28]:
# Reading the data
df = pd.read_csv('all_data.csv')
df.head()

Unnamed: 0,url,title,job_description,seniority_level,employment_type
0,https://www.linkedin.com/jobs/view/database-de...,Database Developer,SummaryThe Database Developer is part of the C...,Entry level,Full-time
1,https://www.linkedin.com/jobs/view/software-en...,Software Engineer I (Full Time) United States,What You’ll DoOur software engineers are the g...,Not Applicable,Full-time
2,https://www.linkedin.com/jobs/view/penetration...,Penetration Tester (Network/Cloud/Application)...,Responsibilities About TikTokTikTok is the lea...,Not Applicable,Full-time
3,https://www.linkedin.com/jobs/view/sql-develop...,SQL Developer,"SQL Developer - Long Beach, CA - Infosys Need...",Entry level,Contract
4,https://ca.linkedin.com/jobs/view/cyber-securi...,Cyber Security Specialist,As one of Canada’s largest and fastest growing...,Not Applicable,Full-time


In [329]:
file = open('API_key.txt')
API_key = file.read()

In [330]:
# Mentionning the API key
openai.api_key = API_key

In [331]:
def create_prompt(input):
    prompt = f"""Input is a job description. Output is the  technical skills and technoligies required for Input job description. 
    You have to predict Output using Input.
    Make sure you don't include anything in the Output that is not part of Input. 
    Output should only include technical skills used in the IT industry like programming languages, algoritms, software tools etc.

    Input: My name is X. I am proficient in python, SQL, machine learning. I am experienced in classification and regression. We want a software engineer who is comfortable with html and css. Who has knowledge of javascript and SQL. Who has c++ and python programming experience.
    Output: python, SQL, machine learning, classification, regression, html, css, c++,python
    Input: Proficiency with Python, or another interpreted programming language like R or Matlab. Experience with time series modeling and causal inference a plus 
    Output: time series, Python
    Input: Good communication skills and confidence to give presentations in high pressure situations
    Output:
    Input: Ability to optimize and condense information and transform data into easily understandable concepts.Basic technical skills in MS Excel, PowerPoint, Word
    Output: MS Excel, PowerPoint, Word
    Input: years of relevant experience with statistical computing in R or Python. 3+ years of experience with Machine Learning algorithms and Probabilistic Modelling. Experience with SQL and SQL Server. Experience with modern R packages and technologies such as dplyr, tidyR, data.table, shinyR preferred. Experience with .NET Framework and C# is preferable.
    Output: R, Python, Machine Learning, SQL, dplyr, tidyR, shinyR, .NET, C#
    Input: years experience with data manipulation ecosystems with R, Pandas. Also good with html and java.
    Output: R, Pandas,html, java
    Input: Knowledge of data lake storage and data warehousing solution components like Azure Data Lake, Synapse, Databricks, snowflakes. 2 years in c++ programming.
    Output: Azure Data Lake, Synapse, Databricks, snowflakes, c++
    Input: Solid knowledge of JavaScript, CSS, HTML, and front-end languages including Node-JS.
    Output: JavaScript, CSS, HTML, Node-JS
    Input : {input}
    Output:"""
    return prompt

create_prompt('We need someone with Matlab expertise')

"Input is a job description. Output is the  technical skills and technoligies required for Input job description. \n    You have to predict Output using Input.\n    Make sure you don't include anything in the Output that is not part of Input. \n    Output should only include technical skills used in the IT industry like programming languages, algoritms, software tools etc.\n\n    Input: My name is X. I am proficient in python, SQL, machine learning. I am experienced in classification and regression. We want a software engineer who is comfortable with html and css. Who has knowledge of javascript and SQL. Who has c++ and python programming experience.\n    Output: python, SQL, machine learning, classification, regression, html, css, c++,python\n    Input: Proficiency with Python, or another interpreted programming language like R or Matlab. Experience with time series modeling and causal inference a plus \n    Output: time series, Python\n    Input: Good communication skills and confide

In [332]:
nlp = spacy.load('en_core_web_sm')

def sentence_segmentation(job_desc):
    """
    This function return a list of sentences for a job description
    """
    doc = nlp(job_desc)
    return [sent for sent in doc.sents]

def get_paras(sents, limit = 64):
    '''
    Converts a list of sentences to a list of paragraphs (so that we could limit the number of API calls to some extent)
    '''
    num_token = 0
    paragraphs = []
    para = ''
    for sent in sents:
        sent = sent.text
        num_token += len(sent.split())
        if num_token >= limit:
            paragraphs.append(para)
            para = ''
            num_token = 0
        para += sent
    paragraphs.append(para)
    return paragraphs

get_paras(sentence_segmentation(df['job_description'][0]))

['SummaryThe Database Developer is part of the CEMCO development team whose responsibilities include managing and maintaining the enterprise data warehouse.Strong development skills in C# and .Net, including Excel VSTO.The ideal candidate should have a deep understanding of database management systems, as well as the ability to write complex code in C# and .Net to build, maintain and optimize databases.',
 'Effective at communicating with users to define requirements and design, solutions based on those requirements.This position is also responsible for creating tools that provide insight to various departments, including Finance, Accounting, and Customer Service to build data visualizations and dashboards.Essential Duties And ResponsibilitiesDesign, install, configure, and maintain database management systems.Monitor database performance and provide optimization.Develop, implement, and maintain backup and recovery procedures.Perform database security administration.Ensure data integri

In [333]:
def get_response(prompt, model, temperature = 0):
    
    response = openai.Completion.create(
    engine=model,
    prompt=prompt,
    temperature= 0,
    max_tokens=6
    )
    
    result = response.choices[0].text
    result = result.split('\n')[0]
    return [i.strip() for i in result.split(',') if len(i.split())<3 and len(i.strip())>1]

In [335]:
def get_skills(job_desc):
    """
    Function to get all the skills required for a job
    """
    skills = []
    for sent in get_paras(sentence_segmentation(job_desc)):
        if len(sent) > 5:
            prompt = create_prompt(sent)
            output = get_response(prompt, 'text-curie-001')
            output = [i for i in output if i.lower() in sent.lower()]
            skills.extend(output)
            time.sleep(1)
    return skills
    

#### Getting skills for first 10 jobs

In [339]:
jobs = df['job_description'].tolist()[:3]

In [340]:
import time
skills = []
jds = []
for job in jobs:
    try:
        skills.append(get_skills(job))
        jds.append(job)
    except:
        time.sleep(60)
skills

[['C#',
  '.Net',
  'database performance',
  'C#',
  '.NET',
  'T-SQL',
  'SQL',
  'Project management',
  'collaboration',
  'teamwork',
  'close vision',
  'distance vision',
  'reasonable accommodations',
  'office',
  'CEMCO',
  'LLC',
  'CEMCO',
  'Steel Fram'],
 ['C++', 'Python', 'Java'],
 ['USDSAt TikTok', 'Security Testing', 'Operating Systems', 'Health Care']]

In [341]:
# Saving the data
df_result = pd.DataFrame({
    'job':jds, 
    'skills': skills
})
df_result.to_csv('GPT3_annotated_data.csv')
df_result

Unnamed: 0,job,skills
0,SummaryThe Database Developer is part of the C...,"[C#, .Net, database performance, C#, .NET, T-S..."
1,What You’ll DoOur software engineers are the g...,"[C++, Python, Java]"
2,Responsibilities About TikTokTikTok is the lea...,"[USDSAt TikTok, Security Testing, Operating Sy..."


# For debugging, printing the output for each api call

In [338]:
for sent in get_paras(sentence_segmentation(df['job_description'][1])):
    print('Sentence:',sent)
    prompt = create_prompt(sent)
    output = get_response(prompt, 'text-curie-001')
    output = [i for i in output if i.lower() in sent.lower()]
    print('----------------')
    print('Output:',output)
    print()
    print('')
    #print('prompt:', prompt)
    #break

Sentence: What You’ll DoOur software engineers are the gurus behind the scenes ensuring all of our programs are easy to use and bug free.Using a keen eye, you’ll develop software and tools in support of many of our high-impact technology platforms such as operating systems, networks, databases and more.
----------------
Output: []


Sentence: While we’re growing in the software business, you’ll still need to see the big picture and watch for hardware compatibility while even potentially influencing design.
----------------
Output: []


Sentence: Who You’ll Work WithOur hardworking team members are busy programming magic across the globe on teams such as Engineering, Information Technology, Supply Chain, Customer Experience, Security and Trust, etc.You would play a crucial role in driving next-gen software innovations including cloud, mobile, desktop or security spaces.', "On any of these teams, you'll get hands-on experience working with applications that make technology accessible no 