# 3. Extract Entities From Job Description

There are a few entities I want to extract from job descriptions: 
- **Programming Languages**: I want to see what programming languages companies are interested in
- **Soft Skills**:  I want to see what soft skills are expected of Data Analyst Jobs
- **Hard Skills**: I want to understand what hard skills are expected of Data Analyst Jobs

## 3.1 Spacy Setup

In [72]:
# !pip freeze | grep spacy

In [73]:
!python -m spacy download en_core_web_sm 


Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 8.3 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [1]:
# wordsegment is a library that probabilistically separates an undelimited piece of text. For example, it turns 'hiimjohn' into 'hi im john'
# We may either use/not use wordsegment for entity extraction
from wordsegment import load as load_words, segment, clean as segment_clean, WORDS, BIGRAMS, UNIGRAMS
import string

In [2]:
load_words()
len(BIGRAMS), len(UNIGRAMS), len(WORDS)

(258437, 333213, 178758)

In [3]:
from nltk.corpus import words
setofwords = set(words.words())

In [4]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

## 3.2 Test out Doc and Span classes

In [16]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

In [21]:
doc.ents[0].start

0

In [23]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [337]:
spacy.explain("appos")

'appositional modifier'

## 3.3 Spacy Custom Components

In [5]:
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

In [8]:
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component_function(doc):
    # Apply the matcher to the doc
    matches =  matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


Rate limiting is a problem for us. I do not wish to upgrade my free plan, so either I live with entity extraction under the free plan or I ditch the idea to use these powerful generative autoregressive LLMs. 

In [None]:
llama.entities("I want to use Tableau and be good at Python", searched_entity="hard skill")

{'entities': [{'start': 14,
   'end': 21,
   'type': 'hard skill',
   'text': 'Tableau'},
  {'start': 37, 'end': 43, 'type': 'hard skill', 'text': 'Python'}]}