### An example of tagging EDUCATION in resume using [spaCy's Rule-based matching](https://spacy.io/usage/rule-based-matching)

In [2]:
import pandas as pd
import numpy as np
import re, string
import spacy
from spacy.matcher import Matcher 

In [3]:
# Download the dataset from here: https://www.kaggle.com/samdeeplearning/deepnlp
# It's a rather small dataset but good for demonstration purposes. 
resume_df = pd.read_csv('../data/Sheet_2.csv', encoding = "ISO-8859-1")
resume_df.head()

Unnamed: 0,resume_id,class,resume_text
0,resume_1,not_flagged,\rCustomer Service Supervisor/Tier - Isabella ...
1,resume_2,not_flagged,\rEngineer / Scientist - IBM Microelectronics ...
2,resume_3,not_flagged,\rLTS Software Engineer Computational Lithogra...
3,resume_4,not_flagged,TUTOR\rWilliston VT - Email me on Indeed: ind...
4,resume_5,flagged,\rIndependent Consultant - Self-employed\rBurl...


In [91]:
resume_df['resume_text_pp'] = resume_df['resume_text'].apply(lambda x: re.sub(r'\r', '\n', x))

In [93]:
resume_df.head()

Unnamed: 0,resume_id,class,resume_text,resume_text_pp
0,resume_1,not_flagged,\rCustomer Service Supervisor/Tier - Isabella ...,\nCustomer Service Supervisor/Tier - Isabella ...
1,resume_2,not_flagged,\rEngineer / Scientist - IBM Microelectronics ...,\nEngineer / Scientist - IBM Microelectronics ...
2,resume_3,not_flagged,\rLTS Software Engineer Computational Lithogra...,\nLTS Software Engineer Computational Lithogra...
3,resume_4,not_flagged,TUTOR\rWilliston VT - Email me on Indeed: ind...,TUTOR\nWilliston VT - Email me on Indeed: ind...
4,resume_5,flagged,\rIndependent Consultant - Self-employed\rBurl...,\nIndependent Consultant - Self-employed\nBurl...


In [95]:
print(resume_df['resume_text_pp'][3])

 TUTOR
Williston VT - Email me on Indeed: indeed.com/r/Alec-Schwartz/7177c11327372c0a
WORK EXPERIENCE
Tutor
Dickinson College Biology Department - Carlisle PA - March 2016 to May 2016
I was invited to tutor three students enrolled in Biology 120: Life at the Extremes. I helped them learn as independently as possible while still acting as a mentor and guide.
Teaching Assistant
Dickinson College Biology Department - Carlisle PA - January 2016 to May 2016
Taught by Professor Scott Boback this comparative physiology course explored how extremophiles are capable of surviving and maintaining
homeostasis in harsh environments. I helped students perform hypothesis-driven physiology experiments and vertebrate dissections.
QA/QC Laboratory Coordinator
Alliance for Aquatic Resource Monitoring (ALLARM) - Carlisle PA - August 2015 to May 2016
ALLARM a small NGO housed at Dickinson college engages communities to use science as a tool to investigate the health of their streams. I helped
mentor organi

In [104]:
s1 = r'EDUCATION AAS in Visual Arts \n Westchester Community College - New York NY School knowledge'
s2 = r'EDUCATION Bachelors in Business Technology and Management \nVermont Technical College'
s3 = r'EDUCATION BS in Biochemistry and Molecular Biology \nDickinson College - Carlisle PA 2012 to 2016'

In [105]:
nlp = spacy.load('en_core_web_sm')

In [132]:
doc = nlp(s1)

In [133]:
for tok in doc:
    print(tok.text, " ", tok.dep_, " ", tok.pos_)

EDUCATION   compound   PROPN
AAS   ROOT   PROPN
in   prep   ADP
Visual   compound   ADJ
Arts   compound   NOUN
\n   pobj   SPACE
Westchester   compound   PROPN
Community   compound   PROPN
College   compound   PROPN
-   punct   PUNCT
New   compound   PROPN
York   compound   PROPN
NY   compound   PROPN
School   compound   PROPN
knowledge   ROOT   NOUN


In [134]:
doc.ents

()

In [135]:
### Rule-based information extraction 
pattern = [
    {'TEXT': 'EDUCATION'}, 
    {'TEXT':  {"REGEX": "(Bachelors)|(Masters)|([A-Za-z.]+)"}}, 
    {'POS': "ADP"}, 
    {'IS_TITLE': True, "OP":'+'},
    {'TEXT': "and", "OP":"?"}, 
    {'IS_TITLE': True, "OP":'+'}    
]

In [136]:
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("match1", None, pattern)

In [137]:
matches = matcher(doc)

In [138]:
matches

[(12981744483764759145, 0, 5)]

In [140]:
doc[matches[0][1]:matches[0][2]]

EDUCATION AAS in Visual Arts