In [158]:
import pandas as pd
import sqlite3
import spacy
from numpy.linalg import norm
from numpy import dot

In [78]:
# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("collectors/data.sqlite3")
job_df = pd.read_sql_query("SELECT * from job_post", con)
company_review_df = pd.read_sql_query("SELECT * from company_review", con)
job_interview_df = pd.read_sql_query("SELECT * from job_interview", con)
con.close()

In [79]:
# Verify that result of SQL query is stored in the dataframe
job_df.head()

Unnamed: 0,id,title,company,location,description,source,search_kw
0,1,Data Scientist,Aquatic Informatics,"Vancouver, BC",Do you want a meaningful role in a company tha...,indeed.com,data scientist
1,2,Business Intelligence Analyst,GLENTEL,"Burnaby, BC",Brand: Glentel Corporate\nLocation: Burnaby Of...,indeed.com,data scientist
2,3,Human Resources Data Scientist,Rio Tinto,Canada,2 x newly created Data Scientist opportunities...,indeed.com,data scientist
3,4,Lead - Human Resource Data Scientist,Rio Tinto,Canada,Newly created data science lead embedded withi...,indeed.com,data scientist
4,5,Machine Learning Engineer,Skycope Technologies Inc,"Vancouver, BC","Who We are\nFounded in 2016, Skycope Technolog...",indeed.com,data scientist


In [80]:
job_descriptions = job_df['description'].to_list()

In [81]:
print("FIRST 100 WORDS OF 5 JOB DESCRIPTIONS:")
for i, description in enumerate(job_descriptions[:5]): 
    print("\nJOB "+str(i))
    print(description[:100])

FIRST 100 WORDS OF 5 JOB DESCRIPTIONS:

JOB 0
Do you want a meaningful role in a company that is making a difference in the world? Do you want to 

JOB 1
Brand: Glentel Corporate
Location: Burnaby Office - 8501 Commerce Court, Burnaby, BC
Are you looking

JOB 2
2 x newly created Data Scientist opportunities embedded within the HR function
Help to define and ma

JOB 3
Newly created data science lead embedded within the HR function
Exciting opportunity to help shape t

JOB 4
Who We are
Founded in 2016, Skycope Technologies Inc. is a high tech company based in Burnaby, Canad


In [88]:
# Download pretrained enlgish model
try:
    import en_core_web_sm
except:
    !python -m spacy download en_core_web_sm
    import en_core_web_sm

In [83]:
# Pretrained english model (small sized)
nlp = en_core_web_sm.load()

In [46]:
docs = [nlp(doc) for doc in job_descriptions]

In [85]:
# Sentence by sentence
for sentence in docs[0].sents:
    print(sentence)

Do you want a meaningful role in a company that is making a difference in the world?
Do you want to be involved in one of the most important environmental resource areas today?
Do you want to learn what's involved in developing and deploying machine learning and predictive analytics solutions from colleagues with years of research and development experience?
Then join our energetic and growing team and help revolutionize an industry.

About
our company
Founded in 2003, Aquatic Informatics provides software solutions that address critical water data management, analytics and compliance challenges for the rapidly growing water industry.
Aquatic Informatics is the trusted provider of water management solutions to over 1,000 municipal, federal, state/provincial, hydropower, mining, academic, and consulting organizations in over 60 countries that collect, manage, and process large volumes of water data.

Aquatic Informatics' platforms include AQUARIUS (http://aquaticinformatics.com/why-aqua

In [75]:
# word by word 
for token in docs[0][:25]:
    print(token, token.idx)

Do 0
you 3
want 7
a 12
meaningful 14
role 25
in 30
a 33
company 35
that 43
is 48
making 51
a 58
difference 60
in 71
the 74
world 78
? 83
Do 85
you 88
want 92
to 97
be 100
involved 103
in 112


In [86]:
# It is possible to get the entity extracted for every doc along with a label
# Check: https://spacy.io/api/annotation#named-entities for further details
for item in docs[0].ents:
    print(item.text, item.label_)

today DATE
years DATE
2003 DATE
Aquatic Informatics ORG
Aquatic Informatics ORG
over 1,000 CARDINAL
over 60 CARDINAL
Aquatic Informatics' ORG
AQUARIUS ORG
WaterTrax ORG
Linko ORG
https://aquaticinformatics.com/products/linko/ ORG
Aquatic Informatics ORG
Vancouver GPE
Canada GPE
US GPE
Australia GPE
one CARDINAL
Canada GPE
EQ ORG
PhD WORK_OF_ART
2+ years DATE
at least one CARDINAL
Python ORG
English LANGUAGE
NumPy LOC
TensorFlow PRODUCT
PyTorch ORG
AWS ORG


In [99]:
# Show Lemmatization of a sentence/document
print("BEFORE LEMMATIZATION:")
print(docs[0][:50])
print("\nAFTER LEMMATIZATION:")
print(' '.join([token.lemma_ for token in docs[0][:50]]))

BEFORE LEMMATIZATION:
Do you want a meaningful role in a company that is making a difference in the world? Do you want to be involved in one of the most important environmental resource areas today? Do you want to learn what's involved in developing and deploying machine learning and

AFTER LEMMATIZATION:
do -PRON- want a meaningful role in a company that be make a difference in the world ? do -PRON- want to be involve in one of the most important environmental resource area today ? do -PRON- want to learn what be involve in develop and deploy machine learning and


In [159]:
def sentvec(s):
    sent = nlp(s)
    return meanv([w.vector for w in sent])

def meanv(coords):
    # assumes every item in coords has same length as item 0
    sumv = [0] * len(coords[0])
    for item in coords:
        for i in range(len(item)):
            sumv[i] += item[i]
    mean = [0] * len(sumv)
    for i in range(len(sumv)):
        mean[i] = float(sumv[i]) / len(coords)
    return mean

def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

In [146]:
sentences = list(docs[0].sents)

In [147]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sentvec(input_str)
    return sorted(space,
                  key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vec),
                  reverse=True)[:n]

In [162]:
for sent in spacy_closest_sent(sentences, "Learning forever and ever"):
    print(sent.text)
    print("---")

Life long learner - incessantly inquisitive about emerging research and technology.

---
Experience with AWS or other cloud services and cloud technologies.
---
Do you want to learn what's involved in developing and deploying machine learning and predictive analytics solutions from colleagues with years of research and development experience?
---
About the opportunity
You will become an integral member of the team researching and operationalizing algorithms and models for processing water and other environmental data.
---
About
---
Working in an Agile scrum team, you will be exposed to the breadth of data science activities, including hypothesis definition, data wrangling and exploratory data analysis, model development and validation, and production deployment and debugging test and customer-reported issues.

---
Solid understanding of mathematical concepts and techniques including: time series analysis, regression modeling, forecasting, and machine-learning.

---
2+ years of hands-on