# SpaCy

### Overview
Here we use a pre-trained model from the SpaCy package ("en_core_web_lg") to build a recommender system. The SpaCy model is able to process word context, meaning, and sentence structure when calculating similarity scores. SpaCy allows for a deeper analysis into each body of text as compared to the CountVectorizer, Tf-Idf Vectorizer and KNN models. We had high expectations for this model, however as you will see, the model fails to make accurate recommendations in this project.

In [9]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import spacy
from IPython.display import display_html
from sklearn.feature_extraction import text

---

In [3]:
jobs = pd.read_csv('../data/job_postings.csv')
jobs = jobs.drop(columns=['date_added', 'organization', 'skills_len', 'job_type'])

In [43]:
jobs['job_title'].unique()

array(['Analyst', 'Developer', 'Manager', 'Administrator', 'Support',
       'Technician', 'Consulting', 'Engineer', 'Architect', 'Designer',
       'Programmer', 'Data Position', 'Director'], dtype=object)

In [4]:
jobs.head(5)

Unnamed: 0,job_description,job_title,location,skills
0,n edi analyst with experience please read on ...,Analyst,Northeast United States,edi trustedlink as van
1,informatica etl developerst petersburg fl only...,Developer,Southern United States,etl informatica b data exchange netezza oracle...
2,this nationally recognized microsoft gold part...,Manager,Western United States,microsoft dynamics ax project manager - toront...
3,.net developer with experience please read on...,Developer,Northeast United States,c asp.net sql javascript mvc
4,hatstand a global financial consultancy is see...,Developer,Northeast United States,java linux unix sdlc; multi-threaded or concur...


In [5]:
jobs.isna().sum()

job_description      0
job_title            0
location             0
skills             170
dtype: int64

In [6]:
# Replace nan's with empty string
jobs.fillna('', inplace=True)

In [7]:
# Combine job description and text into a single column
jobs['text'] = jobs['job_description'] + ' ' + jobs['skills']

In [22]:
def remove_stopwords(lst):
    for word in lst:
        if word in text.ENGLISH_STOP_WORDS:
            lst.remove(word)
    return lst

In [31]:
text_split = jobs['text'].map(lambda x: x.split())
text_split_no_stopwords = text_split.map(remove_stopwords)
text_no_stopwords = text_split_no_stopwords.map(lambda x: ' '.join(x))

In [32]:
jobs['text'] = text_no_stopwords

# Spacy
https://spacy.io/usage/spacy-101

In [15]:
# Load one of the larger models for a better similarity score
nlp = spacy.load("en_core_web_lg")

In [33]:
# Create spacy documents for each job post
titles_and_docs = jobs[['job_title']].copy()
titles_and_docs['doc'] = jobs['text'].map(nlp)

In [34]:
titles_and_docs

Unnamed: 0,job_title,doc
0,Analyst,"(n, edi, analyst, experience, read, we, strong..."
1,Developer,"(informatica, etl, developerst, petersburg, fl..."
2,Manager,"(nationally, recognized, microsoft, gold, part..."
3,Developer,"(.net, developer, experience, read, what, will..."
4,Developer,"(hatstand, global, financial, consultancy, see..."
...,...,...
16427,Developer,"(jpmorgan, chase, co., (, nyse, :, jpm, ), lea..."
16428,Administrator,"(seeking, jr, ., systems, administrators, expe..."
16429,Developer,"(senior, lead, devops, engineer, desired, set,..."
16430,Developer,"(headquartered, downtown, san, francisco, ca, ..."


In [35]:
def gather_profile_data(file_path):
    profile_data = pd.read_csv(file_path)
    profile_data['text'] = profile_data['Titles'] + ' ' \
                            + profile_data['Skills'] + ' ' \
                            + profile_data['Summary'] + ' ' \
                            + profile_data['Education']
    try: profile_data['text'] += ' ' + profile_data['Certifications']
    except: pass
    
    try: profile_data['text'] += ' ' + profile_data['Projects']
    except: pass
    
    return profile_data

In [36]:
def get_recommendations(profile_data):
    # Create nlp doc from profile
    profile_text = profile_data['text'][0]
    profile_doc = nlp(profile_text)
    
    # Calculate scores
    scores = jobs[['job_title']].copy()
    scores['sim_score'] = titles_and_docs['doc'].map(lambda x: x.similarity(profile_doc))
    
    return scores

In [37]:
# Reading in linkedin profile data.
profile_data_zach = gather_profile_data('../data/linkedin/test-output/Zach_LinkedInData_12-16-2020.csv')
profile_data_nolan = gather_profile_data('../data/linkedin/test-output/Nolan_LinkedInData_12-16-2020.csv')
profile_data_albert = gather_profile_data('../data/linkedin/test-output/Albert_LinkedInData.csv')
profile_data_ye = gather_profile_data('../data/linkedin/test-output/Ye_LinkedInData.csv')

In [38]:
# How much text is in each of our profiles?
print('Zach:', len(profile_data_zach['text'][0]))
print('Nolan:', len(profile_data_nolan['text'][0]))
print('Albert:', len(profile_data_albert['text'][0]))
print('Ye:', len(profile_data_ye['text'][0]))

Zach: 933
Nolan: 1432
Albert: 3426
Ye: 1255


In [39]:
# Calculate scores
zach_scores = get_recommendations(profile_data_zach)
nolan_scores = get_recommendations(profile_data_nolan)
albert_scores = get_recommendations(profile_data_albert)
ye_scores = get_recommendations(profile_data_ye)

In [40]:
# Group by job title
zachs_recommendations = zach_scores.groupby('job_title').mean().sort_values('sim_score', ascending=False)
nolans_recommendations = nolan_scores.groupby('job_title').mean().sort_values('sim_score', ascending=False)
alberts_recommendations = albert_scores.groupby('job_title').mean().sort_values('sim_score', ascending=False)
yes_recommendations = ye_scores.groupby('job_title').mean().sort_values('sim_score', ascending=False)

In [41]:
# Credit for notebook styling: https://blog.softhints.com/display-two-pandas-dataframes-side-by-side-jupyter-notebook/
df1_styler = zachs_recommendations.style.set_table_attributes("style='display:inline'").set_caption('Zach')
df2_styler = nolans_recommendations.style.set_table_attributes("style='display:inline'").set_caption('Nolan')
df3_styler = alberts_recommendations.style.set_table_attributes("style='display:inline'").set_caption('Albert')
df4_styler = yes_recommendations.style.set_table_attributes("style='display:inline'").set_caption('Ye')

space = "\xa0" * 5
display_html(df1_styler._repr_html_() + space + df2_styler._repr_html_() + space + df3_styler._repr_html_() + space + df4_styler._repr_html_(), raw=True)

Unnamed: 0_level_0,sim_score
job_title,Unnamed: 1_level_1
Analyst,0.915819
Data Position,0.915598
Engineer,0.912122
Director,0.910046
Programmer,0.908521
Architect,0.908135
Manager,0.908016
Developer,0.907839
Support,0.90756
Technician,0.90177

Unnamed: 0_level_0,sim_score
job_title,Unnamed: 1_level_1
Developer,0.861498
Programmer,0.860427
Architect,0.859831
Data Position,0.856976
Administrator,0.855542
Engineer,0.851969
Analyst,0.849719
Consulting,0.845425
Designer,0.843528
Support,0.837432

Unnamed: 0_level_0,sim_score
job_title,Unnamed: 1_level_1
Director,0.914719
Manager,0.912253
Analyst,0.910727
Technician,0.909855
Support,0.909721
Designer,0.908678
Engineer,0.9081
Data Position,0.906248
Programmer,0.900293
Developer,0.899605

Unnamed: 0_level_0,sim_score
job_title,Unnamed: 1_level_1
Director,0.853864
Manager,0.852047
Analyst,0.848565
Technician,0.844154
Support,0.843433
Data Position,0.835049
Engineer,0.835037
Designer,0.826644
Architect,0.826256
Programmer,0.823107


### SpaCy Conclusion
As shown in the results above, the similarity scores for each LinkedIn profile hardly differ from the highest recommended title to the lowest recommended title. This is particularly apparent in Zach's scores, where the highest recommended position, Analyst, is only .0255 greater than the lowest recommended position of Consulting. We conclude from these results that this recommender is significantly effected by the noise, i.e. common words, in the dataset, and thus fails to provide accurate recommendations. This approach shows promise in future applications, as we fine tune the data cleaning process, however in the current state we decide not to use SpaCy.