**Setting Up Starspace** 

In [2]:
import os
def setup_starspace():
    if not os.path.exists("/usr/local/bin/starspace"):
        os.system("wget https://dl.bintray.com/boostorg/release/1.63.0/source/boost_1_63_0.zip")
        os.system("unzip boost_1_63_0.zip && mv boost_1_63_0 /usr/local/bin")
        os.system("git clone https://github.com/facebookresearch/Starspace.git")
        os.system("cd Starspace && make && cp -Rf starspace /usr/local/bin")
setup_starspace()

**Downloading and Storing the Job Recommendation Challenge Dataset**

In [5]:
!kaggle datasets download -d kandij/job-recommendation-datasets

Downloading job-recommendation-datasets.zip to /content
 78% 41.0M/52.4M [00:01<00:00, 23.2MB/s]
100% 52.4M/52.4M [00:01<00:00, 35.0MB/s]


In [6]:
!mkdir job-dataset
!unzip job-recommendation-datasets.zip -d job-dataset

Archive:  job-recommendation-datasets.zip
  inflating: job-dataset/Combined_Jobs_Final.csv  
  inflating: job-dataset/Experience.csv  
  inflating: job-dataset/Job_Views.csv  
  inflating: job-dataset/Positions_Of_Interest.csv  
  inflating: job-dataset/job_data.csv  


In [7]:
import pandas as pd
import numpy as np

**Descriptions of Loaded Dataframe**

1. jobs = Job Listings 
2. job_views= Job listings viewed by various applicants
3. experience= Previous Experience details of applicants
4. positions= Positions of Interest to the Applicants

In [8]:
jobs = pd.read_csv("/content/job-dataset/Combined_Jobs_Final.csv")
job_views = pd.read_csv("/content/job-dataset/Job_Views.csv")
experience = pd.read_csv("/content/job-dataset/Experience.csv")
positions =  pd.read_csv("/content/job-dataset/Positions_Of_Interest.csv", sep=',')

In [9]:
jobs.head()

Unnamed: 0,Job.ID,Provider,Status,Slug,Title,Position,Company,City,State.Name,State.Code,Address,Latitude,Longitude,Industry,Job.Description,Requirements,Salary,Listing.Start,Listing.End,Employment.Type,Education.Required,Created.At,Updated.At
0,111,1,open,palo-alto-ca-tacolicious-server,Server @ Tacolicious,Server,Tacolicious,Palo Alto,California,CA,,37.443346,-122.16117,Food and Beverages,Tacolicious' first Palo Alto store just opened...,,8.0,,,Part-Time,,2013-03-12 02:08:28 UTC,2014-08-16 15:35:36 UTC
1,113,1,open,san-francisco-ca-claude-lane-kitchen-staff-chef,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef,Claude Lane,San Francisco,California,CA,,37.78983,-122.404268,Food and Beverages,\r\n\r\nNew French Brasserie in S.F. Financia...,,0.0,,,Part-Time,,2013-04-12 08:36:36 UTC,2014-08-16 15:35:36 UTC
2,117,1,open,san-francisco-ca-machka-restaurants-corp-barte...,Bartender @ Machka Restaurants Corp.,Bartender,Machka Restaurants Corp.,San Francisco,California,CA,,37.795597,-122.402963,Food and Beverages,We are a popular Mediterranean wine bar and re...,,11.0,,,Part-Time,,2013-07-16 09:34:10 UTC,2014-08-16 15:35:37 UTC
3,121,1,open,brisbane-ca-teriyaki-house-server,Server @ Teriyaki House,Server,Teriyaki House,Brisbane,California,CA,,37.685073,-122.400275,Food and Beverages,● Serve food/drinks to customers in a profess...,,10.55,,,Part-Time,,2013-09-04 15:40:30 UTC,2014-08-16 15:35:38 UTC
4,127,1,open,los-angeles-ca-rosa-mexicano-sunset-kitchen-st...,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,Kitchen Staff/Chef,Rosa Mexicano - Sunset,Los Angeles,California,CA,,34.073384,-118.460439,Food and Beverages,"Located at the heart of Hollywood, we are one ...",,10.55,,,Part-Time,,2013-07-17 15:26:18 UTC,2014-08-16 15:35:40 UTC


Important Columns - Job.ID, Title, Position, Company, City, Job.Description, Employment.Type

Extracting Important Information of the Job Listings

In [10]:
# important information for job search and recommendations
jobs_info=jobs[['Job.ID', 'Title', 'Position', 'Company', 'City', 'Job.Description', 'Employment.Type']]
jobs_info.head()

Unnamed: 0,Job.ID,Title,Position,Company,City,Job.Description,Employment.Type
0,111,Server @ Tacolicious,Server,Tacolicious,Palo Alto,Tacolicious' first Palo Alto store just opened...,Part-Time
1,113,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef,Claude Lane,San Francisco,\r\n\r\nNew French Brasserie in S.F. Financia...,Part-Time
2,117,Bartender @ Machka Restaurants Corp.,Bartender,Machka Restaurants Corp.,San Francisco,We are a popular Mediterranean wine bar and re...,Part-Time
3,121,Server @ Teriyaki House,Server,Teriyaki House,Brisbane,● Serve food/drinks to customers in a profess...,Part-Time
4,127,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,Kitchen Staff/Chef,Rosa Mexicano - Sunset,Los Angeles,"Located at the heart of Hollywood, we are one ...",Part-Time


Checking how many important column values are null

In [11]:
jobs_info.isnull().sum()

Job.ID                0
Title                 0
Position              0
Company            2271
City                135
Job.Description      56
Employment.Type      10
dtype: int64

Replacing missing company's location values by relevant cities from google

In [12]:
empty_city=jobs_info[pd.isnull(jobs_info['City'])]
print(empty_city.groupby(['Company'])['City'].count())
jobs_info['Company'] = jobs_info['Company'].replace(['Genesis Health Systems'], 'Genesis Health System')
jobs_info.loc[jobs_info.Company == 'CHI Payment Systems', 'City'] = 'Illinois'
jobs_info.loc[jobs_info.Company == 'Academic Year In America', 'City'] = 'Stamford'
jobs_info.loc[jobs_info.Company == 'CBS Healthcare Services and Staffing ', 'City'] = 'Urbandale'
jobs_info.loc[jobs_info.Company == 'Driveline Retail', 'City'] = 'Coppell'
jobs_info.loc[jobs_info.Company == 'Educational Testing Services', 'City'] = 'New Jersey'
jobs_info.loc[jobs_info.Company == 'Genesis Health System', 'City'] = 'Davennport'
jobs_info.loc[jobs_info.Company == 'Home Instead Senior Care', 'City'] = 'Nebraska'
jobs_info.loc[jobs_info.Company == 'St. Francis Hospital', 'City'] = 'New York'
jobs_info.loc[jobs_info.Company == 'Volvo Group', 'City'] = 'Washington'
jobs_info.loc[jobs_info.Company == 'CBS Healthcare Services and Staffing', 'City'] = 'Urbandale'

Company
Academic Year In America                0
CBS Healthcare Services and Staffing    0
CHI Payment Systems                     0
Driveline Retail                        0
Educational Testing Services            0
Genesis Health System                   0
Genesis Health Systems                  0
Home Instead Senior Care                0
St. Francis Hospital                    0
Volvo Group                             0
Name: City, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Re-checking Null values of column

In [13]:
jobs_info.isnull().sum()

Job.ID                0
Title                 0
Position              0
Company            2271
City                  0
Job.Description      56
Employment.Type      10
dtype: int64

Employment Type empty values

In [14]:
employee_empty = jobs_info[pd.isnull(jobs_info['Employment.Type'])]
employee_empty

Unnamed: 0,Job.ID,Title,Position,Company,City,Job.Description,Employment.Type
10768,153197,Driving Partner @ Uber,Driving Partner,Uber,San Francisco,Uber is changing the way the world moves. From...,
10769,153198,Driving Partner @ Uber,Driving Partner,Uber,Los Angeles,Uber is changing the way the world moves. From...,
10770,153199,Driving Partner @ Uber,Driving Partner,Uber,Chicago,Uber is changing the way the world moves. From...,
10771,153200,Driving Partner @ Uber,Driving Partner,Uber,Boston,Uber is changing the way the world moves. From...,
10772,153201,Driving Partner @ Uber,Driving Partner,Uber,Ann Arbor,Uber is changing the way the world moves. From...,
10773,153202,Driving Partner @ Uber,Driving Partner,Uber,Oklahoma,Uber is changing the way the world moves. From...,
10774,153203,Driving Partner @ Uber,Driving Partner,Uber,Omaha,Uber is changing the way the world moves. From...,
10775,153204,Driving Partner @ Uber,Driving Partner,Uber,Lincoln,Uber is changing the way the world moves. From...,
10776,153205,Driving Partner @ Uber,Driving Partner,Uber,Minneapolis,Uber is changing the way the world moves. From...,
10777,153206,Driving Partner @ Uber,Driving Partner,Uber,St. Paul,Uber is changing the way the world moves. From...,


Because it is Uber, it can be replaced with either Full/Part Time

In [15]:
jobs_info['Employment.Type']=jobs_info['Employment.Type'].fillna('Full-Time/Part-Time')
jobs_info.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Job.ID                0
Title                 0
Position              0
Company            2271
City                  0
Job.Description      56
Employment.Type       0
dtype: int64

Combining Title, Position, Company , City and Job Description into one column to form a training corpus

In [16]:
jobs_info['complete_description'] = jobs_info[jobs_info.columns[2:]].apply(lambda x:' '.join(x.dropna().astype(str)),axis=1)
jobs_info.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Job.ID,Title,Position,Company,City,Job.Description,Employment.Type,complete_description
0,111,Server @ Tacolicious,Server,Tacolicious,Palo Alto,Tacolicious' first Palo Alto store just opened...,Part-Time,Server Tacolicious Palo Alto Tacolicious' firs...
1,113,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef,Claude Lane,San Francisco,\r\n\r\nNew French Brasserie in S.F. Financia...,Part-Time,Kitchen Staff/Chef Claude Lane San Francisco ...
2,117,Bartender @ Machka Restaurants Corp.,Bartender,Machka Restaurants Corp.,San Francisco,We are a popular Mediterranean wine bar and re...,Part-Time,Bartender Machka Restaurants Corp. San Francis...
3,121,Server @ Teriyaki House,Server,Teriyaki House,Brisbane,● Serve food/drinks to customers in a profess...,Part-Time,Server Teriyaki House Brisbane ● Serve food/d...
4,127,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,Kitchen Staff/Chef,Rosa Mexicano - Sunset,Los Angeles,"Located at the heart of Hollywood, we are one ...",Part-Time,Kitchen Staff/Chef Rosa Mexicano - Sunset Los ...


Text Pre-processing

In [17]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    text=str(text)
    text = text.lower()
    text = re.sub(REPLACE_BY_SPACE_RE,' ',text)
    text = re.sub(BAD_SYMBOLS_RE,' ',text)
    text = ' '.join(filter(lambda x: x not in STOPWORDS,  text.split()))
    return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Applying text pre-processing to complete_description- training corpus formed

In [18]:
jobs_info['complete_description']=jobs_info['complete_description'].apply(text_prepare)
jobs_info.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Job.ID,Title,Position,Company,City,Job.Description,Employment.Type,complete_description
0,111,Server @ Tacolicious,Server,Tacolicious,Palo Alto,Tacolicious' first Palo Alto store just opened...,Part-Time,server tacolicious palo alto tacolicious first...
1,113,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef,Claude Lane,San Francisco,\r\n\r\nNew French Brasserie in S.F. Financia...,Part-Time,kitchen staff chef claude lane san francisco n...
2,117,Bartender @ Machka Restaurants Corp.,Bartender,Machka Restaurants Corp.,San Francisco,We are a popular Mediterranean wine bar and re...,Part-Time,bartender machka restaurants corp san francisc...
3,121,Server @ Teriyaki House,Server,Teriyaki House,Brisbane,● Serve food/drinks to customers in a profess...,Part-Time,server teriyaki house brisbane serve food drin...
4,127,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,Kitchen Staff/Chef,Rosa Mexicano - Sunset,Los Angeles,"Located at the heart of Hollywood, we are one ...",Part-Time,kitchen staff chef rosa mexicano sunset los an...


In [20]:
jobs_list=jobs_info[['Job.ID', 'Title', 'complete_description']]
jobs_list.head()

Unnamed: 0,Job.ID,Title,complete_description
0,111,Server @ Tacolicious,server tacolicious palo alto tacolicious first...
1,113,Kitchen Staff/Chef @ Claude Lane,kitchen staff chef claude lane san francisco n...
2,117,Bartender @ Machka Restaurants Corp.,bartender machka restaurants corp san francisc...
3,121,Server @ Teriyaki House,server teriyaki house brisbane serve food drin...
4,127,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,kitchen staff chef rosa mexicano sunset los an...


Arranging the data into .tsv format accepted for training by the Starspace word embedding models

In [21]:
description=jobs_info['complete_description']
description.to_csv('output.tsv', sep='\t', index=False)

In [22]:
list_jobs=[]
for line in open('output.tsv'):
  list_jobs.append(line)

In [None]:
list_jobs[0]

'complete_description\n'

In [24]:
import csv
with open('output1.tsv', 'w', newline='') as f_output:
    tsv_output = csv.writer(f_output, delimiter='\t')
    tsv_output.writerow(list_jobs[1:]) #ignoring the first line complete_description


Training Job Corpus Data on .tsv file. The starspace embeddings- word embeddings are trained on mode 3 as here task is of sentence/document similarity- hence we get similar listings together. 

Dimensions- of embeddings trained- 100

Similarity of Sentence explored by model to be trained- cosine similarity

optimizer-adagrad

Learning Rate-0.01


In [None]:
!starspace train -trainFile "output1.tsv" -model starspace_embedding_jobs \
-trainMode 3 -adagrad true -ngrams 1 -epoch 10 -dim 100 -similarity cosine -minCount 2 \
-verbose true -fileFormat labelDoc -negSearchLimit 10 -lr 0.01

Arguments: 
lr: 0.01
dim: 100
epoch: 10
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: cosine
maxNegSamples: 10
negSearchLimit: 10
batchSize: 5
thread: 10
minCount: 2
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 3
fileFormat: labelDoc
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to initialize starspace model.
Build dict from input file : output1.tsv
Read 13M words
Number of words in dictionary:  45735
Number of labels in dictionary: 0
Loading data from file : output1.tsv
Total number of examples loaded : 84089
Initialized model weights. Model size :
matrix : 45735 100
Training epoch 0: 0.01 0.001
Epoch: 100.0%  lr: 0.009000  loss: 0.207160  eta: 0h58m  tot: 0h6m30s  (10.0%)
 ---+++                Epoch    0 Train error : 0.20735569 +++--- ☃
Training epoch 1: 0.009 0.001
Epoch: 100.0%  lr: 0.008000  loss: 0.190011  eta: 0h55m  tot: 0h13m24s  (20.0%)


In [27]:
job_views.head() #Jobs viewed by the Applicants

Unnamed: 0,Applicant.ID,Job.ID,Title,Position,Company,City,State.Name,State.Code,Industry,View.Start,View.End,View.Duration,Created.At,Updated.At
0,10000,73666,Cashiers & Valets Needed! @ WallyPark,Cashiers & Valets Needed!,WallyPark,Newark,New Jersey,NJ,,2014-12-12 20:12:35 UTC,2014-12-12 20:31:24 UTC,1129.0,2014-12-12 20:12:35 UTC,2014-12-12 20:12:35 UTC
1,10000,96655,Macy's Seasonal Retail Fragrance Cashier - Ga...,Macy's Seasonal Retail Fragrance Cashier - Ga...,Macy's,Garden City,New York,NY,,2014-12-12 20:08:50 UTC,2014-12-12 20:10:15 UTC,84.0,2014-12-12 20:08:50 UTC,2014-12-12 20:08:50 UTC
2,10001,84141,Part Time Showroom Sales / Cashier @ Grizzly I...,Part Time Showroom Sales / Cashier,Grizzly Industrial Inc.,Bellingham,Washington,WA,,2014-12-12 20:12:32 UTC,2014-12-12 20:17:18 UTC,286.0,2014-12-12 20:12:32 UTC,2014-12-12 20:12:32 UTC
3,10002,77989,Event Specialist Part Time @ Advantage Sales &...,Event Specialist Part Time,Advantage Sales & Marketing,Simpsonville,South Carolina,SC,,2014-12-12 20:39:23 UTC,2014-12-12 20:42:13 UTC,170.0,2014-12-12 20:39:23 UTC,2014-12-12 20:39:23 UTC
4,10002,69568,Bonefish - Kitchen Staff @ Bonefish Grill,Bonefish - Kitchen Staff,Bonefish Grill,Greenville,South Carolina,SC,,2014-12-12 20:43:25 UTC,2014-12-12 20:43:58 UTC,33.0,2014-12-12 20:43:25 UTC,2014-12-12 20:43:25 UTC


In [None]:
job_views.columns[3:6]

Index(['Position', 'Company', 'City'], dtype='object')

Forming the corpus of data of  Job views applicants wise.
This data will be used for further making recommendations

In [28]:
job_views_description=job_views[['Applicant.ID']]
job_views_description['complete_description']=job_views[job_views.columns[3:6]].apply(lambda x: ' '.join(x.dropna().astype(str)),axis=1)
job_views_description['complete_description']=job_views_description['complete_description'].apply(text_prepare) #pre-processing the text
job_views_description.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Applicant.ID,complete_description
0,10000,cashiers valets needed wallypark newark
1,10000,macy seasonal retail fragrance cashier garden ...
2,10001,part time showroom sales cashier grizzly indus...
3,10002,event specialist part time advantage sales mar...
4,10002,bonefish kitchen staff bonefish grill greenville


One applicant may have viewed more than one job listings. Compiling them in one complete_description

In [66]:
job_views_description = job_views_description.groupby('Applicant.ID', sort=False)['complete_description'].apply(' '.join).reset_index()

In [68]:
job_views_description.head()

Unnamed: 0,Applicant.ID,complete_description
0,10000,cashiers valets needed wallypark newark macy s...
1,10001,part time showroom sales cashier grizzly indus...
2,10002,event specialist part time advantage sales mar...
3,10003,entry level security officer securitas securit...
4,10004,pt teller chester east 36th cleveland keybank ...


In [29]:
experience.head() #experience database of each applicant

Unnamed: 0,Applicant.ID,Position.Name,Employer.Name,City,State.Name,State.Code,Start.Date,End.Date,Job.Description,Salary,Can.Contact.Employer,Created.At,Updated.At
0,10001,Account Manager / Sales Administration / Quali...,Barcode Resourcing,Bellingham,Washington,WA,2012-10-15,,,,,2014-12-12 20:10:02 UTC,2014-12-12 20:10:02 UTC
1,10001,Electronics Technician / Item Master Controller,Ryzex Group,Bellingham,Washington,WA,2001-12-01,2012-04-01,,,,2014-12-12 20:10:02 UTC,2014-12-12 20:10:02 UTC
2,10001,Machine Operator,comptec inc,Custer,Washington,WA,1997-01-01,1999-01-01,,,,2014-12-12 20:10:02 UTC,2014-12-12 20:10:02 UTC
3,10003,maintenance technician,Winn residental,washington,District of Columbia,DC,,,"Necessary maintenance for ""Make Ready"" Plumbin...",10.0,False,2014-12-12 21:27:05 UTC,2014-12-12 21:27:05 UTC
4,10003,Electrical Helper,michael and son services,alexandria,Virginia,VA,,,repair and services of electrical construction,,False,2014-12-12 21:27:05 UTC,2014-12-12 21:27:05 UTC


Preparing the corpus of applicant experience to be later used in recommendations

In [30]:
experience_list=experience[['Applicant.ID', 'Position.Name']]
experience_list['Position.Name']=experience_list['Position.Name'].apply(text_prepare)
experience_list.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Applicant.ID,Position.Name
0,10001,account manager sales administration quality a...
1,10001,electronics technician item master controller
2,10001,machine operator
3,10003,maintenance technician
4,10003,electrical helper


Combining all position held previously by applicant in the Position.Name (incase applicant  has listed more than one Position in experience)

In [31]:
experience_list = experience_list.groupby('Applicant.ID', sort=False)['Position.Name'].apply(' '.join).reset_index()
experience_list.head()

Unnamed: 0,Applicant.ID,Position.Name
0,10001,account manager sales administration quality a...
1,10003,maintenance technician electrical helper techn...
2,10004,nan nan shift superviveur
3,10005,star houseman
4,10007,bartender bar manager head bartender bartender


In [134]:
positions =  pd.read_csv("/content/job-dataset/Positions_Of_Interest.csv", sep=',') #position of interests to various applicants

In [135]:
positions.head()

Unnamed: 0,Applicant.ID,Position.Of.Interest,Created.At,Updated.At
0,10003,security officer,2014-12-12 21:20:54 UTC,2014-12-12 21:20:54 UTC
1,10007,Server,2014-08-14 15:56:42 UTC,2015-02-26 20:35:12 UTC
2,10007,Bartender,2014-08-14 15:56:44 UTC,2015-02-19 23:21:28 UTC
3,10008,Host,2014-08-14 15:56:42 UTC,2015-02-26 20:35:12 UTC
4,10008,Barista,2014-08-14 15:56:43 UTC,2015-02-18 02:35:06 UTC


Preparing the corpus of Positions of Interest and pre-processing the text to be used later in recommending jobs 

In [136]:
positions_new=positions.drop(['Created.At', 'Updated.At'], axis=1)
positions_new.head()

Unnamed: 0,Applicant.ID,Position.Of.Interest
0,10003,security officer
1,10007,Server
2,10007,Bartender
3,10008,Host
4,10008,Barista


In [137]:
positions_new['Position.Of.Interest']=positions_new['Position.Of.Interest'].apply(text_prepare)
positions_new= positions_new.groupby('Applicant.ID', sort=True)['Position.Of.Interest'].apply(' '.join).reset_index()
positions_new.head()

Unnamed: 0,Applicant.ID,Position.Of.Interest
0,96,server
1,153,server host barista customer service rep sales...
2,256,server host receptionist book keeper customer ...
3,438,server host barista customer service rep
4,568,receptionist book keeper customer service rep


Merging Job Views description and Experience by Applicant IDs

In [69]:
jobviews_exp = job_views_description.merge(experience_list, left_on='Applicant.ID', right_on='Applicant.ID', how='outer')
jobviews_exp = jobviews_exp.fillna(' ')
jobviews_exp = jobviews_exp.sort_values(by='Applicant.ID')
jobviews_exp.head()

Unnamed: 0,Applicant.ID,complete_description,Position.Name
4090,2,,writer uloop blog volunteer
4565,3,,prep cook server marketing intern
5706,6,,project assistant
6122,8,,deli clerk server cashier food prep order taker
3542,11,,cashier


In [70]:
jobviews_exp['complete_description'][0]

'cashiers valets needed wallypark newark macy seasonal retail fragrance cashier garden city ny roosevelt field macy garden city'

Merging Job Views+Experience data with Position of Interest data by Applicant IDs

In [71]:
jobviews_exp_interests=jobviews_exp.merge(positions_new, left_on='Applicant.ID', right_on='Applicant.ID', how='outer')
jobviews_exp_interests = jobviews_exp_interests.fillna(' ')
jobviews_exp_interests = jobviews_exp_interests.sort_values(by='Applicant.ID')
jobviews_exp_interests.head()

Unnamed: 0,Applicant.ID,complete_description,Position.Name,Position.Of.Interest
0,2,,writer uloop blog volunteer,
1,3,,prep cook server marketing intern,
2,6,,project assistant,
3,8,,deli clerk server cashier food prep order taker,
4,11,,cashier,


In [39]:
jobviews_exp_interests.columns[1:4]

Index(['complete_description', 'Position.Name', 'Position.Of.Interest'], dtype='object')

Forming a text corpus merging the jobs viewed descriptions, experience/previous positions held and Positions of Interest indicated by Applicants

In [72]:
jobviews_exp_interests['viewedjob+experience+interestedposition']=jobviews_exp_interests[jobviews_exp_interests.columns[1:4]].apply(lambda x: ' '.join(x.dropna().astype(str)),axis=1)
jobviews_exp_interests.head()

Unnamed: 0,Applicant.ID,complete_description,Position.Name,Position.Of.Interest,viewedjob+experience+interestedposition
0,2,,writer uloop blog volunteer,,writer uloop blog volunteer
1,3,,prep cook server marketing intern,,prep cook server marketing intern
2,6,,project assistant,,project assistant
3,8,,deli clerk server cashier food prep order taker,,deli clerk server cashier food prep order ta...
4,11,,cashier,,cashier


Forming Dictionary of word embeddings from Starspace embeddings model saved previously

In [41]:
starspace_embeddings_1 = dict()
for line in open('starspace_embedding_jobs.tsv', encoding='utf-8'):
    row = line.strip().split('\t')
    starspace_embeddings_1[row[0]] = np.array(row[1:], dtype=np.float32)

Converting the User data to vector based on the embeddings

In [42]:
def question_to_vec(question, embeddings, dim):
    result= np.zeros(dim)
    no_of_words=0
    words=question.split()
    for word in words:
      if word in embeddings:
        no_of_words=no_of_words+1
        embed=embeddings[str(word)]
        result+=embed

    if no_of_words!=0:
      result=result/no_of_words
      
    return result

Ranking based on Applicant data's similarity to candidates and selecting the top 11 candidates

In [171]:
def rank_candidates(question, candidates, embeddings, dim):
    
    question_vec = np.array([question_to_vec(question, embeddings, dim) for i in range(len(candidates))])
    #print(question_vec)
    candidate_vec = np.array([question_to_vec(candidate, embeddings, dim) for candidate in candidates])
    #print(candidate_vec)
    cosine_sim = np.array(cosine_similarity(question_vec, candidate_vec)[0]) #similarity in applicant data provided and job listings
    #print(cosine_sim)
    merged_list = list(zip(cosine_sim, range(len(candidates)), candidates))
    #print(merged_list)
    sorted_list  = sorted(merged_list, key=lambda x: x[0], reverse=True)
    sorted_list= sorted_list[:11]
    result = [(b,c) for a,b,c in sorted_list]
    return result

In [115]:
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.metrics.pairwise import cosine_similarity

Getting Recommendation based on the Applicant IDs

In [174]:
def get_recommedations(app_id):
  applicant_info=jobviews_exp_interests.loc[jobviews_exp_interests['Applicant.ID']==app_id]
  app_data=applicant_info['viewedjob+experience+interestedposition'].values[0] #applicant's data collected
  print('Applicant ID: ', app_id )
  pos_int=list(positions['Position.Of.Interest'].loc[positions['Applicant.ID']==app_id].values)
  print('Position/ Positions of Interest: ')
  for i in range(len(pos_int)):
    print(str(i+1)+str('. ')+pos_int[i])
  pos_held=list(experience['Position.Name'].loc[experience['Applicant.ID']==app_id].values)
  print('Previous Position/Positions held: ')
  for i in range(len(pos_held)):
    print(str(i+1)+str('. ')+pos_held[i])
  job_viewed=list(job_views['Title'].loc[job_views['Applicant.ID']==app_id].values)
  print('Jobs Viewed:')
  for i in range(len(job_viewed)):
    print(str(i+1)+str('. ')+job_viewed[i])
  # Due to memory constraints, it allows only 30,000 candidates to be ranked at once
  # using applicant's data and job listings from 10,000 to 40,000 to make predictions
  result=rank_candidates(app_data, jobs_info['complete_description'][10000:40000], starspace_embeddings_1, 100)[1:]  
  result_index=[i[0]+10000 for i in result] # as we are starting from 10,000
  print('Job Recommendations:')
  recommendations=pd.DataFrame(jobs_info.loc[result_index, ['Title', 'Position','City']])
  return recommendations

In [176]:
recommendation=get_recommedations(10085) #for applicant ID 10085
recommendation

Applicant ID:  10085
Position/ Positions of Interest: 
1. Host
2. Receptionist
3. Customer Service Rep
Previous Position/Positions held: 
1. cashier
Jobs Viewed:
1. Entry Level Sales / Customer Service – Part time / Full Time @ Vector Marketing
2. Entry Level Sales / Customer Service – Part time / Full Time @ Vector Marketing
3. Entry Level Sales / Customer Service – Part time / Full Time @ Vector Marketing
Job Recommendations:


Unnamed: 0,Title,Position,City
17581,Entry Level Customer Service / Entry Level Ret...,Entry Level Customer Service / Entry Level Ret...,Fresno
32475,Junior Sales/Marketing Training @ Interview Now,Junior Sales/Marketing Training,Gaithersburg
39612,CUSTOMER SERVICE POSITIONS OPEN-HIRING ENTRY L...,CUSTOMER SERVICE POSITIONS OPEN-HIRING ENTRY L...,Albuquerque
19025,Dishwasher @ Claridge Court,Dishwasher,Prairie Village
16542,CUSTOMER SERVICE POSITIONS OPEN-HIRING ENTRY L...,CUSTOMER SERVICE POSITIONS OPEN-HIRING ENTRY L...,Marion
20166,Customer Service -- Management Training -- Vot...,Customer Service -- Management Training -- Vot...,Tysons Corner
29512,CUSTOMER SERVICE REPRESENTATIVE - FULL TIME @...,CUSTOMER SERVICE REPRESENTATIVE - FULL TIME,Dayton
30483,Nurse Aide @ Carespring Health Care Management,Nurse Aide,West Chester
32654,MANAGEMENT TRAINEE- CUSTOMER SERVICE- FULL TIM...,MANAGEMENT TRAINEE- CUSTOMER SERVICE- FULL TIME,Toms River
14384,CUSTOMER SERVICE POSITIONS OPEN-HIRING ENTRY L...,CUSTOMER SERVICE POSITIONS OPEN-HIRING ENTRY L...,Cape Girardeau
