Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recompute job keywords and scores on skills update #62

Merged
merged 2 commits into from
Nov 3, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions backend/src/all_skills/AllSkills.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import assert from 'assert';
import { forEachAsync } from 'foreachasync';

import { Skills, Jobs } from '../schema';

class AllSkills {
constructor(jobAnalyzer) {
this.jobAnalyzer = jobAnalyzer;
}

static async setup() {
// Make sure there is always one and only one entry
const count = await Skills.countDocuments();
assert(count === 0 || count === 1);

if (count === 0) {
await Skills.create({});
}
}

/* gets and returns a set containing the collective skills of all the users */
static async getAll() {
const doc = await Skills.findOne({});
return doc.skills;
}

/**
* Updates all skills and recomputes job keyword counts and scores
* @param {Array<String>} skills new skills to add to all skills
*/
async update(skills) {
const oldSkillsData = await Skills.findOneAndUpdate({}, {
$addToSet: {
skills,
},
}).orFail();

const updatedSkills = await AllSkills.getAll();
// Find newly added skills
const newSkills = updatedSkills.slice(oldSkillsData.skills.length);

if (newSkills.length === 0) {
return;
}

// Update keyword counts of each job
const jobs = await Jobs.find({});
await forEachAsync(jobs, async (_, jobIdx) => {
this.jobAnalyzer.computeJobKeywordCount(jobs[jobIdx], newSkills);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a side note: if new jobs are added while new skills are added, this is still broken lol

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an issue, author?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#71

await jobs[jobIdx].save();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make this synchronous and save after the for loop (may have better performance, especially when jobs is much larger)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd still need to loop through all jobs and save each, so don't see the difference?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time googling about how .save() is implemented, but couldn't find an answer. The reason I was concerned is because I imagine it is saving everything in jobs[jobIdx], every time you execute .save(), so you would be saving everything job.length times... never mind you would have to do jobs.save() if you did it outside, which could be longer cause the .saves() on the inside would be asynchronous?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobs is a Document[] and .save() operates on a Document.

Doing

loop {
  //stuff
  await .save()
}

shouldn't have any performance difference compared to

loop {
  //stuff
}
loop {
  await .save()
}

});

// Computes tf idf for newly added skills
await this.jobAnalyzer.computeJobScores(oldSkillsData.skills.length);
}
}

export default AllSkills;
3 changes: 3 additions & 0 deletions backend/src/all_skills/index.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
import AllSkills from './AllSkills';

export default AllSkills;
18 changes: 11 additions & 7 deletions backend/src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import JobSearcher from './job_searcher';
import JobShortLister from './job_shortlister';
import JobAnalyzer from './job_analyzer';
import Messenger from './messenger';
import AllSkills from './all_skills';
import firebaseCredentials from '../credentials/firebase';


Expand Down Expand Up @@ -47,13 +48,16 @@ admin.initializeApp({
});

// Setup modules
const shortlister = new JobShortLister(app);
const messenger = new Messenger(app, shortlister);
const user = new User(app);
const jobAnalyzer = new JobAnalyzer(app, shortlister);
new Friend(app, messenger);
new JobSearcher(jobAnalyzer);
new ResumeParser(app, user);
AllSkills.setup().then(() => {
const shortlister = new JobShortLister(app);
const messenger = new Messenger(app, shortlister);
const jobAnalyzer = new JobAnalyzer(app, shortlister);
const allSkills = new AllSkills(jobAnalyzer);
const user = new User(app, allSkills);
new Friend(app, messenger);
new JobSearcher(jobAnalyzer);
new ResumeParser(app, user);
});

// Start the server
app.listen(PORT, () => {
Expand Down
63 changes: 41 additions & 22 deletions backend/src/job_analyzer/JobAnalyzer.js
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import Logger from 'js-logger';
import { forEachAsync } from 'foreachasync';

import User from '../user';
import Response from '../types';
import AllSkills from '../all_skills';
import { Jobs, Users } from '../schema';
import { JOBS_PER_SEND } from '../constants';

Expand All @@ -18,37 +18,56 @@ class JobAnalyzer {
});
}

async computeJobScores() {
/**
* Computes the number of times the given keywords appear in the given job
* and modifies in the job in-place
* @param {Array<String>} keywords
* @param {Job} job
*/
computeJobKeywordCount(job, keywords) {
// Add the number of occurance of all keywords of the result
const description = job.description.toLowerCase();
keywords.forEach((keyword) => {
wchang22 marked this conversation as resolved.
Show resolved Hide resolved
// TODO: matches "java" with "javascript" from description
// NOTE: if you map with spaces around it, problems such as "java," arise
const re = new RegExp(keyword, 'g');
job.keywords.push({
name: keyword,
count: (description.match(re) || []).length,
});
});
}

/**
* Computes tf-idf scores for all jobs using all user skills
* Optionally specify a range of skills to use
*
* @param {Number} skillsStart Index of first skill to use
* @param {Number} skillsEnd One past the index of the last skill to use
*/
async computeJobScores(skillsStart, skillsEnd) {
this.logger.info('Starting to compute job scores...');

const jobs = await Jobs.find({});
const skills = await User._getAllSkills();
const offset = skillsStart || 0;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does offset = skillsStart

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If skillsStart is not passed in, this evaluates to undefined || 0 = 0 instead of undefined.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert statement?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it ever be undefined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that you can call this with or without an argument.

i.e.

computeJobScores() // computes for all skills
computeJobScores(5, 12) // computes skills 5 to 11

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are trying to do now. Could we set the default of skillsStart and skillsEnd to 0 then? I feel like that is easier to read.
async computeJobScores(skillsStart=0, skillsEnd=0)
I am not sure how it works in javascript, but if we do the above, it may be possible to just specify skillsStart and that would lead to an error when we try to split the list. Maybe
async computeJobScores(skillsStart=0, skillsEnd=skillsStart)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such thing as optional arguments in JS.

.slice(skillsStart, undefined) actually works, because this is the same as .slice(skillsStart), which by default will slice until the end.

const allSkills = await AllSkills.getAll();
const newKeywords = offset > 0 ? allSkills.slice(offset, skillsEnd) : allSkills;

await forEachAsync(skills, async (skill, skillIdx) => {
const keyword = skill.replace(/[-/\\^$*+?.()|[\]{}]/g, '\\$&');
const docCount = jobs.reduce((sum, posting) => sum
+ Number(posting.keywords[skillIdx].count > 0), 0);
await forEachAsync(newKeywords, async (_, newKeywordIdxBase) => {
const allKeywordIdx = newKeywordIdxBase + offset;
// Count the number of jobs with the given skill
const docCount = jobs.reduce((sum, job) => sum
+ Number(job.keywords[allKeywordIdx].count > 0), 0);

const jobsLen = jobs.length;
// calculate tf_idf each doc and save it
await forEachAsync(jobs, async (job, i) => {
const keywordOccurrences = job.keywords[skillIdx].count; // TODO: what if new keyword?
await forEachAsync(jobs, async (job, jobIdx) => {
const keywordOccurrences = job.keywords[allKeywordIdx].count;
const wordCount = job.description.split(' ').length;
const tf = keywordOccurrences / wordCount;
const idf = docCount !== 0 ? Math.log(jobsLen / docCount) : 0;
const tfidf = tf * idf;
const idf = docCount !== 0 ? Math.log(jobs.length / docCount) : 0;

// add name and tf_idf score to each job's keywords the first time
// replace tf_idf score for a keyword for each job
const keywordIdx = job.keywords.findIndex(elem => elem.name === keyword);
if (keywordIdx === -1) {
job.keywords.push({
name: keyword,
tfidf,
});
} else {
jobs[i].keywords[keywordIdx].tfidf = tfidf;
}
jobs[jobIdx].keywords[allKeywordIdx].tfidf = tf * idf;

await job.save();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can save all jobs outside the async block

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could prob do job.keywords[keywordIdx].save(). This is very nitpicking, but may make it faster with a lot of users. Would test that there are no errors AND it actually saves though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried, got
mongoose: calling save() on a subdoc does not save the document to MongoDB, it only runs save middleware.

});
Expand Down
39 changes: 16 additions & 23 deletions backend/src/job_searcher/JobSearcher.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,16 @@ import axios from 'axios';
import cheerio from 'cheerio';
import Logger from 'js-logger';

import User from '../user';
import AllSkills from '../all_skills';
import { Jobs } from '../schema';
import { MIN_JOBS_IN_DB } from '../constants';

class JobSearcher {
constructor(jobAnalyzer) {
this.logger = Logger.get(this.constructor.name);

this.jobAnalyzer = jobAnalyzer;

// TODO: Change this to run periodically instead of on startup
this.updateJobStore().then(() => {
jobAnalyzer.computeJobScores()
Expand Down Expand Up @@ -43,53 +45,44 @@ class JobSearcher {
}

async searchJobs(keyphrases) {
const jobs = [];
const keywords = await User._getAllSkills();
const jobList = [];
const keywords = await AllSkills.getAll();

await Promise.all(keyphrases.map(async (keyphrase) => {
// TODO: change these hardcoded params
try {
const results = await indeed.query({
const jobs = await indeed.query({
query: keyphrase,
maxAge: '30',
sort: 'relevance',
limit: 30,
});

await Promise.all(results.map(async (result, i) => {
await Promise.all(jobs.map(async (job, jobIdx) => {
// Add description, unique url to each result by scraping the webpage
const jobPage = await axios.get(result.url);
const jobPage = await axios.get(job.url);
const $ = cheerio.load(jobPage.data);
results[i].description = $('#jobDescriptionText').text();
results[i].url = $('#indeed-share-url').attr('content');
jobs[jobIdx].description = $('#jobDescriptionText').text();
jobs[jobIdx].url = $('#indeed-share-url').attr('content');

const jobExists = await Jobs.findOne({ url: result[i].url });
const jobExists = await Jobs.findOne({ url: jobs[jobIdx].url });
// Check if job exists in the database already
if (jobExists !== null) {
return;
}

// Add the number of occurance of all keywords of the result
const jobDescriptionLower = results[i].description.toLowerCase();
results[i].keywords = [];
keywords.forEach((keyword) => {
// TODO: matches "java" with "javascript" from description
// NOTE: if you map with spaces around it, problems such as "java," arise
const re = new RegExp(keyword, 'g');
results[i].keywords.push({
name: keyword,
count: (jobDescriptionLower.match(re) || []).length,
});
});
jobs[jobIdx].keywords = [];
// Compute count of each keyword in the job
this.jobAnalyzer.computeJobKeywordCount(jobs[jobIdx], keywords);
}));

jobs.push(...results);
jobList.push(...jobs);
} catch (e) {
this.logger.error(e);
}
}));

return jobs;
return jobList;
}

async addToJobStore(jobs) {
Expand Down
3 changes: 2 additions & 1 deletion backend/src/schema/index.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import Jobs from './job_schema';
import Users from './user_schema';
import Skills from './skills_schema';

export { Jobs, Users };
export { Jobs, Users, Skills };
13 changes: 13 additions & 0 deletions backend/src/schema/skills_schema.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import mongoose from 'mongoose';


const skillsSchema = new mongoose.Schema({
skills: [String],
},
{
versionKey: false,
});

const Skills = mongoose.model('Skills', skillsSchema);

export default Skills;
20 changes: 6 additions & 14 deletions backend/src/user/User.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,13 @@ import { Users } from '../schema';
import credentials from '../../credentials/google';

class User {
constructor(app) {
constructor(app, allSkills) {
jacksonx9 marked this conversation as resolved.
Show resolved Hide resolved
this.logger = Logger.get(this.constructor.name);

this.googleAuth = new OAuth2Client(credentials.clientId);

this.allSkills = allSkills;

app.post('/users/googleLogin', async (req, res) => {
const { idToken, firebaseToken } = req.body;
const response = await this.loginGoogle(idToken, firebaseToken);
Expand Down Expand Up @@ -172,19 +174,6 @@ class User {
}
}

/* gets and returns a set containing the collective skills of all the users */
static async _getAllSkills() {
const keywords = [];
const users = await Users.find({});

users.forEach((user) => {
const skills = user.keywords.map(keyword => keyword.name);
keywords.push(...skills);
});

return keywords;
}

async updateSkills(userId, skills) {
if (!userId || !skills) {
return new Response(false, 'Invalid userId or skills', 400);
Expand All @@ -208,6 +197,9 @@ class User {

await user.save();

// Update global set of skills
await this.allSkills.update(skills);

return new Response(true, '', 200);
} catch (e) {
return new Response(false, 'Invalid userId or skills', 400);
Expand Down