Directory structure 🌳

Ranking CVs against JDs.

Directory structure 🌳

.
|   
+-- .gitignore
+-- README.md
+-- model/  --gitignored
+-- data/                 #Both made accessible to all parent folders easily.
|   +-- collectCV.py
|   +-- jd.csv
|   +-- raw_data/
+-- model-stuff/
|   +-- Word2Vec Model Training.ipynb
|   +-- paragraph_extraction_from_Posts.xml.ipynb
|   +-- sentence_extraction_from_paras.txt.ipynb
|   +-- sample_bitcoin.stackexchange_paras.txt
|   +-- sample_bitcoin.stackexchange_sentences.txt
|   +-- stackexchange/ --gitignored
+-- section-stuff/
|   +-- get_sections.ipynb
|   +-- prc_data.csv
+-- score-stuff/
|   +-- WithWord2Vec.ipynb
|   +-- WithSpacyModel.ipynb

gitignored ⛔ too big for GitHub.

See below for detailed explaination. 👇

About the Generated Word2Vec Word Embeddings

Each word is represented as a 300-sized numpy array. 🎷
Trained against stackoverflow⭐ data dump (in xml).
Collected 1237328 word types from a corpus of 565919447 raw words and 32701720 sentences. ❗
Running time ~3hrs. ⏳
Most of the text preprocessing was already completed earlier.
For more details see model-stuff/ below. 👇

Directory details 👈

model-stuff/

Successfully trained the word2vec model on stackexchange data !! 🌞
The code for training the model is present as Word2Vec Model Training.ipynb notebook. The model was saved in ./model/ subdirectory (locally).
sample_bitcoin.stackexchange_paras.txt is the paras.txt (paragraph in html tags) file for bitcoin.stackexchange.com subdirectory of the dataset. It was generated from the Posts.xml using the code in paragraph_extraction_from_Posts.xml.ipynb notebook.
sample_bitcoin.stackexchange_sentences.txt is the sentences.txt (pure sentences) file for bitcoin.stackexchange.com subdirectory of the dataset. It was generated from the corresponding paras.txt, using the code in sentence_extraction_from_paras.txt.ipynb. The process took around ~12.5 hours to complete.
stackexchange/ The dataset is hosted on archive.org. The dataset has dump for all communities under StackExchange in xml format. Each subdirectory(community) had the following directory substructure:

stackexchange/
+-- README
+-- android.stackexchange.com/
|   +-- Posts.xml
|   +-- PostHistory.xml
|   +-- Badges.xml
|   +-- Comments.xml
|   +-- PostLinks.xml
|   +-- Badges.xml
|   +-- Tags.xml
|   +-- Users.xml
|   +-- Votes.xml
+-- askubuntu.com/
|   +-- Posts.xml
|   +-- PostHistory.xml
|   +-- Badges.xml
|   +-- Comments.xml
|   +-- PostLinks.xml
|   +-- Badges.xml
|   +-- Tags.xml
|   +-- Users.xml
|   +-- Votes.xml
+-- ...so on

section-stuff/

sample_bitcoin.stackexchange_sentences.txt is the sentences.txt (pure sentences) file for bitcoin.stackexchange.com subdirectory of the dataset. It was generated from the corresponding paras.txt generated earlier using the code in sentence_extraction_from_paras.txt.ipynb. The process took around 12.5 hours to complete.

data/

raw_data/ : contains collected CVs
collectCV.py : While this program is running, every new text copied to clipboard is saved as a CV in raw_data/ directory in text format.
collectCV.py : While this program is running, every new text copied to clipboard is saved as a CV in raw_data subdirectory in text format.
jd.csv : This program is to filter the Job Descriptions, only for IT positions from the Kaggle dataset here: https://www.kaggle.com/c/job-salary-prediction/data.

score-stuff/

WithWord2Vec.ipynb: Demonstrates how to use word2vec to get similar words by words and similar words by vector. It also implements sent2vec() function. This function takes a sentence as a argument and returns a average vector for the sentence. Root Mean Square is used to average the vectors. The advantage of this function is to use it to find similar words for phrases which makes more sense while searching for roles etc.

For example:

'software developer' will give 'developers' as a similar word

The notebook also implements a function for calculating cosine similarity b/w two vectors.

WithSpacyModel.ipynb: Demonstrates the need for a custom Word2Vector model rather than a general model trained otherwise. The similarity values generated by en_core_web_md spaCy model trained on Google News articles, do not reflect the 🍁 technological sharpness 🍁 required for the project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Directory structure 🌳

About the Generated Word2Vec Word Embeddings

Directory details 👈

model-stuff/

section-stuff/

data/

score-stuff/

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
data		data
model-stuff		model-stuff
score-stuff		score-stuff
section-stuff		section-stuff
.gitignore		.gitignore
README.md		README.md
Report.md		Report.md
Report.pdf		Report.pdf

mesksr/resume-matcher

Folders and files

Latest commit

History

Repository files navigation

Directory structure 🌳

About the Generated Word2Vec Word Embeddings

Directory details 👈

model-stuff/

section-stuff/

data/

score-stuff/

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages