Ranking CVs against JDs.
.
|
+-- .gitignore
+-- README.md
+-- model/ --gitignored
+-- data/ #Both made accessible to all parent folders easily.
| +-- collectCV.py
| +-- jd.csv
| +-- raw_data/
+-- model-stuff/
| +-- Word2Vec Model Training.ipynb
| +-- paragraph_extraction_from_Posts.xml.ipynb
| +-- sentence_extraction_from_paras.txt.ipynb
| +-- sample_bitcoin.stackexchange_paras.txt
| +-- sample_bitcoin.stackexchange_sentences.txt
| +-- stackexchange/ --gitignored
+-- section-stuff/
| +-- get_sections.ipynb
| +-- prc_data.csv
+-- score-stuff/
| +-- WithWord2Vec.ipynb
| +-- WithSpacyModel.ipynb
gitignored ⛔ too big for GitHub.
See below for detailed explaination. 👇
-
Each word is represented as a
300-sized
numpy array. 🎷 -
Trained against stackoverflow⭐ data dump (in xml).
-
Collected 1237328 word types from a corpus of 565919447 raw words and 32701720 sentences. ❗
-
Running time ~3hrs. ⏳
-
Most of the text preprocessing was already completed earlier.
-
For more details see model-stuff/ below. 👇
-
Successfully trained the word2vec model on stackexchange data !! 🌞
-
The code for training the model is present as Word2Vec Model Training.ipynb notebook. The model was saved in
./model/
subdirectory (locally). -
sample_bitcoin.stackexchange_paras.txt
is the paras.txt (paragraph in html tags) file forbitcoin.stackexchange.com
subdirectory of the dataset. It was generated from thePosts.xml
using the code in paragraph_extraction_from_Posts.xml.ipynb notebook. -
sample_bitcoin.stackexchange_sentences.txt
is the sentences.txt (pure sentences) file forbitcoin.stackexchange.com
subdirectory of the dataset. It was generated from the corresponding paras.txt, using the code in sentence_extraction_from_paras.txt.ipynb. The process took around ~12.5 hours to complete. -
stackexchange/ The dataset is hosted on archive.org. The dataset has dump for all communities under StackExchange in
xml
format. Each subdirectory(community) had the following directory substructure:
stackexchange/
+-- README
+-- android.stackexchange.com/
| +-- Posts.xml
| +-- PostHistory.xml
| +-- Badges.xml
| +-- Comments.xml
| +-- PostLinks.xml
| +-- Badges.xml
| +-- Tags.xml
| +-- Users.xml
| +-- Votes.xml
+-- askubuntu.com/
| +-- Posts.xml
| +-- PostHistory.xml
| +-- Badges.xml
| +-- Comments.xml
| +-- PostLinks.xml
| +-- Badges.xml
| +-- Tags.xml
| +-- Users.xml
| +-- Votes.xml
+-- ...so on
- sample_bitcoin.stackexchange_sentences.txt is the sentences.txt (pure sentences) file for bitcoin.stackexchange.com subdirectory of the dataset. It was generated from the corresponding paras.txt generated earlier using the code in sentence_extraction_from_paras.txt.ipynb. The process took around 12.5 hours to complete.
-
raw_data/ : contains collected CVs
-
collectCV.py : While this program is running, every new text copied to clipboard is saved as a CV in raw_data/ directory in text format.
-
collectCV.py : While this program is running, every new text copied to clipboard is saved as a CV in
raw_data
subdirectory in text format. -
jd.csv : This program is to filter the Job Descriptions, only for IT positions from the Kaggle dataset here: https://www.kaggle.com/c/job-salary-prediction/data.
- WithWord2Vec.ipynb: Demonstrates how to use word2vec to get similar words by words and similar words by vector. It also implements sent2vec() function. This function takes a sentence as a argument and returns a average vector for the sentence. Root Mean Square is used to average the vectors. The advantage of this function is to use it to find similar words for phrases which makes more sense while searching for roles etc.
For example:
'software developer' will give 'developers' as a similar word
The notebook also implements a function for calculating cosine similarity b/w two vectors.
- WithSpacyModel.ipynb: Demonstrates the need for a custom Word2Vector model rather than a general model trained otherwise. The similarity values generated by
en_core_web_md
spaCy model trained on Google News articles, do not reflect the 🍁 technological sharpness 🍁 required for the project.