Skip to content

Linkedin data science job postings analysis using natural language processing techniques and prediction on candidate's education level

License

Notifications You must be signed in to change notification settings

m3redithw/Linked-inSight

Repository files navigation

logo

by Meredith Wang

Python Pandas NumPy SciPy Matplotlib Selenium seaborn plotly sklearn NLTK

🌐 Project Description

Job hunting is a tedious and stressful process. Stacked paragraphs of description and long list of requirement from the job listings are only adding fuel to the flame. This project aims to help me and other aspiring data science professionals get a clear insight on the role they're pursuing, and to provide a better understanding on the education level of their competitors.

🌟 Project Goals

Our goal is to to analyze data-science job postings on Linkedin using Natural Language Processing techniques and predict the candidate's education level.

Education level is classified into two categories:

  • Undergraduate (candidate whose highest education level is a Bachelor degree, and those who have 'other' degrees)
  • Graduate (candidate whose highest education level is Master/PhD)

📝 Initial Questions

▪ What does overall candidate's education distribution look like?

▪ Is role dependent on the education level of candidates?

▪ Is job level dependent on the education level of candidates?

▪ Is job description different for graduate vs. undergraduate group?

📂 Data Dictionary

Variable Value Meaning
Link String The url of the job posting
Company String The company name of the job posting
Mode On-Site; Remote; Hybrid The working environment of the job posting
Type Full-time; Contract The contract type of the job posting
Level Entry; Associate; Mid-Senior The job level of the job posting
Requirements String The requirements in the description section of the job posting
Edu Level Int Percentage of education level of candidates of the job position
Skills String The top 10 skills from candidates of job posting

🧭 Outline/Planning

1️⃣ Data Acquisition

Gather data from Linkedin using Selenium
  • Install Selenium web driver

  • Create function to guide driver to automate job search

  • Store data locally to a .csv file

Acquisition

selenium

2️⃣ Data Preparation

Missing Values
  • When job posting does not have enough candidates to generate insight, the education level and skills will be missing

  • Missing values are manually filled by going to URL of job posting, and find another positng with the same job level, role, and company

Dummy Variables

Categorical features (e.g. role, level) are turned into dummy variables to quantify the features, so we can use them in the models.

Initial Text Cleaning

Job role names vary from companies. For example, for data scientist position, there are names like "Data Scientist II", "Data Scientist, Charging Data and Modeling", "Data Scientist - Credit Card", etc... For the purpose of analyzing the general category's relationship with the target variable, all roles are generalized to 4 categories: Data Scientist, Data Analyst, Data Engineer, Managerial Roles.

Parsing Text
  • Convert text to all lower case for normality

  • Remove any accented characters, non-ASCII characters

  • Remove special characters

  • Lemmatization

  • Remove stopwords

  • Store the clean text and the original text for use in future notebooks

Preparation

3️⃣ Data Exploration

  • Address initial questions to find what are the key features that are associated with undragudate and graduate group

  • Explore each feature's correlation with education distribution

  • Use visualizations to better understand the relationship between features and target variable

4️⃣ Statistical Testing & Modeling

  • Conduct T-Test for categorical variable vs. numerical variable

  • Conduct Chi^2 Test for categorical variable vs. categorical variable

  • Conclude hypothesis and address the initial questions

Exploration

5️⃣ Modeling

  • Create decision tree classifer and fit train dataset

  • Find the max depth for the best performing decision tree classifer (evaluated using classification report, accuracy score)

  • Create random forest classifier and fit train dataset

  • Find the max depth for the best performing random forest classifier (evaluated using classification report, accuracy score)

  • Create logistic regression model and fit train dataset

  • Find the parameter C for the best performing logistic regression model (evaluated using classification report, accuracy score)

  • Create XGBoost classifier and fit train dataset

  • Pick the top 3 models among all the models and evaluate performance on validate dataset

  • Pick the model with highest accuracy and evaluate on test dataset

Modeling

🔁 Steps to Reproduce

NOTE: The job postings data is not static. With that being said, the result of each run of auto-search would be different. Therefore, the insight from exploration and accuracy of models would be slightly different as well.

  • You will need to have a Linkedin Premium account, preferrably a premium account so you can access part of data that's used as modeling features. Store your password locally in a secret text file.
  • You will need to install Selenium webdrive. Please follow documentation and steps in acquisition notebook.
  • Run driver and acquire the latest job postings on your own then store it in a .csv format file.

OR

  • You can choose to use my data that I generate analysis on. Please contact me for the .csv file.

The following steps apply for both:

  • Clone my repo (including imports.py, prepare.py)
  • Libraries used are pandas, matplotlib, seaborn, plotly, sklearn, scipy, selenium, nltk
  • Follow instructions in each notebook throughout the pipeline (preparation, exploration, modeling)and README file
  • Good to run workbook and read through white paper 😸

🔑 Key Findings

overall_distribution

  • Less than 1/4 of data science job posting's candidate's highest education level is Bachelor degree.

  • Candidate's education distribution is dependent on role (scientist, analyst, engineer, managerial roles)

  • Candidate's education distribution is independent with job level (entry, associate, mid-senior)

  • For entry level positions, the amount of candidates with graduate degrees is significantly more than those with undergrad degrees.

  • Top phrases mentioned in data science job descriptions are: Data Analytics, no. of years experience, SQL, Python, Master Degree, Business

  • Top skills among data science candidates: SQL, Python, Machine Learning, Data Analysis, R, C/C++, Tableau, Data Visualization

  • Final model decision tree is expected to predict with 87% accuracy on future unseen data.

model_scores

🔜 Next Steps

  • For the purpose of completing a MVP, I was only able to gather 243 observations. That is one of the reason there's a class imbalance in our dataset, and why the model is failing to converge and having a higher accuracy. Therefore, gathering more data would be important.

  • This project is solely focused on Data Science related job positions in the United States. We can expand the field to other areas in tech (e.g. web development, cloud administration, etc.) and compare the education distribution across fields. We can also expand countries to see if such a master-degree dominant poll is solely in the United States.

  • There are extensive amount of master programs, and there is no indicator of the quality of the program itself. For further study, I would like to include parameters that distinguish different levels of degree accomplished.

🔆 Recommendations/Further Questions

  • For candidates who don't have a graduate degree, or a bachelor degree in STEM, I suggest you focus on mastering the "top skills" that we concluded in the explore section.

  • What exactly is the difference between candidates who acquire the skills on their own, and those who went through a graduate program that cost $50k on average? How small is the chance for someone without a desired degree to "survive" the sea of resumes?

About

Linkedin data science job postings analysis using natural language processing techniques and prediction on candidate's education level

Topics

Resources

License

Stars

Watchers

Forks