## Capstone Project: Job Posting and Profitability 

**Springboard Data Science Intensive Course**

October 23, 2017

Phai Phongthiengtham
***

### 1. Introduction

What drives a company’s success? Undeniably, one important factor is the people, and the starting point would be the vacancy posts. Using the merged dataset from CareerBuilder and Compustat, this project aims to understand how vacancy posting can help with a company's success. Section 2 demonstrates how to process raw posting data, and how to apply fuzzy string matching algorithm to match job posting company names and Compustat legal company names. Section 3 provides some basic exploration and visualization of the data. Section 4 applies several machine learning algorithms to predict a company’s profitability using information on education level requirements, locations and industry sectors. The model provides a product recommending education level requirement when posting a job. Section 5 incorporates job titles to the algorithm. The results provide several interesting insights. Section 6 concludes and provides some possible future work and extensions.      

***

### 2. Data Wrangling

Main datasets in this project are:
1. Job postings from CareerBuilder, provided by Economic Modeling Specialists International (EMSI), in 2016. This ipython notebook [here](https://github.com/phaiptt125/online_job_posting/blob/master/data_cleaning/initial_cleaning.ipynb) explains in detail how to clean the job postings data. 
2. Financial data on publicly traded companies: Compustat (North America) database. This ipython notebook [here](https://github.com/phaiptt125/online_job_posting/blob/master/data_cleaning/merge_compustat.ipynb) explains in detail how to  merge job postings data to Compustat.

There are around 1.4 million ads in total, which are too large for a personal computer. As a result, a sampling of 10000 ads is done by using the command *df = df_full.sample(n=10000)*. In addition, this project only considers manager positions.  

***

#### 2.1 Measuring Profitability

This project uses "Return on Equity - ROE": the amount of net income returned as a percantage of shareholders equity (excluding preferred stock). Return on equity measures a corporation's profitability by revealing how much profit a company generates with the money shareholders have invested. 

$$ \text{Return on Equity} = \frac{\text{Net Income}}{\text{Shareholder's Equity}}$$

***

#### 2.2 Variable Description

The cleaned data is shown below:

In [1]:
import pandas as pd

df = pd.read_csv('data_postings.txt',sep='\t',header=0)
select_var = ['conml','roe','onet','naics','state','original_jobtitle', 'high_school','associate','bachelor','master','phd']
df[select_var].head()

Unnamed: 0,conml,roe,onet,naics,state,original_jobtitle,high_school,associate,bachelor,master,phd
0,Del Taco Restaurants Inc,0.055423,11-9051.00,722513,CA,Food Service Director,1,0,0,0,0
1,Medtronic PLC,0.080089,11-3021.00,334510,MN,I&O Business Development & Technology Sr IT Pr...,0,0,0,0,0
2,Macy's Inc,0.143188,11-3121.00,452111,AL,"Macy's Brookwood Village, Birmingham, AL: Huma...",0,0,0,0,0
3,TE Connectivity Ltd,0.236771,11-9141.00,334417,NJ,Property Manager,0,0,1,0,0
4,CA Inc,0.136228,11-2021.00,511210,CA,"VP, Regional Field Marketing",0,0,1,1,0


* *"conml"* : The official company name as reported on its EDGAR SEC filings.
* *"roe"* : Return on equity (percent of net income per shareholder's equity).
* *"onet"* : Occupation code according to U.S. Department of Labor, see [here](https://www.onetonline.org/) for more information.
* *"naics"* : North American Industry Classification System, see [here](https://www.census.gov/eos/www/naics/) for more information.
* *"state"* : State in which the company is located (50 states + DC).
* *"original_jobtitle"* : Original job title as appeared in the careerbuilder.
* *"high_school"* : Whether a post specifically requires a high school degree (=1 if yes). 
* *"associate"* : Whether a post specifically requires an associate degree (=1 if yes). 
* *"bachelor"* : Whether a post specifically requires a bachelor degree (=1 if yes). 
* *"master"* : Whether a post specifically requires a master degree (=1 if yes). 
* *"phd"* : Whether a post specifically requires a phd degree (=1 if yes). 

***

### 3. Data Story

This section demonstrates basic basic exploration (module 4). This ipython notebook demonstates [here](https://github.com/phaiptt125/springboard_course/blob/master/Capstone/data_story.ipynb) demonstrate this section in detail.

1. From the data, we see average ROE of 6.4%. There is, however, a huge spread. Median ROE is 9.18% and we have upto -96.5% and 89% ROE.
2. Successful companies post more ads.
3. Successful companies post more top executives proportionally.
4. Most ads (58%) specifically require Bachelor's degree, followed by Master's degree (17.51%). 
5. Successful companies often specify higher education requirement. 

***

### 4. Recommending Education Requirement when Posting Vacancies

This section applies several machine learning algorithms to predict a company’s profitability using information on education level requirements, locations and industry sectors. The model provides a product recommending education level requirement when posting a job. This ipython notebook [here](https://github.com/phaiptt125/springboard_course/blob/master/Capstone/analyze_edu.ipynb) explains the procedure in detail.

***

### 5. Job Titles and Profitability

This section incorporates job titles to the algorithm and provides several interesting insights. This ipython notebook [here](https://github.com/phaiptt125/springboard_course/blob/master/Capstone/analyze_title.ipynb) explains the procedure in detail.

First, job titles should not be too long. The coefficients in the polymonial terms are: $ROE = a*(word) + b*(word)^2$, then the optimal number of words is around $-\frac{a}{2b} = \frac{0.004857}{2*0.000801} = 3.03$. On average, the length of job titles should not exceed 3-4 words. 

Second, companies that put information of their location into job titles tend to perform better, as the coefficient of the ridge regression is positive (~0.018). The causal relationship, however, is not clear. Most likely, we have a confounding variable as expanding companies are the ones that performing well. As they are expanding to other area, they tend to put information on the new location into job tities. 

***

### 6. Future Work

This project attempts shed some light on how vacancy posting relate to a company's profitability. There is, however, much to be done:
* This project only uses a subset of full data. Future work could incorporate all ads (more than 100 million) across time and occupations. 
* An open-sourced *Word2Vec* model on job descriptions is useful to HR departments.
* Further work could explore more to this relationship between profitability and whether a company put location into its job title.