# Salary Predictions Based on Job Descriptions

### Overall Problem and Motivation

The goal is to produce a model which utilizes job postings to make a prediction on the salaries which should be given for that specific posting. This is valuable information for a company to accurately produce because it can help companies: 
* budget for new talent efficiently
* utilize employees time with more important projects


### The Data

The data will be read in from three CSV files:
1. __train_features.csv__: holds the feature data for each posting that will be used for training the model
2. __train_salaries.csv__: holds the target values for each posting that will be used for training the model
3. __test_features.csv__: which holds the feature data for the postings which will be used for my model's final predictions

The features being used to predict the salaries are job type, degree, major, industry, years of experience, and miles from metropolis. Each row is identified by a unique key named *jobId*. The target salary is given in units of thousands of dollars and are also identified by the unique key of *jobId*.


### Evaluation Metric

I will be evaluating model accuracy using mean squared error (MSE) between the predicted values and the actual target values. Since the testing target data is not given, I will be withholding a random chunk of the training data at the very beginning of the project to evaluate my model on. 

MSE is appropriate for regression models which penalizes larger errors, both over and under estimations, more severely than just using absolute error.

### Output

The final output will be a CSV file named *test_salaries.csv*. The format of this CSV will mirror the format of *train_salaries.csv* with the first column representing the *jobId* and the second column being the corresponding predicted salary in thousands of dollars.

## Part 2 - DISCOVER

### ---- 2 Load the data ----

In [3]:
#load the data into a Pandas dataframe

### ---- 3 Clean the data ----

In [2]:
#look for duplicate data, invalid data (e.g. salaries <=0), or corrupt data and remove it

### ---- 4 Explore the data (EDA) ----

In [3]:
#summarize each feature variable
#summarize the target variable
#look for correlation between each feature and the target
#look for correlation between features

### ---- 5 Establish a baseline ----

In [5]:
#select a reasonable metric (MSE in this case)
#create an extremely simple model and measure its efficacy
#e.g. use "average salary" for each industry as your model and then measure MSE
#during 5-fold cross-validation

### ---- 6 Hypothesize solution ----

In [None]:
#brainstorm 3 models that you think may improve results over the baseline model based
#on your 

Brainstorm 3 models that you think may improve results over the baseline model based on your EDA and explain why they're reasonable solutions here.

Also write down any new features that you think you should try adding to the model based on your EDA, e.g. interaction variables, summary statistics for each group, etc

## Part 3 - DEVELOP

You will cycle through creating features, tuning models, and training/validing models (steps 7-9) until you've reached your efficacy goal

#### Your metric will be MSE and your goal is:
 - <360 for entry-level data science roles
 - <320 for senior data science roles

### ---- 7 Engineer features  ----

In [None]:
#make sure that data is ready for modeling
#create any new features needed to potentially enhance model

### ---- 8 Create models ----

In [15]:
#create and tune the models that you brainstormed during part 2

### ---- 9 Test models ----

In [1]:
#do 5-fold cross validation on models and measure MSE

### ---- 10 Select best model  ----

In [None]:
#select the model with the lowest error as your "prodcuction" model

## Part 4 - DEPLOY

### ---- 11 Automate pipeline ----

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset

### ---- 12 Deploy solution ----

In [16]:
#save your prediction to a csv file or optionally save them as a table in a SQL database
#additionally, you want to save a visualization and summary of your prediction and feature importances
#these visualizations and summaries will be extremely useful to business stakeholders

### ---- 13 Measure efficacy ----

We'll skip this step since we don't have the outcomes for the test data