## Linear Regression Assignment

Today in class, we will work on predicting salaries based on job advertisements.  The goal is to take a set of job postings and KNOWN job salaries (supervised learning) and predict the salary for future job postings.

This was a Kaggle competition for a startup compaany Adzuna.  Kaggle is place to compete in applications of machine learning and data science.  The company, Adzuna, wanted the ability to provide a predicted salary for any job listing on their website.

This is the next assignment in the course, but we will start it together in class.  You will build a model to predict salaries based on these job listings.

The evaluation criteria will be Mean Absolute Error - we want to the lowest average error of salary prediction.

In [206]:
import pandas as pd

import seaborn as sb
%matplotlib inline

### The dataset

The dataset we are using contains a job listing per row

`SalaryNormalized` is the outcome or `y` variable - it is the known the salary.  

All other fields can be used as input or `x` varaiable to predict it.

In [207]:
data = pd.read_csv('train-cedited.csv', sep=',')
data.head()

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


# Modeling sklearn

To use scikits-learn, we first need to create our design matrix X from our initial dataframe.  There many way to do this. Once we have that matrix X, we will use model.fit(X, y) to fit the model.

In [100]:
from sklearn.linear_model import LinearRegression

model = LinearRegression() 
# model = model.fit(X, data.SalaryNormalized) # This won't work yet, until we create X from our dataframe

## Using get_dummies to create the design matrix

We can use the Pandas function get_dummies. 

It creates from a single column - data.Category - 
multiple columns, one for each unique value in data.Category. 
Each row corresponds to a row from the original data, and would have exactly one column
with value 1, according to the job category of that worker. (For example, notice that all first 5 rows in data
has Category = Engineering Jobs, and therefor the first 5 rows in the design matrix has 1 in the engineering jobs column.

In [181]:
category_dummies = pd.get_dummies(data.Category)
category_dummies.head()

Unnamed: 0,Accounting & Finance Jobs,Admin Jobs,Charity & Voluntary Jobs,Consultancy Jobs,Creative & Design Jobs,Customer Services Jobs,Domestic help & Cleaning Jobs,"Energy, Oil & Gas Jobs",Engineering Jobs,HR & Recruitment Jobs,...,Other/General Jobs,"PR, Advertising & Marketing Jobs",Property Jobs,Retail Jobs,Sales Jobs,Scientific & QA Jobs,Social work Jobs,Teaching Jobs,Trade & Construction Jobs,Travel Jobs
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Feature Engineering using dmatrix 

Lets see another way to do the same thing

In [213]:
from patsy import dmatrices, dmatrix
# The '+ 0' syntax throws aways the intercept since we don't need that as a column
X = dmatrix("Category + 0", data=data, return_type='dataframe')

X.head()

Unnamed: 0,Category[Accounting & Finance Jobs],Category[Admin Jobs],Category[Charity & Voluntary Jobs],Category[Consultancy Jobs],Category[Creative & Design Jobs],Category[Customer Services Jobs],Category[Domestic help & Cleaning Jobs],"Category[Energy, Oil & Gas Jobs]",Category[Engineering Jobs],Category[HR & Recruitment Jobs],...,Category[Other/General Jobs],"Category[PR, Advertising & Marketing Jobs]",Category[Property Jobs],Category[Retail Jobs],Category[Sales Jobs],Category[Scientific & QA Jobs],Category[Social work Jobs],Category[Teaching Jobs],Category[Trade & Construction Jobs],Category[Travel Jobs]
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Train-Test Split

We can create a split of OUR training data, to evaluate our model, our features and any parameters we have set.  We want to have some idea of how well we will do on the true test set since we do not know the true salaries on that set.  That is representative of new data we expect in the future.
The default score method of LinearRegression is R^2. 
https://en.wikipedia.org/wiki/Coefficient_of_determination

In [216]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, data.SalaryNormalized)
model.fit(X_train, y_train)
model.score(X_test, y_test) # This is the R^2 score

0.13120634168938916

## Evaluating using MAE

As mentioned, our evaluation criteria is going to be MAE - Mean Absolute Error, we want the lowest average error of salary prediction  

In [217]:
from sklearn.metrics import mean_absolute_error

predicted_salaries = model.predict(X_test)
print (mean_absolute_error(y_test, predicted_salaries))

10922.9524


## Cross Validation

In [201]:
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import make_scorer 
print ("R-squared:", cross_val_score(model, X, data.SalaryNormalized, scoring = 'r2'))
print ("Mean absolute error", cross_val_score(model, X, data.SalaryNormalized, scoring=make_scorer(mean_absolute_error)))
print ("Mean absolute error average: ", cross_val_score(model, X, data.SalaryNormalized, scoring=make_scorer(mean_absolute_error)).mean())

R-squared: [ 0.16976911  0.07820618  0.19462668]
Mean absolute error [  9945.80854771  10189.04253533  10966.73806919]
Mean absolute error average:  10367.1963793


## Adding text features

In [199]:
# Making your own dummy columns
from scipy.sparse import hstack

data['isManager'] = data.FullDescription.str.lower().map(lambda x: 1 if 'manager' in x else 0)
data['isAssistant'] = data.FullDescription.str.lower().map(lambda x: 1 if 'assistant' in x else 0)
data['isExecutive'] = data.FullDescription.str.lower().map(lambda x: 1 if 'exec' in x or 'ceo' in x or 'president' in x else 0)

X = hstack((data[['isManager', 'isAssistant', 'isExecutive']], category_dummies))


In [200]:
print ("Mean absolute error: ", cross_val_score(model, X, data.SalaryNormalized, scoring=make_scorer(mean_absolute_error)).mean())

Mean absolute error mean:  10367.1963674


## --> Our score has imporved! 

In [202]:
data['ContractType'] = data.ContractType.fillna('NA')
X = dmatrix("Category + isManager + isAssistant + isExecutive + ContractType", data=data)
print ("Mean absolute error: ", cross_val_score(model, X, data.SalaryNormalized, scoring=make_scorer(mean_absolute_error)).mean())

Mean absolute error:  10363.4525473


## Lets add another feature

In [205]:
QuantJob = ['analyst','analytical', 'analysis', 'math', 'quant', 'model', 'science', 'scientific', 'simulation', 'simulate', 'engineer']
def Quantitative(desc):
    for x in QuantJob:
        if x.lower() in str(desc).lower():
            return 1
    return 0
data['QuantJob'] = data['FullDescription'].map(Quantitative)
X = dmatrix("Category + isManager + isAssistant + isExecutive + ContractType + QuantJob + 0", data=data)
print ("Mean absolute error: ", cross_val_score(model, X, data.SalaryNormalized, scoring=make_scorer(mean_absolute_error)).mean())

Mean absolute error:  10275.3755651


## Your Turn!

Think how you can imporve our model score. Use your domain knowledge and come up with new 
features, then test your new model and see if the MAE has improved