# Predicting Salary using Job Description text and Location data.
In this notebook, I show how to predict salary using only the job description text and the job location.
I used deep and wide network. I used Keras to build the model.

Often the description text part of data is not used in model building because it is a challenge to handel. But with neural network frameworks like tensorflow and keras, it is becoming more and more less of a challenge. 

I decided to test how the description text will do on its own with just one other column. The results is exciting. I feel if I add the other features in the model, better results will be achieved. I will work on a part two and check the results.
Data source: Kaggle.

credit to Sara Robinson of Google Cloud for her post on predicting the price wine with keras.

In [34]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.preprocessing import LabelEncoder

from tensorflow import keras
layers = keras.layers
%reload_ext signature

In [2]:
data = pd.read_csv('Train_rev1.csv')

In [3]:
data.columns

Index(['Id', 'Title', 'FullDescription', 'LocationRaw', 'LocationNormalized',
       'ContractType', 'ContractTime', 'Company', 'Category', 'SalaryRaw',
       'SalaryNormalized', 'SourceName'],
      dtype='object')

In [4]:
data.head()

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


In [5]:
pd.set_option('display.max_rows', 2000)
local=data['LocationNormalized'].value_counts()
local

UK                                   41093
London                               30522
South East London                    11713
The City                              6678
Manchester                            3516
Leeds                                 3401
Birmingham                            3061
Central London                        2607
West Midlands                         2540
Surrey                                2397
Reading                               2187
Bristol                               2085
Nottingham                            1873
Sheffield                             1766
Aberdeen                              1634
Hampshire                             1557
Belfast                               1537
East Sheen                            1531
Milton Keynes                         1523
Berkshire                             1502
Oxford                                1497
Newcastle Upon Tyne                   1390
Liverpool                             1341
Kent       

In [9]:
data_desc_sal=data[['FullDescription','SalaryNormalized','LocationNormalized']]

In [11]:
data_desc_sal.head()

Unnamed: 0,FullDescription,SalaryNormalized,LocationNormalized
0,Engineering Systems Analyst Dorking Surrey Sal...,25000,Dorking
1,Stress Engineer Glasgow Salary **** to **** We...,30000,Glasgow
2,Mathematical Modeller / Simulation Analyst / O...,30000,Hampshire
3,Engineering Systems Analyst / Mathematical Mod...,27500,Surrey
4,"Pioneer, Miser Engineering Systems Analyst Do...",25000,Surrey


In [12]:
data_desc_sal.shape

(244768, 3)

In [14]:
# Shuffle the data
data_desc_sal = data_desc_sal.sample(frac=1)

# Print the first 5 rows
data_desc_sal.head()

Unnamed: 0,FullDescription,SalaryNormalized,LocationNormalized
24919,"Sales, Marketing & Customer Services Represent...",31200,Birmingham
154954,A fantastic opportunity has arisen for a Part ...,40000,London
206237,"Our client, a large national contractor have r...",42500,UK
219399,A Housing Association based in East London is ...,37440,East London
107444,This is a very exciting opportunity… We are lo...,13440,Slough


In [15]:
location_threshold = 50 # Anything that occurs less than this will be removed.
value_counts = data_desc_sal['LocationNormalized'].value_counts()
to_remove = value_counts[value_counts <= location_threshold].index
data_desc_sal.replace(to_remove, np.nan, inplace=True)
data_desc_sal = data_desc_sal[pd.notnull(data_desc_sal['LocationNormalized'])]

In [16]:
# Split data into train and test
train_size = int(len(data_desc_sal) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(data_desc_sal) - train_size))

Train size: 181910
Test size: 45478


In [17]:
# Train features
description_train = data_desc_sal['FullDescription'][:train_size]
location_train = data_desc_sal['LocationNormalized'][:train_size]

# Train labels
labels_train = data_desc_sal['SalaryNormalized'][:train_size]

# Test features
description_test = data_desc_sal['FullDescription'][train_size:]
location_test = data_desc_sal['LocationNormalized'][train_size:]

# Test labels
labels_test = data_desc_sal['SalaryNormalized'][train_size:]

In [18]:
# Create a tokenizer to preprocess our text descriptions
vocab_size = 12000 # This is a hyperparameter, experiment with different values for your dataset
tokenize = keras.preprocessing.text.Tokenizer(num_words=vocab_size, char_level=False)
tokenize.fit_on_texts(description_train) # only fit on train

In [19]:
# Wide feature 1: sparse bag of words (bow) vocab_size vector 
description_bow_train = tokenize.texts_to_matrix(description_train)
description_bow_test = tokenize.texts_to_matrix(description_test)

In [20]:
# Wide feature 2: one-hot vector of variety categories

# Use sklearn utility to convert label strings to numbered index
encoder = LabelEncoder()
encoder.fit(location_train)
location_train = encoder.transform(location_train)
location_test = encoder.transform(location_test)
num_classes = np.max(location_train) + 1

# Convert labels to one hot
location_train = keras.utils.to_categorical(location_train, num_classes)
location_test = keras.utils.to_categorical(location_test, num_classes)

In [21]:
# Define our wide model with the functional API
bow_inputs = layers.Input(shape=(vocab_size,))
location_inputs = layers.Input(shape=(num_classes,))
merged_layer = layers.concatenate([bow_inputs, location_inputs])
merged_layer = layers.Dense(256, activation='relu')(merged_layer)
predictions = layers.Dense(1)(merged_layer)
wide_model = keras.Model(inputs=[bow_inputs, location_inputs], outputs=predictions)

In [22]:
wide_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(wide_model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 12000)        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 421)          0                                            
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 12421)        0           input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
dense (Dense)                   (None, 256)          3180032     concatenate[0][0]                
__________

In [23]:
# Deep model feature: word embeddings of wine descriptions
train_embed = tokenize.texts_to_sequences(description_train)
test_embed = tokenize.texts_to_sequences(description_test)

max_seq_length = 250
train_embed = keras.preprocessing.sequence.pad_sequences(
    train_embed, maxlen=max_seq_length, padding="post")
test_embed = keras.preprocessing.sequence.pad_sequences(
    test_embed, maxlen=max_seq_length, padding="post")

In [24]:
# Define our deep model with the Functional API
deep_inputs = layers.Input(shape=(max_seq_length,))
embedding = layers.Embedding(vocab_size, 8, input_length=max_seq_length)(deep_inputs)
embedding = layers.Flatten()(embedding)
embed_out = layers.Dense(1)(embedding)
deep_model = keras.Model(inputs=deep_inputs, outputs=embed_out)
print(deep_model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 250)               0         
_________________________________________________________________
embedding (Embedding)        (None, 250, 8)            96000     
_________________________________________________________________
flatten (Flatten)            (None, 2000)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 2001      
Total params: 98,001
Trainable params: 98,001
Non-trainable params: 0
_________________________________________________________________
None


In [25]:
deep_model.compile(loss='mse',
                       optimizer='adam',
                       metrics=['accuracy'])

In [26]:
# Combine wide and deep into one model
merged_out = layers.concatenate([wide_model.output, deep_model.output])
merged_out = layers.Dense(1)(merged_out)
combined_model = keras.Model(wide_model.input + [deep_model.input], merged_out)
print(combined_model.summary())

combined_model.compile(loss='mse',
                       optimizer='adam',
                       metrics=['accuracy'])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 12000)        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 421)          0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 250)          0                                            
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 12421)        0           input_1[0][0]                    
                                                                 input_2[0][0]                    
__________

In [30]:
# Run training
combined_model.fit([description_bow_train, location_train] + [train_embed], labels_train, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x11520a5f710>

In [31]:
# Generate predictions
predictions = combined_model.predict([description_bow_test, location_test] + [test_embed])

In [41]:
# Compare predictions with actual values for the first few items in our test dataset
num_predictions = 40
diff = 0

for i in range(num_predictions):
    val = predictions[i]
    print(description_test.iloc[i])
    print('Predicted: ', val[0], 'Actual: ', labels_test.iloc[i], '\n')
    diff += abs(val[0] - labels_test.iloc[i])

nbsp; We are looking for an experienced Coordinator to oversee and deliver a volunteer programme that nbsp;provides nbsp;isolated older people with indoor and outdoor exercise following a fall or illness. nbsp;The nbsp;Coordinator will
Predicted:  32986.312 Actual:  27007 

Are you a Technical Team Leader looking for a new challenge? Do you have an automotive background? Do you have EDS/Wiring Harness experience? If so read on as this could be the ideal role for you The client I am representing is an industry leading automotive 1st tier supplier and due to a period of sustained growth, has the requirement for a Technical Team Leader to project manage the engineering team to deliver engineering solutions and liaise with the customer for Automotive Wire Harnesses. The responsibilities of the role are; Supporting the Project Manager in technical aspects of the project such as timing, APQP, etc. Representing the project engineering team internally and to the customer. Ensuring the use of e

#### Validate

In [97]:
valid = pd.read_csv('Test_rev1.csv')

In [102]:
valid_desc=valid[['FullDescription']][:1]

In [103]:
valid_loc=valid[['LocationNormalized']][:1]

In [104]:
valid_bow_test = tokenize.texts_to_matrix(valid_desc)

In [105]:
valid_loc = encoder.transform(valid_loc)
valid_loc = keras.utils.to_categorical(valid_loc, num_classes)

  y = column_or_1d(y, warn=True)


In [106]:
valid_embed = tokenize.texts_to_sequences(valid_desc)
valid_embed = keras.preprocessing.sequence.pad_sequences(
    valid_embed, maxlen=max_seq_length, padding="post")

In [107]:
# Generate predictions
predictions_val = combined_model.predict([valid_bow_test, valid_loc] + [valid_embed])

In [108]:
# Compare predictions with actual values for the first few items in our test dataset
pd.set_option('display.max_colwidth', -1)
num_predictions = 1
diff = 0

for i in range(num_predictions):
    val = predictions_val[i]
    print(valid_desc.iloc[i])
    print('Predicted: ', val[0], '\n')
    diff += abs(val[0] - labels_test.iloc[i])

FullDescription    The Company: Our client is a national training provider based in Gateshead, delivering learning programmes across many regions of England. Founded in **** they have developed a firm foundation that underpins their core offer to employers and individuals that is we work with you to fully understand your training and development needs . Their expertise enables them to deliver a range of learning programmes from NVQ certificates and diplomas to short courses that are designed to upskill individuals, including English and maths. They contract with the Skills Funding Agency to provide Workplace and Classroom based learning programmes, Apprenticeships and courses for individuals who are currently seeking employment or alternative employment. The Role:  Our client is looking for an exceptional business development person who could hit the ground running and have a possible client base to bring with them.  Selling to businesses local and nationally NVQ and apprenticeship opp

In [35]:
%signature