# Salary prediction by vacancy description

## Dataset description

The dataset represents the data about vacancies which were published in the world net for different countries. The vacancy info has a full description of this vacancy, title, location, company, working category, salary etc.
In this assignment you have to predict the possibility of raising the salary threshold, using the vacancy description. The data is presented in the dataframe. The columns of interest are:
* FullDescription - description of vacancy
* SalaryNormalized - predicted salary threshold.

Dataset description

There are steps which are required to successfully complete the assignment:
1. Read dataset
2. Perform text transformation by removing punctuation terms and stop words.
3. Generate n-grams.	
4. Count TF * IDF features
5. Fit model for generated features.


## Reading dataset

Init pyspark session

In [3]:
from __future__ import division, print_function, unicode_literals # For the compatibility with Python 2

In [4]:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder\
                            .enableHiveSupport()\
                            .appName("spark sql")\
                            .master("local[4]")\
                            .getOrCreate()

In [5]:
sc=spark_session.sparkContext

In [6]:
!ls /data/vacancie


test.csv  train.csv


In [7]:
train_data = spark_session.read.csv("/data/vacancie/",inferSchema=True,header=True)

train_data.printSchema()


root
 |-- _c0: integer (nullable = true)
 |-- Id: integer (nullable = true)
 |-- Title: string (nullable = true)
 |-- FullDescription: string (nullable = true)
 |-- LocationRaw: string (nullable = true)
 |-- LocationNormalized: string (nullable = true)
 |-- ContractType: string (nullable = true)
 |-- ContractTime: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- SalaryRaw: string (nullable = true)
 |-- SalaryNormalized: integer (nullable = true)
 |-- SourceName: string (nullable = true)



In [8]:
dataset = train_data.select('FullDescription','SalaryNormalized')
dataset.show()

+--------------------+----------------+
|     FullDescription|SalaryNormalized|
+--------------------+----------------+
|Stress Engineer G...|               0|
|Mathematical Mode...|               0|
|Engineering Syste...|               0|
|Pioneer, Miser  E...|               0|
|Engineering Syste...|               0|
|This is an except...|               0|
|A subsea engineer...|               1|
|Are you a success...|               0|
|PROJECT ENGINEER ...|               1|
|Senior Fatigue St...|               1|
|A well respected ...|               0|
|Our client are a ...|               0|
|A leading Subsea ...|               1|
|A popular hotel l...|               0|
| HOTEL AND CONFER...|               0|
|Senior Control an...|               1|
|Control and Instr...|               1|
|Senior Process En...|               1|
|CHEF DE PARTIE PO...|               0|
|Senior Sous Chef ...|               0|
+--------------------+----------------+
only showing top 20 rows



## Transforming dataset

Remove redundant punctuation signs using RegexTokenizer with pattern <code>"\\\\s+|,|\\\\*|/|\\\\."</code>. This pattern removes whitespaces, commas, dots and other characters.

In [9]:
from pyspark.ml.feature import RegexTokenizer, Tokenizer

In [10]:
regexTokenizer = RegexTokenizer(pattern="\\s+|,|\\*|/|\\.", inputCol="FullDescription", outputCol="FullDescriptionTokenized")
regex_data = regexTokenizer.transform(dataset)

print(regex_data.show(3))


+--------------------+----------------+------------------------+
|     FullDescription|SalaryNormalized|FullDescriptionTokenized|
+--------------------+----------------+------------------------+
|Stress Engineer G...|               0|    [stress, engineer...|
|Mathematical Mode...|               0|    [mathematical, mo...|
|Engineering Syste...|               0|    [engineering, sys...|
+--------------------+----------------+------------------------+
only showing top 3 rows

None


Remove English stop words using StopWordsRemover

In [11]:
from pyspark.ml.feature import StopWordsRemover

In [12]:
remover = StopWordsRemover(inputCol="FullDescriptionTokenized", outputCol="FullDescriptionFiltered")
filtered_data = remover.transform(regex_data)

In [13]:
from pyspark.ml.feature import NGram


ngram = NGram(n=2, inputCol="FullDescriptionFiltered", outputCol="2grams")

ngramDataFrame = ngram.transform(filtered_data)
ngramDataFrame.select("2grams").show(3)

ngram = NGram(n=3, inputCol="FullDescriptionFiltered", outputCol="3grams")

ngramDataFrame = ngram.transform(ngramDataFrame)
ngramDataFrame.select("3grams").show(3)

ngramDataFrame.show(3)

+--------------------+
|              2grams|
+--------------------+
|[stress engineer,...|
|[mathematical mod...|
|[engineering syst...|
+--------------------+
only showing top 3 rows

+--------------------+
|              3grams|
+--------------------+
|[stress engineer ...|
|[mathematical mod...|
|[engineering syst...|
+--------------------+
only showing top 3 rows

+--------------------+----------------+------------------------+-----------------------+--------------------+--------------------+
|     FullDescription|SalaryNormalized|FullDescriptionTokenized|FullDescriptionFiltered|              2grams|              3grams|
+--------------------+----------------+------------------------+-----------------------+--------------------+--------------------+
|Stress Engineer G...|               0|    [stress, engineer...|   [stress, engineer...|[stress engineer,...|[stress engineer ...|
|Mathematical Mode...|               0|    [mathematical, mo...|   [mathematical, mo...|[mathematical mo

Generate n-grams with $n = 2$, $n = 3$ (module pyspark.ml.feature). After that you can perform some experiments with concatenating of column datasets (e.g. words and 3-grams or words with 2-grams and 3-grams). You can use the function in the cell below to concatenate lists.

In [14]:
from itertools import chain
from pyspark.sql.functions import col, udf
from pyspark.sql.types import *

def concat(type):
    def concat_(*args):
        return list(chain(*args))
    return udf(concat_, ArrayType(type))

concat_string_arrays = concat(StringType())

In [15]:
ngram = ngramDataFrame.select(concat_string_arrays(col("2grams"),col("3grams")).alias("features"), "SalaryNormalized")

ngram.show(10)

+--------------------+----------------+
|            features|SalaryNormalized|
+--------------------+----------------+
|[stress engineer,...|               0|
|[mathematical mod...|               0|
|[engineering syst...|               0|
|[pioneer miser, m...|               0|
|[engineering syst...|               0|
|[exceptional oppo...|               0|
|[subsea engineeri...|               1|
|[successful resul...|               0|
|[project engineer...|               1|
|[senior fatigue, ...|               1|
+--------------------+----------------+
only showing top 10 rows



## Counting TF-IDF features

Use hashing trick and IDF to count features of train dataset. The appropriate features number for the dataset is about 2000. You can experiment with varying the number of features.


<b>Note.</b> Remember to save IDF model in order to apply it to the test dataset.

In [16]:
from pyspark.ml.feature import HashingTF, IDF, CountVectorizer

In [64]:
# Apply Hashing TF

hashingTF = HashingTF()

hashingTF = HashingTF(inputCol="features", outputCol="rawFeatures", numFeatures=400)
featurizedData = hashingTF.transform(ngram)


In [65]:
featurizedData  = featurizedData.select("rawFeatures", "SalaryNormalized")
featurizedData.show(10)

+--------------------+----------------+
|         rawFeatures|SalaryNormalized|
+--------------------+----------------+
|(400,[3,4,6,7,11,...|               1|
|(400,[3,6,9,12,17...|               1|
|(400,[1,10,16,17,...|               0|
|(400,[2,10,12,13,...|               0|
|(400,[1,7,14,16,2...|               1|
|(400,[0,2,3,4,6,9...|               1|
|(400,[0,1,4,10,13...|               0|
|(400,[1,2,3,5,9,1...|               0|
|(400,[1,3,4,5,6,8...|               1|
|(400,[0,1,2,3,4,5...|               0|
+--------------------+----------------+
only showing top 10 rows



In [66]:
# Transform data with the IDF model

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

In [67]:
rescaledData = rescaledData.select(rescaledData.SalaryNormalized.alias("labels"), "features")

# Fitting model

Split the dataset to train and validation part (it is better to use 90% for the train part and 10% for the validation part)

In [68]:
train_data, test_data = rescaledData.randomSplit([0.9, 0.1])

Fit the Logistic Regression to the model on the splitted train part. Use about 15 iterations for the training process.

<b>Hint.</b> Use regularization parameter in order to prevent overfitting.

In [69]:
from pyspark.ml.classification import LogisticRegression

In [70]:
lr = LogisticRegression(featuresCol='features',labelCol='labels')

model = lr.fit(train_data)


Print the loss function for each iteration. What can you notice from the behaviour of loss function?

<b>Hint.</b> Use summary.objectiveHistory for this case.

In [71]:
trainingSummary = model.summary

trainingSummary.predictions.show(5)

+------+--------------------+--------------------+--------------------+----------+
|labels|            features|       rawPrediction|         probability|prediction|
+------+--------------------+--------------------+--------------------+----------+
|   0.0|(400,[0,1,2,3,4,5...|[-1.3020942812736...|[0.21381276384264...|       1.0|
|   0.0|(400,[0,1,2,3,4,5...|[-1.9345402186152...|[0.12624889764235...|       1.0|
|   0.0|(400,[0,1,2,3,4,5...|[-0.0779602543740...|[0.48051980180497...|       1.0|
|   0.0|(400,[0,1,2,3,4,5...|[0.96634800360846...|[0.72439097931260...|       0.0|
|   0.0|(400,[0,1,2,3,4,5...|[0.91479487133508...|[0.71398033879609...|       0.0|
+------+--------------------+--------------------+--------------------+----------+
only showing top 5 rows



Apply the model to the validation set

In [72]:
test_results = model.evaluate(test_data)

test_results.predictions.show(5)

+------+--------------------+--------------------+--------------------+----------+
|labels|            features|       rawPrediction|         probability|prediction|
+------+--------------------+--------------------+--------------------+----------+
|     0|(400,[0,1,2,3,4,5...|[-0.7818075866801...|[0.31393044056371...|       1.0|
|     0|(400,[0,1,2,3,4,5...|[-0.1445073275075...|[0.46393590493266...|       1.0|
|     0|(400,[0,1,2,3,4,5...|[0.86611700775817...|[0.70393708623075...|       0.0|
|     0|(400,[0,1,2,3,4,5...|[-0.9014196299617...|[0.28875885037011...|       1.0|
|     0|(400,[0,1,2,3,4,5...|[0.62730115244500...|[0.65187725512390...|       0.0|
+------+--------------------+--------------------+--------------------+----------+
only showing top 5 rows



In [73]:
model.summary.objectiveHistory

[0.6927536511049203,
 0.691332843856734,
 0.6906553500872187,
 0.6896795432297079,
 0.6865647069577447,
 0.6800505984184066,
 0.6674675316069572,
 0.6522705972539948,
 0.6426301234840726,
 0.6404475468547574,
 0.6401993871264021,
 0.6400408063473533,
 0.639693774409343,
 0.6389731722208523,
 0.637731206551552,
 0.6371392086968888,
 0.635856429727614,
 0.6357584855198875,
 0.6357076989626469,
 0.6356138794759065,
 0.635336738935395,
 0.6351580612111113,
 0.6350410210005988,
 0.6350080574849847,
 0.6349639728371852,
 0.6349284791445033,
 0.6348704224118138,
 0.6347874068662419,
 0.6347583778647272,
 0.634704543474558,
 0.6346942011585923,
 0.6346864378022077,
 0.6346748326989902,
 0.6346621482583705,
 0.6346569260888605,
 0.6346497683876172,
 0.634649255021077,
 0.6346488246559323,
 0.6346481327206955,
 0.6346476019175346,
 0.6346475819913922,
 0.6346471995709722,
 0.6346470460722691,
 0.6346469452903982,
 0.6346468715448711,
 0.6346467225114841,
 0.6346466765821948,
 0.6346465500950841,

In [74]:
predictions = model.transform(test_data)

predictions.show(3)

+------+--------------------+--------------------+--------------------+----------+
|labels|            features|       rawPrediction|         probability|prediction|
+------+--------------------+--------------------+--------------------+----------+
|     0|(400,[0,1,2,3,4,5...|[-0.7818075866801...|[0.31393044056371...|       1.0|
|     0|(400,[0,1,2,3,4,5...|[-0.1445073275075...|[0.46393590493266...|       1.0|
|     0|(400,[0,1,2,3,4,5...|[0.86611700775817...|[0.70393708623075...|       0.0|
+------+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



Calculate AUC-ROC for the predicted data. For this purpose, you can use BinaryClassificationEvaluator from ml.evaluation model.

In [75]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol='labels') #default=label not labels
print('Test Area Under ROC', evaluator.evaluate(predictions))

Test Area Under ROC 0.6559214041310016


<b>Self-check question:</b>

1. Try to fit and predict model using pure words. Has the result changed?



# Performing test submission

Apply the learned models to the test dataset.

<b>Note!</b> The test dataset will be changed during the test phase. Your last cell output must be the output of the AUC-ROC score.

In [76]:
test_data = spark_session.read.csv("/data/vacancie/test.csv",inferSchema=True,header=True)
test_dataset = test_data.select('FullDescription','SalaryNormalized')
regexTokenizer = RegexTokenizer(pattern="\\s+|,|\\*|/|\\.", inputCol="FullDescription", outputCol="FullDescriptionTokenized")
test_regex_data = regexTokenizer.transform(test_dataset)
remover = StopWordsRemover(inputCol="FullDescriptionTokenized", outputCol="FullDescriptionFiltered")
test_filtered_data = remover.transform(test_regex_data)




ngram = NGram(n=2, inputCol="FullDescriptionFiltered", outputCol="2grams")

ngramDataFrame = ngram.transform(test_filtered_data)

ngram = NGram(n=3, inputCol="FullDescriptionFiltered", outputCol="3grams")

ngramDataFrame = ngram.transform(ngramDataFrame)

ngram = ngramDataFrame.select(concat_string_arrays(col("2grams"),col("3grams")).alias("features"), "SalaryNormalized")

hashingTF = HashingTF()

hashingTF = HashingTF(inputCol="features", outputCol="rawFeatures", numFeatures=400)
featurizedData = hashingTF.transform(ngram)


test_filtered_data = featurizedData.select(featurizedData.rawFeatures.alias('features'), 
                                              featurizedData.SalaryNormalized.alias('labels'))

In [77]:
# Transform dataset and calculate auc-roc
test_filtered_data.show(5)

test_results_dataset = model.evaluate(test_filtered_data)


+--------------------+------+
|            features|labels|
+--------------------+------+
|(400,[3,4,6,7,11,...|     1|
|(400,[3,6,9,12,17...|     1|
|(400,[1,10,16,17,...|     0|
|(400,[2,10,12,13,...|     0|
|(400,[1,7,14,16,2...|     1|
+--------------------+------+
only showing top 5 rows



In [78]:
print(evaluator.evaluate(test_results_dataset.predictions))

0.6872518365175885
