# Salary prediction by vacancy description

## Dataset description

The dataset represents the data about vacancies which were published in the world net for different countries. The vacancy info has a full description of this vacancy, title, location, company, working category, salary etc.
In this assignment you have to predict the possibility of raising the salary threshold, using the vacancy description. The data is presented in the dataframe. The columns of interest are:
* FullDescription - description of vacancy
* SalaryNormalized - predicted salary threshold.

Dataset description

There are steps which are required to successfully complete the assignment:
1. Read dataset
2. Perform text transformation by removing punctuation terms and stop words.
3. Generate n-grams.	
4. Count TF * IDF features
5. Fit model for generated features.


## Reading dataset

Init pyspark session

In [5]:
from __future__ import division, print_function, unicode_literals # For the compatibility with Python 2

In [6]:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder\
                            .enableHiveSupport()\
                            .appName("spark sql")\
                            .master("local[4]")\
                            .getOrCreate()

In [7]:
sc=spark_session.sparkContext

In [8]:
#!ls /data/vacancie


In [9]:
#!ls /data/covertype2

In [10]:
train_data = spark_session.read.csv("/data/covertype2/train.csv",inferSchema=True,header=True)

#train_data.printSchema()


In [11]:
#dataset = train_data.select('FullDescription','SalaryNormalized')
#train_data.select('Hillshade_3pm').show()


## Transforming dataset

In [12]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer


stringIndexer = StringIndexer(inputCol="Wild_Type", outputCol="Wild_Type_Indexed")
model = stringIndexer.fit(train_data)
indexed = model.transform(train_data)


stringIndexer = StringIndexer(inputCol="Soil_Type", outputCol="Soil_Type_Indexed")
model = stringIndexer.fit(indexed)
raw_data = model.transform(indexed)


cols = list(set(raw_data.columns) - {'Wild_Type', 'Soil_Type', 'Target' })
#encoded = encoded.select(cols)


In [13]:
from pyspark.ml.feature import StandardScaler

from  pyspark.ml.feature  import VectorAssembler

# Define assembler 
assembler = VectorAssembler(
    inputCols=cols,
    outputCol='features')

# transform
vector_indexed_data = assembler.transform(raw_data)
#vector_indexed_data.printSchema()

dataset = vector_indexed_data.select('features', 'Target')

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)

# Compute summary statistics by fitting the StandardScaler
scaler_model = scaler.fit(dataset)
scaledData = scaler_model.transform(dataset)
dataset = scaledData

In [14]:
#dataset.show(5, False)


# Fitting model

Split the dataset to train and validation part (it is better to use 90% for the train part and 10% for the validation part)

In [15]:
train_data, test_data = dataset.randomSplit([0.9, 0.1])

Fit the Logistic Regression to the model on the splitted train part. Use about 15 iterations for the training process.

<b>Hint.</b> Use regularization parameter in order to prevent overfitting.

In [16]:
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

In [17]:
#lr = LogisticRegression(featuresCol='features',labelCol='labels')

lr = rf = RandomForestClassifier(labelCol="Target", featuresCol="features", numTrees=100, maxBins=40, maxDepth=7)

model = lr.fit(dataset)


In [18]:
#model.featureImportances

Print the loss function for each iteration. What can you notice from the behaviour of loss function?

<b>Hint.</b> Use summary.objectiveHistory for this case.

In [19]:
# trainingSummary = model.summary

# trainingSummary.predictions.show(5)

Apply the model to the validation set

In [20]:
# test_results = model.evaluate(test_data)

# test_results.predictions.show(5)

#test_data.show()

In [21]:
#model.summary.objectiveHistory

In [22]:
predictions = model.transform(test_data)

#predictions.show(3)

Calculate AUC-ROC for the predicted data. For this purpose, you can use BinaryClassificationEvaluator from ml.evaluation model.

In [23]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


evaluator = MulticlassClassificationEvaluator(labelCol='Target') #default=label not labels
# evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})

<b>Self-check question:</b>

1. Try to fit and predict model using pure words. Has the result changed?



# Performing test submission

Apply the learned models to the test dataset.

<b>Note!</b> The test dataset will be changed during the test phase. Your last cell output must be the output of the AUC-ROC score.

In [24]:
test_data_ = spark_session.read.csv("/data/covertype2/test.csv",inferSchema=True,header=True)


stringIndexer = StringIndexer(inputCol="Wild_Type", outputCol="Wild_Type_Indexed")
modeler = stringIndexer.fit(test_data_)
test_indexed = modeler.transform(test_data_)


stringIndexer = StringIndexer(inputCol="Soil_Type", outputCol="Soil_Type_Indexed")
modeler = stringIndexer.fit(test_indexed)
test_raw_data = modeler.transform(test_indexed)

test_cols = list(set(test_raw_data.columns) - {'Wild_Type', 'Soil_Type', 'Target'})

# Define assembler 
assembler = VectorAssembler(
    inputCols=test_cols,
    outputCol='features')

vector_indexed_data = assembler.transform(test_raw_data)
#vector_indexed_data.printSchema()
test_dataset = vector_indexed_data.select('features', 'Target')


In [25]:
# Transform dataset and calculate auc-roc
predictions_2 = model.transform(test_dataset)

In [26]:
evaluator.evaluate(predictions_2, {evaluator.metricName: "accuracy"})

0.7215819755688788