# NLP in Pyspark's MLlib Project Solution

## Fake Job Posting Predictions

Indeed.com has just hired you to create a system that automatically flags suspicious job postings on it's website. It has recently seen an influx of fake job postings that is negativley impacting it's customer experience. Becuase of the high volume of job postings it receives everyday, their employees don't have the capacity to check every posting so they would like an automated system that prioritizes which postings to review before deleting it. 

#### Your task
Use the attached dataset to create an NLP alogorthim which automatically flags suspicious posts for review. 

#### The data
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs.

**Data Source:** https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction

#### My basic approach
I think I will use just the job description variable (there are a few lengthier text variables we could have chosen from) for my analysis for now and think about the company profile for another analysis later on. Something that would be cool here would be to create multiple models that all work together to provide a recommendation or a score of how likley it is that a job posting is fake. 

In [1]:
# First let's create our PySpark instance
# import findspark
# findspark.init()

import pyspark  # only run after findspark.init()
from pyspark.sql import SparkSession

# May take awhile locally
spark = SparkSession.builder.appName("NLP").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

22/10/11 21:21:53 WARN Utils: Your hostname, masoud-ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.7.139 instead (on interface wlp2s0)
22/10/11 21:21:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/11 21:21:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/11 21:21:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
You are working with 1 core(s)


**Read in dependencies**

In [2]:
from pyspark.ml.feature import (
    MinMaxScaler,
    CountVectorizer,
    StringIndexer,
    RegexTokenizer,
    StopWordsRemover,
    HashingTF,
    IDF,
    Word2Vec,
)
from pyspark.sql.functions import (
    col,
    udf,
    regexp_replace,
    isnull,
    translate,
    lower,
    countDistinct,
)
from pyspark.sql.types import StringType, IntegerType
from pyspark.ml.classification import (
    LogisticRegression,
    OneVsRest,
    MultilayerPerceptronClassifier,
    NaiveBayes,
    LinearSVC,
    RandomForestClassifier,
    DecisionTreeClassifier,
    GBTClassifier,
)
from pyspark.ml.evaluation import (
    BinaryClassificationEvaluator,
    MulticlassClassificationEvaluator,
)
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# For pipeline development
from pyspark.ml import Pipeline

**And the dataset**

In [3]:
path = "Datasets/"

# CSV
postings = spark.read.csv(path + "fake_job_postings.csv", inferSchema=True, header=True)

                                                                                

**View the data for QA**

In [4]:
postings.limit(4).toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0


In [5]:
# Let's read a full line of data of fradulent postings
postings.filter("fraudulent=1").show(1, False)
# These look good!

+------+-----------------+---------------+----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [6]:
postings.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- location: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary_range: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- benefits: string (nullable = true)
 |-- telecommuting: string (nullable = true)
 |-- has_company_logo: string (nullable = true)
 |-- has_questions: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- required_experience: string (nullable = true)
 |-- required_education: string (nullable = true)
 |-- industry: string (nullable = true)
 |-- function: string (nullable = true)
 |-- fraudulent: string (nullable = true)



**See how many rows are in the df**

In [7]:
postings.count()

17880

## Null Values?

In [8]:
from pyspark.sql.functions import *


def null_value_calc(df):
    null_columns_counts = []
    numRows = df.count()
    for k in df.columns:
        nullRows = df.where(col(k).isNull()).count()
        if nullRows > 0:
            temp = k, nullRows, (nullRows / numRows) * 100
            null_columns_counts.append(temp)
    return null_columns_counts


null_columns_calc_list = null_value_calc(postings)
spark.createDataFrame(
    null_columns_calc_list, ["Column_Name", "Null_Values_Count", "Null_Value_Percent"]
).show()

+-------------------+-----------------+--------------------+
|        Column_Name|Null_Values_Count|  Null_Value_Percent|
+-------------------+-----------------+--------------------+
|           location|              346|  1.9351230425055927|
|         department|            11547|   64.58053691275167|
|       salary_range|            15011|   83.95413870246085|
|    company_profile|             3308|  18.501118568232663|
|        description|                1|0.005592841163310962|
|       requirements|             2573|  14.390380313199106|
|           benefits|             6966|   38.95973154362416|
|      telecommuting|               89| 0.49776286353467564|
|   has_company_logo|               29|  0.1621923937360179|
|      has_questions|               30| 0.16778523489932887|
|    employment_type|             3292|   18.41163310961969|
|required_experience|             6723|  37.600671140939596|
| required_education|             7748|  43.333333333333336|
|           industry|   

Quite a bit of missing data here. We better be careful dropping

In [9]:
# Let's see how much total
og_len = postings.count()
drop_len = postings.na.drop().count()
print("Total Null Rows:", og_len - drop_len)
print("Percentage Null Rows", (og_len - drop_len) / og_len)

Total Null Rows: 17094
Percentage Null Rows 0.9560402684563758


Wawwww 95% is wayyyy too much. Better find a better approach.

In [10]:
# How about by subset by just the vars we need for now.
df = postings.na.drop(subset=["fraudulent", "description"])

**Much better**

In [11]:
df.count()

17704

In [12]:
# Quick data quality check on the dependent var....
# This should be a binary outcome (0 or 1)
df.groupBy("fraudulent").count().orderBy(col("count").desc()).show(8)

+--------------------+-----+
|          fraudulent|count|
+--------------------+-----+
|                   0|16080|
|                   1|  886|
|           Full-time|   73|
|Hospital & Health...|   55|
|   Bachelor's Degree|   53|
|         Engineering|   26|
| perform quality ...|   17|
|         Unspecified|   15|
+--------------------+-----+
only showing top 8 rows



We can see from the query above that we have some invalid data in the label (fraudulent) column. Let's delete those.

In [13]:
df = df.filter("fraudulent IN('0','1')")
# QA again
df.groupBy("fraudulent").count().show(truncate=False)

+----------+-----+
|fraudulent|count|
+----------+-----+
|0         |16080|
|1         |886  |
+----------+-----+



### Balance the signal

The other thing I want to do is resample the dataframe so I get a better signal from the data since there is not many fraudulent cases. I mentioned class imbalance earlier on, but we haven't come accross a good example yet. This is a good one where we see that the ratio between the fraudent cases and the real cases is extremley unbalanced. So undersampling the non-fradulent cases will help with that. 

Luckily, Spark has a cool built in function for this called sampleby to accomplish this. 

In [14]:
df = df.sampleBy("fraudulent", fractions={"0": 0.4, "1": 1.0}, seed=10)
# QA again
df.groupBy("fraudulent").count().show(truncate=False)

+----------+-----+
|fraudulent|count|
+----------+-----+
|0         |6323 |
|1         |886  |
+----------+-----+



That's better!

### Encode the label column

In [15]:
# Let's go ahead and encode it too
indexer = StringIndexer(inputCol="fraudulent", outputCol="label")
df = indexer.fit(df).transform(df)
df.limit(6).toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,label
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0,0.0
1,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0,0.0
2,6,Accounting Clerk,"US, MD,",,,,Job OverviewApex is an environmental consultin...,,,0,0,0,,,,,,0,0.0
3,10,Customer Service Associate - Part Time,"US, AZ, Phoenix",,,"Novitex Enterprise Solutions, formerly Pitney ...",The Customer Service Associate will be based i...,Minimum Requirements:Minimum of 6 months custo...,,0,1,0,Part-time,Entry level,High School or equivalent,Financial Services,Customer Service,0,0.0
4,12,Talent Sourcer (6 months fixed-term contract),"GB, LND, London",HR,,Want to build a 21st century financial service...,TransferWise is the clever new way to move mon...,We’re looking for someone who:Proven track rec...,You will join one of Europe’s most hotly tippe...,0,1,0,,,,,,0,0.0
5,13,"Applications Developer, Digital","US, CT, Stamford",,,"Novitex Enterprise Solutions, formerly Pitney ...","The Applications Developer, Digital will devel...",Requirements:4 – 5 years’ experience in develo...,,0,1,0,Full-time,Associate,Bachelor's Degree,Management Consulting,Information Technology,0,0.0


In [16]:
# Let's check the quality of the description var
df.select("description").show(1, False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|description                                                                                

Looks pretty standard.

## Clean the datasets

In [17]:
# Removing anything that is not a letter
df = df.withColumn("description", regexp_replace(df["description"], "[^A-Za-z ]+", ""))
# Remove multiple spaces
df = df.withColumn("description", regexp_replace(df["description"], " +", " "))
# Lower case everything
df = df.withColumn("description", lower(df["description"]))

In [18]:
df.limit(5).toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,label
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...",food a fastgrowing james beard awardwinning on...,Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0,0.0
1,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,job title itemization review managerlocation f...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0,0.0
2,6,Accounting Clerk,"US, MD,",,,,job overviewapex is an environmental consultin...,,,0,0,0,,,,,,0,0.0
3,10,Customer Service Associate - Part Time,"US, AZ, Phoenix",,,"Novitex Enterprise Solutions, formerly Pitney ...",the customer service associate will be based i...,Minimum Requirements:Minimum of 6 months custo...,,0,1,0,Part-time,Entry level,High School or equivalent,Financial Services,Customer Service,0,0.0
4,12,Talent Sourcer (6 months fixed-term contract),"GB, LND, London",HR,,Want to build a 21st century financial service...,transferwise is the clever new way to move mon...,We’re looking for someone who:Proven track rec...,You will join one of Europe’s most hotly tippe...,0,1,0,,,,,,0,0.0


## Split text into words (Tokenizing)

In [19]:
regex_tokenizer = RegexTokenizer(
    inputCol="description", outputCol="words", pattern="\\W"
)
df = regex_tokenizer.transform(df)

df.limit(5).toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,label,words
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...",food a fastgrowing james beard awardwinning on...,Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0,0.0,"[food, a, fastgrowing, james, beard, awardwinn..."
1,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,job title itemization review managerlocation f...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0,0.0,"[job, title, itemization, review, managerlocat..."
2,6,Accounting Clerk,"US, MD,",,,,job overviewapex is an environmental consultin...,,,0,0,0,,,,,,0,0.0,"[job, overviewapex, is, an, environmental, con..."
3,10,Customer Service Associate - Part Time,"US, AZ, Phoenix",,,"Novitex Enterprise Solutions, formerly Pitney ...",the customer service associate will be based i...,Minimum Requirements:Minimum of 6 months custo...,,0,1,0,Part-time,Entry level,High School or equivalent,Financial Services,Customer Service,0,0.0,"[the, customer, service, associate, will, be, ..."
4,12,Talent Sourcer (6 months fixed-term contract),"GB, LND, London",HR,,Want to build a 21st century financial service...,transferwise is the clever new way to move mon...,We’re looking for someone who:Proven track rec...,You will join one of Europe’s most hotly tippe...,0,1,0,,,,,,0,0.0,"[transferwise, is, the, clever, new, way, to, ..."


## Removing Stopwords

In [20]:
from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="words", outputCol="filtered")
feature_data = remover.transform(df)

feature_data.limit(5).toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,...,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,label,words,filtered
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...",food a fastgrowing james beard awardwinning on...,Experience with content management systems a m...,,0,...,0,Other,Internship,,,Marketing,0,0.0,"[food, a, fastgrowing, james, beard, awardwinn...","[food, fastgrowing, james, beard, awardwinning..."
1,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,job title itemization review managerlocation f...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,...,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0,0.0,"[job, title, itemization, review, managerlocat...","[job, title, itemization, review, managerlocat..."
2,6,Accounting Clerk,"US, MD,",,,,job overviewapex is an environmental consultin...,,,0,...,0,,,,,,0,0.0,"[job, overviewapex, is, an, environmental, con...","[job, overviewapex, environmental, consulting,..."
3,10,Customer Service Associate - Part Time,"US, AZ, Phoenix",,,"Novitex Enterprise Solutions, formerly Pitney ...",the customer service associate will be based i...,Minimum Requirements:Minimum of 6 months custo...,,0,...,0,Part-time,Entry level,High School or equivalent,Financial Services,Customer Service,0,0.0,"[the, customer, service, associate, will, be, ...","[customer, service, associate, based, phoenix,..."
4,12,Talent Sourcer (6 months fixed-term contract),"GB, LND, London",HR,,Want to build a 21st century financial service...,transferwise is the clever new way to move mon...,We’re looking for someone who:Proven track rec...,You will join one of Europe’s most hotly tippe...,0,...,0,,,,,,0,0.0,"[transferwise, is, the, clever, new, way, to, ...","[transferwise, clever, new, way, move, money, ..."


## Converting text into vectors

We test out the following three vectors

1. Count Vectors
2. TF-IDF
3. Word2Vec

In [21]:
# Count Vector (count vectorizer and hashingTF are basically the same thing)
# cv = CountVectorizer(inputCol="filtered", outputCol="features")
# model = cv.fit(feature_data)
# countVectorizer_features = model.transform(feature_data)

# Hashing TF
hashingTF = HashingTF(inputCol="filtered", outputCol="rawfeatures", numFeatures=20)
HTFfeaturizedData = hashingTF.transform(feature_data)

# TF-IDF
idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel = idf.fit(HTFfeaturizedData)
TFIDFfeaturizedData = idfModel.transform(HTFfeaturizedData)
TFIDFfeaturizedData.name = "TFIDFfeaturizedData"

# rename the HTF features to features to be consistent
HTFfeaturizedData = HTFfeaturizedData.withColumnRenamed("rawfeatures", "features")
HTFfeaturizedData.name = "HTFfeaturizedData"  # We will use later for printing

                                                                                

In [22]:
# Word2Vec
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="filtered", outputCol="features")
model = word2Vec.fit(feature_data)

W2VfeaturizedData = model.transform(feature_data)
# W2VfeaturizedData.show(1,False)

# W2Vec Dataframes typically has negative values so we will correct for that here so that we can use the Naive Bayes classifier
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(W2VfeaturizedData)

# rescale each feature to range [min, max].
scaled_data = scalerModel.transform(W2VfeaturizedData)
W2VfeaturizedData = scaled_data.select(
    "fraudulent", "description", "label", "scaledFeatures"
)
W2VfeaturizedData = W2VfeaturizedData.withColumnRenamed("scaledFeatures", "features")

W2VfeaturizedData.name = "W2VfeaturizedData"  # We will need this to print later

                                                                                

## Train and Evaluate your model

From here on out, is straight up classification. So we can go and use our trusty function!

In [23]:
def ClassTrainEval(classifier, features, classes, train, test):
    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__

        return Mtype

    Mtype = FindMtype(classifier)

    def IntanceFitModel(Mtype, classifier, classes, features, train):

        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
            #             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).build()
            # Cross Validator requires the following parameters:
            crossval = CrossValidator(
                estimator=OVRclassifier,
                estimatorParamMaps=paramGrid,
                evaluator=MulticlassClassificationEvaluator(),
                numFolds=2,
            )  # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count + 1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(
                maxIter=100, layers=layers, blockSize=128, seed=1234
            )
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if (
            Mtype in ("LinearSVC", "GBTClassifier") and classes != 2
        ):  # These classifiers currently only accept binary classification
            print(
                Mtype,
                " could not be used because PySpark currently only accepts binary classification data for this algorithm",
            )
            return
        if Mtype in (
            "LogisticRegression",
            "NaiveBayes",
            "RandomForestClassifier",
            "GBTClassifier",
            "LinearSVC",
            "DecisionTreeClassifier",
        ):

            # Add parameters of your choice here:
            if Mtype in ("LogisticRegression"):
                paramGrid = (
                    ParamGridBuilder()  #                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                    .addGrid(classifier.maxIter, [10, 15, 20])
                    .build()
                )

            # Add parameters of your choice here:
            if Mtype in ("NaiveBayes"):
                paramGrid = (
                    ParamGridBuilder()
                    .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6])
                    .build()
                )

            # Add parameters of your choice here:
            if Mtype in ("RandomForestClassifier"):
                paramGrid = (
                    ParamGridBuilder().addGrid(classifier.maxDepth, [2, 5, 10])
                    #                                .addGrid(classifier.maxBins, [5, 10, 20])
                    #                                .addGrid(classifier.numTrees, [5, 20, 50])
                    .build()
                )

            # Add parameters of your choice here:
            if Mtype in ("GBTClassifier"):
                paramGrid = (
                    ParamGridBuilder()  #                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                    #                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                    .addGrid(classifier.maxIter, [10, 15, 50, 100]).build()
                )

            # Add parameters of your choice here:
            if Mtype in ("LinearSVC"):
                paramGrid = (
                    ParamGridBuilder()
                    .addGrid(classifier.maxIter, [10, 15])
                    .addGrid(classifier.regParam, [0.1, 0.01])
                    .build()
                )

            # Add parameters of your choice here:
            if Mtype in ("DecisionTreeClassifier"):
                paramGrid = (
                    ParamGridBuilder()  #                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                    .addGrid(classifier.maxBins, [10, 20, 40, 80, 100])
                    .build()
                )

            # Cross Validator requires all of the following parameters:
            crossval = CrossValidator(
                estimator=classifier,
                estimatorParamMaps=paramGrid,
                evaluator=MulticlassClassificationEvaluator(),
                numFolds=2,
            )  # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel

    fitModel = IntanceFitModel(Mtype, classifier, classes, features, train)

    # Print feature selection metrics
    if fitModel is not None:

        if Mtype in ("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print("\033[1m" + Mtype + "\033[0m")
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print(
                    "\033[1m" + "Intercept: " + "\033[0m",
                    model.intercept,
                    "\033[1m" + "\nCoefficients:" + "\033[0m",
                    model.coefficients,
                )

        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print("\033[1m" + Mtype, " Weights" + "\033[0m")
            print("\033[1m" + "Model Weights: " + "\033[0m", fitModel.weights.size)
            print("")

        if Mtype in (
            "DecisionTreeClassifier",
            "GBTClassifier",
            "RandomForestClassifier",
        ):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees
            # in the ensemble The importance vector is normalized to sum to 1.
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print("\033[1m" + Mtype, " Feature Importances" + "\033[0m")
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            print(BestModel.featureImportances)

            if Mtype in ("DecisionTreeClassifier"):
                global DT_featureimportances
                DT_featureimportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in ("GBTClassifier"):
                global GBT_featureimportances
                GBT_featureimportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in ("RandomForestClassifier"):
                global RF_featureimportances
                RF_featureimportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        if Mtype in ("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print("\033[1m" + Mtype, " Coefficient Matrix" + "\033[0m")
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficientMatrix))
            print("Intercept: " + str(BestModel.interceptVector))
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        if Mtype in ("LinearSVC"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print("\033[1m" + Mtype, " Coefficients" + "\033[0m")
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficients))
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel

    # Set the column names to match the external results dataframe that we will join with later:
    columns = ["Classifier", "Result"]

    if Mtype in ("LinearSVC", "GBTClassifier") and classes != 2:
        Mtype = [Mtype]  # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype, score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(
            metricName="accuracy"
        )  # redictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions)) * 100
        Mtype = [Mtype]  # make this a string
        score = [str(accuracy)]  # make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype, score), schema=columns)
        result = result.withColumn("Result", result.Result.substr(0, 5))

    return result
    # Also returns the fit model important scores or p values

Read in all dependencies and declare the algorithims you want to test

In [24]:
# from pyspark.ml.classification import *
# from pyspark.ml.evaluation import *
# from pyspark.sql import functions
# from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
    LogisticRegression(),
    OneVsRest(),
    LinearSVC(),
    NaiveBayes(),
    RandomForestClassifier(),
    GBTClassifier(),
    DecisionTreeClassifier(),
    MultilayerPerceptronClassifier(),
]

featureDF_list = [HTFfeaturizedData, TFIDFfeaturizedData, W2VfeaturizedData]

Loop through all feature types (hashingTF, TFIDF and Word2Vec)

In [26]:
for featureDF in featureDF_list:
    print("\033[1m" + featureDF.name, " Results:" + "\033[0m")
    train, test = featureDF.randomSplit([0.7, 0.3], seed=11)
    features = featureDF.select(["features"]).collect()
    # Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
    class_count = featureDF.select(countDistinct("label")).collect()
    classes = class_count[0][0]

    # set up your results table
    columns = ["Classifier", "Result"]
    vals = [("Place Holder", "N/A")]
    results = spark.createDataFrame(vals, columns)

    for classifier in classifiers:
        new_result = ClassTrainEval(classifier, features, classes, train, test)
        results = results.union(new_result)
    results = results.where("Classifier!='Place Holder'")
    print(results.show(truncate=False))

[1mHTFfeaturizedData  Results:[0m


                                                                                

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-0.02084135, -0.04647101,  0.00528349,  0.04161692, -0.00721138,
              -0.00366765,  0.08073108, -0.01817296, -0.0345073 , -0.00674497,
               0.04157756, -0.06228243, -0.06560296, -0.01242676, -0.03218051,
               0.0797224 ,  0.00129737, -0.02077709,  0.05564467, -0.00402408]])
Intercept: [-1.7538739056298203]


                                                                                

 
[1mOneVsRest[0m
[1mIntercept: [0m 1.7673949621780012 [1m
Coefficients:[0m [0.013699596005148407,0.0375539432086722,-0.004625932185069328,-0.03147622716814407,0.005101493780399831,0.005126544026804918,-0.06399745961961556,0.015410993391970385,0.02556532705435464,0.005447313986072354,-0.030030348653883737,0.04912626924776846,0.04841364056815083,0.008049224973723871,0.02762072912874003,-0.06299882274501249,0.0009178213668294183,0.016836208362601326,-0.043130006238764104,0.0026849622195413763]
[1mIntercept: [0m -1.7673949622168357 [1m
Coefficients:[0m [-0.013699595992360465,-0.03755394320475597,0.0046259321809378525,0.031476227176455485,-0.005101493749404688,-0.0051265440358126075,0.06399745959483165,-0.015410993395815663,-0.02556532703387564,-0.005447314001058085,0.030030348640315615,-0.04912626924659134,-0.048413640537167096,-0.008049224952406928,-0.02762072913550087,0.06299882269925611,-0.000917821356736577,-0.01683620837750009,0.04313000623940599,-0.0026849622130867396]


                                                                                

 
[1mLinearSVC  Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[-0.0009364279131075579,-0.0021881971164234224,-0.0004688209052867408,0.0021319078724465227,-0.0007381186382697452,0.0,0.006153152018785885,-0.001889439882337969,-0.0022515257801326762,-0.0006627715779285126,0.00028564788208232833,-0.002339193136430156,-0.004857343957189951,-0.0009169301540094105,-0.0011674810984431288,0.004782500017219599,-0.00023310913804272139,-0.0013679016541575377,0.0026932745010846987,-0.0008042571915764504]


                                                                                

 
[1mRandomForestClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19],[0.06481879017312955,0.06283168628744923,0.04191259242079413,0.04548559289154977,0.0289362809459686,0.040302929665097964,0.07641067576398769,0.04095060708570659,0.04225499727540636,0.03877017856735944,0.06343045731451313,0.04731447082823992,0.04905733418458075,0.0534958640446744,0.05926565367419835,0.05450047839523453,0.05179573358420345,0.045548233472902266,0.04894826850108662,0.04396917492391711])


                                                                                

 
[1mGBTClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19],[0.061852117682657186,0.05646359791559172,0.054984625282304016,0.054444630977027245,0.05155710766257806,0.04584960779191782,0.0407729632370736,0.04691462371224818,0.05839682980358466,0.03370322316309605,0.059457128899880786,0.0408023574834652,0.058872672513295275,0.06658470539856765,0.04312828366850235,0.05770503959909279,0.031165564155839424,0.054886326764443244,0.0358415129212004,0.04661708136763444])


                                                                                

 
[1mDecisionTreeClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,5,6,7,9,10,11,12,13,17,18,19],[0.01219252435063646,0.010973271915572819,0.02608308547888155,0.06641598244412432,0.04906178458796776,0.09266576422763101,0.059419720309872026,0.0573583186594524,0.18542535864891077,0.09312826951256573,0.07997182189770888,0.05853368926386936,0.03577652008412855,0.07551653517110311,0.09747735344757529])


                                                                                


[1mMultilayerPerceptronClassifier  Weights[0m
[1mModel Weights: [0m 923



                                                                                

+------------------------------+------+
|Classifier                    |Result|
+------------------------------+------+
|LogisticRegression            |88.81 |
|OneVsRest                     |88.86 |
|LinearSVC                     |88.90 |
|NaiveBayes                    |88.44 |
|RandomForestClassifier        |91.18 |
|GBTClassifier                 |91.87 |
|DecisionTreeClassifier        |89.28 |
|MultilayerPerceptronClassifier|89.65 |
+------------------------------+------+

None
[1mTFIDFfeaturizedData  Results:[0m


                                                                                

22/10/11 21:49:27 WARN CacheManager: Asked to cache already cached data.
22/10/11 21:49:27 WARN CacheManager: Asked to cache already cached data.


                                                                                

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-0.24325844, -0.43096839,  0.10227758,  0.39384314, -0.06556127,
              -0.04469701,  0.8670754 , -0.32004126, -0.3521491 , -0.05515062,
               0.63494538, -0.64669185, -1.31449645, -0.23075988, -0.26792781,
               0.96448225,  0.01336484, -0.29772685,  0.37715412, -0.02591433]])
Intercept: [-1.7538739056298203]


                                                                                

 
[1mOneVsRest[0m
[1mIntercept: [0m 1.7673949621624678 [1m
Coefficients:[0m [0.15990052403227323,0.34827225118480437,-0.08954858378230796,-0.29787632675239395,0.04637955878357584,0.06247629360117852,-0.6873514099522913,0.27140063456290453,0.26089569125351564,0.04454027755802488,-0.4586039289574358,0.5100885733475632,0.9700714641901563,0.1494707809051943,0.2299640490791398,-0.7621602430482655,0.009454943314643886,0.24125572619651073,-0.29233094322042535,0.017290655095082826]
[1mIntercept: [0m -1.767394962185767 [1m
Coefficients:[0m [-0.15990052394271265,-0.3482722511630141,0.08954858373432106,0.2978763267995855,-0.04637955861449908,-0.06247629366704679,0.6873514097925758,-0.2714006346035348,-0.26089569112811806,-0.0445402776315416,0.45860392883311496,-0.5100885733402288,-0.970071463817655,-0.14947078066768568,-0.22996404911291332,0.7621602427161115,-0.009454943252255186,-0.24125572632460335,0.29233094322303466,-0.01729065507014205]


                                                                                

 
[1mLinearSVC  Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[-0.010929907269685402,-0.020293164196008122,-0.0090754136509975,0.020175381334414696,-0.006710508365454006,0.0,0.06608665000527114,-0.03327463519342938,-0.02297695520756354,-0.005419190108458121,0.004362228441217207,-0.024288343242931866,-0.09732733808330983,-0.017027013961906395,-0.009720188029658806,0.05785872204622981,-0.0024013754269371257,-0.019601450632379946,0.01825474985789971,-0.005179266059203886]


                                                                                

 
[1mRandomForestClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19],[0.06481879017312955,0.06283168628744923,0.04191259242079413,0.04548559289154977,0.0289362809459686,0.040302929665097964,0.07641067576398769,0.04095060708570659,0.04225499727540636,0.03877017856735944,0.06343045731451313,0.04731447082823992,0.04905733418458075,0.0534958640446744,0.05926565367419835,0.05450047839523453,0.05179573358420345,0.045548233472902266,0.04894826850108662,0.04396917492391711])


                                                                                

 
[1mGBTClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19],[0.061852117682657186,0.05646359791559172,0.054984625282304016,0.054444630977027245,0.05155710766257806,0.04584960779191782,0.0407729632370736,0.04691462371224818,0.05839682980358466,0.03370322316309605,0.059457128899880786,0.0408023574834652,0.058872672513295275,0.06658470539856765,0.04312828366850235,0.05770503959909279,0.031165564155839424,0.054886326764443244,0.0358415129212004,0.04661708136763444])


                                                                                

 
[1mDecisionTreeClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,5,6,7,9,10,11,12,13,17,18,19],[0.01219252435063646,0.010973271915572819,0.02608308547888155,0.06641598244412432,0.04906178458796776,0.09266576422763101,0.059419720309872026,0.0573583186594524,0.18542535864891077,0.09312826951256573,0.07997182189770888,0.05853368926386936,0.03577652008412855,0.07551653517110311,0.09747735344757529])


                                                                                


[1mMultilayerPerceptronClassifier  Weights[0m
[1mModel Weights: [0m 923





+------------------------------+------+
|Classifier                    |Result|
+------------------------------+------+
|LogisticRegression            |88.81 |
|OneVsRest                     |88.86 |
|LinearSVC                     |88.90 |
|NaiveBayes                    |88.90 |
|RandomForestClassifier        |91.18 |
|GBTClassifier                 |91.87 |
|DecisionTreeClassifier        |89.23 |
|MultilayerPerceptronClassifier|89.37 |
+------------------------------+------+

None
[1mW2VfeaturizedData  Results:[0m


                                                                                

22/10/11 21:53:47 WARN BlockManager: Asked to remove block broadcast_31220_piece0, which does not exist


                                                                                

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-0.54425166,  3.39478679,  1.79104606]])

Intercept: [-4.879653458767514]


                                                                                

 
[1mOneVsRest[0m
[1mIntercept: [0m 2.8234485040585304 [1m
Coefficients:[0m [0.600701851435322,-1.0776240013426288,-0.9078208234514907]
[1mIntercept: [0m -2.8234485040585353 [1m
Coefficients:[0m [-0.6007018514353212,1.0776240013426335,0.9078208234514938]


                                                                                

 
[1mLinearSVC  Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[-0.0067368359233024085,0.0510651022598983,-0.003019466432914807]


                                                                                

 
[1mRandomForestClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(3,[0,1,2],[0.33209521630208516,0.3083154451775549,0.3595893385203599])


                                                                                

 
[1mGBTClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(3,[0,1,2],[0.3066905529065085,0.3208647981897012,0.3724446489037903])


                                                                                

 
[1mDecisionTreeClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(3,[0,1,2],[0.3902412086615409,0.386161344912962,0.2235974464254972])


                                                                                


[1mMultilayerPerceptronClassifier  Weights[0m
[1mModel Weights: [0m 39





+------------------------------+------+
|Classifier                    |Result|
+------------------------------+------+
|LogisticRegression            |87.61 |
|OneVsRest                     |87.61 |
|LinearSVC                     |87.61 |
|NaiveBayes                    |87.61 |
|RandomForestClassifier        |90.62 |
|GBTClassifier                 |89.97 |
|DecisionTreeClassifier        |88.72 |
|MultilayerPerceptronClassifier|86.63 |
+------------------------------+------+

None


Looks like the Random Forest classifier with either the HTFfeaturizedData or the TFIDFfeaturizedData are our best performing feature list/classifier combos. Let's go with the Hashing TF vector for the sake of simiplicity and create our final model and play around with the test dataframe. 

In [27]:
# Train final model
classifier = RandomForestClassifier()
featureDF = HTFfeaturizedData

train, test = featureDF.randomSplit([0.7, 0.3], seed=11)
features = featureDF.select(["features"]).collect()

# Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
class_count = featureDF.select(countDistinct("label")).collect()
classes = class_count[0][0]

# running this afain with generate all the objects need to play around with test data
ClassTrainEval(classifier, features, classes, train, test)

                                                                                

 
[1mRandomForestClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19],[0.06481879017312955,0.06283168628744923,0.04191259242079413,0.04548559289154977,0.0289362809459686,0.040302929665097964,0.07641067576398769,0.04095060708570659,0.04225499727540636,0.03877017856735944,0.06343045731451313,0.04731447082823992,0.04905733418458075,0.0534958640446744,0.05926565367419835,0.05450047839523453,0.05179573358420345,0.045548233472902266,0.04894826850108662,0.04396917492391711])


                                                                                

DataFrame[Classifier: string, Result: string]

Let's see some results!

In [28]:
predictions = RF_BestModel.transform(test)
print("Predicted Fraudulent:")
predictions.select("fraudulent", "description").filter("prediction=1").orderBy(
    predictions["prediction"].desc()
).show(3, False)
print(" ")
print("Predicted Not Fraudulent:")
predictions.select("fraudulent", "description").filter("prediction=0").orderBy(
    predictions["prediction"].desc()
).show(3, False)

Predicted Fraudulent:


                                                                                

+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fraudulent|description                                                                                                                                                                                                                                                                                                                                                                                                                              



+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fraudulent|description                                                                                                                                                                                                                                                                                                                                                                                                                                     

                                                                                

## What could be next?

This analysis was really just the tip of the ice berg here. We could also consider the following analysis:

1. Autotag suspicious descriptions
2. Conduct a similar analysis on the company profile field and the requirements field (NLP)
3. Frequent pattern mining on null and non null values in other fields
4. Consider doing anlaysis on amount of typos in the description