<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project: Real or Fake Jobs
_**Note**: Best viewed via [GoogleColab](https://drive.google.com/file/d/1_B619CKVS0MyF2QeghC_6FsgQh419u2F/view)_

## Contents:
### Part 3 (of 3)
- Problem Statement
- Background
- Data Cleaning
- Data Dictionary
- Exploratory Data Analysis
- Data Modelling
    - Model Fitting & Evaluation
- Data Modelling with SMOTE
    - Model Fitting & Evaluation (SMOTE)
- Model Selection
- Misclassified Data
- [Topic Modelling](#Topic-Modelling)
- [Recommendations](#Recommendations)
- [Conclusions](#Conclusions)
- [References](#References)

## Import Libraries

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from time import time
import string
import re
import nltk

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, \
accuracy_score, recall_score, precision_score, plot_roc_curve, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from imblearn.over_sampling import SMOTE 
from imblearn.pipeline import Pipeline as pipeline

pd.set_option('display.max.columns', None)
pd.set_option('display.max.colwidth', 100)



In [3]:
!pip install bertopic



In [4]:
from bertopic import BERTopic

topic_model = BERTopic(language = "english", calculate_probabilities = True, verbose = True)

  defaults = yaml.load(f)


## Topic Modelling

In [5]:
# load the fake_jobs text
fake = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/Capstone Project/data/fake_jobs_text.csv')

In [6]:
fake.head(1)

Unnamed: 0,text_lemma
0,ic e technician ic e technician bakersfield ca mt posoprincipal duty responsibility calibrates t...


### I. Fake Jobs Topic Modelling

In [7]:
# transform text to word embeddings
topics, probs = topic_model.fit_transform(fake['text_lemma'])

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

2021-09-08 10:30:38,495 - BERTopic - Transformed documents to Embeddings
2021-09-08 10:30:49,777 - BERTopic - Reduced dimensionality with UMAP
2021-09-08 10:30:49,821 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [8]:
# get the different topics and their frequency
freq = topic_model.get_topic_info(); 
freq.sort_values(by = 'Count', ascending = False)

Unnamed: 0,Topic,Count,Name
0,0,403,0_system_management_engineering_business
1,1,36,1_job_suppliescomputer_distractionsmust_wage
2,2,34,2_care_nursing_patient_rn
3,3,13,3_glass_optical_optician_lens


In [9]:
# top words in topic 0
topic_model.get_topic(0)

[('system', 0.023941138564172813),
 ('management', 0.022948831317430553),
 ('engineering', 0.02184365650449586),
 ('business', 0.01940101260410698),
 ('equipment', 0.016996461475508783),
 ('responsibility', 0.01618578060292454),
 ('industry', 0.016115081578170223),
 ('design', 0.01581828989972461),
 ('office', 0.015741377113781502),
 ('service', 0.015440683565150588)]

In [10]:
# top words in topic 1
topic_model.get_topic(1)

[('job', 0.13739443488303296),
 ('suppliescomputer', 0.08048696583555454),
 ('distractionsmust', 0.08048696583555454),
 ('wage', 0.07412470764036783),
 ('guarantee', 0.07397709191480306),
 ('benefit', 0.07181144973717363),
 ('applying', 0.07172424609732467),
 ('salary', 0.07137460339963311),
 ('career', 0.0680385365357489),
 ('bonus', 0.06796482484632246)]

In [11]:
# top words in topic 2
topic_model.get_topic(2)

[('care', 0.07307847244476162),
 ('nursing', 0.06650610123780185),
 ('patient', 0.06513947274992281),
 ('rn', 0.06214495691061141),
 ('nurse', 0.04726340812585129),
 ('hospital', 0.03993980426660773),
 ('staff', 0.03915781606185133),
 ('surgery', 0.03313655800183737),
 ('perioperative', 0.03313655800183737),
 ('surgical', 0.03155392972243682)]

In [12]:
# top words in topic 3
topic_model.get_topic(3)

[('glass', 0.21681437641016615),
 ('optical', 0.1859535307632584),
 ('optician', 0.1313440309191682),
 ('lens', 0.1098630761033127),
 ('position', 0.10900946244207413),
 ('drop', 0.09941065984864286),
 ('select', 0.09470475444135318),
 ('optometric', 0.08544295723184353),
 ('selecting', 0.06311806045478131),
 ('presenting', 0.056112957419944896)]

**Observation**: The fake jobs are clustered around 4 topics, with more than 80% of the ads in topic0. From the common words, the jobs are centred around engineering/business management. The other 3 topics include computer supplies, healthcare and optometry.

In [None]:
# visualise the topics
topic_model.visualize_topics()

![google-colab](../image/fake-topic-dist.png)

**Observation**: The topics do not overlap, indicating that they are very different.

In [None]:
# visualise topic hierarchy
topic_model.visualize_hierarchy()

![google-colab](../image/fake-hierarchical-clustering.png)

In [None]:
# visualise the common words for each topic
topic_model.visualize_barchart(top_n_topics = 5)

![google-colab](../image/fake-top-words.png)

### II. Real Jobs Topic Modelling

In [16]:
# load the real jobs data
real = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/Capstone Project/data/real_jobs_text.csv')

In [17]:
topic_model = BERTopic(language = "english", calculate_probabilities = True, verbose = True)

In [18]:
# transform text to word embeddings
topics, probs = topic_model.fit_transform(real['text_lemma'])

Batches:   0%|          | 0/356 [00:00<?, ?it/s]

2021-09-08 10:41:08,760 - BERTopic - Transformed documents to Embeddings
2021-09-08 10:41:30,470 - BERTopic - Reduced dimensionality with UMAP
2021-09-08 10:41:55,465 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [19]:
# get the topics
freq = topic_model.get_topic_info(); 
freq.sort_values(by = 'Count', ascending = False)

Unnamed: 0,Topic,Count,Name
0,-1,3667,-1_digital_want_platform_engineer
1,0,329,0_selling_prospect_revenue_target
2,1,221,1_test_testing_qa_automated
3,2,205,2_java_oracle_xml_framework
4,3,184,3_admin_funding_apprenticeship_na
...,...,...,...
194,190,11,190_fundraising_award_nonprofit_social
201,200,10,200_intercom_impersonal_whatsapp_venture
202,201,10,201_salesphone_websocial_mediaecommercewebsites_website
203,202,10,202_companiestesting_watermarking_technologiesplanning_videostrong


**Observation**: As there were more data on real jobs, the ads are clustered around 204 topics. The top 3 common topics of the genuine jobs ads seem to be related to sales (target/revenue), quality assurance (manufacturing) and java programming. 


In [20]:
# display top 20 topics
freq.sort_values(by = 'Count', ascending = False).head(20)

Unnamed: 0,Topic,Count,Name
0,-1,3667,-1_digital_want_platform_engineer
1,0,329,0_selling_prospect_revenue_target
2,1,221,1_test_testing_qa_automated
3,2,205,2_java_oracle_xml_framework
4,3,184,3_admin_funding_apprenticeship_na
5,4,182,4_designer_visual_creative_photoshop
6,5,172,5_hr_recruiting_recruitment_recruiter
7,6,157,6_sql_database_oracle_server
8,7,149,7_estimating_managing_manage_planning
9,8,143,8_office_assistant_executive_calendar


In [None]:
# visualise how close the topics are to each other
topic_model.visualize_topics()

![google-colab](../image/real-topic-dist.png)

**Observation**: Many of the topics are very close to each other, as seen by the overlapping circles.

In [None]:
# visualise the hierarchy between the top 20 topics
topic_model.visualize_hierarchy(top_n_topics=20)

![google-colab](../image/real-hierarchical-clustering.png)

In [None]:
# visualise the top words from the top 10 topics
topic_model.visualize_barchart(top_n_topics = 10)

![google-colab](../image/real-top-words.png)

In [24]:
# update the topics by using bigrams instead of unigrams
topic_model.update_topics(real['text_lemma'], topics, n_gram_range = (2, 2))

In [25]:
# display the top bigrams from topic of our choice, to better understand the context
topic_model.get_topic(0)

[('sale representative', 0.005868507264418676),
 ('sale manager', 0.00483916083925407),
 ('sale director', 0.004275268948443754),
 ('sale associate', 0.00391747430041073),
 ('sale executive', 0.0035945663504305114),
 ('sale strategy', 0.00279824238030757),
 ('sale development', 0.002740307223647008),
 ('sale professional', 0.0026990948084351625),
 ('sale operation', 0.0025354650998916286),
 ('sale marketing', 0.002191186170686824)]

**Observation**: The bigrams provide more context to the job listings. From the top bigrams of Topic0, it is clear that Topic0 is on jobs related to sales as seen by the common bigrams, as compared to the unigrams above.

#### Topic Reduction
We can also reduce the number of topics after training a BERTopic model. We can decide the number of topics after knowing how many are actually created. In this case there were 204 topics created and some are very close to each other. We will reduce it to 50 topics so that we can better understand how distinct each topics are.

In [26]:
# reduce the number of topics from 204 to 50
new_topics, new_probs = topic_model.reduce_topics(real['text_lemma'], topics, probs, nr_topics = 50)

2021-09-08 10:44:24,686 - BERTopic - Reduced number of topics from 205 to 51


In [27]:
# display the reclassified 50 topics and frequency
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,3769,-1_business development_software development_project management_software engineer
1,0,417,0_sale representative_inside sale_sale executive_sale manager
2,1,253,1_employee also_account executive_service provider_value employee
3,2,250,2_old cash_ipads touch_cash register_touch screen
4,3,243,3_patient care_physical therapist_clinic assistant_patient resident
5,4,230,4_working capital_tidewater finance_international money_money transfer
6,5,225,5_hiring manager_management training_employee relation_training compensation
7,6,221,6_test plan_test case_test automation_automated test
8,7,220,7_graphic designer_graphic design_visual design_web designer
9,8,205,8_java developer_java ee_senior java_java web


**Observation**: After reducing the topics, they are more generic and intuitive as compared to the original 204 topics.

### Identifying Similar Topics
After having trained our model, we can use find_topics to search for topics that are similar to an input search_term. 

In [28]:
# search for topics that are closely related to 'data science' and check the results
similar_topics, similarity = topic_model.find_topics("data science", top_n = 5); 
similar_topics

[44, 9, 22, 18, 6]

In [32]:
# view topic44
topic_model.get_topic(44)

[('data integration', 0.021542134270185986),
 ('data solution', 0.016013178223398507),
 ('process server', 0.010507105751059533),
 ('building data', 0.009959375219183626),
 ('data visualization', 0.009639605324536758),
 ('business process', 0.009334507567075235),
 ('project team', 0.007658501554167974),
 ('solution client', 0.007473318249605507),
 ('data usability', 0.007117026816030808),
 ('project management', 0.006246712052399304)]

In [33]:
# view topic9 to see if similar as topic44
topic_model.get_topic(9)

[('big data', 0.020810860547717587),
 ('data scientist', 0.012443198366093453),
 ('qubit cutting', 0.010035408321542235),
 ('machine learning', 0.009120880274296788),
 ('data analyst', 0.00888640941500809),
 ('data engineeringqubit', 0.008468792974006403),
 ('understand qubit', 0.007668550761584072),
 ('qubit store', 0.007668550761584072),
 ('engineeringqubit looking', 0.004071528529447254),
 ('manager qubit', 0.0038881571197120213)]

In [34]:
# view topic22 to see if similar as topic44 and topic9
topic_model.get_topic(22)

[('sql server', 0.025920481715442523),
 ('data warehouse', 0.012993522743012966),
 ('database design', 0.0069057689973876036),
 ('database development', 0.006745529932106903),
 ('sql developer', 0.006185490745762732),
 ('database performance', 0.005530038854880041),
 ('database administration', 0.005126602748401246),
 ('server database', 0.004857655983724162),
 ('data management', 0.004274220002474243),
 ('resolving database', 0.004200203319567925)]

**Observation**: Looking at the common words among Topic44, Topic9 and Topic22, it is evident that these topics are similar and related to data-science. Some examples of keywords include data integration, data visualisation, machine learning and sql servers which are all data-science related.

## Conclusions
The Logistic Regression with TFIDF Vectorizer is able to accurately distinguish between the fraudulent (fake) and non-fraudulent (real) jobs, with a high ROC AUC score of 0.946 and recall score of 0.801. If we deploy this model to job portals, it will be able to correctly predict the fraudulent jobs 80% of the time and prevent fraudulent jobs from being published on job portals. With the reduction in fraudulent job listings, this largely reduced the risk of job-seekers falling prey to job scams. However, there is also a trade-off between recall and precision. The model has a high recall score (0.801) but a low precision score (0.429). This means that the model may predict more real job ads as fake resulting in genuine jobs not being able to be published on job portals, causing them to have a difficult time finding candidates to fill their positions. These employers can then choose to amend their job ads or use other channels to source for candidates such as via internal referrals.

To take this project further, we could explore the following:
- Getting updated jobs data in recent years (e.g. from 2019 when COVID-19 happened)
- Getting additional data on fraudulent jobs as and when there are reports of such cases
- Collect more features such as platform the job ad was posted on, and the nature of the job
- Further fine-tune hyperparameters, understand context before removing stopwords 
- Try other oversampling techniques

## Recommendations
While we are able to train a classifier to aid in filtering fraudulent job ads from being published on job portals, job-seekers should also be cautious when applying for jobs. If a job posting does not come with a company logo or profile, job-seekers should do some research on the company and ask around or check for online reviews before applying. 

Based on the top features of the model and topic modelling, there seem to be many jobs on engineering or admin which are fraudulent. Job-seekers are advised to be wary of 'too-good-to-be-true' jobs. If the job is high paying and technical but has low requirements, job-seekers should pay extra attention as scrutinize the job ad further. 

Job-seekers should also look for jobs of their interest based on more specialised skills such as computing/programming instead of generic roles requiring basic skills like microsoft office. Even if the recruitment process may be more tedious, at least it puts job-seekers at a safer spot with lower risk of falling prey for job scams. 

As the classifier is not 100% accurate or robust, job-seekers or any users who come across suspicious job ads such as those requesting for sensitive information or payment should report these listings so that action can be taken against the original poster.

## References
1. http://emscad.samos.aegean.gr/ 
2. https://www.american.edu/careercenter/fraudulentjobs.cfm 
3. https://www.flexjobs.com/blog/post/how-to-find-a-real-online-job-and-avoid-the-scams-v2/
4. https://insights.omnia-health.com/hospital-management/employment-fraud-rates-increase-30-cent-during-pandemic
5. https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/
6. https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing#scrollTo=9antKpdC91A-