# Analysis of the Employment Scam Aegean Dataset


### Content

* Introduction
* Data description 
* Research objectives
* Data acquisition, cleaning and shaping  
* Data analyzation and visualization 
* Conclusion 

## Introduction

During the COVID-19 global pandemic amount of people working online has dramatically increased. Changes of the way people work caused the changes in the recruitment process, as a result the amount of employment frauds has increased. According to "ActionFraud" about three quarters of job hunters admit they wouldn't recognise the signs of a job scam. Employment scam сan lead to lose of around £4000 on average for job seekers. The following analysis will be focused on determing the sings of employment scam, by analyzing text data, meta-features, frases of job description.

## Data description

The Employment Scam Aegean Dataset is a dataset that contains 17880 real-life job ads published between 2012 and 2014, 17014 of which are legitimate and 866 are fraudulent job ads. Dataset contains the following data:

* Title - The title of the job ad entry.
* Location - Geographical location of the job ad.
* Department - Corporate department (e.g. sales).
* Salary range - Indicative salary range (e.g. \\$50,000-\\$60,000)
* Company profile - A brief company description.
* Description - The details description of the job ad.
* Requirements - Enlisted requirements for the job opening.
* Benefits - Enlisted offered benefits by the employer.
* Telecommuting - True for telecommuting positions.
* Company logo - True if company logo is present.
* Questions	True - if screening questions are present.
* Fraudulent - Classification attribute.
* In balanced - Selected for the balanced dataset.
* Employment type - Full-type, Part-time, Contract, etc.
* Required experience - Executive, Entry level, Intern, etc.
* Required education - Doctorate, Master’s Degree, Bachelor, etc.
* Industry - Automotive, IT, Health care, Real estate, etc.
* Function - Consulting, Engineering, Research, Sales etc.
* Fraudelent - True if company is fraudelent.

## Research objectives
1. Identify the phrases/key wors of legitimate and fraudelent job ads.
2. Define the relationship between companies profile and legetimacy of job ad.
3. Analysis of the relationship between required education/experience and benefits and salary range.
4. Identify the relationship between location/industry and job legetimacy.
5. Create a model that will help to predict job is fraudulent or real based on previously analyzed data.

## Data acquisition, cleaning and shaping  

In [1]:
# Import all dependencies that will be used
import pandas as pd
import numpy as np
import pygal
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
import re,string,unicodedata 
import unicodedata
from string import punctuation 
from IPython.display import display, HTML
base_html = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js"></script>
  <script type="text/javascript" src="https://kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js""></script>
  </head>
  <body>
    <figure>
      {rendered_chart}
    </figure>
  </body>
</html>
"""

## nltk.download('averaged_perceptron_tagger') - download averaged_perceptron_tagger
## nltk.download('punkt') - download punkt
## nltk.download('wordnet') - download wordnet
## nltk.download('stopwords') - download stopwords

In [2]:
# Extracting data from csv file
data = pd.read_csv("fake_job_postings.csv")
data.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [3]:
# Checking amount of not-null values in each column
data_not_null = abs(data.isnull().sum() - len(data.index))
barChart = pygal.Bar(height=450)
barChart.title = "Amount of not-null values in each column"
[barChart.add(x[0], x[1]) for x in data_not_null.items()]
display(HTML(base_html.format(rendered_chart=barChart.render(is_unicode=True))))

Columns 'salary_range' and 'department' have less than half of values, so they can be removed 

In [4]:
# Separating column location to columns city and country
data['country'] = data['location'].str.split(', ').str.get(0)
data['city'] = data['location'].str.split(', ').str.get(2)
# Remove unnecessary columns
data.pop('department')
data.pop('location')
data.pop('job_id')
data.pop('salary_range')
data.head()

Unnamed: 0,title,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,country,city
0,Marketing Intern,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0,US,New York
1,Customer Service - Cloud Video Production,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0,NZ,Auckland
2,Commissioning Machinery Assistant (CMA),Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0,US,Wever
3,Account Executive - Washington DC,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0,US,Washington
4,Bill Review Manager,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0,US,Fort Worth


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   title                17880 non-null  object
 1   company_profile      14572 non-null  object
 2   description          17879 non-null  object
 3   requirements         15185 non-null  object
 4   benefits             10670 non-null  object
 5   telecommuting        17880 non-null  int64 
 6   has_company_logo     17880 non-null  int64 
 7   has_questions        17880 non-null  int64 
 8   employment_type      14409 non-null  object
 9   required_experience  10830 non-null  object
 10  required_education   9775 non-null   object
 11  industry             12977 non-null  object
 12  function             11425 non-null  object
 13  fraudulent           17880 non-null  int64 
 14  country              17534 non-null  object
 15  city                 17440 non-null  object
dtypes: i

In [6]:
# Rearranging the order of columns
columns = data.columns.tolist()
order = [0, 1, 14, 15, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
columns = [columns[i] for i in order]
data = data[columns]

In [7]:
## Clean data from duplicates and NaN value in columns title or country
data.drop_duplicates()
data.dropna(subset=['title', 'country'])

Unnamed: 0,title,company_profile,country,city,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,Marketing Intern,"We're Food52, and we've created a groundbreaki...",US,New York,"Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,Customer Service - Cloud Video Production,"90 Seconds, the worlds Cloud Video Production ...",NZ,Auckland,Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,Commissioning Machinery Assistant (CMA),Valor Services provides Workforce Solutions th...,US,Wever,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,Account Executive - Washington DC,Our passion for improving quality of life thro...,US,Washington,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,Bill Review Manager,SpotSource Solutions LLC is a Global Human Cap...,US,Fort Worth,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17875,Account Director - Distribution,Vend is looking for some awesome new talent to...,CA,Toronto,Just in case this is the first time you’ve vis...,To ace this role you:Will eat comprehensive St...,What can you expect from us?We have an open cu...,0,1,1,Full-time,Mid-Senior level,,Computer Software,Sales,0
17876,Payroll Accountant,WebLinc is the e-commerce platform and service...,US,Philadelphia,The Payroll Accountant will focus primarily on...,- B.A. or B.S. in Accounting- Desire to have f...,Health &amp; WellnessMedical planPrescription ...,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Internet,Accounting/Auditing,0
17877,Project Cost Control Staff Engineer - Cost Con...,We Provide Full Time Permanent Positions for m...,US,Houston,Experienced Project Cost Control Staff Enginee...,At least 12 years professional experience.Abil...,,0,0,0,Full-time,,,,,0
17878,Graphic Designer,,NG,Lagos,Nemsia Studios is looking for an experienced v...,1. Must be fluent in the latest versions of Co...,Competitive salary (compensation will be based...,0,0,1,Contract,Not Applicable,Professional,Graphic Design,Design,0


## Data analysis

### Identifying the phrases/key words of legitimate and fraudelent job ads.
The first objective of this reseach is to identify the phrases and key words of legitimate and fraudelent job ads. To do this the 'company_profile', 'description', 'requirements', 'benefits' columns should be joined into one column. Then in this column phrases should be defined. To do this, all words should lemmatized, and splitted into phrases by using stop words and stop signs. Then the frequency of phrases can be visualized by using Word Clouds.  

In [8]:
data['company_profile'].fillna(" ", inplace = True)
data['description'].fillna(" ", inplace = True)
data['requirements'].fillna(" ", inplace = True)
data['benefits'].fillna(" ", inplace = True)
data['information'] = data['company_profile'] + ' ' + data['description'] + ' ' + data['requirements'] + ' ' + data['benefits']
data.head()

Unnamed: 0,title,company_profile,country,city,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,information
0,Marketing Intern,"We're Food52, and we've created a groundbreaki...",US,New York,"Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0,"We're Food52, and we've created a groundbreaki..."
1,Customer Service - Cloud Video Production,"90 Seconds, the worlds Cloud Video Production ...",NZ,Auckland,Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0,"90 Seconds, the worlds Cloud Video Production ..."
2,Commissioning Machinery Assistant (CMA),Valor Services provides Workforce Solutions th...,US,Wever,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0,Valor Services provides Workforce Solutions th...
3,Account Executive - Washington DC,Our passion for improving quality of life thro...,US,Washington,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0,Our passion for improving quality of life thro...
4,Bill Review Manager,SpotSource Solutions LLC is a Global Human Cap...,US,Fort Worth,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0,SpotSource Solutions LLC is a Global Human Cap...


In [9]:
## stop = set(stopwords.words('english'))
## stop.update(set(string.punctuation))

In [10]:
## def get_wordnet_part_of_speech(word):
##    word_dict = {"J": wordnet.ADJ,
##                 "N": wordnet.NOUN,
##                 "V": wordnet.VERB,
##                 "R": wordnet.ADV}
##
##    return word_dict.get(word, wordnet.NOUN)

In [11]:
## lemmatizer = WordNetLemmatizer()
## def lemmatize_words(text):
##    lemmatized_text = []
##    for i in nltk.word_tokenize(text):
##        if i.strip().lower() not in stop:
##            part_of_speech = pos_tag([i.strip()])
##            word = lemmatizer.lemmatize(i.strip, get_wordnet_part_of_speech(part_of_speech[0][1]))
##            lemmatized_text.append(word.lower())
##    return " ".join(lemmatized_text)  