**Author: Guillermo Raya Garcia**<br>
**NIU: 1568864**<br>
**Universitat Autònoma de Barcelona**
# Real Vs Fake Job Postings: An analysis

__[Link to our dataset](https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction)__.

In [1]:
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import scipy.stats

In [2]:
# Funcio per a llegir dades en format csv
def load_dataset(path):
    dataset = pd.read_csv(path, header=0, delimiter=',')
    return dataset

# Carreguem dataset d'exemple
dataset = load_dataset('fake_job_postings.csv')
values = dataset.values
labels = dataset.columns.values

In [3]:
values.shape[0]

17880

In [4]:
values.shape[1]

18

In [5]:
values[:,0:18]

array([[1, 'Marketing Intern', 'US, NY, New York', ..., nan, 'Marketing',
        0],
       [2, 'Customer Service - Cloud Video Production', 'NZ, , Auckland',
        ..., 'Marketing and Advertising', 'Customer Service', 0],
       [3, 'Commissioning Machinery Assistant (CMA)', 'US, IA, Wever',
        ..., nan, nan, 0],
       ...,
       [17878,
        'Project Cost Control Staff Engineer - Cost Control Exp - TX',
        'US, TX, Houston', ..., nan, nan, 0],
       [17879, 'Graphic Designer', 'NG, LA, Lagos', ...,
        'Graphic Design', 'Design', 0],
       [17880, 'Web Application Developers', 'NZ, N, Wellington', ...,
        'Computer Software', 'Engineering', 0]], dtype=object)

## 1. Initial observations on the dataset

In [6]:
dataset

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17875,17876,Account Director - Distribution,"CA, ON, Toronto",Sales,,Vend is looking for some awesome new talent to...,Just in case this is the first time you’ve vis...,To ace this role you:Will eat comprehensive St...,What can you expect from us?We have an open cu...,0,1,1,Full-time,Mid-Senior level,,Computer Software,Sales,0
17876,17877,Payroll Accountant,"US, PA, Philadelphia",Accounting,,WebLinc is the e-commerce platform and service...,The Payroll Accountant will focus primarily on...,- B.A. or B.S. in Accounting- Desire to have f...,Health &amp; WellnessMedical planPrescription ...,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Internet,Accounting/Auditing,0
17877,17878,Project Cost Control Staff Engineer - Cost Con...,"US, TX, Houston",,,We Provide Full Time Permanent Positions for m...,Experienced Project Cost Control Staff Enginee...,At least 12 years professional experience.Abil...,,0,0,0,Full-time,,,,,0
17878,17879,Graphic Designer,"NG, LA, Lagos",,,,Nemsia Studios is looking for an experienced v...,1. Must be fluent in the latest versions of Co...,Competitive salary (compensation will be based...,0,0,1,Contract,Not Applicable,Professional,Graphic Design,Design,0


In [7]:
def chartMe(content,colnames,rownames,precision=9,pre="", extra=""):
    #Making the rows of the chart
    chart=[]
    for i,valuesRow in enumerate(content):
        newRow=[str(pre+" "+rownames[i]+" "+extra)]
        newRow.extend([valuesRow])
        chart.append(newRow)
        
    #Setting up the precision, as requested by function call
    pd.set_option('precision', precision)
    seriousChart=pd.DataFrame(chart, columns=colnames).style.hide_index()
    pd.reset_option('precision')
    
    return seriousChart

In [8]:
chartMe(content=values[1],colnames=["Attribute","Value"],rownames=labels)

Attribute,Value
job_id,2
title,Customer Service - Cloud Video Production
location,"NZ, , Auckland"
department,Success
salary_range,
company_profile,"90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. 90 Seconds makes video production fast, affordable, and all managed seamlessly in the cloud from purchase to publish. http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing global network of over 2,000 rated video professionals in over 50 countries managed by dedicated production success teams in 5 countries, 90 Seconds provides a 100% success guarantee.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L’Oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo and Singapore.http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630# | http://90#URL_e2ad0bde3f09a0913a486abdbb1e6ac373bb3310f64b1fbcf550049bcba4a17b# | http://90#URL_8c5dd1806f97ab90876d9daebeb430f682dbc87e2f01549b47e96c7bff2ea17e#"
description,"Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe Account Management? ...And think administration is cooler than a polar bear on a jetski? Then we need to hear you! We are the Cloud Video Production Service and opperating on a glodal level. Yeah, it's pretty cool. Serious about delivering a world class product and excellent customer service.Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects, manage client communications and drive the production process. Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way!We are entering the next growth stage of our business and growing quickly internationally. Therefore, the position is bursting with opportunity for the right person entering the business at the right time. 90 Seconds, the worlds Cloud Video Production Service - http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. Fast, affordable, and all managed seamlessly in the cloud from purchase to publish. 90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing network of over 2,000 rated video professionals in over 50 countries and dedicated production success teams in 5 countries guaranteeing video project success 100%. It's as easy as commissioning a quick google adwords campaign.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L'oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo & Singapore.Our Auckland office is based right in the heart of the Wynyard Quarter Innovation Precinct - GridAKL!"
requirements,"What we expect from you:Your key responsibility will be to communicate with the client, 90 Seconds team and freelance community throughout the video production process including, shoot planning, securing freelance talent, managing workflow and the online production management system. The aim is to manage each video project effectively so that we produce great videos that our clients love.Key attributesClient focused - excellent customer service and communication skillsOnline - oustanding computer knowledge and experience using online software and project management toolsOrganised - manage workload and able to multi-task100% attention to detailMotivated - self-starter with a passion for doing excellent work and achieving great resultsAdaptable - show initiative and think on your feet as this is a constantly evolving atmosphereFlexible - fast turnaround work and after hours availabilityEasy going & upbeat - dosen't get bogged down and loves the challengeSense of Humour - have a laugh and know that working in a startup takes guts!Ability to deliver - including meeting project deadlines and budgetAttitude is more important than experience at 90 Seconds, however previous experience in customer service and/or project management is beneficialPlease view our platform / website at #URL_395a8683a907ce95f49a12fb240e6e47ad8d5a4f96d07ebbd869c4dd4dea1826# and get a clear understand about what we do before reaching out."
benefits,"What you will get from usThrough being part of the 90 Seconds team you will gain:experience working on projects located around the world with an international brandexperience working with a variety of clients and on a large range of projectsopportunity to drive and grow production function and teama positive working environment with a great teamPay$40,000-$55,000Applying for this role with a VIDEOBeing a video business, we understand that one of the quickest ways that we can assess your suitability for this role, and one of the quickest ways that you can apply for it, is for you to submit a 60-90 second long video telling us about yourself, your experience and why you think you would be perfect for the role. It’s not about being a filmmaker or making a really creative video. A simple video filmed with a smart phone or web cam will be fine. Please also include where you are based and when you can start.You can upload the video onto YouTube or Vimeo (or similar) as a Draft or Live link.APPLICATIONS DUE by 5pm on Wednesday 18th July 2014 - Once you have a video ready, apply for this role via the following link together with a cover letter and your CV. After we have watched your video and get an idea of your suitability for the role, we will email the shortlisted candidates"
telecommuting,0


This dataset has 18 columns:

|| Attribute | Data type | Attribute description |
|:--| :-: | :-: | :-- |
|0|'job_id'| int | Row number |
|1|'title'| str | Name of the job offered |
|2|'location'| str | Location on which the job takes place, comprised of Country, State and City|
|3|'department'| str | Department of the offered job |
|4|'salary_range'| float| Salary range of the offered job |
|5|'company_profile'| str | Brief overview of the company |
|6|'description'| str | Description of the offered job |
|7|'requirements'| str | Requirements for the applicants to the job |
|8|'benefits'|float| Benefits included in the offer |
|9|'telecommuting'|int| Availability |
|10|'has_company_logo'|int| Existence and availability of a company logo |
|11|'has_questions'|int| TO-DO |
|12|'employment_type'|str| Type of employment (e.g.:'Full-time','Part-time', 'Minijob',etc.) |
|13|'required_experience'|str| Required experience for applicants |
|14|'required_education'|float| Required education for applicants |
|15|'industry'|float| Industry of the job offered |
|16|'function'|str| Function of the job offer |
|17|'fraudulent'|int| Fraudulent: 1 if yes, otherwise 0 |

## 2. Preliminary manipulation of our dataset

Let's remove the first attribute from our data, since it only numbers our rows and gives us no useful information.

In [9]:
dataset=dataset.drop(columns=('job_id'))

And since we will mainly be working with the description attribute, we want to remove all instances that have "nan" on it.

In [10]:
dataset=dataset.dropna(axis=0,subset=['description'])

Let's recalculate 'values' and 'labels' with the recently applied changes.

In [11]:
values = dataset.values
labels = dataset.columns.values

Let's set apart the X (input data, or explicative variables) from the y (target variable).

In [12]:
X=values[:,0:16]
y=values[:,-1:]

We'll __[split our data into a Training set and a Test set](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#)__ (we will use __[K-fold cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)__ later down the road, this is just to get started).

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Since many important fields in this dataset are comprised solely of text, we will need to process them in some way so that we can compare them. And for that, we'll use scikit's tools for text feature extraction.

Taking a look at __[the documentation](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)__, we choose to start by running a monogram vectorisation on the 'description' attributes, in order to try out our tools.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.05, stop_words='english')
X_t_desc = vectorizer.fit_transform(X_train[:,5])
X_t_desc # This is the matrix that we get from vectorizing the description attribute from X_train

<11978x330 sparse matrix of type '<class 'numpy.int64'>'
	with 398952 stored elements in Compressed Sparse Row format>

In [15]:
len(vectorizer.get_feature_names()) # This is the number of different words found in the vectorisation.

330

In [16]:
vectorizer.get_feature_names() # These are the words we found. When we start analyzing other texts, we'll be looking for the presence and frequency of these words.

['12',
 '200',
 'ability',
 'able',
 'account',
 'accounts',
 'achieve',
 'activities',
 'amp',
 'analysis',
 'application',
 'applications',
 'apply',
 'appropriate',
 'area',
 'areas',
 'assigned',
 'assist',
 'available',
 'based',
 'benefits',
 'best',
 'big',
 'brand',
 'build',
 'building',
 'business',
 'calls',
 'campaigns',
 'candidate',
 'candidates',
 'care',
 'career',
 'client',
 'clients',
 'closely',
 'code',
 'come',
 'communicate',
 'communication',
 'community',
 'companies',
 'company',
 'competitive',
 'complete',
 'complex',
 'contact',
 'content',
 'contract',
 'control',
 'cost',
 'create',
 'creating',
 'creative',
 'credit',
 'cross',
 'culture',
 'current',
 'currently',
 'customer',
 'customers',
 'daily',
 'data',
 'day',
 'degree',
 'deliver',
 'delivering',
 'delivery',
 'department',
 'design',
 'develop',
 'developer',
 'developers',
 'developing',
 'development',
 'digital',
 'direct',
 'direction',
 'directly',
 'director',
 'documentation',
 'drive',


In [17]:
 vectorizer.vocabulary_.get('website') # ← This command gets us the position of a certain word in our vocabulary

322

In [18]:
# This next function will apply count vectorisation to a given string.
# Input:
    # · text: string wanted to be transformed using countvectorization.
# Output:
    # Matrix with the result of the countvectorization. Can be transformed into array (for readable results) using ".toarray()"
def vecCountTrans (text):
    return(vectorizer.transform([text]))

In [19]:
# ↓ And here's the result of vectorizing a new sentence of my choosing :)
vecCountTrans("Check out my new project! It's a travel planning website.").toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Now that we have familiarized ourselves with text countvectorization, we will try and add a column to our datasets (both Test and Train) to include it:

In [47]:
# This next function will apply vecCountTrans to our X_train and X_test sets.
# Input:
    # · myArray: Array containing a column that will be transformed using vecCountTrans
    # · colNumber: The index (int) of the column that is desired to be transformed using vecCountTrans
# Output:
    # · newCol: Array containing the vectorisations of the elements in the chosen column of the given ndarray.
def vecCountTrans_column(myArray,colNumber):
    newCol=np.array([])
    for row in range(myArray.shape[0]):
        newCol=np.append(newCol,vecCountTrans(myArray[row,colNumber]))
    return(newCol)

# This other function will apply the previous function and append the resulting array onto the provided array, as a last column.
# Input:
    # · myArray: Array containing a column that will be transformed using vecCountTrans
    # · colNumber: The index (int) of the column that is desired to be transformed using vecCountTrans
# Output:
    # · myNewArray: Array containing the same as myArray, but having the result of vecCountTrans as a last column
def append_vecCountTrans_column(myArray,colNumber):
    nC=vecCountTrans_column(myArray,colNumber)
    nC=np.array([nC]).transpose()
    myNewArray=np.append(myArray,nC,axis=1)
    return(myNewArray)

In [48]:
X_train=append_vecCountTrans_column(X_train,5)

In [49]:
X_test=append_vecCountTrans_column(X_test,5)