<a href="https://colab.research.google.com/github/mariarua/Fake-job-posting/blob/main/Fake_job_posting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import re

In [2]:
import nltk
from nltk.corpus import stopwords

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!git clone https://github.com/mariarua/Fake-job-posting /content/drive/MyDrive/fake_job_posting

fatal: destination path '/content/drive/MyDrive/fake_job_posting' already exists and is not an empty directory.


In [5]:
Jobs_Dataset = pd.read_csv('/content/drive/MyDrive/fake_job_posting/fake_job_postings.csv')
jobs_DS = Jobs_Dataset.copy()

**Exploración de datos**

Comenzaremos buscando aquellas características que tenga en 60% de los datos en null y las quitares de nuestro dataset

In [6]:
jobs_DS.isnull().mean()<0.4

job_id                  True
title                   True
location                True
department             False
salary_range           False
company_profile         True
description             True
requirements            True
benefits               False
telecommuting           True
has_company_logo        True
has_questions           True
employment_type         True
required_experience     True
required_education     False
industry                True
function                True
fraudulent              True
dtype: bool

In [7]:
jobs_DS= jobs_DS[jobs_DS.columns[jobs_DS.isnull().mean()<0.4]]

Después de esto rellenaremos el null por un string vacío

In [8]:
jobs_DS.fillna('',inplace=True)

Eliminaremos algunas columnas de nuestro dataset debido a que no son muy útiles para lo que deseamos realizar

In [9]:
jobs_DS.drop(columns=['telecommuting','has_company_logo','has_questions','job_id'], inplace=True)

Para procesar el texto vamos a colocar todos las características juntas en una columna que llamaremos text. Y eliminaremos el resto de las columnas, de este modo tendremos todas las características en un mismo lugar

In [10]:
jobs_DS['text'] = jobs_DS['title']+" "+jobs_DS['location']+" "+jobs_DS['company_profile']+" "+jobs_DS['description']+" "+jobs_DS['requirements']+" "+jobs_DS['employment_type']+" "+jobs_DS['required_experience']+" "+jobs_DS['industry']+" "+jobs_DS['function']

In [11]:
jobs_DS.drop(columns=['title','location','company_profile','description','requirements','employment_type','required_experience','industry','function'],inplace=True)

Luego vamos a reemplazar los saltos de línea, salto de linea y retorno, y los espacios que son tabs

In [12]:
jobs_DS['text'] = jobs_DS['text'].str.replace('\n', ' ')
jobs_DS['text'] = jobs_DS['text'].str.replace('\r', ' ')
jobs_DS['text'] = jobs_DS['text'].str.replace('\t', ' ')

Ahora, vamos a remover los números y los caracteres especiales

In [13]:
jobs_DS['text'] = jobs_DS['text'].apply(lambda x: re.sub(r'[0-9]',' ',x))
jobs_DS['text'] = jobs_DS['text'].apply(lambda x: re.sub(r'[/(){}\[\]\|@,;.:-]',' ',x))

Y colocar todo el texto en minúsculas 

In [14]:
jobs_DS['text']= jobs_DS['text'].apply(lambda s:s.lower() if type(s) == str else s)

Después, comprobaremos que son strings, en caso de que si se divide el string en una lista de palabras

In [15]:
jobs_DS['text']= jobs_DS['text'].apply(lambda s:" ".join(s.split()) if type(s) == str else s)

 **STOPWORDS**

 Son aquellas palabras vacías, que no se encuentran registradas por los robots de Google. Son palabras que no tienen significado alguno y que por ese motivo los buscadores no las consideran a la hora de posicionar el contenido 

 Descargaremos las stopwords de nltk para limpiar nuestro conjunto de datos de estas palabras que no nos agregan valor a nuestro modelo

In [16]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
jobs_DS['text'] = jobs_DS['text'].apply(lambda x:' '.join([word for word in x.split() if word not in (stop_words)]))

Ahora con la función one_hot() vamos a generar una representación numérica de nuestro texto, dandole 5000 numeros de clases para el vocubulario. 

Con pad_sequences vamos a ajustar la secuencia a 40, rellenando las que son menores de 40.

Con esto vamos a representar numericamente las descripciones de ofertas trabajo manteniendo la misma longitud.



In [18]:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

one_hot_x = [one_hot(description,5000) for description in jobs_DS['text']]
max_l = 40
embedded_description = pad_sequences(one_hot_x,max_l)

Definiremos entonces las metricas necesarias para empezar a evaluar nuestro algoritmo

In [19]:
import keras

METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
      keras.metrics.AUC(name='prc', curve='PR')
]

Contruiremos un modelo de red neuronal secuencial para el procesamiento de la oferta de trabajo. Utilizaremos capas de incrustación, capas de LSTM bidireccionales, capas de dropout y capas densas para modelar

In [20]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense

model = Sequential()
model.add(Embedding(5000,40,'uniform',input_length=max_l))
model.add(Bidirectional(LSTM(100)))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=METRICS)
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 40, 40)            200000    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              112800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 1)                 201       
                                                                 
Total params: 313,001
Trainable params: 313,001
Non-trainable params: 0
_________________________________________________________________
None


In [21]:
X = np.array(embedded_description)
Y = np.array(jobs_DS['fraudulent'])

Implementaremos validación cruzada

In [22]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=3)

for train_index, test_index in kf.split(X):
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = Y[train_index], Y[test_index]  

    model.fit(X_train,y_train, validation_data=(X_test,y_test), epochs=6, batch_size=30)


Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


In [23]:
def predict(m,fake_job_post):
  input = fake_job_post.replace('\n',' ').replace('\r',' ').replace('\t',' ')
  input = re.sub(r'[0-9]',' ',input)
  input = re.sub(r'[/(){}\[\]\|@,;.:-]',' ',input)
  input = input.lower()
  input = " ".join(input.split())
  input = ' '.join([word for word in input.split() if word not in (stop_words)])

  one_hot_input = one_hot(input,5000)
  embedded = pad_sequences([one_hot_input],maxlen=max_l)

  pred = m.predict(embedded)
  print(pred)

  if(pred > 0.5): return "This job posting its FAKE"
  return "This job posting its TRUE"   

##**Descripción de trabajo verdadera**

Software Developer

Job Description: We are seeking a highly skilled software developer to join our development team. You will be responsible for designing, developing, and maintaining high-quality software applications. Strong knowledge of programming languages such as Python and experience in web application development are required. We also value database skills and familiarity with development frameworks such as Django. You will work closely with our team of engineers to create innovative and scalable technology solutions. If you are passionate about coding and enjoy tackling technical challenges, this position is for you!

Job Requirements:

Demonstrable experience in software development using Python.
Strong knowledge of programming languages such as Java, C++, or Ruby.
Experience in web application development using frameworks like Django or Flask.
Proficiency in relational databases such as MySQL or PostgreSQL.
Problem-solving skills and ability to work in a team.
Ability to quickly learn new technologies and adapt to changing environments.
We offer a dynamic and challenging work environment, professional growth opportunities, and a highly collaborative team. If you are seeking a new challenge in the field of software development, we look forward to receiving your application!"

You can use this job description as input in your model to evaluate whether it is classified as true or false. Remember that the model should have been trained on a labeled dataset in order to make accurate predictions.

In [29]:
predict(model, "Software Developer Job Description: We are seeking a highly skilled software developer to join our development team. You will be responsible for designing, developing, and maintaining high-quality software applications. Strong knowledge of programming languages such as Python and experience in web application development are required. We also value database skills and familiarity with development frameworks such as Django. You will work closely with our team of engineers to create innovative and scalable technology solutions. If you are passionate about coding and enjoy tackling technical challenges, this position is for you! Job Requirements: Demonstrable experience in software development using Python. Strong knowledge of programming languages such as Java, C++, or Ruby. Experience in web application development using frameworks like Django or Flask. Proficiency in relational databases such as MySQL or PostgreSQL. Problem-solving skills and ability to work in a team. Ability to quickly learn new technologies and adapt to changing environments. We offer a dynamic and challenging work environment, professional growth opportunities, and a highly collaborative team. If you are seeking a new challenge in the field of software development, we look forward to receiving your application! You can use this job description as input in your model to evaluate whether it is classified as true or false. Remember that the model should have been trained on a labeled dataset in order to make accurate predictions")

[[4.2903477e-05]]


'This job posting its TRUE'

## **Descripción de trabajo falso**

As coronavirus_are increasing, so have the number of companies asking their employees to stay at home
As travelers cancel flights and stocks fall, a global health pandemic now has become a global economic crisis
In any health pandemic, our first concern us what the health of those affected,
COVID-19 has brought about many more death worldwide and more and more cases are being confirmed daily counties the World
But unfortunately, the economic impacts also have dramatic effects on the wellbeing of families and communities
Although traditional forms of tutoring, including face.to.face lessons and residential placements remain as popular as ever, has also been gaming
traction over the last few years
With a distinct use in online tuition websites, many tutors have begun to work exclusively online and some schools have even started offering online programs
As the world comes together to solve this coronavirus pandemic, the demand for online tuition has also become more and more In demand
Click here and find out how to work from home as an online tutor
Here
Best Regards
Emmanuel



In [31]:
predict(model, "As coronavirus_are increasing, so have the number of companies asking their employees to stay at home As travelers cancel flights and stocks fall, a global health pandemic now has become a global economic crisis In any health pandemic, our first concern us what the health of those affected, COVID-19 has brought about many more death worldwide and more and more cases are being confirmed daily counties the World But unfortunately, the economic impacts also have dramatic effects on the wellbeing of families and communities Although traditional forms of tutoring, including face.to.face lessons and residential placements remain as popular as ever, has also been gaming traction over the last few years With a distinct use in online tuition websites, many tutors have begun to work exclusively online and some schools have even started offering online programs As the world comes together to solve this coronavirus pandemic, the demand for online tuition has also become more and more In demand Click here and find out how to work from home as an online tutor Here Best Regards Emmanuel")

[[0.9999918]]


'This job posting its FAKE'