# Data Cleaning & Pre-processing
![data-cleaning-in-python](https://daxg39y63pxwu.cloudfront.net/images/blog/data-cleaning-in-python/data-cleaning-in-python.png)

First step of an analytics project is to clean the datasets and pre-processed it to make it suitable for use by analytical model and visualization


In [None]:
# install dependencies
# Run the following code in your terminal if you don't have the dependency installed:
    # pip install -U imbalanced-learn --user
    # pip install statsmodels
    # pip install -U scikit-learn
    # pip install nltk==3.5

# Or run in notebook:    
# %pip install -U imbalanced-learn 
# %pip install statsmodels
# %pip install -U scikit-learn
# %pip install nltk==3.5

In [1]:
# import basic libraries
import pandas as pd
import numpy as np
import seaborn as sb

References:
- https://www.kaggle.com/code/madz2000/text-classification-using-keras-nb-97-accuracy
- https://realpython.com/nltk-nlp-python/
- https://codeigo.com/python/remove-punctuation-with-nltk
- https://rishabh20118.medium.com/fake-job-posting-detection-and-getting-useful-job-posting-insights-from-the-dataset-e8edf1870831
- https://towardsdatascience.com/fake-job-predictor-a168a315d866

## 1) Import dataset into the notebook

In [2]:
# job_df -> fake_job_postings dataframe
job_df = pd.read_csv('datasets/fake_job_postings.csv')
job_df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


## 2) Data Cleaning

In [4]:
job_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15185 non-null  object
 8   benefits             10670 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

### **2.1 NA values**

NA values are meaningful as it tells us that the job listing lacks a particular field, which may play a important role in determing a whether a job listing is a scam. Thus, NA values will be **replaced with meaningful values** according to domain knowledge and the values in the dataset.

#### **2.1.1 Columns with NA values**

In [22]:
job_df.isna().sum()

job_id                     0
title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2695
benefits                7210
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64

In [3]:
columnsWithNA = []

def displayNAColumns(df):
    tempColumnsWithNA = []
    # Columns with NA values
    for col in df.isna().any().loc[lambda x: x == True].index:
        columnsWithNA.append(col)
        tempColumnsWithNA.append(col)

    print(f"{len(tempColumnsWithNA)} columns with NA values")
    if len(tempColumnsWithNA):
        print(tempColumnsWithNA)

displayNAColumns(job_df)

12 columns with NA values
['location', 'department', 'salary_range', 'company_profile', 'description', 'requirements', 'benefits', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']


There are 12 columns with NA values.

#### **2.1.2 Replacing NA values**


**2.1.2 (a) `For EDA`, NA values for these columns will be converted as shown below:**

- location -> "Unknown"
- department -> "Unknown"
- salary_range -> "Unknown"
- company_profile -> ""
- description -> ""
- requirements -> ""
- benefits -> "Unknown"
- employment_type -> "Unknown"
- required_experience -> "Not Applicable"
- required_education -> "Not Applicable"
- industry -> "Unknown"
- function -> "Unknown"

In [7]:
job_df_eda = job_df.copy()

In [8]:
job_df_eda['location'] = job_df_eda['location'].fillna("Unknown")
job_df_eda['department'] = job_df_eda['department'].fillna("Unknown")
job_df_eda['salary_range'] = job_df_eda['salary_range'].fillna("Unknown")
job_df_eda['company_profile'] = job_df_eda['company_profile'].fillna(" ")
job_df_eda['description'] = job_df_eda['description'].fillna(" ")
job_df_eda['requirements'] = job_df_eda['requirements'].fillna(" ")
job_df_eda['benefits'] = job_df_eda['benefits'].fillna("Unknown")
job_df_eda['employment_type'] = job_df_eda['employment_type'].fillna("Unknown")
job_df_eda['required_experience'] = job_df_eda['required_experience'].fillna("Not Applicable")
job_df_eda['required_education'] = job_df_eda['required_education'].fillna("Not Applicable")
job_df_eda['industry'] = job_df_eda['industry'].fillna("Unknown")
job_df_eda['function'] = job_df_eda['function'].fillna("Unknown")

In [9]:
print('Number of NA entries: ', job_df_eda.isna().sum().sum())
displayNAColumns(job_df_eda)

Number of NA entries:  0
0 columns with NA values


In [10]:
job_df_eda.isna().sum()

job_id                 0
title                  0
location               0
department             0
salary_range           0
company_profile        0
description            0
requirements           0
benefits               0
telecommuting          0
has_company_logo       0
has_questions          0
employment_type        0
required_experience    0
required_education     0
industry               0
function               0
fraudulent             0
dtype: int64

**2.1.1 (b) `For Modelling`, all NA values will be converted to " " in order to facilitate text mining**

In [4]:
job_df_model = job_df.copy()

In [5]:
for col in columnsWithNA:
  job_df_model[col] = job_df_model[col].fillna("")

In [6]:
displayNAColumns(job_df_model)

0 columns with NA values


### **2.2 Cleaning up continuous variables**

There are no continuous variables to be cleaned. Thus, there are no outliers to be considered for continuous variables.

In [17]:
job_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15185 non-null  object
 8   benefits             10670 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

### **2.3 Cleaning up categorical variables**

In [19]:
for feature in job_df_eda.columns:
  if np.dtype(job_df_eda[feature]) != 'object':
    continue
  print(job_df_eda[feature].value_counts(), end='\n\n')

English Teacher Abroad                                         311
Customer Service Associate                                     146
Graduates: English Teacher Abroad (Conversational)             144
English Teacher Abroad                                          95
Software Engineer                                               86
                                                              ... 
West Coast Regional Channel Manager (RCM)                        1
BI Practice Manager                                              1
Community Coordinator- Ambassador Programme                      1
Senior Traffic Engineer                                          1
Project Cost Control Staff Engineer - Cost Control Exp - TX      1
Name: title, Length: 11231, dtype: int64

GB, LND, London          718
US, NY, New York         658
US, CA, San Francisco    472
GR, I, Athens            464
Unknown                  346
                        ... 
GB, SFK, Leiston           1
GB, LND, Hammersmi

- all the categorical variables are consistent in their values
- there are no missing values

Therefore, no data cleaning is needed for categorical variables. Additionally, one hot encoding is not required for categorical variables to encode them into numeric forms because:

1) The text-based variables will be combined to form text data and text mining will be performed.
2) Categorical variables with 2 levels are already Integer-Encoded.

## 3) Data Preprocessing

In order to utilize `text mining` for the dataset, the dataset needs to be preprocessed in order for use by `analytical models`. In this step, we will process the `job_df_model` dataframe to combine several fields into one `text` column/field which will be used to perform text mining.

### 3.1 Overview of the job_df_model dataframe

In [7]:
job_df_model.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


### 3.2 Combining columns into `text` column

In [8]:
columnsToCombine = [
    "title",
    "location",
    "department",
    "salary_range",
    "company_profile",
    "description",
    "requirements",
    "benefits",
    "employment_type",
    "required_experience",
    "required_education",
    "industry",
    "function"
]

job_df_model["text"] = ""

for col in columnsToCombine:
    job_df_model["text"] = job_df_model["text"] + " " + job_df_model[col]


In [9]:
# See one of the combined 'text'
job_df_model["text"][0]

" Marketing Intern US, NY, New York Marketing  We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City. Food52, a fast-growing, James Beard Award-winning online food commu

### 3.3 Dropping redundant columns

#### 3.3.1 For `job_df_eda` 

In [11]:
job_df_eda.drop(["job_id"], axis=1, inplace=True)

In [12]:
job_df_eda.head()

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,Marketing Intern,"US, NY, New York",Marketing,Unknown,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,Unknown,0,1,0,Other,Internship,Not Applicable,Unknown,Marketing,0
1,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,Unknown,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,Not Applicable,Marketing and Advertising,Customer Service,0
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",Unknown,Unknown,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,Unknown,0,1,0,Unknown,Not Applicable,Not Applicable,Unknown,Unknown,0
3,Account Executive - Washington DC,"US, DC, Washington",Sales,Unknown,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,Bill Review Manager,"US, FL, Fort Worth",Unknown,Unknown,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


#### 3.3.2 For `job_df_model`

In [10]:
columnsToDrop = columnsToCombine + ["job_id"]

job_df_model.drop(columnsToDrop, axis=1, inplace=True)

In [11]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,text
0,0,1,0,0,"Marketing Intern US, NY, New York Marketing ..."
1,0,1,0,0,"Customer Service - Cloud Video Production NZ,..."
2,0,1,0,0,"Commissioning Machinery Assistant (CMA) US, I..."
3,0,1,0,0,"Account Executive - Washington DC US, DC, Was..."
4,0,1,1,0,"Bill Review Manager US, FL, Fort Worth Spot..."


### 3.4 Processing `text` column

**Reference:**
- https://realpython.com/nltk-nlp-python/
- https://codeigo.com/python/remove-punctuation-with-nltk

#### 3.4.0 Install Dependencies and Import Library for NLP

**Install `nltk` package if not done so:**

In [150]:
# %pip install nltk==3.5
# import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

**Import `nltk` library:**

In [12]:
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

#### 3.4.1 Character + Word + Sentence Count

**(a) Adding `character_count` column**

In [13]:
# Adding a new column - character_count
job_df_model["character_count"] = 0

for row in range(len(job_df_model)):
  job_df_model.loc[row, "character_count"] = len(job_df_model.loc[row, "text"])

In [14]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,text,character_count
0,0,1,0,0,"Marketing Intern US, NY, New York Marketing ...",2720
1,0,1,0,0,"Customer Service - Cloud Video Production NZ,...",6227
2,0,1,0,0,"Commissioning Machinery Assistant (CMA) US, I...",2662
3,0,1,0,0,"Account Executive - Washington DC US, DC, Was...",5558
4,0,1,1,0,"Bill Review Manager US, FL, Fort Worth Spot...",4060


**(b) Adding `word_count` column**

Note: This is an estimate via utilizing tokenizer, and puntuations are removed here

In [15]:
# Create a tokenize based on a regular expression.
# "[a-zA-Z0-9]+" captures all alphanumeric characters
tokenizer = RegexpTokenizer(r"[a-zA-Z0-9]+")

# Adding a new column - word_count
job_df_model["word_count"] = 0

for row in range(len(job_df_model)):
  job_df_model.loc[row, "word_count"] = len(tokenizer.tokenize(job_df_model.loc[row, "text"]))

In [16]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,text,character_count,word_count
0,0,1,0,0,"Marketing Intern US, NY, New York Marketing ...",2720,411
1,0,1,0,0,"Customer Service - Cloud Video Production NZ,...",6227,944
2,0,1,0,0,"Commissioning Machinery Assistant (CMA) US, I...",2662,387
3,0,1,0,0,"Account Executive - Washington DC US, DC, Was...",5558,752
4,0,1,1,0,"Bill Review Manager US, FL, Fort Worth Spot...",4060,510


**(c) Adding `sentence_count` column**

In [17]:
# Adding a new column - sentence_count
job_df_model["sentence_count"] = 0

for row in range(len(job_df_model)):
  job_df_model.loc[row, "sentence_count"] = len(sent_tokenize(job_df_model.loc[row, "text"]))

In [18]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,text,character_count,word_count,sentence_count
0,0,1,0,0,"Marketing Intern US, NY, New York Marketing ...",2720,411,7
1,0,1,0,0,"Customer Service - Cloud Video Production NZ,...",6227,944,25
2,0,1,0,0,"Commissioning Machinery Assistant (CMA) US, I...",2662,387,8
3,0,1,0,0,"Account Executive - Washington DC US, DC, Was...",5558,752,9
4,0,1,1,0,"Bill Review Manager US, FL, Fort Worth Spot...",4060,510,16


#### 3.4.2 Tokenizing

Note: Punctuations are removed here too.

> By tokenizing, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. It’s your first step in turning unstructured data into structured data, which is easier to analyze.

In [19]:
# Create a tokenize based on a regular expression.
# "[a-zA-Z0-9]+" captures all alphanumeric characters
tokenizer = RegexpTokenizer(r"[a-zA-Z0-9]+")

# Creating tokens & Adding a new column - tokens
job_df_model["tokens"] = job_df_model.apply(lambda row: tokenizer.tokenize(row["text"]), axis=1)

In [20]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,text,character_count,word_count,sentence_count,tokens
0,0,1,0,0,"Marketing Intern US, NY, New York Marketing ...",2720,411,7,"[Marketing, Intern, US, NY, New, York, Marketi..."
1,0,1,0,0,"Customer Service - Cloud Video Production NZ,...",6227,944,25,"[Customer, Service, Cloud, Video, Production, ..."
2,0,1,0,0,"Commissioning Machinery Assistant (CMA) US, I...",2662,387,8,"[Commissioning, Machinery, Assistant, CMA, US,..."
3,0,1,0,0,"Account Executive - Washington DC US, DC, Was...",5558,752,9,"[Account, Executive, Washington, DC, US, DC, W..."
4,0,1,1,0,"Bill Review Manager US, FL, Fort Worth Spot...",4060,510,16,"[Bill, Review, Manager, US, FL, Fort, Worth, S..."


#### 3.4.3 Filtering Stop Words

> Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [21]:
from nltk.corpus import stopwords

# Generating English Stopwords
stop_words = set(stopwords.words("english"))

# Removing Stopwords & Adding a new column - tokens_without_stopwords
job_df_model["tokens_without_stopwords"] = job_df_model.apply(lambda row: [word for word in row["tokens"] if word.casefold() not in stop_words], axis=1)

In [22]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,text,character_count,word_count,sentence_count,tokens,tokens_without_stopwords
0,0,1,0,0,"Marketing Intern US, NY, New York Marketing ...",2720,411,7,"[Marketing, Intern, US, NY, New, York, Marketi...","[Marketing, Intern, US, NY, New, York, Marketi..."
1,0,1,0,0,"Customer Service - Cloud Video Production NZ,...",6227,944,25,"[Customer, Service, Cloud, Video, Production, ...","[Customer, Service, Cloud, Video, Production, ..."
2,0,1,0,0,"Commissioning Machinery Assistant (CMA) US, I...",2662,387,8,"[Commissioning, Machinery, Assistant, CMA, US,...","[Commissioning, Machinery, Assistant, CMA, US,..."
3,0,1,0,0,"Account Executive - Washington DC US, DC, Was...",5558,752,9,"[Account, Executive, Washington, DC, US, DC, W...","[Account, Executive, Washington, DC, US, DC, W..."
4,0,1,1,0,"Bill Review Manager US, FL, Fort Worth Spot...",4060,510,16,"[Bill, Review, Manager, US, FL, Fort, Worth, S...","[Bill, Review, Manager, US, FL, Fort, Worth, S..."


#### 3.4.4 Tagging `tokens_without_stopwords`

> Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.

In [23]:
from nltk import pos_tag

# Tagging tokens_without_stopwords & Adding a new column - tokens_tagged
job_df_model["tokens_tagged"] = job_df_model.apply(lambda row: pos_tag(row["tokens_without_stopwords"]), axis=1)

In [24]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,text,character_count,word_count,sentence_count,tokens,tokens_without_stopwords,tokens_tagged
0,0,1,0,0,"Marketing Intern US, NY, New York Marketing ...",2720,411,7,"[Marketing, Intern, US, NY, New, York, Marketi...","[Marketing, Intern, US, NY, New, York, Marketi...","[(Marketing, VBG), (Intern, NNP), (US, NNP), (..."
1,0,1,0,0,"Customer Service - Cloud Video Production NZ,...",6227,944,25,"[Customer, Service, Cloud, Video, Production, ...","[Customer, Service, Cloud, Video, Production, ...","[(Customer, NNP), (Service, NNP), (Cloud, NNP)..."
2,0,1,0,0,"Commissioning Machinery Assistant (CMA) US, I...",2662,387,8,"[Commissioning, Machinery, Assistant, CMA, US,...","[Commissioning, Machinery, Assistant, CMA, US,...","[(Commissioning, VBG), (Machinery, NNP), (Assi..."
3,0,1,0,0,"Account Executive - Washington DC US, DC, Was...",5558,752,9,"[Account, Executive, Washington, DC, US, DC, W...","[Account, Executive, Washington, DC, US, DC, W...","[(Account, NNP), (Executive, NNP), (Washington..."
4,0,1,1,0,"Bill Review Manager US, FL, Fort Worth Spot...",4060,510,16,"[Bill, Review, Manager, US, FL, Fort, Worth, S...","[Bill, Review, Manager, US, FL, Fort, Worth, S...","[(Bill, NNP), (Review, NNP), (Manager, NNP), (..."


#### 3.4.5 Lemmatizing

> Like stemming, lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word

Note: Lemmatize based on the token tags

In [25]:
from nltk.stem import WordNetLemmatizer

# Create a lemmatizer to use
lemmatizer = WordNetLemmatizer()


def lemmatizeFunction(tagged_word):
    word, tag = tagged_word[0], tagged_word[1]

    # ADJ, ADV, NOUN, VERB = "a", "r", "n", "v"
    if tag.startswith("VB"):
        return lemmatizer.lemmatize(word, pos="v")
    elif tag.startswith("JJ"):
        return lemmatizer.lemmatize(word, pos="a")
    elif tag.startswith("RB"):
        return lemmatizer.lemmatize(word, pos="r")

    return lemmatizer.lemmatize(word)

# Lemmatizing & Adding a new column - lemma
job_df_model["lemma"] = job_df_model.apply(lambda row: [lemmatizeFunction(tagged_word) for tagged_word in row["tokens_tagged"]], axis=1)

In [26]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,text,character_count,word_count,sentence_count,tokens,tokens_without_stopwords,tokens_tagged,lemma
0,0,1,0,0,"Marketing Intern US, NY, New York Marketing ...",2720,411,7,"[Marketing, Intern, US, NY, New, York, Marketi...","[Marketing, Intern, US, NY, New, York, Marketi...","[(Marketing, VBG), (Intern, NNP), (US, NNP), (...","[Marketing, Intern, US, NY, New, York, Marketi..."
1,0,1,0,0,"Customer Service - Cloud Video Production NZ,...",6227,944,25,"[Customer, Service, Cloud, Video, Production, ...","[Customer, Service, Cloud, Video, Production, ...","[(Customer, NNP), (Service, NNP), (Cloud, NNP)...","[Customer, Service, Cloud, Video, Production, ..."
2,0,1,0,0,"Commissioning Machinery Assistant (CMA) US, I...",2662,387,8,"[Commissioning, Machinery, Assistant, CMA, US,...","[Commissioning, Machinery, Assistant, CMA, US,...","[(Commissioning, VBG), (Machinery, NNP), (Assi...","[Commissioning, Machinery, Assistant, CMA, US,..."
3,0,1,0,0,"Account Executive - Washington DC US, DC, Was...",5558,752,9,"[Account, Executive, Washington, DC, US, DC, W...","[Account, Executive, Washington, DC, US, DC, W...","[(Account, NNP), (Executive, NNP), (Washington...","[Account, Executive, Washington, DC, US, DC, W..."
4,0,1,1,0,"Bill Review Manager US, FL, Fort Worth Spot...",4060,510,16,"[Bill, Review, Manager, US, FL, Fort, Worth, S...","[Bill, Review, Manager, US, FL, Fort, Worth, S...","[(Bill, NNP), (Review, NNP), (Manager, NNP), (...","[Bill, Review, Manager, US, FL, Fort, Worth, S..."


#### 3.4.6 Converting `lemma` into `processed_text` column

In [29]:
# Converting lemmas into a new column - processed_text
job_df_model["processed_text"] = job_df_model.apply(lambda row: " ".join(row["lemma"]), axis=1)

In [30]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,text,character_count,word_count,sentence_count,tokens,tokens_without_stopwords,tokens_tagged,lemma,processed_text
0,0,1,0,0,"Marketing Intern US, NY, New York Marketing ...",2720,411,7,"[Marketing, Intern, US, NY, New, York, Marketi...","[Marketing, Intern, US, NY, New, York, Marketi...","[(Marketing, VBG), (Intern, NNP), (US, NNP), (...","[Marketing, Intern, US, NY, New, York, Marketi...",Marketing Intern US NY New York Marketing Food...
1,0,1,0,0,"Customer Service - Cloud Video Production NZ,...",6227,944,25,"[Customer, Service, Cloud, Video, Production, ...","[Customer, Service, Cloud, Video, Production, ...","[(Customer, NNP), (Service, NNP), (Cloud, NNP)...","[Customer, Service, Cloud, Video, Production, ...",Customer Service Cloud Video Production NZ Auc...
2,0,1,0,0,"Commissioning Machinery Assistant (CMA) US, I...",2662,387,8,"[Commissioning, Machinery, Assistant, CMA, US,...","[Commissioning, Machinery, Assistant, CMA, US,...","[(Commissioning, VBG), (Machinery, NNP), (Assi...","[Commissioning, Machinery, Assistant, CMA, US,...",Commissioning Machinery Assistant CMA US IA We...
3,0,1,0,0,"Account Executive - Washington DC US, DC, Was...",5558,752,9,"[Account, Executive, Washington, DC, US, DC, W...","[Account, Executive, Washington, DC, US, DC, W...","[(Account, NNP), (Executive, NNP), (Washington...","[Account, Executive, Washington, DC, US, DC, W...",Account Executive Washington DC US DC Washingt...
4,0,1,1,0,"Bill Review Manager US, FL, Fort Worth Spot...",4060,510,16,"[Bill, Review, Manager, US, FL, Fort, Worth, S...","[Bill, Review, Manager, US, FL, Fort, Worth, S...","[(Bill, NNP), (Review, NNP), (Manager, NNP), (...","[Bill, Review, Manager, US, FL, Fort, Worth, S...",Bill Review Manager US FL Fort Worth SpotSourc...


#### 3.4.6 Removing intermediate columns created

In [31]:
job_df_model.drop(["text", "tokens", "tokens_without_stopwords", "tokens_tagged"], axis=1, inplace=True)

In [32]:
job_df_model.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,character_count,word_count,sentence_count,lemma,processed_text
0,0,1,0,0,2720,411,7,"[Marketing, Intern, US, NY, New, York, Marketi...",Marketing Intern US NY New York Marketing Food...
1,0,1,0,0,6227,944,25,"[Customer, Service, Cloud, Video, Production, ...",Customer Service Cloud Video Production NZ Auc...
2,0,1,0,0,2662,387,8,"[Commissioning, Machinery, Assistant, CMA, US,...",Commissioning Machinery Assistant CMA US IA We...
3,0,1,0,0,5558,752,9,"[Account, Executive, Washington, DC, US, DC, W...",Account Executive Washington DC US DC Washingt...
4,0,1,1,0,4060,510,16,"[Bill, Review, Manager, US, FL, Fort, Worth, S...",Bill Review Manager US FL Fort Worth SpotSourc...


## 4) Export generated dataset

- `fake_job_postings_eda.csv` is a dataset for EDA and has all its NA values replaced with something meaningful
- `fake_job_postings_model.csv` is a dataset for modelling and has a new column `text` that combines several of the text-related other columns, as well as new columns generated after processing the `text` column with NLP.

In [33]:
job_df_eda.to_csv('datasets/fake_job_postings_eda.csv', index=False)
job_df_model.to_csv('datasets/fake_job_postings_model.csv', index=False)

## 5) Overview of datasets created from this notebook

Dataset created from this notebook:

    .
    ├── fake_job_postings.csv       # original dataset
    |   ├── fake_job_postings_eda.csv        # for EDA and visualization
    |   └── fake_job_postings_model.csv        # for analytical models (Combined columns + NLP related columns)
    └──|

 