# Data Cleaning (Pinoy Jobs)

## Imports used (to be described)

* `os` - a module that provides functions to interact with the operating system.
* `pandas` - is a tool that helps analyze data.
* `numpy` - Library that contains multiple functions that help ease the work with arrays, matrices, and alike to better reassemble data.
* `json` - enables import and export from and to JSON files
* `re` - Short for Regular Expressions, help recognize patterns on strings of data and is used to orderly reassemble them.
* `gensim` - Library that efficiently handles large, unmanaged text collections of data.
* `nltk` - Short for Natural Language Toolkit. It helps the program to apply human language data to statistical natural language.
* `requests` - Requests allows the program to send HTTP requests easily.
* `Seaborn` - A library in python that is used to better visualize data through drawing informative graphs.
* `math` - Imported library that allows quick computations of mathematical tasks
* `gensim.utils` `simple_preprocess` - used to preprocess text by making them lower-cased, and transforming the words to their original form (de-tokenizing)
* `gensim.parsing.preprocessing` `STOPWORDS` - stop words common words that do not have value and are often removed in pre-processing
* `gensim` `corpora` - used to work with corpus and words
* `gensim` `models` - used for topic modelling and model training
* `nltk.stem` `WordNetLemmatizer` - used for grouping similar strings together
* `bs4` `BeautifulSoup` - library used to web scrape HTML from websites
* `datetime` `datetime` - An imported module in python to create an object that properly resembles date and time. Used for converting string of time into datetime format to month, day, and year.
* `datetime` `timedelta` - used for finding delta of time ago with time scraped if date has minutes, hours, days, or weeks ago
* `dateutil.relativedelta` `relativedelta` - used for finding delta of time ago with time scraped if date has months and years

In [18]:
import os
import pandas as pd
import numpy as np
import json
import re
import gensim
import nltk
import requests
import datetime
import seaborn as sns
import calplot
import matplotlib.pyplot as plt
import math

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from datetime import datetime
from datetime import timedelta
from dateutil.relativedelta import relativedelta

today = datetime.today()

### Importing JSON File

Testing the importing of the created JSON file.

In [19]:
#Read from Json File 
pinoyjobs_df_json = pd.read_json (r'PinoyJobs Data\pinoy_jobs.json')
pinoyjobs_df_json

Unnamed: 0,Website,Job Title,Category,Company,Date Posted,Location,Status,Salary,Education,Years of Work Experience,Job Description
0,Pinoy Jobs,Project Manager,"Jobs in Engineering, Construction & Electrical",C.M Pancho Construction Inc.,"Posted on January 17, 2020","The Forum Bldg., #71-A Scout Borromeo St. Brgy...",Full-time,"₱40,000 - ₱80,000",Not Specified / In Description,Not Specified / In Description,Description: Responsible for the performance a...
1,Pinoy Jobs,Project Engineer,"Jobs in Engineering, Construction & Electrical",C.M Pancho Construction Inc.,"Posted on January 17, 2020","The Forum Bldg., #71-A Scout Borromeo St. Brgy...",Full-time,"₱25,000 - ₱45,000",Not Specified / In Description,Not Specified / In Description,Description: Over all supervision of field act...
2,Pinoy Jobs,CAD Sketch up Operator,"Jobs in Engineering, Construction & Electrical",BestBuilders Inc.,"Posted on September 17, 2019","Unit 7 3rd Floor CVA Bldg. National Rd, Putata...",Full-time,₱15000 - ₱20000,Not Specified / In Description,Not Specified / In Description,Description: Design and draft CAD (computer-ai...
3,Pinoy Jobs,Mechanical / Painting Technician,"Jobs in Engineering, Construction & Electrical",Aero Auto Metal Products LLC,"Posted on September 12, 2019","Abudhabi, United Arab Emirates",Full-time,₱35000 - ₱65000,Not Specified / In Description,Not Specified / In Description,Description: Assembly/Mechanical TechnicianExe...
4,Pinoy Jobs,Project Engineer – Mechanical,"Jobs in Engineering, Construction & Electrical",JAM Industrial Sales,"Posted on September 10, 2019","2946 Molave Street Tondo, Manila",Full-time,₱15000 - ₱20000,Not Specified / In Description,Not Specified / In Description,Description: Assessing project requirementsMea...
...,...,...,...,...,...,...,...,...,...,...,...
3735,Pinoy Jobs,Online Marketing Manager – SEO Manager,"Jobs in Web Development & Design, HTML, SEO",Placidway,"Posted on June 19, 2015",homebased/virtual,full time,negotiable,Not Specified / In Description,Not Specified / In Description,We are looking for a result-oriented and self-...
3736,Pinoy Jobs,Web Developer,"Jobs in Web Development & Design, HTML, SEO",4th SHift Global Inc.,"Posted on June 5, 2015","919/F Trafalgar Plaza, HV Dela Costa st., Salc...",Full time,,Not Specified / In Description,Not Specified / In Description,JOB SUMMARYResponsible for the development an...
3737,Pinoy Jobs,Web Designer / Graphics Designer ***Work from ...,"Jobs in Web Development & Design, HTML, SEO",CONFIDENTIAL,"Posted on May 28, 2015",Located in USA but Work From Home!,Full Time - HomeBased,Negotiable,Not Specified / In Description,Not Specified / In Description,Candidate must be willing to work US Eastern S...
3738,Pinoy Jobs,Web Designer/Developer (WordPress),"Jobs in Web Development & Design, HTML, SEO",Hyper6,"Posted on May 26, 2015",Work From Home,Full Time,Negotiable,Not Specified / In Description,Not Specified / In Description,"Our company is seeking a very talented, full-t..."


In [20]:
pinoyjobs_df_json["Category"].unique()

array(['Jobs in Engineering, Construction & Electrical',
       'Jobs in IT, Programming, Systems & Networks',
       'Jobs in Manufacturing, Production',
       'Jobs in Nursing, Medical, Dental & Health',
       'Jobs in Sciences, Lab, R&D',
       'Jobs in Web Development & Design, HTML, SEO'], dtype=object)

### Getting the Date Posted
Taking a look at the Date Posted column of the created dataframe for PinoyJobs, we can see that it is not formatted correctly, therefore we will be reformatting it to YYYY-MM-DD.

In [21]:
#Converts the Date Format (PinoyJobs)
new_date_posted = []        
for index, row in pinoyjobs_df_json.iterrows():
    then = datetime.strptime(row["Date Posted"], 'Posted on %B %d, %Y')
    new_date_posted.append(then)
pinoyjobs_df_json["Date Posted"] = new_date_posted

### Salary

As we can observe from the Salary column, we can see that it has two components: minimum salary and maximum salary since it takes the range of the salary. We will be putting those two components into two different columns: Min Salary and Max Salary.
### Getting the Min Salary

In [22]:
#Gets the MIN salary (PinoyJobs)
def salary_seperatorinator_MIN_PJ(salary):
    if not len(salary):
        salary = "Not Specified"
    str2 = (salary.replace('₱', ''))
    str3 = (str2.replace(',', ''))
    stroutput = [int(s) for s in str3.split() if s.isdigit()]
    if not len(stroutput):
        return salary
    else:
        return stroutput[0]

### Getting the Max Salary

In [23]:
 #Gets the MAX salary (PinoyJobs)
def salary_seperatorinator_MAX_PJ(salary):
    if not len(salary):
        salary = "Not Specified"
    str2 = (salary.replace('₱', ''))
    str3 = (str2.replace(',', ''))
    stroutput = [int(s) for s in str3.split() if s.isdigit()]
    if not len(stroutput):
        return salary
    else:
        try:
            return stroutput[1]
        except:
            return salary

### Getting the Years of Experience from Description

In [24]:
#Trying to get the years of experience from description
def find_experienceinator(description):
    yearoutput = [int(s) for s in description.split() if s.isdigit()]
    if len(yearoutput) > 0:
        if (yearoutput[0] < 20):
            if len(yearoutput) > 1:
                return ("{0} - {1}").format(yearoutput[0],yearoutput[1])
            elif len(yearoutput) == 1:
                return ("{}").format(yearoutput[0])
            else:
                return "Not Specified"
        else:
            return "Not Specified"
    else:
        return "Not Specified"

### Getting the Min Years

As from the Min and Max salaries, we can also observe that there are two components from the years of experience sometimes. They will be separated into Maximum Years of Experience, and Minimum Years of Experience. If there are two components found, we will take the first one, else if there is only one digit found, we take that instead for the min years.

In [25]:
#Gets the MAX year (PinoyJobs)
def year_seperatorinator_MIN_PJ(year_exp):
    if not len(year_exp):
        year_exp = "Not Specified"
    str2 = (year_exp.replace('-', ' '))
    str3 = (str2.replace(',', ' '))
    stroutput = [int(s) for s in str3.split() if s.isdigit()]
    if not len(stroutput):
        return year_exp
    else:
        try:
            return stroutput[0]
        except:
            return year_exp

### Getting the Max Years
Extracting Maximum Years from the years of experience. Like the Minimum Years, we look at the component(s) in the years of experience, if we find two, we take the latter, else if there is only one, then we take that instead for the max years.

In [26]:
#Gets the MAX year (PinoyJobs)
def year_seperatorinator_MAX_PJ(year_exp):
    if not len(year_exp):
        year_exp = "Not Specified"
    str2 = (year_exp.replace('-', ' '))
    str3 = (str2.replace(',', ''))
    stroutput = [int(s) for s in str3.split() if s.isdigit()]
    if not len(stroutput):
        return year_exp
    else:
        try:
            return stroutput[1]
        except:
            return stroutput[0]

### Getting Educational Attainment

In the details we can further find that it sometimes contains the Educational Attainment as part of the requirements.

There are three keywords that we will use to find these such as "Bachelor", "Degree", and "BS". If the keywords are found, we will write in "Bachelor's Degree" for the Education, otherwise "Not Specified"

In [27]:
def find_education(description):
    eduoutput = (description.replace('/', ''))
    education_list = ['Bachelor','Degree','BS']
    if any(x in eduoutput for x in education_list):
        return "Bachelor's Degree"
    else:
        return "Not Specified"

### Getting the Work Experience in Years
From the description, we find the number of years by looking up anything related to the spelling of years in any capitalization.

In [28]:
def find_experienceinator_test(description):
    yearoutput = [int(s) for s in description.split() if s.isdigit()]
    find_list = ["year","Years","YEARS","Year","YEAR"]
    if (len(re.findall(r"\d-\d \w+",description)) > 0):
        return (re.findall(r"\d-\d \w+",description))[0]

    elif (len(re.findall(r"\d \w+ \w+ \w+",description)) > 0):
        return (re.findall(r"\d+ \w+ \w+ \w+",description))[0]

    elif (len(re.findall(r"\d+ year",description)) > 0):
        return (re.findall(r"\d+",description))[0]

    elif (len(re.findall(r"\w+ year",description)) > 0):
        return (re.findall(r"\w+",description))[0]
    else:
        return "Not Specified"

### Getting the Employment Status

In [29]:
def status_cleanator_PJ(status):
    if (len(re.findall(r"Full",status, re.IGNORECASE)) > 0):
        return ("Full Time")
    elif(len(re.findall(r"Part",status, re.IGNORECASE)) > 0):
        return ("Part Time")
    elif(len(re.findall(r"Contract",status, re.IGNORECASE)) > 0):
        return ("Contract")
    elif(len(re.findall(r"Project",status, re.IGNORECASE)) > 0):
        return ("Project Base")    
    elif(len(re.findall(r"Freelance",status, re.IGNORECASE)) > 0):
        return ("Freelance")        
    elif(len(re.findall(r"OJT",status, re.IGNORECASE)) > 0):
        return ("OJT")    
    elif(len(re.findall(r"Intern",status, re.IGNORECASE)) > 0):
        return ("OJT")    
    elif(len(re.findall(r"Regular",status, re.IGNORECASE)) > 0):
        return ("Full Time")
    elif(len(re.findall(r"Temporary",status, re.IGNORECASE)) > 0):
        return ("Contract")   
    elif(len(re.findall(r"Permanent",status, re.IGNORECASE)) > 0):
        return ("Full Time")  
    else: return ("Not Specified")

### Getting the Job Location

In [30]:
def location_cleanator_PJ(location):
    if (len(re.findall(r"Home",location, re.IGNORECASE)) > 0):
        return ("Work From Home")
    elif (len(re.findall(r"\w+ City",location, re.IGNORECASE)) > 0):
        output = (re.findall(r"\w+ City",location, re.IGNORECASE))
        return (output[0])
    elif (len(re.findall(r"Metro Manila",location, re.IGNORECASE)) > 0):
        return ("Metro Manila")
    elif (len(re.findall(r"Manila",location, re.IGNORECASE)) > 0):
        return ("Manila City")
    elif (len(re.findall(r"Laguna",location, re.IGNORECASE)) > 0):
        return ("Laguna")
    elif (len(re.findall(r"Pasig",location, re.IGNORECASE)) > 0):
        return ("Pasig City")
    elif (len(re.findall(r"Paranaque",location, re.IGNORECASE)) > 0):
        return ("Paranaque")
    elif (len(re.findall(r"Pampanga",location, re.IGNORECASE)) > 0):
        return ("Pampanga")
    elif (len(re.findall(r"Batangas",location, re.IGNORECASE)) > 0):
        return ("Batangas")
    elif (len(re.findall(r"Cavite",location, re.IGNORECASE)) > 0):
        return ("Cavite")
    else:
        return location

### Categorizing for Combined Dataset

* <a href="https://www.bestcolleges.com/careers/stem/">
    bestcolleges.com
</a> 

    - Basis for careers from "IT, Programming, Systems & Networks" AND "Jobs in Web Development & Design, HTML, SEO" were classified as IT
    - Basis for careers from "Jobs in Engineering, Construction & Electrical" and "Jobs in Manufacturing, Production" were classified as Engineering
    - Basis for careers from "Jobs in Sciences, Lab, R&D" were classified as Science
    - Basis for careers from "Jobs in Nursing, Medical, Dental & Health" were classified as Medicine

In [31]:
def field_deciderinator_PJ(field):
    if (len(re.findall(r"Jobs in IT, Programming, Systems & Networks",field, re.IGNORECASE)) > 0):
        return ("IT")
    elif (len(re.findall(r"Jobs in Engineering, Construction & Electrical",field, re.IGNORECASE)) > 0):
        return ("Engineering")
    elif (len(re.findall(r"Jobs in Manufacturing, Production",field, re.IGNORECASE)) > 0):
        return ("Engineering")
    elif (len(re.findall(r"Jobs in Nursing, Medical, Dental & Health",field, re.IGNORECASE)) > 0):
        return ("Medicine")
    elif (len(re.findall(r"Jobs in Web Development & Design, HTML, SEO",field, re.IGNORECASE)) > 0):
        return ("IT")
    elif (len(re.findall(r"Jobs in Sciences, Lab, R&D",field, re.IGNORECASE)) > 0):
        return ("Science")

### Applying Functions

Apply all functions for data clean up to their specified feature

In [32]:
pinoyjobs_df_json["Min Salary"]= pinoyjobs_df_json["Salary"].apply(salary_seperatorinator_MIN_PJ) 
pinoyjobs_df_json["Max Salary"]= pinoyjobs_df_json["Salary"].apply(salary_seperatorinator_MAX_PJ) 
pinoyjobs_df_json["Years of Work Experience"] = pinoyjobs_df_json["Job Description"].apply(find_experienceinator_test)
pinoyjobs_df_json["Education"] = pinoyjobs_df_json["Job Description"].apply(find_education)
pinoyjobs_df_json["Min Years of Work Experience"]= pinoyjobs_df_json["Years of Work Experience"].apply(year_seperatorinator_MIN_PJ) 
pinoyjobs_df_json["Max Years of Work Experience"]= pinoyjobs_df_json["Years of Work Experience"].apply(year_seperatorinator_MAX_PJ) 
pinoyjobs_df_json["Status"]= pinoyjobs_df_json["Status"].apply(status_cleanator_PJ) 
pinoyjobs_df_json["Location"]= pinoyjobs_df_json["Location"].apply(location_cleanator_PJ) 
pinoyjobs_df_json["Field"]= pinoyjobs_df_json["Category"].apply(field_deciderinator_PJ) 
pinoyjobs_df_json.drop("Salary", inplace=True, axis=1)
pinoyjobs_df_json.drop("Years of Work Experience", inplace=True, axis=1)

### Pinoy Jobs Dataset Cleaned
The data that we will gather will contain the following variables:
- `Job Title` - The title of the job position
- `Category` - The type of the job, or job category
- `Company` - Employer
- `Date Posted` - date when the listing was posted in the sites
- `Location` - location of the job listing where the applicants are to be deployed to. 
- `Status` - Whether the job is available for full-time or part-time (or not specified)
- `Description` - detailed description of the job listing
- `Min Salary` - minimum monetary compensation range in Philippine Peso (PHP)
- `Min Salary` - minimum monetary compensation range in Philippine Peso (PHP)
- `Min Years of Work Experience` - minimum years of experienced required
- `Max Years of Work Experience` - maximum years of experienced required
- `Field` - field it was categorized based on STEM field


In [33]:
pinoyjobs_df_json

Unnamed: 0,Website,Job Title,Category,Company,Date Posted,Location,Status,Education,Job Description,Min Salary,Max Salary,Min Years of Work Experience,Max Years of Work Experience,Field
0,Pinoy Jobs,Project Manager,"Jobs in Engineering, Construction & Electrical",C.M Pancho Construction Inc.,2020-01-17,Quezon City,Full Time,Not Specified,Description: Responsible for the performance a...,40000,80000,10,10,Engineering
1,Pinoy Jobs,Project Engineer,"Jobs in Engineering, Construction & Electrical",C.M Pancho Construction Inc.,2020-01-17,Quezon City,Full Time,Not Specified,Description: Over all supervision of field act...,25000,45000,7,7,Engineering
2,Pinoy Jobs,CAD Sketch up Operator,"Jobs in Engineering, Construction & Electrical",BestBuilders Inc.,2019-09-17,Muntinlupa City,Full Time,Not Specified,Description: Design and draft CAD (computer-ai...,15000,20000,2,2,Engineering
3,Pinoy Jobs,Mechanical / Painting Technician,"Jobs in Engineering, Construction & Electrical",Aero Auto Metal Products LLC,2019-09-12,"Abudhabi, United Arab Emirates",Full Time,Bachelor's Degree,Description: Assembly/Mechanical TechnicianExe...,35000,65000,Not Specified,Not Specified,Engineering
4,Pinoy Jobs,Project Engineer – Mechanical,"Jobs in Engineering, Construction & Electrical",JAM Industrial Sales,2019-09-10,Manila City,Full Time,Bachelor's Degree,Description: Assessing project requirementsMea...,15000,20000,1,4,Engineering
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3735,Pinoy Jobs,Online Marketing Manager – SEO Manager,"Jobs in Web Development & Design, HTML, SEO",Placidway,2015-06-19,Work From Home,Full Time,Not Specified,We are looking for a result-oriented and self-...,negotiable,negotiable,Not Specified,Not Specified,IT
3736,Pinoy Jobs,Web Developer,"Jobs in Web Development & Design, HTML, SEO",4th SHift Global Inc.,2015-06-05,Makati City,Full Time,Not Specified,JOB SUMMARYResponsible for the development an...,Not Specified,Not Specified,3,5,IT
3737,Pinoy Jobs,Web Designer / Graphics Designer ***Work from ...,"Jobs in Web Development & Design, HTML, SEO",CONFIDENTIAL,2015-05-28,Work From Home,Full Time,Not Specified,Candidate must be willing to work US Eastern S...,Negotiable,Negotiable,Not Specified,Not Specified,IT
3738,Pinoy Jobs,Web Designer/Developer (WordPress),"Jobs in Web Development & Design, HTML, SEO",Hyper6,2015-05-26,Work From Home,Full Time,Not Specified,"Our company is seeking a very talented, full-t...",Negotiable,Negotiable,Not Specified,Not Specified,IT


### Parsing to CSV File

Store the cleaned gathered data into CSV file

In [34]:
pinoyjobs_df_json.to_csv ('Cleaned Data CSV\pinoyjobs_clean.csv', index = False)