# Job Portal - Pinoy Jobs

Web scraping https://pinoyjobs.ph/ using Beautiful Soup.
    
### Fetch Data from the Website's Jobs in 'IT, Programming, Systems & Network' category (for testing)

For Pinoy Jobs dataset, the Beautiful Soup method will be used to manually extract raw html text data from the website specified in the URL. This will return the html only of the relevant category that the group is interested in. This will return a list of related job results which will be further examined later.

## Imports used (to be described)

* `os` - a module that provides functions to interact with the operating system.
* `pandas` - is a tool that helps analyze data.
* `numpy` - Library that contains multiple functions that help ease the work with arrays, matrices, and alike to better reassemble data.
* `json` - enables import and export from and to JSON files
* `re` - Short for Regular Expressions, help recognize patterns on strings of data and is used to orderly reassemble them.
* `gensim` - Library that efficiently handles large, unmanaged text collections of data.
* `nltk` - Short for Natural Language Toolkit. It helps the program to apply human language data to statistical natural language.
* `requests` - Requests allows the program to send HTTP requests easily.
* `Seaborn` - A library in python that is used to better visualize data through drawing informative graphs.
* `math` - Imported library that allows quick computations of mathematical tasks
* `gensim.utils` `simple_preprocess` - used to preprocess text by making them lower-cased, and transforming the words to their original form (de-tokenizing)
* `gensim.parsing.preprocessing` `STOPWORDS` - stop words common words that do not have value and are often removed in pre-processing
* `gensim` `corpora` - used to work with corpus and words
* `gensim` `models` - used for topic modelling and model training
* `nltk.stem` `WordNetLemmatizer` - used for grouping similar strings together
* `bs4` `BeautifulSoup` - library used to web scrape HTML from websites
* `datetime` `datetime` - An imported module in python to create an object that properly resembles date and time. Used for converting string of time into datetime format to month, day, and year.
* `datetime` `timedelta` - used for finding delta of time ago with time scraped if date has minutes, hours, days, or weeks ago
* `dateutil.relativedelta` `relativedelta` - used for finding delta of time ago with time scraped if date has months and years

In [1]:
import os
import pandas as pd
import numpy as np
import json
import re
import gensim
import nltk
import requests
import datetime
import seaborn as sns
import math

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from datetime import datetime
from datetime import timedelta
from dateutil.relativedelta import relativedelta

today = datetime.today()

In [None]:
URL = "https://pinoyjobs.ph/job-hiring/category/it-programming-systems-networks/"
page=requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
soup

### Get the Job URLS

Taking a closer look at the html code of the website results, it can be observed that the jobs list only a few details of the job like the date, location, title, and the company looking. However the description where we can find further relevant details is cut off from view. Therefore, to be able to fully view the desrcription, the pages should be accessed and the data then extracted.
Fortunately, we can also find the links of the job posts also. We now put all the links we can find into a list.


In [None]:
jobURL_List = []
i=0
for div in soup.find_all('div', class_='card-content'):
     for a in div.find_all('a', href=True):
        jobURL_List.append(a['href'])
        ##print(jobURL_List[1]
        print ("Found the URL:", a['href'])
        i=i+1
        print(i)

### HTML Parsing of the Job Links

In [None]:
jobURL = jobURL_List[0]
print(jobURL)
jobPage=requests.get(jobURL)
soupJobs = BeautifulSoup(jobPage.content, 'html.parser')

### Getting the Job Title

From the list of job URLS, the respective data listed below will be fetched from each one of the URLS in list jobURL_List to get the following:

- Job_Title : The title of the job position

In [None]:
Job_Title = soupJobs.findAll("h1", {"itemprop": "title"})
print (Job_Title[0].text)

### Getting the Job Employment Type
 - Job_employmentType : Full-time/Part-time

Job_employmentType = soupJobs.findAll("li", {"itemprop": "employmentType"})
print (Job_employmentType[0].text)

### Getting the Job Location
- Job_jobLocation : The location where the position is stationed in, in some cases the employer writes this as Anywhere/Online or other terms for home-based jobs

In [None]:
Job_jobLocation = soupJobs.find_all('li',{"itemprop": "addressLocality"})
print (Job_jobLocation[0].text)

### Getting the Date Posted
- Job_dateposted : The date and time when the job was posted

In [None]:
Job_dateposted = soupJobs.find_all('li',{"itemprop": "datePosted"})
print (Job_dateposted[0].text)

### Getting the Job Salary
- Job_salary : The expected salary in Philippine Peso (PHP)

In [None]:
Job_salary = soupJobs.find_all('i', {"itemprop": "salary"})
print(Job_salary[0].text)

for div in soupJobs.find_all('div', {"itemprop": "description"}):
        Job_desc = div.find_all("p")
        for i in range(len(Job_desc)):
            print (Job_desc[i].text)### Getting the Job Description
 - Job_desc : A detailed job description containing the requirements and responsibilities included in the job

In [None]:
for div in soupJobs.find_all('div', {"itemprop": "description"}):
        Job_desc = div.find_all("p")
        for i in range(len(Job_desc)):
            print (Job_desc[i].text)

### Extracting the Data

Putting it all together, the job posts in categories:
- Engineering, Construction, Electrical
- IT, Programming, Systems & Networks
- Manufacturing Production
- Nursing, Medical, Dental Health
- Sciences, Lab Research
- Web Development, Design, HTML, SEO

will be fetched and turned into a dataframe with the following variables:   
- Job Type
- Job Title
- Employment Type
- Office Location
- Date Posted
- Description
- Salary
- Job Location

In [None]:
PinoyJobs_URL_List = []
PinoyJobs_URL_List.append("https://pinoyjobs.ph/job-hiring/category/engineering-construction-electrical/")
PinoyJobs_URL_List.append("https://pinoyjobs.ph/job-hiring/category/it-programming-systems-networks/")
PinoyJobs_URL_List.append("https://pinoyjobs.ph/job-hiring/category/manufacturing-production/")
PinoyJobs_URL_List.append("https://pinoyjobs.ph/job-hiring/category/nursing-medical-dental-health/")
PinoyJobs_URL_List.append("https://pinoyjobs.ph/job-hiring/category/sciences-lab-research/")
PinoyJobs_URL_List.append("https://pinoyjobs.ph/job-hiring/category/web-development-design-html-seo/")
job_title_list = []
job_employment_type_list = []
job_jobLocation_list = []
job_dateposted_list = []
job_desc_list = []
job_salary_list = []
job_location_list = []
job_type_list = []
comapny_name_list = []
for web_URL in range(len(PinoyJobs_URL_List)):
    URL = PinoyJobs_URL_List[web_URL]
    #print(PinoyJobs_URL_List[web_URL])
    page=requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    for ul in soup.find_all('ul', {"class": "pagination hide-on-small-only"}):
        page_num = ul.find_all("li")
        max_page = int(page_num[len(page_num)-2].text.strip())
        #print(page_num[len(page_num)-2].text)
    for web_pages in range(0,max_page): ##max_page
        URL = PinoyJobs_URL_List[web_URL] + "page/{}/".format(web_pages)
        page=requests.get(URL)
        soup = BeautifulSoup(page.content, 'html.parser')
        jobURL_List = []
        for div in soup.find_all('div', class_='card-content'):
             for a in div.find_all('a', href=True):
                jobURL_List.append(a['href'])
                ##print ("Found the URL:", a['href'])
        for k in range(len(jobURL_List)): ##len(jobURL_List)
            jobURL = jobURL_List[k]
            ##print(jobURL)
            jobPage=requests.get(jobURL)
            soupJobs = BeautifulSoup(jobPage.content, 'html.parser')

            job_type = soup.findAll("h1")
            #print(job_type[0].text)
            job_type_list.append(job_type[0].text)

            Job_title = soupJobs.findAll("h1", {"itemprop": "title"})
            #print (titleinfo[0].text)
            job_title_list.append(Job_title[0].text)

            Company_name = soupJobs.find_all('h5',{"itemprop": "hiringOrganization"})
            ##print (Company_name[0].text)
            comapny_name_list.append(Company_name[0].text)

            Job_employmentType = soupJobs.findAll("li", {"itemprop": "employmentType"})
            #print (Job_employmentType[0].text)
            job_employment_type_list.append(Job_employmentType[0].text)

            Job_jobLocation = soupJobs.find_all('li',{"itemprop": "addressLocality"})
            #print (Job_jobLocation[0].text)
            job_jobLocation_list.append(Job_jobLocation[0].text)

            Job_dateposted = soupJobs.find_all('li',{"itemprop": "datePosted"})
            #print (Job_dateposted[0].text)
            job_dateposted_list.append(Job_dateposted[0].text)

            Job_desc =  soupJobs.find_all('div', {"itemprop": "description"})
            #print (Job_desc[0].text)
            job_desc_list.append(Job_desc[0].text)

            Job_salary = soupJobs.find_all('i', {"itemprop": "salary"})
            #print(Job_salary[0].text)
            job_salary_list.append(Job_salary[0].text)

            Job_location = soupJobs.find_all('i')
            #print (Job_location[len(Job_location)-1].text)
            job_location_list.append(Job_location[len(Job_location)-1].text)

    jobs_data={'Website': 'Pinoy Jobs',
               'Job Title': job_title_list, 
               'Category': job_type_list,
               'Company': comapny_name_list,
               'Date Posted': job_dateposted_list,
               'Location': job_location_list,
               'Status': job_employment_type_list, 
               'Salary': job_salary_list,
               #'Office Location': job_jobLocation_list, 
               'Education': "Not Specified / In Description",
               'Years of Work Experience': "Not Specified / In Description",
               'Job Description': job_desc_list,
               }
    pinoy_jobs_df = pd.DataFrame(data=jobs_data)
    #Website 	Job Title 	Category 	Company 	Date Posted 	Location 	Status 	Salary 	Education 	Job Description

### Parsing to JSON File

Store the cleaned gathered data into json file

In [None]:
#Save DF to JSON file
data = pinoy_jobs_df.to_json(orient='records')
parsed = json.loads(data)
json.dumps(parsed, indent=4) 
with open('pinoy_jobs.json', 'w') as json_file:
    json.dump(parsed, json_file)