# Job Portal - Monster

## Imports used (to be described)

* `os` - a module that provides functions to interact with the operating system.
* `pandas` - is a tool that helps analyze data.
* `numpy` - Library that contains multiple functions that help ease the work with arrays, matrices, and alike to better reassemble data.
* `json` - enables import and export from and to JSON files
* `re` - Short for Regular Expressions, help recognize patterns on strings of data and is used to orderly reassemble them.
* `gensim` - Library that efficiently handles large, unmanaged text collections of data.
* `nltk` - Short for Natural Language Toolkit. It helps the program to apply human language data to statistical natural language.
* `requests` - Requests allows the program to send HTTP requests easily.
* `Seaborn` - A library in python that is used to better visualize data through drawing informative graphs.
* `math` - Imported library that allows quick computations of mathematical tasks
* `gensim.utils` `simple_preprocess` - used to preprocess text by making them lower-cased, and transforming the words to their original form (de-tokenizing)
* `gensim.parsing.preprocessing` `STOPWORDS` - stop words common words that do not have value and are often removed in pre-processing
* `gensim` `corpora` - used to work with corpus and words
* `gensim` `models` - used for topic modelling and model training
* `nltk.stem` `WordNetLemmatizer` - used for grouping similar strings together
* `bs4` `BeautifulSoup` - library used to web scrape HTML from websites
* `datetime` `datetime` - An imported module in python to create an object that properly resembles date and time. Used for converting string of time into datetime format to month, day, and year.
* `datetime` `timedelta` - used for finding delta of time ago with time scraped if date has minutes, hours, days, or weeks ago
* `dateutil.relativedelta` `relativedelta` - used for finding delta of time ago with time scraped if date has months and years

In [1]:
import os
import pandas as pd
import numpy as np
import json
import re
import gensim
import nltk
import requests
import datetime
import seaborn as sns
import math

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from datetime import datetime
from datetime import timedelta
from dateutil.relativedelta import relativedelta

today = datetime.today()



Monster is another job listing site that has been around for more than 20 years, providing a "job board" globally for job seeking, career management, recruitment, and talent management products and services. They are also one of the companies that take advantage of technology to create and deliver the best receruiting media, technologies, and platforms for connecting jobs and people via helping hire and find people jobs.

## Web Scraping Data

### Selecting Categories

For web scraping the different jobs available, we selected job links in monster.com.ph's site that are related to the STEM course.monster_ph_df_json_1 = pd.read_json(r'Monster Data/monster_ph_IT.json')

In [46]:
#The List of Categories Relevant to the Paper
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
           'AppleWebKit/537.36 (KHTML, like Gecko) '\
           'Chrome/75.0.3770.80 Safari/537.36'}
Category_URL_List = ['https://www.monster.com.ph/search/it-computers-hardware-networking-jobs',
                     'https://www.monster.com.ph/search/it-computers-software-jobs',
                     'https://www.monster.com.ph/search/engineering-design-jobs',
                     'https://www.monster.com.ph/search/electrical-switchgears-jobs',
                     'https://www.monster.com.ph/search/chemicals-petrochemicals-jobs', 
                     'https://www.monster.com.ph/search/construction-engineering-jobs',
                     'https://www.monster.com.ph/search/hospitals-healthcare-diagnostics-jobs']
                     
#Category_URL_List

### Parse HTML data

We then try to get the all of the listed job URLs within that category via Parsing the data with Beautiful Soup. The links provided from `Category_URL_List` are then used to scrape through the various individual job listings and are given the following features in a dataframe:
* `Website` - Monster PH
* `Job Title` - name of the job listed
* `Category` - job category based on Monster PH
* `Company` - employer company
* `Date Posted` - date job was posted
* `Location` - where the work site is located
* `Status` - employment type (Full/Part time)
* `Salary` - not specified 
* `Education` - highest level of education attainment required
* `Years of Work Experience` - years of working experience required
* `Job Description` - body text description

In [None]:
for kekw in range(len(Category_URL_List)):
    #URL Page Gatherer
    print (Category_URL_List[kekw])
    isNext = True
    isNext_Page = 1
    isPer_Page = []
    #kekw = 6
    URL = Category_URL_List[kekw]
    page=requests.get(URL,headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    while isNext:
        page=requests.get(URL,headers=headers)
        soup = BeautifulSoup(page.content, 'html.parser')
        for div in soup.findAll('div', {"class": "srp-navigation"}):
            Test = div.find_all("button")
        if len(Test) > 1:
            isOutput = Test[1].get_text().replace('\n','').replace(' ','').replace('\t','')
            if isOutput == "Next":
                #print(isOutput)
                isNext_Page = isNext_Page + 1
                #print(isNext_Page)
                URL = Category_URL_List[kekw] + "-{}?".format(isNext_Page)
                #print(URL)
                isPer_Page.append(Category_URL_List[kekw] + "-{}?".format(isNext_Page))
            else:
                #print(isOutput)
                #print(isNext_Page)
                isNext = False
        elif (len(Test) == 1):
            isOutput = Test[0].get_text().replace('\n','').replace(' ','').replace('\t','')
            if isOutput == "Next":
                #print(isOutput)
                #print(isNext_Page)
                URL = Category_URL_List[kekw] + "-{}?".format(isNext_Page)
                #print(URL)
                isPer_Page.append(Category_URL_List[kekw] + "-{}?".format(isNext_Page))
                isNext_Page = isNext_Page + 1
            elif isOutput == "Previous":
                #print(Test[0].get_text().replace('\n',''))
                #print ("Test")
                #print(isNext_Page)
                isNext = False
        else:
            break
            
    #Gathering Data 
    job_title_list = []
    job_employment_type_list = []
    job_jobLocation_list = []
    job_dateposted_list = []
    job_desc_list = []
    job_salary_list = []
    job_location_list = []
    job_type_list = []
    comapny_name_list = []
    job_years_list = []
    job_education_list = []

    #The number of pages
    for kek in range(0,len(isPer_Page)): ##len(isPer_Page)
        #print("Page Number: ",kek)
        #Getting the Job URL per page
        URL = isPer_Page[kek]
        page=requests.get(URL,headers=headers)
        soup = BeautifulSoup(page.content, 'html.parser')
        jobURL_List = []
        i=0
        for div in soup.find_all('div', class_='job-tittle'):
            isSponsored = div.find_all('div', class_='sponsr') 
            if(len(isSponsored)):
                print("do nothing")
            else:
                for h3 in div.find_all('h3', class_='medium'):
                     for a in h3.find_all('a', href=True):
                        jobURL_List.append(a['href'])
                        i=i+1

        #Getting the Info of the JOB                

        for i in range(len(jobURL_List)):
            Monster_URL = jobURL_List[i]
            Monster_page=requests.get(Monster_URL,headers=headers)
            Monstersoup = BeautifulSoup(Monster_page.content, 'html.parser')
            for div in Monstersoup.findAll('div', {"class": "job-tittle detail-job-tittle jd-mt-0"}):
                    Job_Title = div.findAll('h1')
                    Company_name = div.findAll('a')
                    Office_Location = div.findAll('span',{"class": "loc jd-loc"} )
                    Salary = div.findAll('span',{"class": "package"} )
                    Years_WE = div.findAll("span",{"class": "loc"})
            Job_location =  Monstersoup.findAll("div",{"class": "posted-update pl5 w100"})
            Date_posted = Monstersoup.findAll('span',{"class": "posted seprator pLR-10"} )
            Job_Description = Monstersoup.findAll('div', {"class": "card-body card-body-apply pd20"})
            Job_detail = Monstersoup.findAll('div', {"class": "job-detail-list"})
            #Job_Education = Monstersoup.findAll('a', {"href": "https://www.monster.com.ph/search/bachelors-degree-jobs"})
            Job_Education = Monstersoup.findAll('div', {"class": "job-detail-list"})

            #Data Appending
            try:
                job_title_list.append(Job_Title[0].get_text().replace('\n',' '))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                comapny_name_list.append("Error")


            try:
                comapny_name_list.append(Company_name[0].get_text().replace('\n',' '))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                comapny_name_list.append("Not Specified")

            try:
                job_employment_type_list.append(Job_detail[0].get_text().replace('\n',' '))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                job_employment_type_list.append("Not Specified")
            try:    
                job_type_list.append(Job_detail[1].get_text().replace('\n',' '))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                job_type_list.append("Not Specified")

            try:
                job_jobLocation_list.append(Office_Location[0].get_text().replace('\n',' '))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                job_jobLocation_list.append("Not Specified")

            try:
                job_dateposted_list.append(Date_posted[0].get_text().replace('\n',' '))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                job_dateposted_list.append("Not Specified")

            try:            
                job_desc_list.append(Job_Description[0].get_text().replace('\n',' '))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                job_desc_list.append("Not Specified")

            try:
                job_salary_list.append(Salary[0].get_text().replace('\n',' '))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                job_salary_list.append("Not Specified")

            try:
                job_location_list.append(Job_location[0].get_text().replace('\n',' '))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                job_location_list.append("Not Specified")

            try:
                job_years_list.append(Years_WE[1].get_text().replace('\n',''))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                job_years_list.append.append("Not Specified")

            try:
                job_education_list.append(Job_Education[5].get_text().replace('\n',''))
            except:
                #print("An exception occurred") 
                #print(Job_Title[0].get_text().replace('\n',' '))
                #print(jobURL_List[i]) 
                job_education_list.append("Not Specified")

            jobs_data={'Website:': "Monster PH",
                       'Job Title': job_title_list, 
                       'Category': job_type_list,
                       'Company': comapny_name_list,
                       'Date Posted': job_dateposted_list, 
                       'Location': job_location_list,
                       'Status': job_employment_type_list, 
                       'Salary': job_salary_list,
                       'Education': job_education_list,
                       'Years of Work Experience': job_years_list,
                       'Job Description': job_desc_list,
                       'Office Location': job_jobLocation_list, 
                       }
            monster_jobs_df = pd.DataFrame(data=jobs_data)

    if(kekw == 0):
        data = monster_jobs_df.to_json(orient='records')
        parsed = json.loads(data)
        json.dumps(parsed, indent=4) 
        with open('monster_ph_IT_HW.json', 'w') as json_file:
            json.dump(parsed, json_file)
    elif(kekw == 1):
        data = monster_jobs_df.to_json(orient='records')
        parsed = json.loads(data)
        json.dumps(parsed, indent=4) 
        with open('monster_ph_IT_SW.json', 'w') as json_file:
            json.dump(parsed, json_file)
    elif(kekw == 2):
        data = monster_jobs_df.to_json(orient='records')
        parsed = json.loads(data)
        json.dumps(parsed, indent=4) 
        with open('monster_ph_ENG_DE.json', 'w') as json_file:
            json.dump(parsed, json_file)
    elif(kekw == 3):
        data = monster_jobs_df.to_json(orient='records')
        parsed = json.loads(data)
        json.dumps(parsed, indent=4) 
        with open('monster_ph_ELEC_SG.json', 'w') as json_file:
            json.dump(parsed, json_file)
    elif(kekw == 4):
        data = monster_jobs_df.to_json(orient='records')
        parsed = json.loads(data)
        json.dumps(parsed, indent=4) 
        with open('monster_ph_CHEM_ENG.json', 'w') as json_file:
            json.dump(parsed, json_file)
    elif(kekw == 5):
        data = monster_jobs_df.to_json(orient='records')
        parsed = json.loads(data)
        json.dumps(parsed, indent=4) 
        with open('monster_ph_CON_ENG.json', 'w') as json_file:
            json.dump(parsed, json_file)
    elif(kekw == 6):
        data = monster_jobs_df.to_json(orient='records')
        parsed = json.loads(data)
        json.dumps(parsed, indent=4) 
        with open('monster_ph_HEALTH.json', 'w') as json_file:
            json.dump(parsed, json_file)
