<a href="https://colab.research.google.com/github/drshahizan/python-web/blob/main/lxml/QUAD/QUAD_LXML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Web Scraping using LXML

<br>
 <p align="center">
  <img src="https://raw.githubusercontent.com/Terence172/FirstR/main/Pictures/lxml.png" height = "150"/>
 </p>
</br>

🚀 Group Members QUAD

> 1. CHONG KAI ZHE
> 2. TERENCE A/L LOORTHANATHAN
> 3. RISHMA FATHIMA BINTI BASHER
> 4. NUR SYAMALIA FAIQAH BINTI MOHD KAMAL

In this notebook, we will show you how to scrape a website using lxml. lxml is a Python library for parsing and manipulating XML and HTML documents. It provides a way to navigate, search, and modify the elements and attributes of an XML or HTML document using a simple and consistent API.

The library is built on top of the libxml2 and libxslt C libraries, which provide fast and efficient parsing and manipulation of XML and HTML documents. lxml provides a Pythonic API that is easy to use and intuitive for Python programmers, while still being very powerful and flexible.

<br>

---
<br>
Why use lxml? <br>
lxml is considered to be one of the most feature-rich and stable XML and HTML parsing libraries for Python. It's considered to be much faster than other libraries like BeautifulSoup, and it's more powerful when it comes to handling complex xpath and xslt.

<br>
For more information on lxml please go to this link https://lxml.de/ 
<br><br>

---
<br>

<br>
 <p align="center">
  <img src="https://raw.githubusercontent.com/Terence172/FirstR/main/Pictures/jobstreet.jpg" height = "150"/>
 </p>
</br>

What website we are trying to scrape?<br>
We are going to use the most used online job search website in Malaysia, Jobstreet. Jobstreet operates primarily in Southeast Asia, including countries such as Malaysia, Singapore, Philippines, Indonesia, and Vietnam. However it has established its HQ in Malaysia.
<br><br>

---
<br>
What data we are going to scrape?<br>
We are going to retrieve data of job offerings for Computer/Information technology specialists. We will get basic information of the job offering such as what company is offering it, what is the salary, and what is the job title.<br>


**First step** - Install all the required libraries.
<br>
Since lxml is not pre-installed, we have to install it manually by `!pip install lxml`. However why we need `!pip install requests`? well we need requests library to retrieve the HTML content of the website we are trying to scrape.



In [None]:
!pip install requests
!pip install lxml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Second step** - Import required libraries that is going to be used
<br>
As explained we need requests library to to retrieve the HTML content of the website. We need lxml to parse the HTML and locate elements using the specified xpath. We also need dataframe functionality from pandas.

In [None]:
import requests
from lxml import html
import pandas as pd

**Third step** - Use the requests package to retrieve the HTML source for the first page of 30 job offerings

In [None]:
url = 'https://www.jobstreet.com.my/en/job-search/job-vacancy.php?specialization=191%2C192%2C193'

response = requests.get(url)

tree = html.fromstring(response.content)

**Fourth step** - create a variable and store the value retrieved from xpath `'//div[@class="sx2jih0 zcydq876 zcydq866 zcydq896 zcydq886 zcydq8n zcydq856 zcydq8f6 zcydq8eu"]'` which finds all div elements with a class attribute whose value is `"sx2jih0 zcydq876 zcydq866 zcydq896 zcydq886 zcydq8n zcydq856 zcydq8f6 zcydq8eu"`. We then can use the variable to iterate and find information of each job offering.

In [None]:
elements = tree.xpath('//div[@class="sx2jih0 zcydq876 zcydq866 zcydq896 zcydq886 zcydq8n zcydq856 zcydq8f6 zcydq8eu"]')

**Fifth step** - Now we have to access specific sub-elements using the xpath from before<br><br>

We can use a `for loop` to iterate through each element in the elements list, and for each element use the xpath() method with the specified xpath to locate the specific sub-elements. The xpath() method will then return a list of elements that match the specified xpath, so we then need to use indexing to access the first element in the list. 

> While looking through the job offerings, we figured that some companies prefer to not give full information of the job offering. This could be because of confidentiality issues, or the recruiter just simply forgot. Because of this  we have to make sure that if the list is empty (no element matched the xpath) it assigns an empty string.

The extracted data has to be appended to the data list, which will be used to create a Pandas dataframe.

In [None]:
data = []

for element in elements:
    
    #Get company name it is in span tag with class attribute value of sx2jih0
    company_name = element.xpath('.//span[@class="sx2jih0"]/text()')
    company_name = company_name[0] if company_name else '' #If value is empty then make sure it is represented

    #Get Job Title being offered it is in a tag with data-automation attribute value of jobCardCompanyLink
    job_title = element.xpath('.//a[@data-automation="jobCardCompanyLink"]/text()')
    job_title = job_title[0] if job_title else 'Company Confidential' #If value is empty then company prefers to be confidentiality
    
    #Get Job Location being offered it is in a tag with data-automation attribute value of jobCardLocationLink
    job_loc = element.xpath('.//a[@data-automation="jobCardLocationLink"]/text()')
    job_loc = job_loc[0] if job_loc else '' #If value is empty then make sure it is represented

    #Get Salary being offered it is in span tag with class attribute value of sx2jih0 zcydq84u es8sxo0 es8sxo3 es8sxo21 es8sxoh   
    salary = element.xpath('.//span[@class="sx2jih0 zcydq84u es8sxo0 es8sxo3 es8sxo21 es8sxoh"]/text()')
    salary = salary[0] if salary else 'Not Specified' #If value is empty then company did not specify salary

    #Get the first Benefit being offered it is in span tag with class attribute value of sx2jih0 zcydq84u es8sxo0 es8sxo1 es8sxo21 _1d0g9qk4 es8sxo7   
    benefit = element.xpath('.//span[@class="sx2jih0 zcydq84u es8sxo0 es8sxo1 es8sxo21 _1d0g9qk4 es8sxo7"]/text()')
    benefit = benefit[0] if benefit else 'Nothing' #If value is empty then company did not specify benefit

    #Append value into list
    data.append([company_name, job_title, job_loc, salary, benefit])

**Sixth step** - Convert the list to a dataframe, rename the columns as well

In [None]:
df = pd.DataFrame(data, columns=['Job Title', 'Company Name', 'Job Location', 'Salary', 'Benefit'])
df.head()

Unnamed: 0,Job Title,Company Name,Job Location,Salary,Benefit
0,Senior Software Engineer / Software Engineer (...,Ideagen Plc.,Subang Jaya,Not Specified,13 Month Salary
1,Software Developer ( Java ),Wiseview Information Technology,Kuala Lumpur,"MYR 6K - 8,400 monthly",Nothing
2,Data Analyst,Zempot Malaysia Sdn. Bhd.,Johor Bahru,Not Specified,Training Provided
3,Software Engineer (Java),Ideagen Plc.,Subang Jaya,Not Specified,13 Month Salary
4,Internship for Information Technology (IT) Stu...,Infineon Technologies (Malaysia) Sdn Bhd,Melaka,Not Specified,Nothing


**Seventh step** - Make sure there are no null values
<br><br>
use `.isnull()` to check if there are null values

In [None]:
df.isnull().sum()

Job Title       0
Company Name    0
Job Location    0
Salary          0
Benefit         0
dtype: int64

There is no null values. Therefore we do not have to take any action in terms of null values.

**Last step** - Convert the result dataframe into a csv file

In [None]:
# Convert the dataframe into a csv file
df.to_csv('job_search.csv', index=False)