# Web-scraping: an educational excursion with a few detours to gain valuable hands-on experience into one of the key tools in data science

## Data understanding

John Rollin’s data methodology outlines 10 stages that guide data scientists in solving complex problems and making data-driven decisions. While this Python project doesn’t focus on the methodology, it’s always best practice to follow it, especially for real-world scenarios.

The data understanding stage is the fifth stage, so we should revisit the first stage, business understanding, to recapture the business problems, data needed, goals, and objectives for this project. The project introduction helps identify these.
#### What was the business problem?
- The recruitment agency was facing a challenge in efficiently identifying potential job openings for its clients. Manual job searches over multiple sites were time-consuming and risky, potentially missing out on job ads that could affect clients’ employment prospects.
#### What did the business need?
- The agency urgently needs an analytical tool to expedite its job search process. It should automatically grab job postings from multiple marketplace websites and provide them to clients as part of its service.
#### What does the business aim to achieve?
- The agency in this scenario aims to improve the efficiency of their job vacancy sourcing process through the web-scraping tool.
- The agency can quickly find relevant job openings from multiple posting sites, so they’ll have a better chance of finding many great opportunities before others. This; therfore, helps them improve the quality of job vacancies they source.
The  two objectives once achieved will make the business and its clients more competitive in the job market as this project deliverable to enhance the quality of their service with higher accuracy and up-to-date with the currency. That will be the ultimate goal.


#### The target websites

Having identified the business problem with its goals and objectives, I progressed to the stage of constructing the tools to help the business achieve the outcomes. From my understanding, the data that the business required would simply consist of job position listings from several posting sites. The agency might have its own list of job-posting websites that they use or partner with. I would need to ask the agency to supply me with the list of their target websites; however, for now, in Australia, I consider scraping the data from the following pages:
* [Seek, AU](https://seek.com.au)
* [Indeed, AU](https://au.indeed.com)
* [Jora, AU](https://jora.com.au)
* [LinkedIn, AU](https://linkedin.com.au)

I am fully aware that the tools that I am going to develop, if used to scrape the data without the owners' permissions, can result in legal consequences. So, I will discuss with the agency on this crucial aspect of web-scraping to avoid any unnecessary legal entanglements for myself and the business. I will probably advise the business to obtain a permission and User-agent's licence that will permit it to scrape freely.

In addition, I also find it quite risk-taking and misleading to employ one of the above webpages to test the tools’ functionality. If my tools request information from them and the responses come back as negative or not accurate data, then it will be hard to judge whether the tools are functional or the scraping is actually forbidden.

Therefore, I, instead, employ another site, so-called [Fake Python job posting for your web scraping journeys](https://realpython.github.io/fake-jobs/), to assess the scraping capability. As the title suggests, the site was developed by the Python community to facilitate data scientists and Python users to practice their web-scraping skills without having to worry about breaking the rules.

### Detour: understanding of the rules around web-scraping & anti-scraping

- I have done quite a substantial research on the ethical considerations to help guide myself to establish the best practice for web-scraping and also set guidelines for my future projects and my journey as a data scientist. The research review outlines the following practices and actionable steps to take in order to remain compliant with the anti-scraping guidelines.
- Respecting Robots.txt: most webpages communicate their scraping rules through this robots.txt file. Although this communication is not absolutely accurate, it should be respected and cannot be ignored when we have the intention to crawl on their webpages. The file lists User-agents or Bots with their status of 'Allow' or 'Disallow' or 'Allow on certain contents'. This information sufficiently represents the owners’ view and intention about scraping their website, and failing to obey this ruling will lead to ethical and legal issues.

- The robot.txt file can be conviniently obtained by appending'/robots.txt' to the website's URL as follows:
  * https://www.seek.com.au/robots.txt
  * https://au.indeed.com/robots.txt
  * https://jora.com.au/robots.txt
  * https://linkedin.com/robots.txt  

Clicking on each URL leads us to a new window to obtain the txt file that tells us will inform us about the ruling that each individual has, and our attentions should be appreciated, especialy to the **Disallow directives** that indicate which pages sections and contents are off-limit to crawlers.

#### Robots.txt Results

In [3]:
# robots.txt file for www.seek.com.au

•	As part of the learning curve and of establishing the best practice for my crawling journey, I will spend some time here interpreting the directives from seek.com.au so that I can develop my understanding in this area. The other webpages, I will just show the outcomes instead.
##### **Default directives**
- _Disallow: */job/_ **==** all bots are ***disallowed*** from accessing  any ***URLs that contain /job***
- _Disallow: *?returnUrl=_ **==** Bots ***cannot access*** URLs with the ***query parameter returnUrl.***
- _Disallow: *?page=_ **==** Bots ***cannot access*** URLs with ***the query parameter page.***
- _Disallow: /graphql=_ **==** Bots are ***disallowed*** from accessing the ***/graphql endpoint.***
- _Disallow: /api/jobsearch/_ **==** Bots ***cannot access*** the ***job search API.***

_Interpretation:_ The site _restricts most bots_ from accessing job listings and *certain API endpoints*, ***indicating a desire to protect job data from being scraped.***

##### **Disallowed bots**
- _Disallow: /_:

  These specific bots inluding ***LinkedInBot***, ***Baiduspider***, ***PetalBot*** are completely disallowed from accessing any part of the site
 
_Interpretatation:_ *The site explicitly blocks these bots, likely due to concerns about scraping or unwanted traffic.*
  
##### **Exceptions**
- _Disallow: /companies:_

  User-agents including **anthropic-ai**, **Bytespider**, **CCBot**, **Diffbot**,**Google-Extended**, **omgili**, **GPTBot**  are ***disallowed*** from accessing ***the /companies section.***
- _Disallow: */job/:_
  They are also ***disallowed*** from accessing ***job listings.***

_Interpretation:_ *While these bots are recognized, they still face restrictions similar to other bots regarding job listings and company data.*

##### **User-agent: LinkedInBot**
- _Allow: */job/:_ **==** Interestingly, ***LinkedInBot*** is ***allowed to access job listings***, which is a **specific exception to the general disallowance.**
- User-agent: **facebookexternalhit** _Allow: */job/_, _Allow: */jobs*_, _Allow: */*-jobs*_

_Interpretation:_ This indicates **a selective approach** where certain bots, like ***LinkedIn*** and ***Facebook's crawlers***, are permitted to access ***job listings***, ***possibly for sharing or indexing purposes.***

[Seek, AU](seek.com.au) strongly disallows scraping their site. 

In [4]:
# robots.txt file for www.au.indeed.com

In [None]:
User-agent: *
Allow: /
Allow: /hire/*?*isid=
Allow: /personeel/*?*isid=
Allow: /reclutamiento/*?*isid=
Allow: /recruiting/*?*isid=
Allow: /recrutement/*?*isid=

Disallow: /*rt=nc
Disallow: /*&alid=
Disallow: /*&calert=
Disallow: /*&iafilter=
Disallow: /*&mna=
Disallow: /*?rss
Disallow: /addlLoc/
Disallow: /ads/
Disallow: /advanced_search
Disallow: /alert
Disallow: /api/fetch/mc-anon
Disallow: /api/getrecjobs
Disallow: /applystart
Disallow: /cmp/_/
Disallow: /cmp/_c/claim/*
Disallow: /cmp/_rpc/
Disallow: /cmp/addlink
Disallow: /cmp/addvideo
Disallow: /cmp/Login*
Disallow: /cmp/*/analytics
Disallow: /cmp/*/company-questions
Disallow: /cmp/*/people
Disallow: /cmp/*/write-review
Disallow: /community/
Disallow: /company/*
Disallow: /conversion/
Disallow: /cookiemigrator/
Disallow: /create-resume/lp/
Disallow: /cdn-cgi/
Disallow: /%E4%BB%95%E4%BA%8B?
Disallow: /%E5%B7%A5%E4%BD%9C/
Disallow: /%E5%B7%A5%E4%BD%9C/CN/
Disallow: /%E5%B7%A5%E4%BD%9C/title/
Disallow: /%E6%B1%82%E4%BA%BA/JP/
Disallow: /%E6%B1%82%E4%BA%BA/title/
Disallow: /%E8%81%8C%E4%BD%8D%E6%98%BE%E7%A4%BA?
Disallow: /%EC%B7%A8%EC%97%85/KR/
Disallow: /%EC%B7%A8%EC%97%85/title/
Disallow: /empleo/
Disallow: /emploi/
Disallow: /emplois/FR/
Disallow: /emplois/title/
Disallow: /emprego/
Disallow: /empregos/BR/
Disallow: /empregos/title
Disallow: /forum/profile
Disallow: /g/
Disallow: /graphql
Disallow: /imgping
Disallow: /ita?
Disallow: /ja/clk?
Disallow: /ja/imp.gif
Disallow: /job/
Disallow: /Job/
Disallow: /jobb/
Disallow: /jobb/SE/
Disallow: /jobb/title/
Disallow: /jobroll
Disallow: /jobs/AE/
Disallow: /jobs/AQ/
Disallow: /Jobs/AT/
Disallow: /jobs/AU/
Disallow: /jobs/BE/
Disallow: /jobs/BH/
Disallow: /jobs/CA/
Disallow: /jobs/CZ/
Disallow: /jobs/DE/
Disallow: /Jobs/DE/
Disallow: /jobs/DK/
Disallow: /jobs/FI/
Disallow: /jobs/GB/
Disallow: /jobs/GR/
Disallow: /jobs/HK/
Disallow: /jobs/HU/
Disallow: /jobs/ID/
Disallow: /jobs/IE/
Disallow: /jobs/IL/
Disallow: /jobs/IN/
Disallow: /jobs/KW/
Disallow: /jobs/LU/
Disallow: /jobs/MY/
Disallow: /jobs/NO/
Disallow: /jobs/NZ/
Disallow: /jobs/OM/
Disallow: /jobs/PE/
Disallow: /jobs/PH/
Disallow: /jobs/PK/
Disallow: /jobs/PT/
Disallow: /jobs/QA/
Disallow: /jobs/RO/
Disallow: /jobs/RU/
Disallow: /jobs/SA/
Disallow: /jobs/SG/
Disallow: /jobs/title
Disallow: /Jobs/title
Disallow: /jobs/TR/
Disallow: /jobs/TW/
Disallow: /jobs/US/
Disallow: /jobs/VE/
Disallow: /jobs/ZA/
Disallow: /jobtrends/trends/log/*
Disallow: /jwidget.js*
Disallow: /me/*/pdf
Disallow: /m/basecamp/analytics.js
Disallow: /m/jobalerts?
Disallow: /m/moreLoc
Disallow: /m/newjobs
Disallow: /m/recommended
Disallow: /m/rpc/
Disallow: /m/viewjob?
Disallow: /my/
Disallow: /*oc=1
Disallow: /ofertas/ES/
Disallow: /ofertas/title/
Disallow: /offerta-lavoro
Disallow: /offerte-lavoro/IT/
Disallow: /offerte-lavoro/title
Disallow: /pagead/
Disallow: /poka%C5%BCprac%C4%99?
Disallow: /praca/
Disallow: /praca/PL/
Disallow: /praca/title/
Disallow: /preferences
Disallow: /*radius=
Disallow: /rc/
Disallow: /rdr/
Disallow: /recommendedjobs
Disallow: /resumes/account
Disallow: /resumes/advanced?
Disallow: /resumes/alert
Disallow: /resumes*radius=
Disallow: /resumes*rb=
Disallow: /resumes*res=
Disallow: /resumes/rpc/
Disallow: /resumes*sort=
Disallow: /resumes*start=
Disallow: /rpc/
Disallow: /r/*/pdf
Disallow: /rss
Disallow: /setprefs
Disallow: /*sid=
Disallow: /*sp=0
Disallow: /*&start=
Disallow: /stc
Disallow: /Stellen/CH/
Disallow: /Stellen/title
Disallow: /tellafriend
Disallow: /tmn/ccs/
Disallow: /tos/banner.js
Disallow: /trabajo/
Disallow: /trabajo/AR/
Disallow: /trabajo/CL/
Disallow: /trabajo/CO/
Disallow: /trabajo/MX/
Disallow: /trabajo/title/
Disallow: /url
Disallow: /vacature/
Disallow: /vacature-bekijken?
Disallow: /vacatures/NL/
Disallow: /vacatures/title/
Disallow: /ver-empleo?
Disallow: /ver-emprego?
Disallow: /ver-oferta?
Disallow: /viewjob?
Disallow: /visajobb?
Disallow: /voir-emploi?
Disallow: /Zeige-Job?
Disallow: //applystart

User-agent: Googlebot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
Allow: /
Allow: /hire/*?*isid=
Allow: /personeel/*?*isid=
Allow: /reclutamiento/*?*isid=
Allow: /recruiting/*?*isid=
Allow: /recrutement/*?*isid=
Allow: /viewjob?

Disallow: /*rt=nc
Disallow: /*&alid=
Disallow: /*&calert=
Disallow: /*&iafilter=
Disallow: /*&mna=
Disallow: /*?rss
Disallow: /*&serpstart=
Disallow: /*/vaclk
Disallow: /addlLoc/
Disallow: /ads/
Disallow: /advanced_search
Disallow: /alert
Disallow: /api/fetch/mc-anon
Disallow: /api/getrecjobs
Disallow: /applystart
Disallow: /cmp/_/
Disallow: /cmp/_c/claim/*
Disallow: /cmp/_rpc/
Disallow: /cmp/addlink
Disallow: /cmp/addvideo
Disallow: /cmp/Login*
Disallow: /cmp/*/analytics
Disallow: /cmp/*/company-questions
Disallow: /cmp/*/people
Disallow: /cmp/*/write-review
Disallow: /community/
Disallow: /company/*
Disallow: /conversion/
Disallow: /cookiemigrator/
Disallow: /create-resume/lp/
Disallow: /%E4%BB%95%E4%BA%8B?
Disallow: /%E5%B7%A5%E4%BD%9C/
Disallow: /%E5%B7%A5%E4%BD%9C/CN/
Disallow: /%E5%B7%A5%E4%BD%9C/title/
Disallow: /%E6%B1%82%E4%BA%BA/JP/
Disallow: /%E6%B1%82%E4%BA%BA/title/
Disallow: /%E8%81%8C%E4%BD%8D%E6%98%BE%E7%A4%BA?
Disallow: /%EC%B7%A8%EC%97%85/KR/
Disallow: /%EC%B7%A8%EC%97%85/title/
Disallow: /empleo/
Disallow: /emploi/
Disallow: /emplois/FR/
Disallow: /emplois/title/
Disallow: /emprego/
Disallow: /empregos/BR/
Disallow: /empregos/title
Disallow: /forum/profile
Disallow: /g/
Disallow: /graphql
Disallow: /imgping
Disallow: /ita?
Disallow: /ja/clk?
Disallow: /ja/imp.gif
Disallow: /job/
Disallow: /Job/
Disallow: /jobb/
Disallow: /jobb/SE/
Disallow: /jobb/title/
Disallow: /jobroll
Disallow: /jobs/AE/
Disallow: /jobs/AQ/
Disallow: /Jobs/AT/
Disallow: /jobs/AU/
Disallow: /jobs/BE/
Disallow: /jobs/BH/
Disallow: /jobs/CA/
Disallow: /jobs/CZ/
Disallow: /jobs/DE/
Disallow: /Jobs/DE/
Disallow: /jobs/DK/
Disallow: /jobs/FI/
Disallow: /jobs/GB/
Disallow: /jobs/GR/
Disallow: /jobs/HK/
Disallow: /jobs/HU/
Disallow: /jobs/ID/
Disallow: /jobs/IE/
Disallow: /jobs/IL/
Disallow: /jobs/IN/
Disallow: /jobs/KW/
Disallow: /jobs/LU/
Disallow: /jobs/MY/
Disallow: /jobs/NO/
Disallow: /jobs/NZ/
Disallow: /jobs/OM/
Disallow: /jobs/PE/
Disallow: /jobs/PH/
Disallow: /jobs/PK/
Disallow: /jobs/PT/
Disallow: /jobs/QA/
Disallow: /jobs/RO/
Disallow: /jobs/RU/
Disallow: /jobs/SA/
Disallow: /jobs/SG/
Disallow: /jobs/title
Disallow: /Jobs/title
Disallow: /jobs/TR/
Disallow: /jobs/TW/
Disallow: /jobs/US/
Disallow: /jobs/VE/
Disallow: /jobs/ZA/
Disallow: /jobtrends/trends/log/*
Disallow: /jwidget.js*
Disallow: /me/*/pdf
Disallow: /m/basecamp/analytics.js
Disallow: /m/jobalerts?
Disallow: /m/moreLoc
Disallow: /m/newjobs
Disallow: /m/recommended
Disallow: /m/rpc/
Disallow: /my/
Disallow: /*oc=1
Disallow: /ofertas/ES/
Disallow: /ofertas/title/
Disallow: /offerte-lavoro/IT/
Disallow: /offerte-lavoro/title
Disallow: /pagead/
Disallow: /praca/
Disallow: /praca/PL/
Disallow: /praca/title/
Disallow: /preferences
Disallow: /*radius=
Disallow: /rc/
Disallow: /rdr/
Disallow: /recommendedjobs
Disallow: /resumes/account
Disallow: /resumes/advanced?
Disallow: /resumes/alert
Disallow: /resumes*radius=
Disallow: /resumes*rb=
Disallow: /resumes*res=
Disallow: /resumes/rpc/
Disallow: /resumes*sort=
Disallow: /resumes*start=
Disallow: /rpc/
Disallow: /r/*/pdf
Disallow: /rss
Disallow: /setprefs
Disallow: /*sid=
Disallow: /*sp=0
Disallow: /*&start=
Disallow: /stc
Disallow: /Stellen/CH/
Disallow: /Stellen/title
Disallow: /tellafriend
Disallow: /tmn/ccs/
Disallow: /tos/banner.js
Disallow: /trabajo/
Disallow: /trabajo/AR/
Disallow: /trabajo/CL/
Disallow: /trabajo/CO/
Disallow: /trabajo/MX/
Disallow: /trabajo/title/
Disallow: /url
Disallow: /vacature/
Disallow: /vacatures/NL/
Disallow: /vacatures/title/
Disallow: //applystart

User-agent: Bingbot
Allow: /
Allow: /hire/*?*isid=
Allow: /personeel/*?*isid=
Allow: /reclutamiento/*?*isid=
Allow: /recruiting/*?*isid=
Allow: /recrutement/*?*isid=

Disallow: /*rt=nc
Disallow: /*&alid=
Disallow: /*&calert=
Disallow: /*&iafilter=
Disallow: /*&mna=
Disallow: /*?rss
Disallow: /addlLoc/
Disallow: /ads/
Disallow: /advanced_search
Disallow: /alert
Disallow: /api/fetch/mc-anon
Disallow: /api/getrecjobs
Disallow: /applystart
Disallow: /cmp/_/
Disallow: /cmp/_c/claim/*
Disallow: /cmp/_rpc/
Disallow: /cmp/addlink
Disallow: /cmp/addvideo
Disallow: /cmp/Login*
Disallow: /cmp/*/analytics
Disallow: /cmp/*/company-questions
Disallow: /cmp/*/people
Disallow: /cmp/*/write-review
Disallow: /community/
Disallow: /company/*
Disallow: /conversion/
Disallow: /cookiemigrator/
Disallow: /create-resume/lp/
Disallow: /%E4%BB%95%E4%BA%8B?
Disallow: /%E5%B7%A5%E4%BD%9C/
Disallow: /%E5%B7%A5%E4%BD%9C/CN/
Disallow: /%E5%B7%A5%E4%BD%9C/title/
Disallow: /%E6%B1%82%E4%BA%BA/JP/
Disallow: /%E6%B1%82%E4%BA%BA/title/
Disallow: /%E8%81%8C%E4%BD%8D%E6%98%BE%E7%A4%BA?
Disallow: /%EC%B7%A8%EC%97%85/KR/
Disallow: /%EC%B7%A8%EC%97%85/title/
Disallow: /empleo/
Disallow: /emploi/
Disallow: /emplois/FR/
Disallow: /emplois/title/
Disallow: /emprego/
Disallow: /empregos/BR/
Disallow: /empregos/title
Disallow: /forum/profile
Disallow: /g/
Disallow: /graphql
Disallow: /imgping
Disallow: /ita?
Disallow: /ja/clk?
Disallow: /ja/imp.gif
Disallow: /job/
Disallow: /Job/
Disallow: /jobb/
Disallow: /jobb/SE/
Disallow: /jobb/title/
Disallow: /jobroll
Disallow: /jobs/AE/
Disallow: /jobs/AQ/
Disallow: /Jobs/AT/
Disallow: /jobs/AU/
Disallow: /jobs/BE/
Disallow: /jobs/BH/
Disallow: /jobs/CA/
Disallow: /jobs/CZ/
Disallow: /jobs/DE/
Disallow: /Jobs/DE/
Disallow: /jobs/DK/
Disallow: /jobs/FI/
Disallow: /jobs/GB/
Disallow: /jobs/GR/
Disallow: /jobs/HK/
Disallow: /jobs/HU/
Disallow: /jobs/ID/
Disallow: /jobs/IE/
Disallow: /jobs/IL/
Disallow: /jobs/IN/
Disallow: /jobs/KW/
Disallow: /jobs/LU/
Disallow: /jobs/MY/
Disallow: /jobs/NO/
Disallow: /jobs/NZ/
Disallow: /jobs/OM/
Disallow: /jobs/PE/
Disallow: /jobs/PH/
Disallow: /jobs/PK/
Disallow: /jobs/PT/
Disallow: /jobs/QA/
Disallow: /jobs/RO/
Disallow: /jobs/RU/
Disallow: /jobs/SA/
Disallow: /jobs/SG/
Disallow: /jobs/title
Disallow: /Jobs/title
Disallow: /jobs/TR/
Disallow: /jobs/TW/
Disallow: /jobs/US/
Disallow: /jobs/VE/
Disallow: /jobs/ZA/
Disallow: /jobtrends/trends/log/*
Disallow: /jwidget.js*
Disallow: /me/*/pdf
Disallow: /m/basecamp/analytics.js
Disallow: /m/jobalerts?
Disallow: /m/moreLoc
Disallow: /m/newjobs
Disallow: /m/recommended
Disallow: /m/rpc/
Disallow: /m/viewjob?
Disallow: /my/
Disallow: /*oc=1
Disallow: /ofertas/ES/
Disallow: /ofertas/title/
Disallow: /offerta-lavoro
Disallow: /offerte-lavoro/IT/
Disallow: /offerte-lavoro/title
Disallow: /pagead/
Disallow: /poka%C5%BCprac%C4%99?
Disallow: /praca/
Disallow: /praca/PL/
Disallow: /praca/title/
Disallow: /preferences
Disallow: /*radius=
Disallow: /rc/
Disallow: /rdr/
Disallow: /recommendedjobs
Disallow: /resumes/account
Disallow: /resumes/advanced?
Disallow: /resumes/alert
Disallow: /resumes*radius=
Disallow: /resumes*rb=
Disallow: /resumes*res=
Disallow: /resumes/rpc/
Disallow: /resumes*sort=
Disallow: /resumes*start=
Disallow: /rpc/
Disallow: /r/*/pdf
Disallow: /rss
Disallow: /setprefs
Disallow: /*sid=
Disallow: /*sp=0
Disallow: /*&start=
Disallow: /stc
Disallow: /Stellen/CH/
Disallow: /Stellen/title
Disallow: /tellafriend
Disallow: /tmn/ccs/
Disallow: /tos/banner.js
Disallow: /trabajo/
Disallow: /trabajo/AR/
Disallow: /trabajo/CL/
Disallow: /trabajo/CO/
Disallow: /trabajo/MX/
Disallow: /trabajo/title/
Disallow: /url
Disallow: /vacature/
Disallow: /vacature-bekijken?
Disallow: /vacatures/NL/
Disallow: /vacatures/title/
Disallow: /ver-empleo?
Disallow: /ver-emprego?
Disallow: /ver-oferta?
Disallow: /viewjob?
Disallow: /visajobb?
Disallow: /voir-emploi?
Disallow: /Zeige-Job?
Disallow: //applystart

User-agent: ia_archiver
Disallow: /

User-Agent: OmniExplorer_Bot
Disallow: /

User-agent: Mediapartners-Google
Allow: /

User-agent: AdsBot-Google
User-agent: AdsBot-Google-Mobile
Allow: /create-resume/lp/
Allow: /*sid=
Disallow: /rpc/

User-agent: adidxbot
Allow: /*sid=

User-agent: Googlebot-Image
Disallow: /trendgraph/*

User-agent: msnbot-media
Disallow: /trendgraph/*

User-agent: 008
Disallow: /

User-agent: JobdiggerSpider
Disallow: /

User-agent: Cliqzbot
Disallow: /

User-agent: GoogleOther
Disallow: / 

User-agent: facebookexternalhit
User-agent: WhatsApp
User-agent: snapchat
User-agent: TelegramBot
User-agent: Twitterbot
Allow: /

User-agent: GPTBot
User-agent: Google-Extended
User-agent: CCBot
User-agent: anthropic-ai
User-agent: FacebookBot
User-agent: AmazonBot
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: Baiduspider
User-agent: cohere-training-data-crawler
User-agent: FriendlyCrawler
User-agent: img2dataset
Disallow: /c/info/
Disallow: /hire/how-to-hire/
Disallow: /hire/interview-questions/
Disallow: /recruiting/interviewfragen/
Disallow: /recruiting/stellenbeschreibung/
Disallow: /recrutement/description-du-poste/
Disallow: /recrutement/questions-entretien-dembauche/
Disallow: /personeel/functiebeschrijving/
Disallow: /companies/
Disallow: /cmp/
Disallow: /career-advice/
Disallow: /conselho-de-carreira/
Disallow: /hire/job-description/
Disallow: /karriere-guide/
Disallow: /conseils-carrieres/
Disallow: /conseils-carriere/ 
Disallow: /orientacion-profesional/
Disallow: /porady-zawodowe/
Disallow: /guida-alla-carriera/
Disallow: /職涯貼士/
Disallow: /carrieregids/
Disallow: /karriarrad/
Disallow: /orientacion-laboral/ 
Disallow: /career/
Disallow: /*jobs.html
Disallow: /jobs
Disallow: /viewjob
Disallow: /certifications/

[**Indeed, AU**](https://indeed.com.au) also has a very strict rules about crawling their job listings from most User-agents. Those that are allowed  search engine Bots from Google, Bing and Facebook.

In [5]:
# robots.txt file for au.jora.com

In [None]:
User-agent: Googlebot
User-agent: Bingbot
User-agent: Amazonbot
User-agent: Baiduspider
User-agent: Blekkobot
User-agent: DuckDuckBot
User-agent: Ecosia
User-agent: ExaBot
User-agent: facebookexternalhit
User-agent: Yeti/Naverbot
User-agent: Slurp
User-agent: SeznamBot
User-agent: Sogou spider
User-agent: Soso Spider
User-agent: YandexBot
User-agent: TwitterBot
User-agent: Alexabot
User-agent: APIs-Google
User-agent: AdsBot-Google
User-agent: AdsBot-Google-Mobile
User-agent: AdsBot-Google-Mobile-Apps
User-agent: DuplexWeb-Google
User-agent: Googlebot-Image
User-agent: Googlebot-News
User-agent: Googlebot-Video
User-agent: AdIdxBot
User-agent: BingPreview
User-agent: AhrefsBot
User-agent: ArchitextSpider
User-agent: Crawler4j
User-agent: RogerBot
User-agent: SEMrushBot
User-agent: ia_archiver # Alexa web-wide crawler
User-agent: Lycos_Spider_(T-Rex)
User-agent: speedy_spider
User-agent: Teoma
User-agent: anthropic-ai
User-agent: Bytespider
User-agent: CCBot
User-agent: Diffbot
User-agent: Google-Extended
User-agent: omgili
User-agent: GPTBot
Disallow: /rpc/
Disallow: /job/rd/
Disallow: /job/
Disallow: /job/description/
Disallow: /job-search/
Disallow: /view-job/
Disallow: /vanity/
Disallow: /(empleo|emploi|vaga|viÃ¡Â»â€¡c|trabajo|Ã Â¸â€¡Ã Â¸Â²Ã Â¸â„¢|job vacancy)/
Disallow: /rss
Disallow: /style-guide/
Disallow: /iniciar-sesion/
Disallow: /cdn-cgi/
Disallow: /*disallow=true*

User-agent: *
Disallow: /

[**Jora, AU**](au.jora.com) bans specific User-agents from crawling on their pages especially in job listings.

In [5]:
#robots.txt file for linkedIn.com

[LinkedIn](linkedin.com) shows a very extensive list specifying clearly which agents can access which pages, showing that their view on crawling is also very strict. However, the owner does advise crawlers to contact them via email to apply for the permission as follows:
>
>**Notice:** If you would like to crawl LinkedIn,
please email whitelist-crawl@linkedin.com to apply
for white listing.
>

My rule of thumb on scraping is always contacting the owner of the site if I have a genuine interest in crawling their site irrespective of purposes. This practice will not only avoid breaking the rules but also creating such a transparency and integrity to faciliate fair and profitable partnerships among businesses as well as to build the individual reputations for their compliance with crawling rules. 

So I did write to [LinkedIn](linkedin.com) to ask for the permisson to crawl the site on the basis of educational purposes but no reply yet. 
[Seek, Au](seek.com.au) is my favorite site to look for new opportunities; so, I wrote to them for a permit but it was rejected due to safety concerns. I totally understand the owners' point of view and I respect that. 

Correspondence from seek regarding the permission:

![Seek_reply](https://i.postimg.cc/RFjkfYLz/Gmail-Other-Redacted.png)

***This anti-scraping excercise is considered one of the crucial tasks in the data requirements and preparation as this determines the data sourcing and its quality.***

In [8]:
# Before we get into the data extraction, I would like to revisit 
# the antiscraping rules because it is very important that we double check this
# by the response._status.code through 
# the try and except in the error and exception handling 

def check_anti_scraping(URL): 
    try:
        # Request a response from the url by 'requests.get()' 
        response = requests.get(URL)
        # The status_code is an attribute of the response object
        if response.status_code == 403:  # a very clear message for anti-scraping
            print("Access Forbidden (403): The User-agent is dissallowed.")
        elif response.status_code == 429:  # this could happen with social media
            print("Too Many Requests (429): User-agent is rate-limited.")
        elif response.status_code != 200:  # 200 is the code for a smooth 
            # and complete response so if not 200 then there is an issue
            # that's why we asked to print out the 'unexpected status code'for further assessment
            print(f"Unexpected status code: {response.status_code}")
        else:  # This is to continue to tell Python to handle
            # a situation which is different from the above. Another type or indicator
            # Analyze the response content for anti-scraping indicators
            content = response.text.lower()  # the content is an attribute of the response object
            if "captcha" in content or "bot" in content:
                print("The website may have anti-scraping measures (CAPTCHA or bot detection).")
            else:  # if nothing is countered before then the website must allow the scraping
                print("The website appears to allow scraping.")
    # there are exceptions to the requests library, that could be
    # connection errors, timeout errors or https: errors
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        
# To know the status code, we need to call the function after we defined it
# This funtionality in Python truly creates such a great flexibility and time saving
# as multiple websites can be checked over and over again without having to 
# rewrite or paste the whole code set
# Seek
URL_seek       = 'https://www.seek.com.au/jobs-in-healthcare-medical/in-All-Sydney-NSW/full-time/on-site?daterange=3&salaryrange=50000-200000&salarytype=annual'
URL_indeed     = 'https://au.indeed.com/jobs?q=data+scientist&l=Sydney+NSW&radius=50&sc=0kf%3Aattr%28CF3CP%29%3B&from=searchOnDesktopSerp&vjk=ba2dce9f8e032c78'
URL_jora       = 'https://au.jora.com/j?disallow=true&jt=3&l=Sydney+NSW&q=data+scientist&sp=facet_job_type'
URL_linkedin   = 'https://www.linkedin.com/jobs/search?trk=guest_homepage-basic_guest_nav_menu_jobs&position=1&pageNum=0'
URL_fake_job   = 'https://realpython.github.io/fake-jobs/'


In [None]:
#Seek
check_anti_scraping(URL_seek)
#Indeed
check_anti_scraping(URL_indeed)
#Jora
check_anti_scraping(URL_jora)
#LinkedIn
check_anti_scraping(URL_linkedin)
#Fakejob-Python-page
check_anti_scraping(URL_fake_job)

**This is consistent with the output from the results from '/robots.txt'**

### What parameters should be included in the job search output?

- In terms of what other parameters or details should be included in the output to help the agency to gauge the quality of the job vacancies, and subsequently allocate with confidence the opportunities to eligible applicants, I suggest the output should be in a listing format with each row is a job vacancy followed by its associated details as follows:

| Job title | Company| Location| Job description|Posting date| Closing date|Salary|URL links|
|:-----:|:----------:|:--------:|:------------:|:-------:|:--------:|:-----:|:---------:|
|Data scientist| IBM Coursera| Sydney NSW 2020| - Hybrid work|-  / Data engineer|11/07/2025|30/07/2025|$135,000 pa excluding super, leave loading| https://_ _ _|
| | | |- To join a team of business analytics team ||| | |
| | | | - To build machine learning models for clients in banking industry|| || 

* These parameters are not garateed to be obtained in full due to the anti-scraping rules, privacy polcies and the businesses.
* However, I will try my best to make sure project will be well-managed and delivered successfully together with quality time and in line with the budget.

## Code choices

This section of the project will convey coding concepts and my approaches to building the web-scraping tools. There will be code sets that will be presented with detailed explanations and further warnings or suggestions to guide the staff at the agency to leverage the scraping tools for grabbing as many good opportunities as possible for the clients.

For the code set in design and display, I will adhere to the 'Task-by-task guide’ provided in the project introduction as I think the document contains useful suggestions for the code sets and links to helpful sources.


## Task 1

### Import your important libraries here 

In [7]:
# csv for writing data to a CSV file
import pandas as pd
# datetime for getting the current date
import datetime
#requests for sending HTTP requests to the website
import requests
# BeautifulSoup for parsing the HTML source code of the webpage
from bs4 import BeautifulSoup
# time for introducing a delay in our program
import datetime  # Import the entire module

# This Jupyter notebook that I have been working on was created in
# the environment in the Anaconda Navigator, I named it as 'scrapy_env'
# in this evironment, the Scrapy framework is implemented.
# This will facilitate both the 'requests' and BeatifulSoup modules as well as
# the Spider module 

import json
import csv

## Task 2

### Figuring out different components within an URL

When enterning the job posting sites such as [Seek, AU](https://seek.com.au) and choosing the job category for example as 'All Accounting' and the location as 'All Sydney NSW'(shown below) click Search, the following URL link was shown as 

"https://www.seek.com.au/jobs-in-accounting/in-All-Sydney-NSW".

  
The URL consists of three different components as:
- The scheme as 'https://'
- The host or base URL as 'www.seek.com.au'
- The path as '/jobs-in-accounting/in-All-Sydney-NSW


### Generating the URL (dynamic parameters)

I changed the job category to Healthcare and Medical as well as including other parameters including the job type, job mode, the pay range and how long it has been listed for as below:


![Seek_job_search](https://i.postimg.cc/zvRXpdp5/Screenshot-2025-07-18-at-2-44-37-pm.png)

The URL now changed to as:

'https://www.seek.com.au/jobs-in-healthcare-medical/in-All-Sydney-NSW/full-time/on-site?daterange=3&salaryrange=50000-200000&salarytype=annual'

The base of the URL, 'https://www.seek.com.au/', the job category 'jobs-in-...' and the location, 'in-...' are the ***static components*** of the URL.

However, in the URL, there are also the components of a ***dynamic type***.

- They are the ***parameters by queries*** starting with the **"?"** followed by the **information input** in a [key=value] format such as [daterange=3], [salaryrange=50000-200000] and [salarytype=annual].

- Each parameter is separated by **"&"** known as the **separator**.

  
Websites like [Seek](https://seek.com.au) often have many pages and I think to  properly scrape the contents of [Seek](seek.com.au), I need to consider whether the scraping tool meets this scale and scope. The project instruction and resources does direct me to use the BeautifulSoup but I am curious about another possible tool and wonder whether I should consider using it. 

- I knew the Beautiful Soup module that parses the html respobse via requests.get() and I came acroos the Scrapy framework in one of learning the matearials.
- Hence, as I was very curious about the difference, I went ahead to enter ask the AI integrated Coach on the Cousera platform to help, and the AI model has helpfully provided me with usefull information and I summarised the comparison between the BeautifulSoup module and the Scrapy framework  in **Table 1** below.

**Table 1- A comparison between BeautifulSoup 4 and Scrapy**
| Feature                | BeautifulSoup 4                     | Scrapy                               |
|:------------------------:|:----------------------------------|:-------------------------------------|
| **Purpose**|- A library for parsing HTML & XML docs           |- A full-fledged web scraping framework |
|            |- Used for extracting data from webpages          |- Designed for **large-scale** web scraping projects|
| **Usage**  |- For smaller projects such as **a few pages**        |- Scraping **multiple pages** or entire websites|
|            |- With **requests.get(url)** to fetch web pages       |- Provides built-in support for handling requests,  following links, and storing scraped data|
| **Feature**|- Simple and easy to use for navigating &  search the parse tree|- Asynchronous processing:faster for  scraping large amounts of data|      
|            |- Permits easy manipulations of the html /xml structure|- Built-in support for handling cookies  sessions, and user agents|
|            |- Good for poorly formated html                   |- Offers a robust pipeline for processing and storing scraped data|  
|            |                                                  |- Includes tools pagination and data cleaning|
| **Similarities** |    | |

|                  |                                                                     |
|:-----------------|:--------------------------------------------------------------------|
|                  |- Web Scraping: Both are used to extract data from web pages|            
|                  |- HTML Parsing: They can parse HTML and XML documents,  allowing users to navigate and search through the parse tree |
|                  |- Python Libraries: Both are written in Python and can be  easily integrated into Python projects.| 

From **Table 1**, the BeautifulSoup and the Scrapy framework are similar in their functionality, which is used to extract data from webpages. Both can parse the HTML and XML files into their subconstituents in the parse tree, and both are written in Python. However, *Scrapy* is designed for **large-scale** web scraping projects, with multiple pages whilst *BeautifulSoup*, though easy to use, is intended for smaller projects with a few pages. BeautifulSoup lacks a pipeline for processing and storing the scraped data. Within the Scrapy framework, the Spider module written in Python can crawl through the pages to extract the data and exported as JSON files, which can be conveniently opened by Jupyter notebook.

### Coming back to the BeautifulSoup

Out out my curiosity, I had attempted the Scrapy framework for this project and had intended to attach the attempt details here; however, checking back the requirement of the project as well as other suggestions from the resources, I decided to include my attempt on the BeautifulSoup Module this time and the Scrapy framework in another attempt hopefully for a new project.

### Back to task2

### Generating the URL (static parameters)

In [1]:
### The above information can be used to work out the static and the dynamic feature
# of the page you will scrape. So are the purpose for identifying the right tag in task 4
### This is to capture the paramaters postion or job title and location 

## for static URL
def generate_static_url(job_title):
    base_url = "https://realpython.github.io/fake-jobs/jobs/"
    # Format the job title to create a valid URL segment
    formatted_title = job_title.lower().replace(" ", "-")  # Replace spaces with hyphens
    url = f"{base_url}{formatted_title}-0.html"  # Append the formatted title and a static identifier
    return url

# Example usage
job_title = "Senior Python Developer"
url = generate_static_url(job_title)
print("Generated URL:", url)

Generated URL: https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html


### Generating the URL (dynamic parameters)

In [16]:
### for  dynamic URL, from indeed as an example


def generate_job_search_url(query, location, source, job_id):
    base_url = "https://au.indeed.com/jobs?"
    url = (f"{base_url}q={query}&l={location}&from={source}&vjk={job_id}")
    return url

#### For example
url = generate_job_search_url("data+scientist", "All+Sydney+NSW", "searchOnHP%2Cwhereautocomplete", "35eecd1df181ecb1")
print("Generated URL:", url)

Generated URL: https://au.indeed.com/jobs?q=data+scientist&l=All+Sydney+NSW&from=searchOnHP%2Cwhereautocomplete&vjk=35eecd1df181ecb1


### Going back to BeatifulSoup

## Task

### Constructing hierachy tree to identify different elements, tags and contents in a html file for collecting the right details

In [30]:
URL = 'https://realpython.github.io/fake-jobs/'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')
html_string = str(soup) # to make soup as str and separate each compoenent by line breaks
html_lines = html_string.split('\n')
first_70_lines = html_lines[:70]  # I want to investigate the first 70 lines to get an idea

# To make it easy for myself and everyone to follow up, I numberred each line in the output
html_output = "<ol>\n"
for index, line in enumerate(first_70_lines, start=1):  # Start numbering from 1
    html_output += f"<li>{index}. {line}</li>\n"  # Include the line number
html_output += "</ol>"

# Print the HTML output
print(html_output)


<ol>
<li>1. <!DOCTYPE html></li>
<li>2. </li>
<li>3. <html></li>
<li>4. <head></li>
<li>5. <meta charset="utf-8"/></li>
<li>6. <meta content="width=device-width, initial-scale=1" name="viewport"/></li>
<li>7. <title>Fake Python</title></li>
<li>8. <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/></li>
<li>9. </head></li>
<li>10. <body></li>
<li>11. <section class="section"></li>
<li>12. <div class="container mb-5"></li>
<li>13. <h1 class="title is-1"></li>
<li>14.         Fake Python</li>
<li>15.       </h1></li>
<li>16. <p class="subtitle is-3"></li>
<li>17.         Fake Jobs for Your Web Scraping Journey</li>
<li>18.       </p></li>
<li>19. </div></li>
<li>20. <div class="container"></li>
<li>21. <div class="columns is-multiline" id="ResultsContainer"></li>
<li>22. <div class="column is-half"></li>
<li>23. <div class="card"></li>
<li>24. <div class="card-content"></li>
<li>25. <div class="media"></li>
<li>26. <div class="media-left"></li>
<li>

In [31]:
# After the subtitle line on line 17

soup.find("p",  class_="subtitle is-3")

<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>

In [32]:
#After the subtitle, the rest of the site is about job openings, so the whole block the html after the substile should be filtered 
# by id=ResultsContain via the the find() method #line 21

listings = soup.find(id = 'ResultsContainers') 

As shown below inside the ResultsContainers, we can see a repeating code ubit starting as <div class="column is-half"> and ended with </div>

- The portion of the ***listings*** below shows two units of the repeating blocks
(1) line 22
```
<div class="column is-half">
  <div class="card">
   <div class="card-content">
    <div class="media">
     <div class="media-left">
      <figure class="image is-48x48">
       <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
      </figure>
     </div>
     <div class="media-content">
      <h2 class="title is-5">
       Senior Python Developer
      </h2>
      <h3 class="subtitle is-6 company">
       Payne, Roberts and Davis
      </h3>
     </div>
    </div>
    <div class="content">
     <p class="location">
      Stewartbury, AA
     </p>
     <p class="is-small has-text-grey">
      <time datetime="2021-04-08">
       2021-04-08
      </time>
     </p>
    </div>
    <footer class="card-footer">
     <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
      Learn
     </a>
     <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">
      Apply
     </a>
    </footer>
   </div>
  </div>
 </div>
 ```
(2) line 51
 ```
  <div class="column is-half">
  <div class="card">
   <div class="card-content">
    <div class="media">
     <div class="media-left">
      <figure class="image is-48x48">
       <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
      </figure>
     </div>
     <div class="media-content">
      <h2 class="title is-5">
       Energy engineer
      </h2>
      <h3 class="subtitle is-6 company">
       Vasquez-Davidson
      </h3>
     </div>
    </div>
    <div class="content">
     <p class="location">
      Christopherville, AA
     </p>
     <p class="is-small has-text-grey">
      <time datetime="2021-04-08">
       2021-04-08
      </time>
     </p>
    </div>
    <footer class="card-footer">
     <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
      Learn
     </a>
     <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html" target="_blank">
      Apply
     </a>
    </footer>
   </div>
  </div>
 </div>
 ```

If we consider each repeating unit is one job card, so we can collect all the cards, by just taking each card and that is called looing

### Extract the Job Data from a single job posting card

In [33]:
# soup contains a block element of div and in this element, the content "card" will give us all the details 
# I willprint out 2 cards for us to examine

job_cards = soup.find_all("div", class_='card')
for i, job in enumerate(job_cards): # for card/job in all the cards, take one by one, 
    #at the end just bring out 2 top cards to have a look.
    if i < 2:
        print(job, end='\n'*3)

<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>
<div class="content">
<p class="location">
        Stewartbury, AA
      </p>
<p class="is-small has-text-grey">
<time datetime="2021-04-08">2021-04-08</time>
</p>
</div>
<footer class="card-footer">
<a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>
<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>
</footer>
</div>
</div>


<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48

Inpecting each job card, we can easily identify:

- The **Job title** as a string in the **'h2'** tag
  ```
  <h2 class="title is-5">Senior Python Developer</h2>
  ```
  
- The **company name** as a string in the **'h3'** tag
  ```
  <h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
  ```
  
- The **location** as a string in the **'p'** tag
  ```<p class="location">
        Christopherville, AA </p>
      ```
- The **posted date** as a string in the **'time'** tag
  ```
  <time datetime="2021-04-08">2021-04-08</time>
  ```

In [34]:
#for the job_description, we need to parse the html of the job into the Beautiful soup again to gain the access to the job summary by using find_all
job_html = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'
job_html_content=requests.get(job_html)
soup_job=BeautifulSoup(job_html_content.text, 'html.parser')
soup_job

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="box">
<h1 class="title is-2">Senior Python Developer</h1>
<h2 class="subtitle is-4 company">Payne, Roberts and Davis</h2>
<div class="content">
<p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inc

In [35]:
soup_job.find("div", class_="content").find('p').text.strip()

'Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.'

In [37]:
#### The tool is designed to search with key words on 'Job Titles'
# The input can be anything related to the common job title vocabulary
# The location has a degenerate feature, it can still search if location not entered
### However, if you want to search for a specific exact job title. Please use the following
# to extract that 'job title' for a word by word manner.
#from list of HTML elements containing job titles
#
# Find the relevant tags (e.g., <a> tags with class 'job-title')
# The find_all method targets 'a' tag as this is an anchor tag which is
# linked to the page of the job ad and in that tag element, we specifiy class 'job-title'

atags = soup.find_all('a', class_='job-title') #be flexible and chage accordingly
                                              # Adjust the tag and class as needed
# Print the job titles, for us to see the tag list 
for tag in atags:
    print(tag.text.strip())
    
# test them out to see if the job-titles are extracted, remember atags is a list,
# use the index atags[index] to pick the right link
#### Otherwise, just search using keywords
try:
    job_title = atags[0].text.strip()  # Attempt to extract the job title
except IndexError:
    job_title = ""  # If an IndexError occurs, set job_title to an empty string
# Continue with your code
print("Job Title:", job_title)

# The listings is for every job title contained in the ResultsContainer and
# We will loop through it

Job Title: 


In [38]:
# From the previous section, I found 'h2' tag class 'title', will return a list of all the job titles
atags = soup.find_all('h2', class_='title')

for index, tag in enumerate(atags):
    print(f"Index: {index}, Tag: {tag}")

    
# test them out to see if the job-titles are extracted, remember atags is a list,
# use the index atags[index] to pick the right link



Index: 0, Tag: <h2 class="title is-5">Senior Python Developer</h2>
Index: 1, Tag: <h2 class="title is-5">Energy engineer</h2>
Index: 2, Tag: <h2 class="title is-5">Legal executive</h2>
Index: 3, Tag: <h2 class="title is-5">Fitness centre manager</h2>
Index: 4, Tag: <h2 class="title is-5">Product manager</h2>
Index: 5, Tag: <h2 class="title is-5">Medical technical officer</h2>
Index: 6, Tag: <h2 class="title is-5">Physiological scientist</h2>
Index: 7, Tag: <h2 class="title is-5">Textile designer</h2>
Index: 8, Tag: <h2 class="title is-5">Television floor manager</h2>
Index: 9, Tag: <h2 class="title is-5">Waste management officer</h2>
Index: 10, Tag: <h2 class="title is-5">Software Engineer (Python)</h2>
Index: 11, Tag: <h2 class="title is-5">Interpreter</h2>
Index: 12, Tag: <h2 class="title is-5">Architect</h2>
Index: 13, Tag: <h2 class="title is-5">Meteorologist</h2>
Index: 14, Tag: <h2 class="title is-5">Audiological scientist</h2>
Index: 15, Tag: <h2 class="title is-5">English as a 

In [39]:
# Index 26 will give us Data Scientist job
try:
    job_title = atags[26].text.strip()  # Attempt to extract the job title
except IndexError:
    job_title = ""  # If an IndexError occurs, set job_title to an empty string
# Continue with your code
print("Job Title:", job_title)

Job Title: Data scientist


### Task 4 

### Define the main function

In [40]:
# Define the URL
URL = "https://realpython.github.io/fake-jobs/"  # Make sure to define your URL

# Request the page
response = requests.get(URL)

if response.status_code == 200:
    print("Success", 'code=', response.status_code)
    soup = BeautifulSoup(response.text, 'html.parser')
else:
    print(f"Unexpected status code: {response.status_code}")
    exit()  # Exit if the request fails

listings = soup.find(id="ResultsContainer")

if listings is None:
    print("No listings found. Please check the HTML structure of the page.")
else:
    def job_crawling(job_title_keyword, location_keyword=None):
        job_data = []  # Initialize job data list
        job_find = listings.find_all('h2', string=lambda text: job_title_keyword.lower() in text.lower() or (location_keyword and location_keyword.lower() in text.lower()))
        cards = [h2_element.parent.parent.parent for h2_element in job_find]

        for job_card in cards:
            try:
                job_title = job_card.find("h2", class_="title").text.strip()
                company = job_card.find("h3", class_="company").text.strip()
                location = job_card.find("p", class_="location").text.strip()
                posted_date = job_card.find("time").text.strip()
                link_url = job_card.find_all("a")[1]["href"]

                # Get job description
                job_html = requests.get(link_url)
                if job_html.status_code == 200:
                    soup_job = BeautifulSoup(job_html.content, 'html.parser')
                    job_description = soup_job.find("div", class_="content").find('p').text.strip()
                else:
                    job_description = "Description not available"

                job_data.append({
                    'job_title': job_title,
                    'company': company,
                    'location': location,
                    'posted_date': posted_date,
                    'job_description': job_description,
                    'link_url': link_url
                })
            except Exception as e:
                print(f"Error processing job card: {e}")

        return job_data

# Call the function
job_data = job_crawling('scientist', 'sydney')

now = datetime.datetime.now()
print('A search was conducted @', URL, 'on', now, end='\n' * 3)

# Create DataFrame and save to CSV
job_listings = pd.DataFrame(job_data)
job_listings.to_csv('job_listings_scientist.csv', index=False)

print(f"Successfully saved {len(job_listings)} job listings to job_listings.csv")

Success code= 200
A search was conducted @ https://realpython.github.io/fake-jobs/ on 2025-07-20 21:06:58.183931


Successfully saved 6 job listings to job_listings.csv


### Task 5

#### A conclusion about Process and key findings

I have finally reached this stage, the conclusion of the project. What a journey it has been for me in using Python to build the web-scraping tool for the recruitment agency in particular and for business and future employers in general. The project has allowed me to apply effectively what I learnt about Python from the data science course(s) to solve real-world problems. The process was not so straightforward, as I mentioned in the title of this notebook; it was an excursion into web-scraping with Python, and I took a detour into the Scrapy framework to find out if that approach could help. I also spent quite a lot of my time on the anti-scraping topic because I think it is very important to be compliant rather than being sorry down the track. I want to demonstrate that the work I deliver is of quality and the highest ethical standards, always. For example, I contacted Seek.com.au and LinkedIn seeking their permission to scrape. This was the right thing to do.


From the beginning of this project, I spent a fair bit of time thinking about the scenario and enlarged it in a practical sense so that I could gain a deeper understanding and be more connected to the project. I exercised critical thinking by referring back to the data methodology by John Rollin because the methodology truly teaches data scientists to think while doing. I find the methodology most practical, yet applying it requires us to think further and customise for data projects.


During this Python excursion, I not only learnt application but also coupled the theory and the practical of my Python knowledge. This now clearly demonstrates my skills match the proficient level of a Python user for such a project. I read every document suggested in the instruction and followed every link, further enhancing my gained knowledge, helping me identify new resources for my future projects and career development. I acknowledge greatly the AI Coach on the Coursera learning platform. The Coach is an amazing model for learning and teaching, especially in such coding.


As technology advances, I am always open to continued learning so to remain current and to work more efficiently, an open mindset being an integral way forward. Whilst aiming to become a Data Scientist, I cannot go past the presence and the need of AI. It’s for this reason that I am growing my AI Generalist skills also. They complement each other incredibly well. Such skills and capability will equip me to advise organisations on their AI journey, including integration likely to lead to their desired transformation, whilst uplifting productivity. This course and this project has facilitated my abilities to code with confidence, something new as I didn’t code a lot in my previous Scientist roles. I understand better Python, the code sets and the coding principles. By using the AI Coach and applying my critical thinking and experience with (writing) prompts, I was able to piece together all components and elements to deliver the project.


I am a PhD with a background in biochemistry and genomics. My science journey has been quite eventful and I had been previously working mainly in research laboratories and with -omics data. I am curious and serious about a career in data science and I look forward to an exicitng opportunity.