# Scraping data from 'TimesJobs' website using BeautifulSoup

Objective of the Project:

This project involves scraping data from a website using the BeautifulSoup library of Python. The extracted data is formatted and printed in the notebook for further use. The program requires the user to input a skills that he is not familiar with so that the mentioned skills can be filtered away from our results. The posts are also filtered to show only the jobs that were listed a few days ago.

Additionally, each entry from the data has been stored in separate text files that refresh automatically every 10 minutes to ensure that any changes made on the website are captured in our scraped data.

Libraries Used: BeautifulSoup, requests, time

In [1]:
#Importing the required libraries.
from bs4 import BeautifulSoup
import requests
import time

In [2]:
#Input the skill that needs to be filtered out.
print("Select a skill that you are not familiar with:")
unknown_skill = input('>')
print(f'Filtering Out: {unknown_skill}')

Select a skill that you are not familiar with:
>django
Filtering Out: django


In [3]:
#Fetching data from the website using requests.get().
html_text = requests.get('https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=Microsoft+SQL&txtLocation=').text
soup = BeautifulSoup(html_text, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).loader_config={licenseKey:"2aa1afabd7",applicationID:"57254935"};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var i=e[n]={exports:{}};t[n][0].call(i.exports,function(e){var i=t[n][1][e];return r(i||e)},i,i.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(t,e,n){function r(){}function i(t,e,n){return function(){return o(t,[u.now()].concat(f(arguments)),e?null:this,n),e?void 0:this}}var o=t("handle"),a=t(8),f=t(9),c=t("ee").get("tracer"),u=t("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var d=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],p="api-",l=p+"ixn-";a(d,function(t,e){s[e]=i(p+e,!0,"api")}),s.addPageAction=i(p+"addPageAction",!0),s.setCurrentRouteName=i(p+"routeName",!0),e.exports=

In [13]:
#Defining the print_jobs() function to print the filtered posts in an acceptable format.
def print_jobs():
    html_text = requests.get('https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=Microsoft+SQL&txtLocation=').text
    soup = BeautifulSoup(html_text, 'lxml')
    listed_jobs = soup.find_all('li', class_ = 'clearfix job-bx wht-shd-bx')
    for index, job in enumerate(listed_jobs):
        publishing_date = job.find('span', class_ = 'sim-posted').span.text
        if 'few' in publishing_date:
            company_name = job.find('h3', class_ = 'joblist-comp-name').text.replace(' ','')
            listed_skills = job.find('span', class_ = 'srp-skills').text.replace(' ','')
            added_info = job.header.h2.a['href']
            if unknown_skill not in listed_skills:
                print(f'Company Name: {company_name.strip()}')
                print(f'Required Skills: {listed_skills.strip()}')
                print(f'More Info: {added_info} \n')
            print(f'File Saved: {index} \n')

In [14]:
print_jobs()

Company Name: BRINGLEACADEMY
Required Skills: MicrosoftSql,trainingmanuals,developtraining
More Info: https://www.timesjobs.com/job-detail/microsoft-sql-db-trainer-bringle-academy-mumbai-0-to-3-yrs-jobid-qsrw0__SLASH__zhNNtzpSvf__PLUS__uAgZw==&source=srp 

File Saved: 0 

Company Name: alliancerecruitmentagency
Required Skills: performancetuning,disasterrecovery,recovery,sql,sqldatabaseadministrator,database
More Info: https://www.timesjobs.com/job-detail/microsoft-sql-database-administrator-alliance-recruitment-agency-canada-5-to-8-yrs-jobid-9NxQHscE__SLASH__HRzpSvf__PLUS__uAgZw==&source=srp 

File Saved: 1 

Company Name: alliancerecruitmentagency
Required Skills: performancetuning,disasterrecovery,recovery,sql,sqldatabaseadministrator,database
More Info: https://www.timesjobs.com/job-detail/microsoft-sql-database-administrator-alliance-recruitment-agency-canada-5-to-8-yrs-jobid-s9rrt7apQxZzpSvf__PLUS__uAgZw==&source=srp 

File Saved: 2 

Company Name: Talendrone
Required Skills: sql

In [4]:
#Defining the search_jobs function to save th filtered jobs in separated texts files.
def search_jobs():
    html_text = requests.get('https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=Microsoft+SQL&txtLocation=').text
    soup = BeautifulSoup(html_text, 'lxml')
    listed_jobs = soup.find_all('li', class_ = 'clearfix job-bx wht-shd-bx')
    for index, job in enumerate(listed_jobs):
        publishing_date = job.find('span', class_ = 'sim-posted').span.text
        if 'few' in publishing_date:
            company_name = job.find('h3', class_ = 'joblist-comp-name').text.replace(' ','')
            listed_skills = job.find('span', class_ = 'srp-skills').text.replace(' ','')
            added_info = job.header.h2.a['href']
            if unknown_skill not in listed_skills:
                with open(f'/Users/User/Desktop/Posts/{index}.txt','w') as t:
                    t.write(f'Company Name: {company_name.strip()} \n')
                    t.write(f'Required Skills: {listed_skills.strip()} \n')
                    t.write(f'More Info: {added_info}')
                print(f'File Saved: {index}')

In [5]:
#Running a while loop to ensure that the search_jobs function runs after every 10 minutes to keep our fetched 
#data up to date.
if __name__ == '__main__':
    while True:
        search_jobs()
        refresh = 10
        print(f'Refreshing in {refresh} minutes...')
        time.sleep(refresh*60)

File Saved: 0
File Saved: 1
File Saved: 2
File Saved: 3
File Saved: 4
File Saved: 5
File Saved: 17
Refreshing in 10 minutes...


KeyboardInterrupt: 