# Exercise - How to Build A Web Crawler

Also called "spiders"

## Objective: Crawling [Indeed.com](https://www.indeed.com) for software developer jobs in Dallas, TX
Starting URL: https://www.indeed.com/jobs?q=software%20developer&l=Dallas,%20TX&start=0  *(start=# goes up by increments of 10)*

In [1]:
import requests # one of the ways to connect to websites via Python
from bs4 import BeautifulSoup # allows you to go through page source and get data

### Main Spider Function

In [72]:
# set max pages
def job_pages_spider(max_pages):
    page = 0  # increment by 10

    while page <= max_pages:
        url = "https://www.indeed.com/jobs?q=software%20developer&l=Dallas,%20TX&start="+str(page)

        # GET request; stores page HTML source in variable
        source_code = requests.get(url) 

        # gets the front end text of the HTMl source code; ignoring back end stuff; essentally parses through HTML source
        plain_text = source_code.text 

        # can sort through this variable
        soup = BeautifulSoup(plain_text) 

        # TASK: Gather/search through all job titles & their urls (will need to inspect element for unique class/id names)
        for link in soup.findAll('a', {'class': 'jcs-JobTitle'}): # similar to CSS descendant selectors (a .jcs-JobTitle)
            job_title = link.get_text().strip()
            job_href = link.get('href')
            print(job_title + " " + "https://www.indeed.com"+job_href + "\n")
        
        # Increment pages by 10
        page += 10


In [73]:
job_pages_spider(10)

Staff Software Engineer - Interview Scheduling Team https://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0CiRNM7CVr8YueLFKlzwbFWI0o7IjV438l4sVrvKZ0flpURU_mqoI8E-VxPfg2eTCGE0XMt_o0TuNAuIilnB6fQMQFwZDFV9_Yv7vEwlFKnZeZwY3Fr2nchFF2I9Aer3Xih3kyYsI_vwvwQHg46FNMbWAyBPlyolR972r5CIqFfdF25AlJK_p7wU0QdBir-cabFWD3lbKGd318WQgjBshs1VkeT5zsqJcByTlSPe61yGxbkczYg3ypdzxgerlHCM8D-S6tx8-L4k8_So8Aob7ImNXQkOCHOCjKl0a02ZqOthS4MU_m7qdZKu6u_N8MTMT0lZDLvp9XLlm6Qjh1RwZpgBqyd5bJphLxSNSRWPc6dIYqu1VRNOZXiPdTk4Y-mADyHfJoSTvnHBX_hddBuzpr3pYxNtmxW3JL8vOVZQ9EVm78tCqaVs82USzOwnePUzKxvLkAzroKtGO6AlAjnr4PfOgnlkgRU51puohCVjLWjO5whSkBV68ynWs3Xp7MLDiMjEI6ItMSsYJXOV0rn6jAzif7kAFdVRsGOEu7xSPQpRQ25W4-ohdu0J3Qn_7-tWLI=&p=0&fvj=0&vjs=3

Software Developer (.NET) Opportunities for Entry-Level Cand... https://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0CgC63lnUQVJUwrbKOtHhMUe_Phd-DOzVl9NhGqIw_FIy82NPGo905VFf-PfPrqpp43W7YtOhz_iRGcncNecKe-9XYwrwmeEHMHqPDkCpDuuWMwWMypUtaTN1it7QbtQo448tQ1PITr8f8LNGxB1xpqdcDlTFG-yIsSo07yFwoidhMzAZ2TssVHj2

- - - -

### Dynamic Web Crawler

Getting text of job title for each listing based on each listing URL (targets ONE CSS class):

In [161]:
# set max pages
def job_pages_spider2(max_pages):
    page = 0  # NEED TO FIX, increment by 10

    while page <= max_pages:
        url = "https://www.indeed.com/jobs?q=software%20developer&l=Dallas,%20TX&start="+str(page)

        # GET request; stores page HTML source in variable
        source_code = requests.get(url) 

        # gets the front end text of the HTMl source code; ignoring back end stuff; essentally parses through HTML source
        plain_text = source_code.text 

        # can sort through this variable
        soup = BeautifulSoup(plain_text) 

        
        # TASK: Gather/search through all job titles & their urls (will need to inspect element for unique class/id names)
        for link in soup.findAll('a', {'class': 'jcs-JobTitle'}): # similar to CSS descendant selectors (a .jcs-JobTitle)
            job_title = link.get_text().strip()
            job_href = "https://www.indeed.com" + link.get('href')
            get_single_job_posting_data(job_href) # store urls of each job listing meant for function in next cell
            # print(job_title + " " + "https://www.indeed.com"+job_href + "\n")
        
        # Increment pages by 10
        page += 10


In [162]:
def get_single_job_posting_data(job_url):
        # GET request; stores page HTML source in variable
        source_code = requests.get(job_url) 

        # gets the front end text of the HTMl source code; ignoring back end stuff; essentally parses through HTML source
        plain_text = source_code.text 

        # can sort through this variable
        soup = BeautifulSoup(plain_text)

        for jobtitle in soup.findAll('h1', {'class': 'jobsearch-JobInfoHeader-title'}):
                print(jobtitle.get_text())
    

In [163]:
job_pages_spider2(10)

Software Developer (.NET) Opportunities for Entry-Level Candidates
Junior Software Developer
Entry Level Software Engineer
Entry-level Software Engineer
Software Engineering - Multiple Openings
Assistant Software Developer
Entry Level Software Engineers
Backend Software Developer
Software Engineer
Software Engineer II
Software Developer Associate
Associate Software Quality Engineer
Jr. Software Engineer - Java
Software Engineer (US REMOTE)
Seasonal Employee - Software Engineer
Software Developer
Software Developer Associate
Entry Level Software Engineers
Seasonal Employee - Software Engineer
Java Spring Boot Developer- 4287421
Flutter Developer (Remote)
Junior Software Developer
Software Engineer
Software Engineer
Software Engineering - Multiple Openings
Software Developer (Salesforce)
Software Developer - Hybrid Remote Options
Assistant Software Developer
Software Developer
Entry Level Java Developer


 _****************************************************************************_

Side project -- Getting both job title & company (targets MULTIPLE CSS classes):

In [158]:
# set max pages
def job_pages_spider3(max_pages):
    page = 0  # NEED TO FIX, increment by 10

    while page <= max_pages:
        url = "https://www.indeed.com/jobs?q=software%20developer&l=Dallas,%20TX&start="+str(page)

        # GET request; stores page HTML source in variable
        source_code = requests.get(url) 

        # gets the front end text of the HTMl source code; ignoring back end stuff; essentally parses through HTML source
        plain_text = source_code.text 

        # can sort through this variable
        soup = BeautifulSoup(plain_text) 

        
        # TASK: Gather/search through all job titles & their urla (will need to inspect element for unique class/id names)
        for link in soup.findAll('a', {'class': 'jcs-JobTitle'}): # similar to CSS descendant selectors (a .jcs-JobTitle)
            job_title = link.get_text().strip()
            job_href = "https://www.indeed.com" + link.get('href')
            get_single_job_posting_data2(job_href) # store urls of each job listing meant for function in next cell
            # print(job_title + " " + "https://www.indeed.com"+job_href + "\n")
        
        # Increment pages by 10
        page += 10


In [159]:
def get_single_job_posting_data2(job_url):
        # GET request; stores page HTML source in variable
        source_code = requests.get(job_url) 

        # gets the front end text of the HTMl source code; ignoring back end stuff; essentally parses through HTML source
        plain_text = source_code.text 

        # can sort through this variable
        soup = BeautifulSoup(plain_text)

        # gets HTML of job header info that lists job title, rating (if applic), company name, location
        jobHeader = soup.findAll('div', {'class': 'jobsearch-DesktopStickyContainer'})

        for x in jobHeader:
                ### GET JOB TITLE
                job_title = x.find('h1', {'class': 'jobsearch-JobInfoHeader-title'})
                print(job_title.get_text())
                
                ### GET COMPANY
                # stores each iteration in a list for some reason
                # (IMPORTANT) note: both classes targeted must be within the same element (e.g. <p class='class1 class2'></p>)
                job_company = x.select('div.icl-u-lg-mr--sm.icl-u-xs-mr--xs')

                # for some reason, variable stores each HTML snippet in an individual list, so will need to index 1 to target text
                # index 0 prints out the div tags only
                print(job_company[1].get_text() + "\n")
                #print(x)
    

In [160]:
job_pages_spider3(10)

Software Developer (.NET) Opportunities for Entry-Level Candidates
Quotum Technologies, Inc

Junior Software Developer
Revature

Entry Level Software Engineer
Revature

Entry-level Software Engineer
Cognizant Technology Solutions

Software Engineering - Multiple Openings
amdocs

Assistant Software Developer
Fujitsu

Entry Level Software Engineers
SkillStorm

Backend Software Developer
IBM

Software Engineer
Microsoft

Software Engineer II
Microsoft

Software Developer Associate
PNC Financial Services Group

Associate Software Quality Engineer
Vizient, Inc.

Jr. Software Engineer - Java
JPMorgan Chase Bank, N.A.

Software Engineer (US REMOTE)
Splunk

Seasonal Employee - Software Engineer
State Farm

Junior Software Developer
Brooksource

Entry Level Software Engineer
Neo Prism Solutions

Associate Software Quality Engineer
Vizient, Inc.

Enterprise Systems Developer III-Lead Software Engineer
University of Texas at Dallas

Seasonal Employee - Software Engineer
State Farm

Full Stack Sof