## Introduction
When searching for data science opportunities after the conclusion of my master's program, I naturally started with Indeed.com, a popular American job searching website, but I immediately noticed a problem when inputting my search queries. Whenever I searched for jobs related to data science and machine learning, a number of unrelated jobs would come up, such as "biochemical engineer", "tech support assistant", and even "sales representative" to name a few, all of which are jobs that I am not quite interested in. At one point in time, I had to navigate through 3 pages of jobs before I even saw my first data science job and even that wasn't quite what I was looking for, as the company was looking for a Senior Data Scientist... 

## Goal
In order to try to make my life easier, I decided to make use of what I learned just a couple weeks ago and try to create my own data scraper in order to display jobs that would help during the application process. The final output of my program would then output a .json and .csv file containing all of the jobs that are relevant to a young data science professional looking to enter the industry, such as myself. 

## Data Scraping Script

Import the necessary libraries that will be used during this process.

In [1]:
import json
import string
import requests
from urllib.parse import urljoin 
from bs4 import BeautifulSoup, Tag 
import re
import csv
from tqdm import tqdm
import pyexcel as pe
import datetime

Specify the base URL, create a request to retrieve data from the base URL, and create an HTML parser using BeautifulSoup. During this project, I will be scraping jobs located in San Francisco, California and will sort these jobs by the date that they were posted.

In [2]:
base_url = "https://www.indeed.com/jobs?q=data+scientist&l=San+Francisco&sort=date"
req = requests.get(base_url)
soup = BeautifulSoup(req.text, "html.parser")

By looking into the HTML code of the base URL, obtain the main table that contains all of the listed jobs that are of interest. This ignores all other jobs, links, and text that are irrelevant to what we want. Using our HTML parser, we find the table in which the ID is 'resultsCol' and call this block of HTML code "main".

In [3]:
main = soup.find('td', {'id' : 'resultsCol'})

For only a single page on Indeed, the output of this function is a list, named 'job_list', in which each observation within this list is a dictionary entry that displays the name of the company, the title of the job, the location, how long ago the job was posted, and a link to the URL. 

To better explain the code, comments can be seen within the function itself.

In [4]:
def getSinglePage(main = main):
    
    # Specify the job title that is wanted
    regex_job_title = "[Dd]ata [Ss]cien"
    
    
    # All of the constraints that we use on our query.
    # Note that since we do NOT want jobs with these names (Senior, Manager, Lead, etc.), the regex contains a '^'
    regex_constraint1 = "^((?![Ss]enior).)*$"
    regex_constraint2 = "^((?![Mm]anage).)*$"
    regex_constraint3 = "^((?![Pp][Hh]d).)*$"
    regex_constraint4 = "^((?![Ll]ead).)*$"
    regex_constraint5 = "^((?![Dd]irector).)*$"
    regex_constraint6 = "^((?![Pp]rincipal).)*$"
    regex_constraint7 = "^((?![Ss][Rr].).)*$"
    regex_constraint8 = "^((?![Cc]hief).)*$"
    
    # Put all of the constraints into a list
    regex_constraints = [regex_constraint1, regex_constraint2, regex_constraint3, regex_constraint4,
        regex_constraint5, regex_constraint6, regex_constraint7, regex_constraint8]
    
    # Initialize a list that will eventually contain of all of the jobs for this specific page
    job_list = []

    # Iterate through all of the jobs using the HTML parser and regular expressions
    # Each of the 10 jobs listed on the page have a class of 'row result'
    # The name of the company, the location, title, and age are all extracted from the HTML
    for jobs in main.find_all('div', {'class': re.compile(r"row result")}):
        company = jobs.find('span', {'class': 'company'}).text.strip()
        location = jobs.find('span', {'class': 'location'}).text.strip()
        title = jobs.find('h2', {'class': 'jobtitle'}).text.strip()
        age = jobs.find('span', {'class': 'date'}).text.strip()
        

        # This if statement takes care of the case in which the same company posts multiple jobs
        if company in job_list:
            company = company + "*"
            
        # This variable will be used in the following if statement in order to count how many constraints this specific job title matches
        constraint_counter = 0
        
        # This if statement initially looks to see if its title actually matches the job title that we are looking for
        # If the title of the job matches the job that we are looking for (i.e Data Scientist), it goes into the following for-loop
        if re.search(regex_job_title, title):
            
            # The for-loop iterates through the constrain list.
            # If the title doesn't contain that specific constraint, it adds 1 to the constraint counter 
            for i in range(len(regex_constraints)):
                if re.search(regex_constraints[i], title):
                    constraint_counter += 1
                    
        # Checks if the constraint counter satisfies all constraints (i.e. does not contain 'Manager')
        # If true, it gets the URL and appends the company name, title, location, age, and URL to the job_list
        if constraint_counter == len(regex_constraints):
            url = "https://www.indeed.com" + str(jobs.find('h2', {'class': 'jobtitle'}).a['href'])
            job_list.append({'company': company, 'title': title, 'location': location, 'age': age, 'url': url})
            
            
            
    return(job_list)

For the first page (and as talked about in the Introduction section), no data science jobs were found that matches what we're looking for... An example of this can be found below:

In [5]:
getSinglePage()

[]

To actually display and exemplify a page that contains what we are looking for, I'll output the 4th page by changing the base URL. From there, I need to re-initialize our request, the HTML parser, and the main block of text. Note that this is only done to exemplify the above code. We will be re-initializing everything back to its original code immediately afterwards.

In [None]:
# Change base URL to the 4th page in order to show results.
base_url = "https://www.indeed.com/jobs?q=data+scientist&l=San+Francisco&sort=date&fromage=last&start=40"
req = requests.get(base_url)
soup = BeautifulSoup(req.text, "html.parser")
main = soup.find('td', {'id' : 'resultsCol'})

getSinglePage(main)

[{'age': '1 day ago',
  'company': 'Electronic Arts',
  'location': 'Redwood City, CA 94065',
  'title': 'Research Data Scientist',
  'url': 'https://www.indeed.com/rc/clk?jk=b6cc4e352a182b68&fccid=617d7f961cfcf54a&vjs=3'}]

Now that we've seen what an observation looks like, I am now going to re-initialize everything back to how it was originally.

In [None]:
# Change the base URL back to the original position
base_url = "https://www.indeed.com/jobs?q=data+scientist&l=San+Francisco&sort=date"
req = requests.get(base_url)
soup = BeautifulSoup(req.text, "html.parser")
main = soup.find('td', {'id' : 'resultsCol'})

Using the previous function, we are now going to iterate through the specified number of pages in order to output the list 'allJobs'. This list will contain all jobs matching our criteria for the specified number of pages. At every iteration, a new request, HTML parser and main table needs to be initialized.

In [None]:
def getMultiplePages(numPages = 10):
    original_url = base_url + "&start="
    allJobs = []
    for i in range(0, numPages):
        newPage = original_url + str(i * 10)
        newReq = requests.get(newPage)
        newSoup = BeautifulSoup(newReq.text, "html.parser")
        newPage_main = newSoup.find('td', {'id' : 'resultsCol'})
        list = getSinglePage(newPage_main)
        for i in list:
            allJobs.append(i)

    return(allJobs)

Here is an example of the types of data science jobs that we are specifically looking for in the first 15 pages.

In [None]:
getMultiplePages(numPages = 15)

Now that we have the ability to scrape multiple Indeed webpages, I then created a function necessary to write this to a .JSON file. As an input, the function takes in the list that contains every job (which can be generated in the getMultiplePages() function) and the name of the .JSON file that you want to write to.

In [None]:
def createJSON(job_list, output_name):
    with open(output_name, 'w') as f:
        json.dump(job_list, f, indent=4)

Using the PyExcel library and the previously created .JSON file, this function then creates a .CSV file that can be opened in Excel. When opening the .CSV file, the names of the columns will be "Company Name", "Job Title", "Location", "Data Posted", and "URL". The inputs necessary to run this function is the name of the previously created .JSON file and the name of the .CSV file that you would like. 

In [None]:
def createCSV(json_file, output_file):
    with open(json_file) as json_data:
        d = json.load(json_data)
        json_data.close()
    allJobs = []
    headers = ['Company Name', 'Job Title', 'Location', 'Date Posted', 'URL']
    allJobs.append(headers)
    for i in range(len(d)):
        newJob = []
        newJob.append(d[i]['company'])
        newJob.append(d[i]['title'])
        newJob.append(d[i]['location'])
        newJob.append(d[i]['age'])
        newJob.append(d[i]['url'])
        allJobs.append(newJob)

    sheet = pe.Sheet(allJobs)
    sheet.save_as(output_file)

Because this can be used multiple times, I found that it might be helpful to also include the date at which these data scrapes took place. 

In [None]:
date = datetime.datetime.now()
month_day_year = str(date.month) + "_" + str(date.day) + "_" + str(date.year)

Putting everything together now:

In [None]:
allJobs = getAllJobTitles(30)
createJSON(allJobs, "Indeed_jobs_" + month_day_year + ".json")
createCSV("Indeed_jobs_" + month_day_year + ".json", "Indeed_jobs_" + month_day_year + ".csv")

And we're done! Check your directory now for both a .JSON file and a .CSV file that lists all of the data science jobs!

## Conclusion
In conclusion, I thought that this personal project really helped cement what I was taught during my Applied Data Mining (Natural Language Processing) lecture just a few weeks ago. The final output is something that I think is very usable in the near future and could help simplify the job application process for me. In the future, I hope to come back to this code and find a better way to deal with the constraints that were specified using regular expressions. Overall, however, I'm very pleased with my results and hope you are too!