# Indeed Job Scraper

The objective of this program is to scrape/ extract a batch of job postings from Indeed. The program currently allows the user to input multiple queries and locations into the scraper, but otherwise leaves the other settings at their defaults. After specifying the settings, the scraper returns the 
* title
* location
* company
* salary
* synopsis

The output will be shown in a preview below, and stored in a local csv file. The local csv file will be used for the main project, analyzing the job postings and be able to gain insight from each posting's synopsis.

Current Issues
* Sponsored job posts show up in data multiple times
* Summary page is not formated properly, .split() doesn't fix it

would be nice, but not sure how to implement

* responsibilities
* requirements

Link to project: https://github.com/mhuh22/Thinkful/blob/master/Bootcamp/Unit%207/Analysis%20on%20Indeed%20Data%20Scientist%20Postings.ipynb

In [1]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
import os

In [2]:
def parse(query,metro, n):
    
    # Target url, and n number of pages to scrape
    url = "https://www.indeed.com/jobs?q={}&l={}&start={}"
    
    # Create dataframe that we'll return later
    df = pd.DataFrame(columns=["Title","Location","Company","Salary", "Synopsis", "Query", "Metro"])
    
    # Scrape first n pages
    for i in range(0,n):
        html = requests.get(url.format(query, metro, i))
        time.sleep(1)
        soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
        
        for each in soup.find_all(class_= "result" ):
            try: 
                title = each.find(class_='jobtitle').text.replace('\n', '')
            except:
                title = 'None'
            try:
                location = each.find('span', {'class':"location" }).text.replace('\n', '')
            except:    
                try:
                    location = each.find('div', {'class':"location" }).text.replace('\n', '')
                except:
                    location = 'None'
            try: 
                company = each.find(class_='company').text.replace('\n', '')
            except:
                company = 'None'
            try:
                salary = each.find('span', {'class':'no-wrap'}).text
            except:
                salary = 'None'
            # Can also extract url (not sure if it will be used, though)
#             try:
#                 link = 'https://www.indeed.com/viewjob' + each.find('a', href=True)['href'][7:]
#             except:    
#                 link = 'None'
            synopsis = each.find('span', {'class':'summary'}).text.replace('\n', '')
            df = df.append({'Title':title, 'Location':location, 'Company':company, 'Salary':salary, 'Synopsis':synopsis, 'Query': query, 'Metro': metro}, ignore_index=True)
    return df

In [3]:
# Lists of positions and locations to search
positions = ['data+analyst', 'data+scientist', 'data+engineer']
cities = ['united states']
n = 100

# Create the dataframe
data = pd.DataFrame()

# Scrape all positions in all cities specified, and append to dataframe
for position in positions:
    for city in cities:
        data = pd.concat([data, parse(position, city, n)])

# Drop duplicate posts (typically caused by sponsored positions reappearing)
data = data.drop_duplicates()
        
# Preview the data
data.head(10)

Unnamed: 0,Title,Location,Company,Salary,Synopsis,Query,Metro
0,Data Analyst,"Duluth, MN",Members Cooperative Credit Union,,The Data Analyst i...,data+analyst,united states
1,Data Analyst,"Elgin, IL",voestalpine High Performance Metals Co...,,Proven work experi...,data+analyst,united states
2,Data Analyst,"Jersey City, NJ 07311 (Downtown area)",Tradeweb Markets LLC,,Tradeweb has an op...,data+analyst,united states
3,Web Data Analyst,"Pleasant Prairie, WI 53158",Uline,,Web Data Analyst. ...,data+analyst,united states
4,SAS Programmer (Data Analyst),"Washington, DC","Oasis Systems, Inc.",,Perform data proce...,data+analyst,united states
5,CLINICAL SYSTEM DATA ANALYST (STATISTICAL PROG...,"Philadelphia, PA 19102 (City Center West area)",Temple Health,,Provides clinical ...,data+analyst,united states
6,Junior Data Analyst,"Valhalla, NY",Rectangle Health,,Data visualization skills (ex. As the Junior D...,data+analyst,united states
7,Business / Data Analyst,"New York, NY 10038 (Financial District area)",Bank of America,,"Proficient with Data modeling, SQL...",data+analyst,united states
8,Data Analyst,"New York, NY 10005 (Financial District area)",Murmuration,,The Data Analyst will:. The Data A...,data+analyst,united states
9,Jr. Data Analyst (Data Assurance) ~ Entry Level,"Jersey City, NJ",Genuent,\n $20 - $25 an hour,"Data Analyst (Data Assurance) ~ Jersey City, N...",data+analyst,united states


In [4]:
# Return size of dataframe
data.shape

(671, 7)

In [5]:
# Save scraped data to a local directory
if os.path.exists("data/job_board.csv"):
    os.remove("data/job_board.csv")
data.to_csv('data/job_board.csv', encoding='utf-8')