Scraping CollegeRecruiter.com
=============================
This notebook goes through the process of scraping CollegeRecruiter to find out the distribution of jobs in a list of cities in the bay area. This was very much a learn-as-you-go exercise. 

Code and process is based off of Harvard's CS 109 class lecture (02 for Data Scraping). Please refer to it for more info.

Basic process flow:

- Determine amount of jobs there are so you know how many pages to scrape in a loop
- Create a loop to grab all the relevant job description links
- Create a loop to go through each job page and look for city names in the HTML
- Use a dictionary and counter to keep track of counts
- Plot the distribution with a histogram

In [2]:
import urllib2
import bs4
import numpy as np
import time
import matplotlib
import matplotlib.pyplot as plt
import pickle

First, we'll want to pull a search page for our target website and scrape it for relevant information. 

In [3]:
# Fixed url for job postings containing data scientist
url = 'https://www.collegerecruiter.com/job-search?keyword=Analyst&location=San+Jose' \
      '%2C+CA%2C+United+States&locMatches=1&lat=37.3382082&lng=-121.88632860000001&page=2'
# get HTML using urllib2 read()
source = urllib2.urlopen(url).read()
# Use BS4 to parse code
bs_tree = bs4.BeautifulSoup(source, "lxml") #lxml given from warnings
print bs_tree

<!DOCTYPE html>
<html>
<head>
<title>
        Analyst Jobs near San Jose, CA, United States |        Entry Level Jobs | Internships for Students | College Recruiter
    </title>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Leading niche job board for college and university students searching for internships, part-time employment, and seasonal work and recent graduates hunting for entry-level jobs and other career opportunities." name="description"/>
<meta content="entry-level entry level jobs work careers internships co-op employment recent graduates students college university" name="keywords"/>
<meta content="College Recruiter" name="author"/>
<link href="https://www.collegerecruiter.com/favicon.png?v=1409759204" rel="icon" type="image/png"/>
<link href="https://www.collegerecruiter.com/vendor/bootstrap/dist/css/bootstrap.min.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://www.collegerecruiter.com/vendor/fontawesome/cs

Second, we'll parse the HTML using BS4. If you inspect the page here: https://www.collegerecruiter.com/job-search?keyword=Analyst&location=San+Jose%2C+CA%2C+United+States&locMatches=1&lat=37.3382082&lng=-121.88632860000001&page=2

You'll see that the total amount of jobs is a div with class "pull-left", so let's get that string and split it. We'll then use digits mathematics to convert that string into a number. The conversion method was taken from the lecture. To better explain it:

The for loop and Zip() is used to pair each digit in the string with it's digits place by using len() (Ex: 25 has length of 2, so the '2' is in the 2nd digits place). Using that knowledge, you can take the sum of all digits when you multiply each digit by 10 ^ (digits place).

Example: 142 = (2*(10^0))+(4*(10^1))+(1*(10^2)) = (2 + 40 + 100)



In [4]:
#get total # of hits listed so we know what to loop to
job_count_string = bs_tree.find("div",class_= 'pull-left').contents[0]
print job_count_string #.contents returns a list of all children of the found tag, hence why you need [0] to access the list
job_count_string = job_count_string.split()[-1] #-1 is given because the wanted string is the last object in array from split
print job_count_string
digits = []
for i in job_count_string: #slow and ugly way to get all the ints in a list
    if i.isdigit():
        digits.append(int(i))
print digits #now convert the array to a number
job_count = np.sum([digit*(10**exponent) for digit, exponent in 
                    zip(digits[::-1], range(len(digits)))])
print job_count


                        Jobs 11 -
                        20 of 5199                    
5199
[5, 1, 9, 9]
5199


Now that we know how many pages we need to go through, let's create a loop to go through each job search page and append the job description links to a list. The process is pretty much the same as before: Find the relevant HTML tag, parse for it, and then repeat if necessary

In [None]:
# The website is only listing 10 results per page, 
# so we need to scrape them page after page
num_pages = int(np.ceil(job_count/10.0))

job_links = []
for i in range(1, job_count/10): #since pages have 10 jobs each, we will go from 1 - job_count/10 to get all job links
    if i%10==0:
        print num_pages-i
    a = str(i)
    url = "https://www.collegerecruiter.com/job-search?keyword=Analyst&location=San+Jose" \
          "%2C+CA%2C+United+States&locMatches=1&lat=37.3382082&lng=-121.88632860000001&page=" + a
    #use loop to go through each url needed
    html_page = urllib2.urlopen(url).read() 
    bs_tree = bs4.BeautifulSoup(html_page)
    job_link_area = bs_tree.find_all("form", method="POST")[1] #list containing all form tags, only want 2nd one
    job_postings = job_link_area.find_all(class_="jobTitle") #jobTitle contains the link
    for i in job_postings:
        job_links.append(i.find("a")['href']) #within the a tag in jobTitle, use href tag to access just the link

    time.sleep(1) #delay between each scrape
#print job_links    
print "We found a lot of jobs: ", len(job_links)

Scraping 520 pages takes a while, so I already scraped all of the job links and stored them in a Pickle object which is a file on GitHub called "joblinks.pickle". To load it you can run the following cell below.

In [10]:
with open('joblinks.pickle','rb') as handle:
    job_links = pickle.load(handle)
print job_links

['https://www.collegerecruiter.com/job/33093138-behavior-analyst', 'https://www.collegerecruiter.com/job/33148079-data-analyst', 'https://www.collegerecruiter.com/job/33690967-human-resources-operations-analyst', 'https://www.collegerecruiter.com/job/31731893-business-analyst', 'https://www.collegerecruiter.com/job/32725184-financial-analyst', 'https://www.collegerecruiter.com/job/33692338-finacial-analyst', 'https://www.collegerecruiter.com/job/31516611-business-systems-analyst', 'https://www.collegerecruiter.com/job/32942263-systems-analyst-cadence', 'https://www.collegerecruiter.com/job/33187781-systems-analyst-asap', 'https://www.collegerecruiter.com/job/32943852-systems-analyst-optime', 'https://www.collegerecruiter.com/job/33715445-systems-analyst-radiant', 'https://www.collegerecruiter.com/job/32616529-data-verification-analyst', 'https://www.collegerecruiter.com/job/33053510-systems-analyst-radiant', 'https://www.collegerecruiter.com/job/33655889-web-development-qa-analyst', 'h

Now that we have all relevant links in a list, we'll want to go through each one and see if any of the key cities are in the HTML by first using regex to delete all characters that are not a-z, ., +, or 3. After that, let's plot the dictionary!

In [7]:
cities =  {"sunnyvale":0, "san jose": 0, "campbell":0, "san francisco": 0, "santa clara": 0,  "mountain view":0}
for link in job_links:
    #These are here to prevent a bad link or connection from messing up the entire for loop
    try:
        html_page = urllib2.urlopen(link).read()
    except urllib2.HTTPError:
        print "HTTPError:"
        continue
    except urllib2.URLError:
        print "URLError:"
        continue
    except socket.error as error:
        print "Connection closed"
        continue
    #Distribution of industry
    
    html_text = re.sub("[^a-z.+3]"," ", html_page.lower()) #regex rule is replace any char that is not a-z with space
    #caret means "anything but" 
    for key in cities.keys():
        if key in html_text:
            cities[key]+=1
            
print cities

NameError: name 'job_links' is not defined

In [None]:
pseries = pd.Series(cities) #create pandas series from dict
pseries.sort(ascending=False)

pseries.plot(kind = 'bar') #simple histogram using pandas
plt.title('Bay Area Cities Distribution')
plt.xlabel('City')
plt.ylabel('Count')
plt.show()

<img src="files/cities_dist.png">

Based off this, I think we can say that San Francisco is definitely the city to go to if you are a recent college graduate looking for an entry-level job!

You'll also notice the counts are so small. This is because CollegeRecruiter's search engine will list all the jobs in it's database when you do a search, but it will sort the search by closest to the location you indicate. 