Scraping CollegeRecruiter.com
=============================
This notebook goes through the process of scraping CollegeRecruiter to find out the distribution of jobs in a list of cities in the bay area using Python and BeautifulSoup

Basic process flow:

- Determine amount of jobs there are so you know how many pages to scrape in a loop
- Create a loop to grab all the relevant job description links
- Create a loop to go through each job page and look for city names in the HTML
- Use a dictionary and counter to keep track of counts
- Plot the distribution with a histogram

In [2]:
import urllib2
import bs4
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import pickle
import re
import socket

### 1. Determine HTML Structure
Pull HTML from a random page to parse in later steps.

In [3]:
url = ('https://www.collegerecruiter.com/job-search?keyword=Analyst&location=San+Jose'
      '%2C+CA%2C+United+States&locMatches=1&lat=37.3382082&lng=-121.88632860000001&page=2')
print url
# get HTML using urllib2 read()
source = urllib2.urlopen(url).read()
# Use BS4 to parse code
bs_tree = bs4.BeautifulSoup(source, "lxml") #lxml given from warnings
#print bs_tree

https://www.collegerecruiter.com/job-search?keyword=Analyst&location=San+Jose%2C+CA%2C+United+States&locMatches=1&lat=37.3382082&lng=-121.88632860000001&page=2


### 2. Get total amount of jobs to scrape
Inspect the page here: https://www.collegerecruiter.com/job-search?keyword=Analyst&location=San+Jose%2C+CA%2C+United+States&locMatches=1&lat=37.3382082&lng=-121.88632860000001&page=2

The total amount of jobs is located in a div with class "pull-left", so parse for that div and scrape the string. Use digits mathematics to convert that string into a number. The method for that is explained here:

The for loop and Zip() is used to pair each digit in the string with it's digits place by using len() (Ex: 25 has length of 2, so the '2' is in the 2nd digits place). Using that knowledge, convert the string by taking the sum of all digits when you multiply each digit by 10 ^ (digits place).

Example: 142 = (2*(10^0))+(4*(10^1))+(1*(10^2)) = (2 + 40 + 100)



In [4]:
#get total # of hits listed so we know what to loop to
job_count_string = bs_tree.find("div",class_= 'pull-left').contents[0]
print job_count_string 
job_count_string = job_count_string.split()[-1]
print job_count_string
digits = []
for i in job_count_string:
    if i.isdigit():
        digits.append(int(i))
print digits #now convert the array to a number
job_count = np.sum([digit*(10**exponent) for digit, exponent in 
                    zip(digits[::-1], range(len(digits)))])
print job_count


                        Jobs 11 -
                        20 of 5199                    
5199
[5, 1, 9, 9]
5199


### 3. Scrape and Compile All Job Links
Divide the total amount of jobs by the amount of jobs shown per page to get the range limit for the loop to go through every page on the site. Create the first for loop to go through each page and find the HTML object with all the links. In this case, the tags are "form" and tags with class "jobTitle". Create another for loop to go through that HTML object and append all the links. 

Skip to next cell if you do not want to wait for the link scraping.

In [None]:
""" Uncomment if you want to scrape all the links. Otherwise, use the pickle cell below to get the links.
num_pages = int(np.ceil(job_count/10.0))

job_links = []
for i in range(1, job_count/10): #since pages have 10 jobs each, we will go from 1 - job_count/10 to get all job links
    if i%10==0:
        print num_pages-i
    a = str(i)
    url = "https://www.collegerecruiter.com/job-search?keyword=Analyst&location=San+Jose" \
          "%2C+CA%2C+United+States&locMatches=1&lat=37.3382082&lng=-121.88632860000001&page=" + a
    #use loop to go through each url needed
    html_page = urllib2.urlopen(url).read() 
    bs_tree = bs4.BeautifulSoup(html_page)
    job_link_area = bs_tree.find_all("form", method="POST")[1] #list containing all form tags, only want 2nd one
    job_postings = job_link_area.find_all(class_="jobTitle") #jobTitle contains the link
    for i in job_postings:
        job_links.append(i.find("a")['href']) #within the a tag in jobTitle, use href tag to access just the link

    time.sleep(1) #delay between each scrape 
print "We found a lot of jobs: ", len(job_links)
"""

### 3.5 Pickling
Scraping 520 pages takes a while, so I already scraped all of the job links and stored them in a Pickle object which is a file on GitHub called "joblinks.pickle". To load it you can run the following cell below. You can also see how to pickle by looking at the Python file.

In [11]:
with open('joblinks.pickle','rb') as handle:
    job_links = pickle.load(handle)

### 4. Use Regex to get Counts
Now that we have all the links of the jobs, loop through the list and get the HTML for each link. Then, use regex to delete all characters that are not a-z or ".". Afterwards, loop through the HTML using each index in the cities dictionary, and add +1 to the dictionary if the city is found in the HTML.

In [None]:
cities =  {"sunnyvale":0, "san jose": 0, "campbell":0, "san francisco": 0, "santa clara": 0,  "mountain view":0}
for link in job_links:
    #These are here to prevent a bad link or connection from messing up the entire for loop
    try:
        html_page = urllib2.urlopen(link).read()
    except urllib2.HTTPError:
        print "HTTPError:"
        continue
    except urllib2.URLError:
        print "URLError:"
        continue
    except socket.error as error:
        print "Connection closed"
        continue

    #regex rule is replace any char that is not a-z with space, carat means "anything but" 
    html_text = re.sub("[^a-z.]"," ", html_page.lower()) 
    for key in cities.keys():
        if key in html_text:
            cities[key]+=1
            
print cities

### 5. Plot the Results
Use a simple histogram to plot the counts using Pandas.
Based off this, it seems clear that San Francisco is definitely the city to go to if you are a recent college graduate looking for an entry-level job as it has the largest number of jobs on the site.

I also noticed the counts were really low, and after looking through the website, I realized that any search query will end up listing all jobs sorted by distance from the zip code entered, so a majority of the ~5000 links scraped were not in CA. 

In [None]:
pseries = pd.Series(cities) #create pandas series from dict
pseries.sort(ascending=False)

pseries.plot(kind = 'bar') #simple histogram using pandas
plt.title('Bay Area Cities Distribution')
plt.xlabel('City')
plt.ylabel('Count')
plt.show()

<img src="files/cities_dist.png">