# Scraping and processing resumes off Craigslist

To gather summary statistics, you'll probably first want to figure out where the most populated areas are. There is a few ways to go about this, but to make searches easier on Craigslist, zip codes make life a lot easier, and we can define a radius around that zip code. The 100 most populated zip codes was found via http://localistica.com/usa/zipcodes/most-populated-zipcodes/.  

In [1]:
from IPython.display import display, Video  # Jupyter notebook methods

from selenium import webdriver

import sys
sys.path.append("./../scripts/")

## Expalain what a webdriver is here

In [2]:
driver = webdriver.Chrome("./../assets/chromedriver")

## Let's make our first automated move to Craigslist!

In [3]:
driver.get("http://craigslist.org")

Well it unfortunately looks like Craigslist defaults to searching in your local area. You could manually change the location by searching for a new one... or is there an automated way of doing this? 

Let's see what happens when you search for a new address.

In [5]:
%%HTML
<video width="1000" height="600" autoplay controls>
  <source src="./../assets/videos/craigslist_hyperlink_zip.mp4" type="video/mp4">
</video>

It's subtle, but the hyperlink provides a lot of information. Searching CL in different cities can be done by changing following the format of ```{cityname}.craigslist.org/?search_distance={distance}&postal={zipcode}```. 

With this knowledge we are now able to change the area that we are searching! So what now? Do we compile a list of every city and their zipcode? Naw, we're smarter than that. There are various approaches to this, such as simply changing the distance to something really large (9999) and pick the middle of America, but a more clever approach may be to find the cities and zipcodes with the highest populations. For learning purposes, let's give the latter a shot!

First we're going to find a website that provides a list of the most populated zipcodes, the Google's tells us that http://localistica.com/usa/zipcodes/most-populated-zipcodes/ provides just that information. Let's check it out.

In [6]:
driver.get('http://localistica.com/usa/zipcodes/most-populated-zipcodes/')

Cool site! There's also a really convenient table that provides both the zip code and city! But how do we get this table into our Notebook? Yes we can copy and paste, or even worse, manually copy down the values. But here's where selenium shines! We can quickly copy the table into our notebook by extracting the HTML element. This can be done in a few ways but we're going to do this through the `XPath` on the page (this of this as the path to some element, i.e. the table). The video below shows how to get the `XPath` of a webpage element. 

In [7]:
%%HTML
<video width="1000" height="600" autoplay controls>
  <source src="./../assets/videos/getting_xpaths.mp4" type="video/mp4">
</video>

With the `XPath` copied, we can get the element by telling the driver to `find_element_by_xpath`. 

In [8]:
webpage_table = driver.find_element_by_xpath('//*[@id="frmAF"]/div[6]/div/div[1]/div[2]')

What is returned is a `WebElement` class. It has a few methods and attributes but what we're interested in is the `text` attribute, which contains the raw information. Let's see what it looks like

In [9]:
print("webpage_table type: ", type(webpage_table))
print("webpage_table text: \n", webpage_table.text)

webpage_table type:  <class 'selenium.webdriver.remote.webelement.WebElement'>
webpage_table text: 
 ZipCode City Population Growth Age Income per household
79936 El Paso TX 115,556 3% 31.00 $42,857.00
90011 Los Angeles CA 106,326 2% 26.20 $23,851.00
60629 Chicago IL 105,209 -8% 28.80 $40,279.00
90650 Norwalk CA 104,765 0% 32.50 $46,012.00
90201 Bell Gardens CA 101,479 0% 27.80 $30,029.00
77084 Houston TX 101,233 6% 31.20 $53,075.00
92335 Fontana CA 99,743 4% 26.90 $35,008.00
78521 Brownsville TX 99,632 5% 28.00 $23,426.00
77449 Katy TX 99,586 5% 29.60 $59,198.00
78572 Mission TX 96,822 22% 31.50 $23,799.00
90250 Hawthorne CA 96,593 3% 31.90 $33,656.00
90280 South Gate CA 95,430 1% 29.40 $35,744.00
11226 Brooklyn NY 94,814 -7% 34.50 $29,498.00
90805 Long Beach CA 94,475 1% 29.00 $32,565.00
91331 Pacoima CA 93,821 -10% 29.50 $39,225.00
08701 Lakewood NJ 93,320 0% 23.90 $35,647.00
90044 Los Angeles CA 92,967 3% 28.60 $22,091.00
92336 Fontana CA 92,195 4% 30.10 $55,340.00
00926 San Juan P

Eyy! We got the table! Now we need to get just the zipcodes and cities. This processes is often referred to as _parsing_. Separating the zipcodes and city names is actually not entirely intuitive, nor is it important for this tutorial. We provided a simple helper function that does this from the `ancillary` library as `parse_webpage_table`. Feel free to view that code, or don't and just think of it as a magical black box.

The function returns a list of tuples containing ({city}, {zipcode}). Let's see the first few. Note that the spaces in the city names is removed because that's how craigslist processes it.

In [10]:
from ancillary import parse_webpage_table

In [11]:
webpage_table_info = parse_webpage_table(webpage_table)

In [12]:
webpage_table_info[:5]

[('ElPaso', '79936'),
 ('LosAngeles', '90011'),
 ('Chicago', '60629'),
 ('Norwalk', '90650'),
 ('BellGardens', '90201')]

Looks good! For a sanity check, let's make sure that all 100 rows are captured.

In [13]:
print("Number of elements parsed table: ", len(webpage_table_info))

Number of elements parsed table:  100


Sweet, now that we're able to search different cities, let's check out resumes! Back to Craigslist!

In [14]:
driver.get("http://craigslist.org")

If we search a new city and click on ```resumes``` you'll notice that the hyperlink adds a bit more information, in the format of, `https://{city}.craigslist.org/d/resumes/search/rrr?postal={zip}&search_distance={distance}`. This is perfect since we have all the information we need! 

Let's try this out with one of the cities and zipcodes we found to be heavily populate.

In [15]:
city, zipcode = webpage_table_info[0]
distance = 30

driver.get("https://{}.craigslist.org/d/resumes/search/rrr?postal={}&search_distance={}".format(city, zipcode, distance))

Awesome! We can now start scraping these resume's across all the cities! 

First we need to get all the hyperlinks associated with the posts, this can be done in the DevTool panel once again (F12). Search through a line highlights the table in the CL page, and copy that `XPath` again. See below for the example, the `XPath` should be `//*[@id="sortable-results"]/ul`

In [16]:
%%HTML
<img src="./../assets/images/cl_table_ref.png" 
    alt="DevTool example for location of CL table element " 
    style="width:1000px;height:600px;"
>

In [17]:
cl_results = driver.find_element_by_xpath('//*[@id="sortable-results"]/ul')
cl_results_items = cl_results.find_elements_by_tag_name("li")

In [18]:
item = cl_results_items[0]
url = item.find_element_by_tag_name('a').get_attribute('href')
print(url)

https://elpaso.craigslist.org/res/d/professional-talented-and-energetic/6966671942.html


Boom, we're able to pull URLs! We can now simple go through all the URLs we find and pull the information through selenium once again. 

Let's start with pulling resume information from the link.

In [19]:
driver.get(url)

Again, we find the `XPath`, this time for the content body.

In [20]:
content = driver.find_element_by_xpath('//*[@id="postingbody"]')

In [22]:
print(content.text)

Reliable well organized professional with strong leadership qualities. Committed to producing results above and beyond what is expected. Strong in conflict resolutions. Your company will love to have an enthusiastic, discipline, knowledgeable & fast learner employee like me. Available IMMEDIATELY. Full-time job offer desirable.

My career expertise areas are:
• Customer Service
• Customer Satisfaction
• Salesperson
• Administrative Assistant
• Human Resources Assistant
• Human Resources Specialist
• Shift Supervisor
• Management
• Law Enforcement/Security Specialist

I believe that I would be an assets to your company because my relevant knowledge, skills and abilities for this job includes:

1. Bilingual Spanish & English
2. Clerical Skills
3. Microsoft Office Suite
4. High Stress Environment
5. High degree of initiative
6. Proven Leadership
7. Integrity
8. Problem Solving
9. High Degree of Initiative

Over 15 years career in law enforcement with a successfully approved MBA in Human R