# Scraping resumes off Craigslist
---

To gather summary statistics, you'll probably first want to figure out where the most populated areas are. There is a few ways to go about this, but to make searches easier on Craigslist, zip codes make life a lot easier, and we can define a radius around that zip code. The 100 most populated zip codes was found via http://localistica.com/usa/zipcodes/most-populated-zipcodes/.  

## Sections
---
- [Part 1: What is a WebDriver?](#Part-1:-What-is-a-WebDriver?)

In [2]:
from IPython.display import display, Video  # Jupyter notebook methods

import sys
sys.path.append("./../scripts/")

## Part 1: What is a Webdriver?
---
Webdriving in simple terms is __the automation of actions on a web browser__, such as Chrome and FireFox. You can think of this as being able to tell a computer how to navigate websites and perform human actions such as copy, download, etc. automatically. Most, Web Drivers are driven by analyzing HTML, which makes knowledge of the markup language extremely helpful in performing this task. Web driving is an extremely popular form of webscraping, and is also known as a webcrawler. 

With Python, the most common webdriver can be found in the selenium suite, which is offered in C#, Java, JavaScript, Python, and Ruby. More information on the platform can be found [here](https://www.seleniumhq.org/).

Now let's get started! To kick things off, we're going to first have to download our browser's webdriver; we're going to be sticking with Google Chrome for consistency. First, we need to get our browser's specific webdriver applciation, for Google (chromedriver), this can be found at their [chromium website](https://chromedriver.chromium.org/). The version of the driver needs to math your current browser's version. To check for this, watch the video below. 

Once chromedriver is downloaded, move it into the `./../assets` directory. 

In [5]:
%%HTML
<video width="1000" height="600" controls>
  <source src="./../assets/videos/webdriver_download.mp4" type="video/mp4">
</video>

With chromedriver downloaded, let's get to some code! We're going to first import the `webdriver` module from selenium with the line below. Next, we define a variable called `driver` which will open up a chrome browser by calling the `webdriver.Chrome` method (note the path to the webdriver application downloaded from above). This browser however is special and can be controlled via the `driver` variable.

In [6]:
from selenium import webdriver

In [9]:
driver = webdriver.Chrome("./../assets/chromedriver")

Woho! We'e got Chrome open. Now this browser works like your normal one, you can click around and even go to whatever webpage you like. However, the special part of this browser is that we use the `driver`'s methods to perform these tasks for us. Let's make our first automated move to Craigslist! This is done with the `get` method. 

In [10]:
driver.get("http://craigslist.org")

Sweet, we're able to jump from page to page... now what? Well, let's see if we can pull some HTML information from this page. 
One way to extract HTML information is by using the `find_element_by_xpath` method from the driver variable ([more on XPath](https://www.w3schools.com/xml/xpath_intro.asp)). This seeks out the particular web element on the page and is returned as a WebElement class. We'll start with the front page text elements by extracting the `/html` element and assigning it to the variable front_page. 

With the instatiated variable, if the xpath is found, the variable class contains the attribute `text` which will print a string of all the text found in the associated HTML element.

In [16]:
front_page = driver.find_element_by_xpath('/html')
print(front_page.text)

washington, DC
doc
nva
mld
craigslist
create a posting
my account
event calendar
M T W T F S S
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
help, faq, abuse, legal
avoid scams & fraud
personal safety tips
terms of usenew
privacy policy
system status
about craigslist
craigslist is hiring in sf
craigslist open source
craigslist blog
best-of-craigslist
craigslist TV
"craigslist joe"
craig connects
community
activities
artists
childcare
classes
events
general
groups
local news
lost+found
missed connections
musicians
pets
politics
rants & raves
rideshare
volunteers
services
automotive
beauty
cell/mobile
computer
creative
cycle
event
farm+garden
financial
household
labor/move
legal
lessons
marine
pet
real estate
skilled trade
sm biz ads
travel/vac
write/ed/tran
discussion forums
android
apple
arts
atheist
autos
beauty
bikes
celebs
comp
cosmos
diet
divorce
dying
eco
feedbk
film
fixit
food
frugal
gaming
garden
haiku
help
history
housing
jobs
jokes
legal
linux
man

Well it unfortunately looks like Craigslist defaults to searching in your local area. You could manually change the location by searching for a new one... or is there an automated way of doing this? 

Let's see what happens when you search for a new address.

In [4]:
%%HTML
<video width="1000" height="600" autoplay controls>
  <source src="./../assets/videos/craigslist_hyperlink_zip.mp4" type="video/mp4">
</video>

It's subtle, but the hyperlink provides a lot of information. Searching CL in different cities can be done by changing following the format of ```{cityname}.craigslist.org/?search_distance={distance}&postal={zipcode}```. 

With this knowledge we are now able to change the area that we are searching! So what now? Do we compile a list of every city and their zipcode? Naw, we're smarter than that. There are various approaches to this, such as simply changing the distance to something really large (9999) and pick the middle of America, but a more clever approach may be to find the cities and zipcodes with the highest populations. For learning purposes, let's give the latter a shot!

First we're going to find a website that provides a list of the most populated zipcodes, the Google's tells us that http://localistica.com/usa/zipcodes/most-populated-zipcodes/ provides just that information. Let's check it out.

In [5]:
driver.get('http://localistica.com/usa/zipcodes/most-populated-zipcodes/')

Cool site! There's also a really convenient table that provides both the zip code and city! But how do we get this table into our Notebook? Yes we can copy and paste, or even worse, manually copy down the values. But here's where selenium shines! We can quickly copy the table into our notebook by extracting the HTML element. This can be done in a few ways but we're going to do this through the `XPath` on the page (this of this as the path to some element, i.e. the table). The video below shows how to get the `XPath` of a webpage element. 

In [6]:
%%HTML
<video width="1000" height="600" autoplay controls>
  <source src="./../assets/videos/getting_xpaths.mp4" type="video/mp4">
</video>

With the `XPath` copied, we can get the element by telling the driver to `find_element_by_xpath`. 

In [7]:
webpage_table = driver.find_element_by_xpath('//*[@id="frmAF"]/div[6]/div/div[1]/div[2]')

What is returned is a `WebElement` class. It has a few methods and attributes but what we're interested in is the `text` attribute, which contains the raw information. Let's see what it looks like

In [8]:
print("webpage_table type: ", type(webpage_table))
print("webpage_table text: \n", webpage_table.text)

webpage_table type:  <class 'selenium.webdriver.remote.webelement.WebElement'>
webpage_table text: 
 ZipCode City Population Growth Age Income per household
79936 El Paso TX 115,556 3% 31.00 $42,857.00
90011 Los Angeles CA 106,326 2% 26.20 $23,851.00
60629 Chicago IL 105,209 -8% 28.80 $40,279.00
90650 Norwalk CA 104,765 0% 32.50 $46,012.00
90201 Bell Gardens CA 101,479 0% 27.80 $30,029.00
77084 Houston TX 101,233 6% 31.20 $53,075.00
92335 Fontana CA 99,743 4% 26.90 $35,008.00
78521 Brownsville TX 99,632 5% 28.00 $23,426.00
77449 Katy TX 99,586 5% 29.60 $59,198.00
78572 Mission TX 96,822 22% 31.50 $23,799.00
90250 Hawthorne CA 96,593 3% 31.90 $33,656.00
90280 South Gate CA 95,430 1% 29.40 $35,744.00
11226 Brooklyn NY 94,814 -7% 34.50 $29,498.00
90805 Long Beach CA 94,475 1% 29.00 $32,565.00
91331 Pacoima CA 93,821 -10% 29.50 $39,225.00
08701 Lakewood NJ 93,320 0% 23.90 $35,647.00
90044 Los Angeles CA 92,967 3% 28.60 $22,091.00
92336 Fontana CA 92,195 4% 30.10 $55,340.00
00926 San Juan P

Eyy! We got the table! Now we need to get just the zipcodes and cities. This processes is often referred to as _parsing_. Separating the zipcodes and city names is actually not entirely intuitive, nor is it important for this tutorial. We provided a simple helper function that does this from the `ancillary` library as `parse_webpage_table`. Feel free to view that code, or don't and just think of it as a magical black box.

The function returns a list of tuples containing ({city}, {zipcode}). Let's see the first few. Note that the spaces in the city names is removed because that's how craigslist processes it.

In [9]:
from ancillary import parse_webpage_table

In [10]:
webpage_table_info = parse_webpage_table(webpage_table)

In [11]:
webpage_table_info[:5]

[('ElPaso', '79936'),
 ('LosAngeles', '90011'),
 ('Chicago', '60629'),
 ('Norwalk', '90650'),
 ('BellGardens', '90201')]

Looks good! For a sanity check, let's make sure that all 100 rows are captured.

In [12]:
print("Number of elements parsed table: ", len(webpage_table_info))

Number of elements parsed table:  100


Sweet, now that we're able to search different cities, let's check out resumes! Back to Craigslist!

In [13]:
driver.get("http://craigslist.org")

If we search a new city and click on ```resumes``` you'll notice that the hyperlink adds a bit more information, in the format of, `https://{city}.craigslist.org/d/resumes/search/rrr?postal={zip}&search_distance={distance}`. This is perfect since we have all the information we need! 

Let's try this out with one of the cities and zipcodes we found to be heavily populate.

In [14]:
city, zipcode = webpage_table_info[0]
distance = 30

driver.get("https://{}.craigslist.org/d/resumes/search/rrr?postal={}&search_distance={}".format(city, zipcode, distance))

Awesome! We can now start scraping these resume's across all the cities! 

First we need to get all the hyperlinks associated with the posts, this can be done in the DevTool panel once again (F12). Search through a line highlights the table in the CL page, and copy that `XPath` again. See below for the example, the `XPath` should be `//*[@id="sortable-results"]/ul`

In [15]:
%%HTML
<img src="./../assets/images/cl_table_ref.png" 
    alt="DevTool example for location of CL table element " 
    style="width:1000px;height:600px;"
>

In [16]:
cl_results = driver.find_element_by_xpath('//*[@id="sortable-results"]/ul')
cl_results_items = cl_results.find_elements_by_tag_name("li")

In [17]:
item = cl_results_items[0]
url = item.find_element_by_tag_name('a').get_attribute('href')
print(url)

https://elpaso.craigslist.org/res/d/professional-talented-and-energetic/6966671942.html


Boom, we're able to pull URLs! We can now simple go through all the URLs we find and pull the information through selenium once again. 

Let's start with pulling resume information from the link.

In [18]:
driver.get(url)

Again, we find the `XPath`, this time for the content body.

In [19]:
content = driver.find_element_by_xpath('//*[@id="postingbody"]')

In [20]:
print(content.text)

Reliable well organized professional with strong leadership qualities. Committed to producing results above and beyond what is expected. Strong in conflict resolutions. Your company will love to have an enthusiastic, discipline, knowledgeable & fast learner employee like me. Available IMMEDIATELY. Full-time job offer desirable.

My career expertise areas are:
• Customer Service
• Customer Satisfaction
• Salesperson
• Administrative Assistant
• Human Resources Assistant
• Human Resources Specialist
• Shift Supervisor
• Management
• Law Enforcement/Security Specialist

I believe that I would be an assets to your company because my relevant knowledge, skills and abilities for this job includes:

1. Bilingual Spanish & English
2. Clerical Skills
3. Microsoft Office Suite
4. High Stress Environment
5. High degree of initiative
6. Proven Leadership
7. Integrity
8. Problem Solving
9. High Degree of Initiative

Over 15 years career in law enforcement with a successfully approved MBA in Human R

Awesome! We are able to pull resume information. That's pretty much everything we need to start pulling data, we just need to summarize everything and run it as a loop. Let's create a function to do each step of the process. Ultimately, what we'd like to have is a single function that takes in the list of ({city}, {zipcode}) tuples, and spits out all the resumes found on CL. 

Let's break this down into the steps we performed above:
    1. Going to the webpage via the city and zipcode (and distance).
    2. Getting all the associate URLs for the location.
    3. Going to the URLs and pulling the body contents.
    4. Repeat step 1-3 until all 100 locations are scraped.
    
To keep the resume contents in a format that will make it easier to process later on, we're going to use a pandas dataframe, which is effectively a glorified Excel table.

In [21]:
from collections import defaultdict

import pandas as pd

from ancillary import update_search_page

In [22]:
class CraigslistScrapper():
    def __init__(self, driver, locations):
        self.driver = driver
        self.locations = locations
        
    # Function 1 to go to a particular webpage
    def cl_resume_area(self, city, zipcode, distance=9999):
        return self.driver.get("https://{}.craigslist.org/d/resumes/search/rrr?postal={}&search_distance={}".format(city, zipcode, distance))
    
    # Function 2 to get all the associate URLs for the location.
    def extract_urls(self):
        urls = []
        
        # try statement to check if page exists
        try: 
            n_pages = int(self.driver.find_element_by_xpath('//*[@id="searchform"]/div[3]/div[3]/span[2]/span[3]/span[2]').text) // 120 + 1

            for page in range(n_pages):
                self.driver.get(update_search_page(self.driver.current_url, page))
                cl_results = self.driver.find_element_by_xpath('//*[@id="sortable-results"]/ul')
                cl_results_items = cl_results.find_elements_by_tag_name("li")
                for item in cl_results_items:
                    urls.append(item.find_element_by_tag_name('a').get_attribute('href'))

            return urls
        
        except:
            return urls
        
    # Function for 3 to get content
    def extract_resume_content(self, url):
        self.driver.get(url)

        try:
            return self.driver.find_element_by_xpath('//*[@id="postingbody"]').text
        except:
            return None
        
    # addtional function to determine if location is in a diff area i.e. NJ w/ NYC search
    def extract_nearby_location(self):
        return self.driver.find_element_by_xpath('//*[@id="sortable-results"]/ul/li[1]/p/span[2]/span[1]').get_attribute('title')
    
    # Looping over all locations and saving as a dataframe
    def execute_scrape(self):
        self.raw_data = defaultdict(list)
        
        for city, zipcode in self.locations:
            self.cl_resume_area(city, zipcode)
            for url in self.extract_urls():
                contents = self.extract_resume_content(url)
                if contents is None:
                    continue
                    
                try:
                    city = self.extract_nearby_location()
                except:
                    pass
                
                self.raw_data['city'].append(city)
                self.raw_data['searched_zipcode'].append(zipcode)
                self.raw_data['url'].append(url)
                self.raw_data['content'].append(contents)

In [23]:
scraper = CraigslistScrapper(driver, webpage_table_info)  # GET ALL 100 WARNING, DO NOT RUN IN DEMO.
# scraper = CraigslistScrapper(driver, [webpage_table_info[0]])
scraper.execute_scrape()

In [26]:
df = pd.DataFrame(scraper.raw_data)

In [27]:
df.to_csv("./../assets/data_large.csv")