## Pulling from Craigslist: Overview
The following code scrapes Craig's List job posts. First, queries are run on each geographic specific CL site for each job category:

Geo CL Site
> `https://sfbay.craigslist.org`

Job Categories
> `ofc, bus, csr, edu, egr, etc...`

The specifc URL's that we pull from CL return the search results page for full time jobs that were posted today for a given job category. An example that shows the proper URL format below:

> `https://sfbay.craigslist.org/search/{Job Category}?employment_type=1&postedToday=1`


In [3]:
import pandas as pd
import numpy as np
import random
import time
import psycopg2
%pylab inline
import requests
from bs4 import BeautifulSoup as bs4
import re
from __future__ import division
from IPython.display import clear_output


def clean_name(raw):
   letters=re.sub("[^a-zA-Z]", " ", raw)
   words=letters.lower().split()
   meaningful=words
   return (" ".join( meaningful))

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


## Load Sets of CL sites and job types to query over

(1) SITES: The list of ~420 geo-specific CL sites was pulled from this website [find website] and are stored in a .csv file in the local directory. 

(2) JOBS TYPES: The list of job types are manually coded and were found by search each category on CL and finding the corresponding 3-letter code for the URL 



In [4]:
#this code loads the various CL sites and job catagories we want to search

CL_sites=pd.read_csv('CL_US_Sites.csv')
CL_sites=CL_sites['site'].astype(str)
random.shuffle(CL_sites)

CL_job_cats=['ofc','bus','csr','edu','egr','etc','acc','fbh','lab','gov','hea','hum','eng','lgl','mnu','mar','med',
            'npo','rej','ret','sls','spa','sci','sec','trd','sof','sad','tch','trp','tfr','web','wri']

## Pull all Job Posting from Today

There are 4 nested loops that must be executed to pull all job postingson CL:

(1) CL GEO SITE: Loop through all CL sites

(2) JOB CATEGORY: Loop through all job categories
    
(3) RESULT PAGES: Since CL returns search results in batches of 100, we must iterate through each HTML result page and scrape it. 

(4) JOB DESCRIPTION: Loop through all job posting links to pull the text from the individual job descriptions

The following batch of code will return a dataframe that contains the following:
- Time of CL posting
- CL posting title
- CL link to the actual job posting
- The job category
- Job description

In [25]:
## To scrape, let this run for a few minutes and then hit 'stop'. Your IP address will be blocked eventually. 

results = []  # We'll store the data here

# Careful with this...too many queries == your IP gets banned temporarily

search_indices = np.arange(0, 100000, 100)  #picks first page of results, which is 100. NEED to change to grab all jobs 'posted today'

# to loop through all (1) CL sites, (2) CL job cats, and (3) result pages

site_job_count=0
result_count=0
desc_count=0

results_df=pd.DataFrame(columns=['time','title', 'links', 'job description'])

for site in CL_sites:
    
    for job_cat in CL_job_cats:
        clear_output(wait=1)
        print "{:.4%}".format(((site_job_count)/((len(CL_sites)*len(CL_job_cats)))))
        site_job_count=site_job_count+1
        
        for i in search_indices:
            result_count=result_count+1
            #Create URL to pull
            url = site + '/search/{0}?employment_type=1&postedToday=1'.format(job_cat)

            try: #Send to CL.com
                resp = requests.get(url, params={'s': i})

                #Turn the returned info into readable HTML
                txt = bs4(resp.text, 'html.parser')

                #Grab just the listed job postings 
                jobs = txt.findAll(attrs={'class': "row"})

                # Find the CL posting title, the posting link, and posting time
                title = [rw.find('a', attrs={'class': 'hdrlnk'}).text
                              for rw in jobs]
                links = [rw.find('a', attrs={'class': 'hdrlnk'})['href']
                         for rw in jobs]
                time = [pd.to_datetime(rw.find('time')['datetime']) for rw in jobs]
                
                #links=np.unique(links)

                #Create the full job posting URLs (the geo site + specifc sinlge posting)
                temp=[]
            
                for x in links:
                    if 'craigslist' in x: 
                        temp.append('http:' + x)
                    else: 
                        temp.append((site + x))

                #temp=set(links)-set(results_df.links)
                temp_df=pd.DataFrame({'links':temp,'time':time,'title':title})
                links_new=set(temp_df.links)-set(results_df.links)
                results_df=pd.concat([results_df,temp_df],ignore_index=True)
                results_df=results_df.drop_duplicates('links')
                
                
                
                
                
                #########Get individual CL job descriptions#########
               
                for link in links_new:
                    url2 = link
                    desc_count=desc_count+1
                    try:
                        resp2 = requests.get(url2)
                        txt2 = bs4(resp2.text, 'html.parser')
                        jobs2 = txt2.findAll('section', attrs={'class': 'userbody'})
                        desc = [rw.get_text() for rw in jobs2]

                        if len(desc)>0:
                            results_df.loc[results_df.links==link,'job description']=clean_name(desc[0])
                              
                            
                    except KeyboardInterrupt: 
                        sys.exit()

                    except:
                        pass
        
                # We'll create a dataframe to store all the data
                                         
                results_df.loc[results_df.links.isin(links_new),'job category'] = job_cat
                results_df.loc[results_df.links.isin(links_new),'CL site'] =site
                
            except KeyboardInterrupt: 
                sys.exit()    
                
            except:
                pass
            
            if len(jobs)<100:
                    break
                    



0.0673%


SystemExit: 

To exit: use 'exit', 'quit', or Ctrl-D.


In [28]:
#Writes a csv file containing CL job postings from today

results_df.to_csv('CL_JOBS.csv', encoding='utf-8')