## Web scraping with BeautifulSoup

this demo shows how to use BeautifulSoup to crawl job listing in indeed.

In [25]:
## Import the necessary packages
from bs4 import BeautifulSoup
import urllib
import re
import pandas as pd

### 1. Reach the link of jobs first

use indeed mobile web version since its html is simplier

In [26]:
from urllib.request import urlopen
url = "https://www.indeed.com/m/jobs?q=data+scientist&l=Los+Angeles%2C+CA"
page = urlopen(url)
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all('a', attrs={'rel':['nofollow']})
for i in all_matches:
    print (i['href'])
    print (type(i['href']))
    print ("https://www.indeed.com/m/"+i['href'])

viewjob?jk=7aed7cc38549bce3
<class 'str'>
https://www.indeed.com/m/viewjob?jk=7aed7cc38549bce3
viewjob?jk=9b738179006949c6
<class 'str'>
https://www.indeed.com/m/viewjob?jk=9b738179006949c6
viewjob?jk=56e9ad66b5f392da
<class 'str'>
https://www.indeed.com/m/viewjob?jk=56e9ad66b5f392da
viewjob?jk=7ba20367fa1415e4
<class 'str'>
https://www.indeed.com/m/viewjob?jk=7ba20367fa1415e4
viewjob?jk=8d8506f8352eac32
<class 'str'>
https://www.indeed.com/m/viewjob?jk=8d8506f8352eac32
viewjob?jk=95d9a941220172a1
<class 'str'>
https://www.indeed.com/m/viewjob?jk=95d9a941220172a1
viewjob?jk=624c8f359efae6c0
<class 'str'>
https://www.indeed.com/m/viewjob?jk=624c8f359efae6c0
viewjob?jk=2256ad6801a8503d
<class 'str'>
https://www.indeed.com/m/viewjob?jk=2256ad6801a8503d
viewjob?jk=91ef39342a2a6562
<class 'str'>
https://www.indeed.com/m/viewjob?jk=91ef39342a2a6562
viewjob?jk=a09116a70fc1d56b
<class 'str'>
https://www.indeed.com/m/viewjob?jk=a09116a70fc1d56b


### 2. Find the title, company, location and detailed job description for each job

Let's first see a brief example:

In [27]:
test_html= \
'''
<html>
	<body>
		<p>
			<b>
				<font size="+1">Analyst - Data Science</font>
			</b>
			<br>The Boston Consulting Group - <span class="location">Los Angeles, CA</span>
		</p>
	</body>
</html>
'''


In [28]:
bs = BeautifulSoup(test_html,'lxml')

In [29]:
print(bs.body.p.b.font.text)

Analyst - Data Science


In [30]:
print(bs.body.p.text)



Analyst - Data Science

The Boston Consulting Group - Los Angeles, CA



In [31]:
print(bs.body.p.span.text)

Los Angeles, CA


#### Find title, company, location and job description for one position

In [32]:
title = []
company = []
location = []
jd = []
for each in all_matches:
    jd_url= 'http://www.indeed.com/m/'+each['href']
    jd_page = urlopen(jd_url)
    jd_soup = BeautifulSoup(jd_page, 'lxml')
    jd_desc = jd_soup.findAll('div',attrs={'id':['desc']}) ## find the structure like: <div id="desc"></>
 
    title.append(jd_soup.body.p.b.font.text)
    company.append(jd_desc[0].span.text)
    location.append(jd_soup.body.p.span.text)
    jd.append(jd_desc[0].text)
    
#     break

In [33]:
## Job Description
print(jd_desc[0].text)

Job Category: Information Technology
Department: Enterprise Data Strategy
Location: Los Angeles, CA, US, 90017
Position Type: Full Time
Requisition ID: 3621

Established in 1997, L.A. Care Health Plan is an independent public agency created by the state of California to provide health coverage to low-income Los Angeles County residents. We are the nation’s largest publicly operated health plan. Serving more than 2 million members in five health plans, we make sure our members get the right care at the right place at the right time.

Mission: L.A. Care’s mission is to provide access to quality health care for Los Angeles County's vulnerable and low-income communities and residents and to support the safety net required to achieve that purpose.

Job Summary

The Data Scientist is responsible for supporting L.A. Care’s strategic business initiatives through the application of predictive analytics as part of production workflows. This individual will lead projects through the iterative dat

In [34]:
## Job Title 
print(jd_soup.body.p.b.font.text)

Data Scientist


In [35]:
## Company Name
print(jd_desc[0].span.text)
print(jd_soup.body.p.span.previous_sibling.split('-')[0][1:])

L.A. Care Health Plan
L.A. Care Health Plan 


In [36]:
title

['Data Scientist',
 'Experience Insights Manager',
 'Data Scientist (Entry Level)',
 'Data Wrangler',
 'Chief Data Scientist',
 'Data Scientist',
 'Laboratory Assistant (AM)',
 'Research Analyst, Amazon Studios Consumer Insights',
 'Resesarch Business Administrastor II/III - Epidemiologic Research',
 'Data Scientist']

#### Save the data into Data Frame

In [37]:
job = {'title': title,
         'company': company,
         'location': location,
         'Job Description': jd}
df = pd.DataFrame.from_dict(job)

In [38]:
df

Unnamed: 0,title,company,location,Job Description
0,Data Scientist,Kaiser Permanente,"Pasadena, CA",The Knowledge Management functional area manag...
1,Experience Insights Manager,Apple,"Los Angeles, CA 90020","Summary\nPosted: Dec 3, 2018\nRole Number: 200..."
2,Data Scientist (Entry Level),HireClout,"Los Angeles, CA","Want to work on a profitable, data-focused pro..."
3,Data Wrangler,USC,"Los Angeles, CA",Environment\nThe Lawrence J. Ellison Institute...
4,Chief Data Scientist,Epix,"Culver City, CA 90230","Metro-Goldwyn-Mayer Studios Inc. (""MGM"") is lo..."
5,Data Scientist,Amazon.com,"Santa Monica, CA",Job Description\nDo you like to solve the most...
6,Laboratory Assistant (AM),Olympia Medical Center,"Los Angeles, CA 90036",1. Completes laboratory requests with satisfac...
7,"Research Analyst, Amazon Studios Consumer Insi...",Amazon.com,"Santa Monica, CA",Job Description\nWe hire the world’s brightest...
8,Resesarch Business Administrastor II/III - Epi...,Kaiser Permanente,"Pasadena, CA",The incumbent of this position assists the Div...
9,Data Scientist,L.A. Care Health Plan,"Los Angeles, CA 90017",Job Category: Information Technology\nDepartme...


If we don't break the loop above, we can crawl all the job information from one page.

## 3. Change Pages Automatically

In [22]:
title = []
company = []
location = []
jd = []
url = "https://www.indeed.com/m/jobs?q=data+scientist&l=Los+Angeles%2C+CA"
for i in range(2):
    
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    all_matches = soup.findAll(attrs={'rel':['nofollow']})
    for each in all_matches:
        jd_url= 'http://www.indeed.com/m/'+each['href']
        jd_page =urlopen(jd_url)
        jd_soup = BeautifulSoup(jd_page, 'lxml')
        jd_desc = jd_soup.findAll(attrs={'id':['desc']})
        title.append(jd_soup.body.p.b.font.text)
        company.append(jd_desc[0].span.text)
        location.append(jd_soup.body.p.span.text)
        jd.append(jd_desc[0].text)
        
    ## Change the pages to Next Page
    url_all = soup.findAll(attrs={'rel':['next']})
    url = 'http://www.indeed.com/m/'+ str(url_all[0]['href'])


In [23]:
job = {'title': title,
         'company': company,
         'location': location,
         'Job Description': jd}
df = pd.DataFrame.from_dict(job)

In [24]:
df

Unnamed: 0,title,company,location,Job Description
0,Data Scientist (Entry Level),HireClout,"Los Angeles, CA","Want to work on a profitable, data-focused pro..."
1,Data Scientist,Kaiser Permanente,"Pasadena, CA",The Knowledge Management functional area manag...
2,Data Scientist,L.A. Care Health Plan,"Los Angeles, CA 90017",Job Category: Information Technology\nDepartme...
3,Laboratory Assistant (AM),Olympia Medical Center,"Los Angeles, CA 90036",1. Completes laboratory requests with satisfac...
4,Experience Insights Manager,Apple,"Los Angeles, CA 90020","Summary\nPosted: Dec 3, 2018\nRole Number: 200..."
5,Receptionist,SARA,"Cypress, CA 90630",Description\nReceptionist\n\nJob Summary\nThis...
6,Data Scientist,Snap Inc.,"Los Angeles, CA",Snap Inc. is a camera company. We believe that...
7,Data Scientist,Hearts and Science,"Burbank, CA",Description:\n\nHearts & Science has been insp...
8,Learning Enrichment Insights Manager,Apple,"Los Angeles, CA 90020","Summary\nPosted: Nov 28, 2018\nRole Number: 20..."
9,"Linguist, Text Classification (Multiple Langua...",Google,"Los Angeles, CA",Qualifications\nMinimum qualifications:\n\nBac...
