# Example - Netflix job postings

### Introduction

Like some other companies, Netflix posts its job offers at a platform called Lever. In this example, I capture information on these jobs, extracting it from the Lever web pages. In this simple **web scraping** exercise, I use the Python packages Requests and Beautiful Soup.

**Netflix job postings** can be found at `jobs.lever.co/netflix`. I call this page the **main page**. It will display, the day you visit it, about 450 postings. The postings can be filtered by city, team and work type. Most of the postings on display are for teams in from the Streaming division.

The main page contains, for each available position, basic information about the job, such as the job title, the location and the team, and a link to a page specific for that position, such as `jobs.lever.co/netflix/2d11d912-bfb3-4d9d-bfa1-0ce036214284`. I call that specific page the **individual page**. The individual page presents a description of the company and the role of the new employee.

### Capturing the source code

**HTTP** is a protocol for communication between clients and servers. For instance, a client (such as your browser) sends a **HTTP request** to the server. Then the server returns the response to the client. The response contains status information about the request and, when the request is accepted, the requested content. 

**GET** is one of the most common HTTP methods. It is used to request data from a specified resource. In the Python package `requests`, the function `get` is an implementation of the HTTP method GET. `requests` comes with the Anaconda distribution, so you can import it directly.

In [1]:
import requests

The `request` function `get` returns an object of a special type (type `requests.models.Response`). The attribute `text` of this object is a string which, for an ordinary web page, is the HTML source code. 

In [2]:
html_str = requests.get('https://jobs.lever.co/netflix').text

Now, `html_str` is a string containing the source code of the Netflix Lever main page.

### Parsing the source code

To parse HTML code, learning the tree structure it conveys, I use the function `BeautifulSoup` from the package `bs4` (Beautiful Soup, version 4). I import this function with: 

In [3]:
from bs4 import BeautifulSoup

`BeautifulSoup` transforms the string `html_str` into the "soup" object `soup`:

In [4]:
soup = BeautifulSoup(html_str)

Next, I use the method `find_all` to extract from the soup the data on the title, as a list. Every term of the list will be one job title. I will repeat the exercise with the job location and the team for every job. To get extra information that could be found there, I will also extract a list with the links to the individual job pages. 

### Job titles

In a web scraping job, we take advantage of the fact that web pages posting information units in a systematic way have a repetitive structure, in which every unit is contained in a set of HTML elelements with the same names and attributes values. So, by means of `find_all`, we can capture one of features for all the units in one shot, as a list. Let show you how to do this with the job title. 

The key assumption is that all job titles will be strored in HTML elements with the same name and attribute values, and that this is exclusive for job titles. This is, precisely, what allows Lever to update the pages in a programmatic way with the information supplied by Netflix.

To use `find_all`, we need to know the name of the tag and, probably, some of the attributes. How can find this? There are many ways, and every web scraper has his/her own cookbook. The simplest approach is based on browser tools. First, I count the number of times that *APPLY* appears on the page. This is 455 (you will probably get a different number when you visit the page). So I know the number of job titles that I have to capture.

Next, I use the *Inspect* tool of the browser. Right-click on the first job title, opening a contextual menu, and select *Inspect*. This will open a window showing a view of the source code in which the element containing that job title is highlighted. The element is:

`<h5 data-qa="posting-name">Compositing Supervisor - Wendell & Wild</h5>`

So, I try:

In [5]:
job = soup.find_all('h5', {'data-qa': 'posting-name'})

If this is right, I must have a list with 455 elements. Indeed:

In [6]:
len(job)

455

To be sure, I explore the head and the tail of this list:

In [7]:
job[:5]

[<h5 data-qa="posting-name">Compositing Supervisor - Wendell &amp; Wild</h5>,
 <h5 data-qa="posting-name">Lead Compositor - Wendell &amp; Wild</h5>,
 <h5 data-qa="posting-name">Lead Houdini FX Artist  - Wendell &amp; Wild</h5>,
 <h5 data-qa="posting-name">Production Pipeline TD</h5>,
 <h5 data-qa="posting-name">VFX Supervisor</h5>]

In [8]:
job[-5:]

[<h5 data-qa="posting-name">Manager, CREWS Technology Program</h5>,
 <h5 data-qa="posting-name">Manager, Space Planning - EMEA</h5>,
 <h5 data-qa="posting-name">Manager, Space Planning - UCAN</h5>,
 <h5 data-qa="posting-name">Workplace Manager - Stockholm</h5>,
 <h5 data-qa="posting-name">Workplace Operations Co-Ordinator - MAC, London</h5>]

*Note*. `h1`, `h2`, etc tags are used for headers. They don't have a `class` attribute because their style is unique, specified in a `style` element within the `head` part of the source code. Frequently you don't need the attribute value to capture these elements. In this case, `soup.find_all('h5')` would have given you the same result.

To extract the text from every element, I use a `for` loop:

In [9]:
job = [j.text for j in job]
job[:5]

['Compositing Supervisor - Wendell & Wild',
 'Lead Compositor - Wendell & Wild',
 'Lead Houdini FX Artist  - Wendell & Wild',
 'Production Pipeline TD',
 'VFX Supervisor']

### Job location

Now, the job location, which is found, following the same approach as for the job title, as the text within a `span` tag with `class="sort-by-location posting-category small-category-label"`. 

In [10]:
location = soup.find_all('span', 'sort-by-location posting-category small-category-label')
location = [l.text for l in location]
location[:5]

['Oregon',
 'Oregon',
 'Oregon',
 'Los Angeles, California',
 'Los Angeles, California']

### Team

The team is found in a `span` tag with `class="sort-by-team posting-category small-category-label"`:

In [11]:
team = soup.find_all('span', 'sort-by-team posting-category small-category-label')
team = [t.text for t in team]
team[:5]

['Animation – Animation',
 'Animation – Animation',
 'Animation – Animation',
 'Animation – Animation',
 'Animation – Animation']

The team comes in two parts: (a) a division, such *Animation* or *Gaming*, and (b) a department, such as *Art* or *Production Management*. It might be interesting to split it in these two parts, which are separated by a symbol which looks like a hyphen but it is a bit longer. It is the **en dash** (see `jkorpela.fi/dashes.html` if you are curious about this). You can copypaste it in a Jupyter interface, or use the Unicode representation \u2013.

In [12]:
team = [t.split(' – ') for t in team]
team[:5]

[['Animation', 'Animation'],
 ['Animation', 'Animation'],
 ['Animation', 'Animation'],
 ['Animation', 'Animation'],
 ['Animation', 'Animation']]

Once the split has been performed, I name the two parts:

In [13]:
division = [t[0] for t in team]
division[:5]

['Animation', 'Animation', 'Animation', 'Animation', 'Animation']

In [14]:
dept = [t[1] for t in team]
dept[:5]

['Animation', 'Animation', 'Animation', 'Animation', 'Animation']

### Links

With the *Inspect* tool, one can learn that there is a rectangular area of the page associated to every job posting. For the first posting, that area corresponds to a HTML element with a `div` tag, with `class="posting"` and `data-qa-posting-id="bec6d4e4-47ac-4ec0-b2b2-34ad1845de11"`. The `posting-id` value is specific for that posting, and it works as an ID for the job. The link is formed by pasting the ID after the fixed part `https://jobs.lever.co/netflix/`.

So, one way of getting the links is by capturing the ID as:

In [15]:
id = soup.find_all('div', 'posting')
id = [i['data-qa-posting-id'] for i in id]

Alternatively, since we know that every link has to appear as the value of an `href` attribute at an `a` tag, we can search directly for these tags. Indeed, just below the `h5` tag enclosing the job title, the link appears twice, in two `a` tags. The second one has `class="posting-title"`. So, the links can be captured as:

In [16]:
link = soup.find_all('a', 'posting-title')
link = [l['href'] for l in link]
link[:5]

['https://jobs.lever.co/netflix/bec6d4e4-47ac-4ec0-b2b2-34ad1845de11',
 'https://jobs.lever.co/netflix/78b120a0-1f92-4f74-b894-716fa7fb83fc',
 'https://jobs.lever.co/netflix/3f9031e5-e350-483c-be7e-508236645b8c',
 'https://jobs.lever.co/netflix/f0615765-1451-42ae-bf76-7d3dfc1de481',
 'https://jobs.lever.co/netflix/d98d1d6a-3b80-4a6d-b0f0-e1648bbd6034']

### Packing

Now, leaving aside `team` and `id`, I have the five lists, `job`, `location`, `division`, `dept` and `link`. I can pack them as the columns of a Pandas data frame:

In [17]:
import pandas as pd

In [18]:
df = pd.DataFrame({'job': job, 'location': location, 'division': division, 'dept': dept, 'link': link})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455 entries, 0 to 454
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   job       455 non-null    object
 1   location  455 non-null    object
 2   division  455 non-null    object
 3   dept      455 non-null    object
 4   link      455 non-null    object
dtypes: object(5)
memory usage: 17.9+ KB


In [19]:
df.head()

Unnamed: 0,job,location,division,dept,link
0,Compositing Supervisor - Wendell & Wild,Oregon,Animation,Animation,https://jobs.lever.co/netflix/bec6d4e4-47ac-4e...
1,Lead Compositor - Wendell & Wild,Oregon,Animation,Animation,https://jobs.lever.co/netflix/78b120a0-1f92-4f...
2,Lead Houdini FX Artist - Wendell & Wild,Oregon,Animation,Animation,https://jobs.lever.co/netflix/3f9031e5-e350-48...
3,Production Pipeline TD,"Los Angeles, California",Animation,Animation,https://jobs.lever.co/netflix/f0615765-1451-42...
4,VFX Supervisor,"Los Angeles, California",Animation,Animation,https://jobs.lever.co/netflix/d98d1d6a-3b80-4a...


### Exporting to CSV file

Finally, we can export the data to a CSV file by means of the function `to_csv` (edit the path of the file):

In [20]:
df.to_csv('netflix.csv', index=False)

The argument `index=False` to skip the default of `to_csv`, which is to add a column containing the index.

### Homework

1. In this example, I have used the browser tools to find the elements containing the data. Nevertheless, in most cases you can use the method `find_all`to find them. For instance, if you know that the first title is 'Compositing Supervisor - Wendell & Wild', you can use this information to find the tag. To do this, import the package `re`, and use an appropriate regular expression `expr`, covering any potential tag name, in `soup.find_all(re.compile(expr), text='Compositing Supervisor - Wendell & Wild')`.

2. The method suggested above would work for any data that come as the text within a tag. Use it again to find the tag for the location of the first job posting. Since the location and the team are displayed in upper case in the web page and one can trust upper case in web pages, you may have to refine the method, using a regular expression as the text.

3. For the team, the method can give you trouble if you use `text='animation'`, but it will work fine with `text='animation – animation'`.

4. Find the elements containing the links using `find_all`.