## Your Tasks:

- Scrape the first 100 available search results
- Generalize your code to allow searching for different locations/jobs
- Pick out information about the URL, job title, and job location
- Save the results to a file

In [1]:
import requests
from bs4 import BeautifulSoup

---

### Part 1: Inspect

- How do the URLs change when you navigate to the next results page?
- How do the URLs change when you use a different location and/or job title search?
- Which HTML elements contain the link, title, and location of each job?

**Next Page**: The `start=` parameter gets added and incremented by the value of `10` for each additional page. This is because each results page displays 10 job results.

E.g.: <https://www.indeed.com/jobs?q=python&l=new+york&start=20>

**Different Location/Job Title**: The values for the query parameters `q` (for job title) and `l` (for location) change accordingly.

In [2]:
page = requests.get('https://www.indeed.com/jobs?q=python&l=new+york')

**HTML Elements**: A single job posting lives inside of a `div` element with the class name `result`. Inside there are other elements. You can find the specific info you're looking for here:

- **Link**: In the `href` attribute of the `<a>` Element that is a child of the title `<h2>` element
- **Title**: The text of the link in the `<h2>` element which also contains the link URL mentioned above
- **Location**: A `<span>` element with the telling class name `location`

---

### Part 2: Scrape

- Build the code to fetch the first 100 search results. This means you will need to automatically navigate to multiple results pages
- Write functions that allow you to specify the job title, location, and amount of results as arguments

In [3]:
page_2 = requests.get('https://www.indeed.com/jobs?q=python&l=new+york&start=20')

Every 10 results means you're on a new page. Let's make that an argument to a function:

In [4]:
def get_jobs(page=1):
    """Fetches the HTML from a search for Python jobs in New York on Indeed.com from a specified page."""
    base_url_indeed = 'https://www.indeed.com/jobs?q=python&l=new+york&start='
    results_start_num = page*10
    url = f'{base_url_indeed}{results_start_num}'
    page = requests.get(url)
    return page

In [5]:
get_jobs(3)

<Response [200]>

In [6]:
get_jobs(4)

<Response [200]>

Great! Let's customize this function some more to allow for different search queries and search locations:

In [7]:
def get_jobs(title, location, page=1):
    """Fetches the HTML from a search for Python jobs in New York on Indeed.com from a specified page."""
    loc = location.replace(' ', '+')  # for multi-part locations
    base_url_indeed = f'https://www.indeed.com/jobs?q={title}&l={loc}&start='
    results_start_num = page*10
    url = f'{base_url_indeed}{results_start_num}'
    page = requests.get(url)
    return page

In [8]:
get_jobs('python', 'new york', 3)

<Response [200]>

With a generalized way of scraping the page done, you can move on to picking out the information you need by parsing the HTML.

---

### Part 3: Parse

- Sieve through your HTML soup to pick out only the job title, link, and location
- Format the results in a readable format (e.g. JSON)
- Save the results to a file

Let's start by getting access to all interesting search results for one page:

In [9]:
site = get_jobs('python', 'new york')

In [10]:
soup = BeautifulSoup(site.content)

In [11]:
results = soup.find(id='resultsCol')

In [12]:
jobs = results.find_all('div', class_='result')

**Job Titles** can be found like this:

In [13]:
job_titles = [job.find('h2').find('a').text.strip() for job in jobs]

In [14]:
job_titles

['Data Engineer Internship (REMOTE)',
 'Data Scientist Internship (REMOTE)',
 'Fraud Intelligence, Data Operations Analyst',
 'Data Analyst, Royalties and Accounting',
 'Computer Vision Co-Op Internship',
 'Software Developer (Work Remotely)',
 'Forward Deployed Engineer',
 'Data Scientist',
 'Junior Laboratory Associate',
 'Data Science Intern',
 'Technical Recruiter',
 'Python Developer - Clearing',
 'Python / Perl Dev (Application Development)',
 'Assistant Deputy Comptroller - Item # 06585',
 'Software Engineer - Python']

**Link URLs** need to be assembled, and can be found like this:

In [15]:
base_url = 'https://www.indeed.com'

In [16]:
job_links = [base_url + job.find('h2').find('a')['href'] for job in jobs]

In [17]:
job_links

['https://www.indeed.com/rc/clk?jk=230b869a1ae6f0d9&fccid=f0d890633e02e75f&vjs=3',
 'https://www.indeed.com/rc/clk?jk=a418287c804bd2e0&fccid=f0d890633e02e75f&vjs=3',
 'https://www.indeed.com/rc/clk?jk=9b185f14c1a68aaa&fccid=046813bd6f2871bb&vjs=3',
 'https://www.indeed.com/rc/clk?jk=68aa8fc72efce89d&fccid=fe404d18bb9eef1e&vjs=3',
 'https://www.indeed.com/rc/clk?jk=d27cdbda643f6fba&fccid=ffd98152d5e973b6&vjs=3',
 'https://www.indeed.com/rc/clk?jk=9fe2a2c6c88eb76f&fccid=105ecfd0283f415f&vjs=3',
 'https://www.indeed.com/rc/clk?jk=11d921a85988c3a5&fccid=e032d49a262c01c8&vjs=3',
 'https://www.indeed.com/rc/clk?jk=1636faa0308fd08f&fccid=1577085fc2290983&vjs=3',
 'https://www.indeed.com/rc/clk?jk=7866f9bbf9d0b701&fccid=8077183a161ef0fd&vjs=3',
 'https://www.indeed.com/rc/clk?jk=4ca17765619fed16&fccid=f68587a673422855&vjs=3',
 'https://www.indeed.com/rc/clk?jk=470fff695259ff95&fccid=363d27976f5fa667&vjs=3',
 'https://www.indeed.com/rc/clk?jk=a035a9bc24f8c883&fccid=01b91641951e8886&vjs=3',
 'ht

**Locations** can be picked out of the soup by their class name:

In [18]:
job_locations = [job.find(class_='location').text for job in jobs]

In [19]:
job_locations

['New York State',
 'New York State',
 'New York State',
 'New York, NY 10011 (Flatiron District area)',
 'Clifton Park, NY',
 'New York, NY 10003 (Flatiron District area)',
 'New York, NY 10002 (Lower East Side area)',
 'New York, NY',
 'New York, NY 10012 (Greenwich Village area)',
 'New York, NY 10006 (Financial District area)',
 'New York, NY 10013 (SoHo area)',
 'New York, NY 10005 (Financial District area)',
 'New York, NY',
 'New York, NY 10038 (Financial District area)',
 'New York, NY']

Let's assemble all this info into a function, so you can pick out the pieces and save them to a useful data structure:

In [20]:
def parse_info(soup):
    """
    Parses HTML containing job postings and picks out job title, location, and link.
    
    args:
    soup (BeautifulSoup object): A parsed bs4.BeautifulSoup object of a search results page on indeed.com
    
    returns:
    job_list (list): A list of dictionaries containing the title, link, and location of each job posting
    """
    results = soup.find(id='resultsCol')
    jobs = results.find_all('div', class_='result')
    base_url = 'https://www.indeed.com'

    job_list = list()
    for job in jobs:
        title = job.find('h2').find('a').text.strip()
        link = base_url + job.find('h2').find('a')['href']
        location = job.find(class_='location').text
        job_list.append({'title': title, 'link': link, 'location': location})

    return job_list

Let's give it a try:

In [21]:
page = get_jobs('python', 'new_york')

In [22]:
soup = BeautifulSoup(page.content)

In [23]:
results = parse_info(soup)

In [24]:
results

[{'title': 'Junior Software Engineer',
  'link': 'https://www.indeed.com/rc/clk?jk=611147834ac53ea4&fccid=5eb16c6eab3dc5ef&vjs=3',
  'location': 'Long Island City, NY 11101'},
 {'title': 'Data Strategy & Operations Associate',
  'link': 'https://www.indeed.com/rc/clk?jk=d63a6ea80c6f91ea&fccid=facc18fe475dac15&vjs=3',
  'location': 'New York, NY'},
 {'title': 'Python Developer',
  'link': 'https://www.indeed.com/rc/clk?jk=ff6aa54312d72b56&fccid=c2a63affe8751868&vjs=3',
  'location': 'New York, NY 10271 (Financial District area)'},
 {'title': 'Content Contributor: Data Science',
  'link': 'https://www.indeed.com/rc/clk?jk=4376c8c660c60538&fccid=b9d4e9eceb3ff4c0&vjs=3',
  'location': 'New York State'},
 {'title': 'Apprentice Conversion - Entry-Level Software Developer',
  'link': 'https://www.indeed.com/rc/clk?jk=49f1e9e8766bb5cf&fccid=de71a49b535e21cb&vjs=3',
  'location': 'Poughkeepsie, NY 12601'},
 {'title': 'Data Engineer Internship (REMOTE)',
  'link': 'https://www.indeed.com/rc/clk?

And let's add a final step of generalization:

In [25]:
def get_job_listings(title, location, amount=100):
    results = list()
    for page in range(amount//10):
        site = get_jobs(title, location, page=page)
        soup = BeautifulSoup(site.content)
        page_results = parse_info(soup)
        results += page_results
    return results

In [26]:
r = get_job_listings('python', 'new york', 100)

In [27]:
len(r)

135

In [29]:
r

[{'title': 'Junior Python Developer',
  'link': 'https://www.indeed.com/rc/clk?jk=f8f068da6dd93fa6&fccid=ca7680692810259a&vjs=3',
  'location': 'New York, NY 10022 (Midtown area)'},
 {'title': "Mosaic Group's Free Artificial Intelligence Academy",
  'link': 'https://www.indeed.com/rc/clk?jk=420abc44f8e74252&fccid=bb2c5b251a31bfc4&vjs=3',
  'location': 'New York State'},
 {'title': 'Python Developer',
  'link': 'https://www.indeed.com/rc/clk?jk=f61cf22d1bc1088d&fccid=0c39fb2c91742dcf&vjs=3',
  'location': 'New York, NY'},
 {'title': 'Penetration Testing Trainee (Remote USA)',
  'link': 'https://www.indeed.com/rc/clk?jk=487b30db63184515&fccid=bf0600f0f252b45b&vjs=3',
  'location': 'Florida, NY'},
 {'title': 'Data Technician (Full- or Part-Time)',
  'link': 'https://www.indeed.com/rc/clk?jk=da727c0cddda240e&fccid=56a26d4c816e53d1&vjs=3',
  'location': 'New York, NY 10003 (Greenwich Village area)'},
 {'title': 'Software Engineer Internship (REMOTE)',
  'link': 'https://www.indeed.com/rc/cl