## Webscraping

In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.


In [40]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np
import io
import re
import datetime
from datetime import timedelta



### 1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object. 

In [2]:
endpoint = 'https://realpython.github.io/fake-jobs/'
response = requests.get(endpoint)
response.status_code
soup = BeautifulSoup(response.text, features="html.parser")

 #### a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title. 

In [3]:
s_python_dev = soup.find('h2').text#:'Senior Python Developer')

 #### b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list. 

In [4]:
job_titles = [item.text for item in soup.findAll('h2')]

 #### c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end. 

In [5]:
companies = [name.text for name in soup.findAll(attrs={'class':'subtitle is-6 company'})]
locations = [place.text for place in soup.findAll(attrs={'class':'location'})]
locations = [item.strip() for item in locations if str(item)]
dates = [day.text for day in soup.findAll('time')]


#### d. Take the lists that you have created and combine them into a pandas DataFrame. 


In [6]:
job_app_data = pd.DataFrame({'job title': job_titles, 'company':companies,'location': locations, 'date':dates})
job_app_data

Unnamed: 0,job title,company,location,date
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08
...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08


### 2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.  

#### a. First, use the BeautifulSoup find_all method to extract the urls. 

In [7]:
urls = [x['href'] for x in soup.find_all('a') if x.text == 'Apply']
urls

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html',
 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html',
 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html',
 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html',
 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html',
 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html',
 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html',
 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html',
 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html',
 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html',
 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html',
 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html',
 'https://realpython.github.io/fake-jobs/jobs/architect-12.html',
 'https://realpython.gi

#### b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.
    


In [8]:
job_app_data['url_form'] = (
    job_app_data['job title']
    .str.lower()
    .str.replace(r"[\ !@#$%\^&*\(\)\[\]\{\};:,\.\/<>?`~=_\+]", '-', regex=True)+'-'+job_app_data.index.astype('string')
)
job_app_data['url_form'] = [re.sub(r'--+', '-', row) for row in job_app_data['url_form']]
job_app_data['url'] = 'https://realpython.github.io/fake-jobs/jobs/' + job_app_data['url_form'].astype('string') + '.html'
job_app_data = job_app_data.drop(columns='url_form')
job_app_data

Unnamed: 0,job title,company,location,date,url
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...
...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/mu...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ra...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/da...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fu...


### 3. Finally, we want to get the job description text for each job.  


#### a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.  


In [9]:
endpoint = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'
response2 = requests.get(endpoint)
soup = BeautifulSoup(response2.text, features="html.parser")
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="box">
<h1 class="title is-2">Senior Python Developer</h1>
<h2 class="subtitle is-4 company">Payne, Roberts and Davis</h2>
<div class="content">
<p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inc

In [10]:
jd = soup.findAll('p')[1].text
jd

'Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.'

#### b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.". 

In [11]:
def get_jd(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, features='html.parser')
    jd = soup.findAll('p')[1].text
    return jd

In [12]:
url = 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html'
get_jd(url)

'At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.'

#### c. Use the [.apply method](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html) on the url column you created above to retrieve the description text for all of the jobs.

In [13]:
job_app_data['job description'] = job_app_data['url'].apply(get_jd)
job_app_data

Unnamed: 0,job title,company,location,date,url,job description
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...,Professional asset web application environment...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...,Party prevent live. Quickly candidate change a...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...,Administration even relate head color. Staff b...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...,Tv program actually race tonight themselves tr...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...,Traditional page a although for study anyone. ...
...,...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/mu...,Paper age physical current note. There reality...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ra...,Able such right culture. Wrong pick structure ...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/da...,Create day party decade high clear. Past trade...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fu...,Pressure under rock next week. Recognize so re...


## Webscraping Bonus

### 1. Navigate to https://www.billboard.com/charts/hot-100/. Using BeautifulSoup, extract out the This Week, artist, song, Last Week, Peak Position, and Weeks on Chart values into a pandas DataFrame. Hint: The HTML for the number one ranked song is slightly different from that of the rest of the songs.

In [14]:
endpoint = 'https://www.billboard.com/charts/hot-100/'
response = requests.get(endpoint)
soup = BeautifulSoup(response.text, features='html.parser')

**The below commented out code was my first attempt at pulling each requested item by row, but I ran into indexing issues. I then backed out and realized i could do a larger for loop that would give my all of the information i was looking for except the 'This Week' column, because that was in a separate div and fell outside the scope of the for loop. However, the this_week list comprehension worked great, but it's also in index order more or less, so I could use the index as the 'This Week' column if I set the first value to 1 instead of 0.**

In [15]:
# titles = [row.h3.text.strip() for row in soup.select('.o-chart-results-list-row')]
# this_week = [row.span.text.strip() for row in soup.select('.o-chart-results-list-row')]
# artists = [row.ul.li.span.text.strip() for row in soup.select('.o-chart-results-list-row')]
# last_week = [row.ul.ul.span.text.strip() for row in soup.select('.o-chart-results-list-row')]

**This is my final solution**

In [16]:
# list comprehension to get 'This Week' values
this_week = [row.span.text.strip() for row in soup.select('.o-chart-results-list-row')]
# initialize list for use inside loop
here = []
for row in soup.select('.o-chart-results-list-row'):
    item = row.ul.text.strip() # gets the text info inside the unordered list
    item = re.sub(r'[\n\t]+', ',', item) # removes many \n and \t and replaces with commas
    here.append(item) #add each item to the list
# split the list into elements separated by commas rather than a list of 100 elements
data = [row.split(',') for row in here]
# create dataframe, clean, and reshape
info_df = (
    pd.DataFrame(data,columns = ['Song Title', 'Artist', 'Last Week A', 'Peak A', 'Wks on Chart A','Last Week B', 'Peak B', 'Wks on Chart B', 'None' ])
    .drop(columns=['Last Week B', 'Peak B', 'Wks on Chart B', 'None'])
    .rename(columns={'Last Week A':'Last Week', 'Peak A':'Peak', 'Wks on Chart A':'Weeks on Chart'})
)
info_df['This Week'] = this_week
info_df['Last Week'] = [re.sub(r'-+', 'New to the list!', row) for row in info_df['Last Week']]
info_df = info_df.loc[:,['Song Title', 'Artist','This Week', 'Last Week', 'Peak', 'Weeks on Chart']]
info_df

Unnamed: 0,Song Title,Artist,This Week,Last Week,Peak,Weeks on Chart
0,Love Somebody,Morgan Wallen,1,New to the list!,1,1
1,A Bar Song (Tipsy),Shaboozey,2,1,1,28
2,Birds Of A Feather,Billie Eilish,3,2,2,23
3,Die With A Smile,Lady Gaga & Bruno Mars,4,4,3,10
4,Espresso,Sabrina Carpenter,5,3,3,28
...,...,...,...,...,...,...
95,Leave Me Alone,BigXthaPlug,96,99,96,2
96,Belong Together,Mark Ambor,97,New to the list!,74,24
97,The Emptiness Machine,Linkin Park,98,New to the list!,21,6
98,Mantra,Jennie,99,98,98,2


### 2. After getting the code working for the current chart, navigate to last week's chart. Notice how the url for the page changes. Write a function which will, given a date, return a pandas DataFrame containing the Billboard chart data for that date.

**Link format**
https://www.billboard.com/charts/hot-100/2024-10-26/

In [27]:
def get_billboard_chart(date=None):
    data=[] # initialize to reset each time function is run
    # default to current week if no date is given
    if date == None:
        date = datetime.date.today().strftime('%Y-%m-%d')
    endpoint = 'https://www.billboard.com/charts/hot-100/'+date+'/'
    response = requests.get(endpoint)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, features='html.parser')
        # Loop through each row of the chart and extract relevant text data
        for row in soup.select('.o-chart-results-list-row'):
            item = row.ul.text.strip()  # Extract text inside the unordered list
            item = re.sub(r'[\n\t]+', ',', item)  # Replace newline and tab characters with commas
            row_data = item.split(',')

            # Adjust row length to a specific number of columns (e.g., 7)
            if len(row_data) > 7:
                row_data = row_data[:7]  # Truncate to 7 items if longer
            elif len(row_data) < 7:
                row_data.extend([None] * (7 - len(row_data)))  # Pad with None if shorter

            data.append(row_data)  # Add processed row data to the list

        info_df = (
            pd.DataFrame(data,columns = ['Song Title', 'Artist', 'Last Week A', 'Peak A', 'Wks on Chart A','Last Week B', 'Peak B'])
            .drop(columns=['Last Week B', 'Peak B'])
            .rename(columns={'Last Week A':'Last Week', 'Peak A':'Peak', 'Wks on Chart A':'Weeks on Chart'})
        )
        info_df = info_df.rename_axis(index='This Week')
        info_df.index += 1
        info_df['Last Week'] = [re.sub(r'-+', 'New to the list!', row) for row in info_df['Last Week']]
        info_df = info_df.loc[:,['Song Title', 'Artist', 'Last Week', 'Peak', 'Weeks on Chart']]
    else:
        print(f"Request failed with status code: {response.status_code}")
    return info_df

In [26]:
# date = '2024-02-10'
get_billboard_chart()

Unnamed: 0_level_0,Song Title,Artist,Last Week,Peak,Weeks on Chart
This Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Love Somebody,Morgan Wallen,New to the list!,1,1
2,A Bar Song (Tipsy),Shaboozey,1,1,28
3,Birds Of A Feather,Billie Eilish,2,2,23
4,Die With A Smile,Lady Gaga & Bruno Mars,4,3,10
5,Espresso,Sabrina Carpenter,3,3,28
...,...,...,...,...,...
96,Leave Me Alone,BigXthaPlug,99,96,2
97,Belong Together,Mark Ambor,New to the list!,74,24
98,The Emptiness Machine,Linkin Park,New to the list!,21,6
99,Mantra,Jennie,98,98,2


### 3. Write a loop to retrieve the Billboard chart data for the last 10 weeks.

In [57]:
# Get today's date
today = datetime.date.today()
dfs = []
# Iterate 10 weeks back
for i in range(10):
    # Calculate the date for the week
    week_date = today - timedelta(weeks=i)
    # Format the date as a string (e.g., YYYY-MM-DD)
    date_string = week_date.strftime("%Y-%m-%d")
    dfs.append(get_billboard_chart(date_string))
print(dfs)
display(dfs[3])
display(dfs[8])

[                       Song Title                  Artist         Last Week  \
This Week                                                                     
1                   Love Somebody           Morgan Wallen  New to the list!   
2              A Bar Song (Tipsy)               Shaboozey                 1   
3              Birds Of A Feather           Billie Eilish                 2   
4                Die With A Smile  Lady Gaga & Bruno Mars                 4   
5                        Espresso       Sabrina Carpenter                 3   
...                           ...                     ...               ...   
96                 Leave Me Alone             BigXthaPlug                99   
97                Belong Together              Mark Ambor  New to the list!   
98          The Emptiness Machine             Linkin Park  New to the list!   
99                         Mantra                  Jennie                98   
100        Angel With An Attitude                Ro

Unnamed: 0_level_0,Song Title,Artist,Last Week,Peak,Weeks on Chart
This Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,A Bar Song (Tipsy),Shaboozey,1,1,25
2,Birds Of A Feather,Billie Eilish,6,2,20
3,Timeless,The Weeknd & Playboi Carti,New to the list!,3,1
4,I Had Some Help,Post Malone Featuring Morgan Wallen,2,1,21
5,Espresso,Sabrina Carpenter,3,3,25
...,...,...,...,...,...
96,Keep Up,Odetari,New to the list!,96,1
97,Passport Junkie,Rod Wave,98,61,3
98,This Is My Dirt,Justin Moore,New to the list!,98,1
99,Close To You,Gracie Abrams,New to the list!,49,9


Unnamed: 0_level_0,Song Title,Artist,Last Week,Peak,Weeks on Chart
This Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,A Bar Song (Tipsy),Shaboozey,1,1,20
2,Taste,Sabrina Carpenter,New to the list!,2,1
3,Please Please Please,Sabrina Carpenter,9,1,12
4,Espresso,Sabrina Carpenter,7,3,20
5,I Had Some Help,Post Malone Featuring Morgan Wallen,2,1,16
...,...,...,...,...,...
96,Prove It,21 Savage & Summer Walker,New to the list!,43,8
97,American Nights,Zach Bryan,100,21,8
98,She's Somebody's Daughter (Reimagined),Drew Baldridge,96,93,3
99,Close To You,Gracie Abrams,New to the list!,49,5


Unnamed: 0_level_0,Song Title,Artist,Last Week,Peak,Weeks on Chart
This Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,A Bar Song (Tipsy),Shaboozey,1,1,21
2,I Had Some Help,Post Malone Featuring Morgan Wallen,5,1,17
3,Espresso,Sabrina Carpenter,4,3,21
4,Please Please Please,Sabrina Carpenter,3,1,13
5,Taste,Sabrina Carpenter,2,2,2
...,...,...,...,...,...
96,Close To You,Gracie Abrams,99,49,6
97,Prove It,21 Savage & Summer Walker,96,43,9
98,Parking Lot,Mustard & Travis Scott,New to the list!,57,6
99,Sorry Not Sorry,Lil Yachty & Veeze,New to the list!,99,1
