## Webscraping

In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

### 1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from IPython.core.display import HTML

#define function to parse the webpage, This returns soup object
def parseWebpageWithSoup(url,parserType):
    response = requests.get(url)
    return BeautifulSoup(response.text, features=parserType)
    

URL = 'https://realpython.github.io/fake-jobs/'
parseType = 'html.parser'

soup = parseWebpageWithSoup(URL,parseType)


a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.

In [2]:
titles = soup.findAll('h2')
titles[0].text

'Senior Python Developer'

b.Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.

In [3]:
titles = soup.findAll('h2')
titleList = list(title.text for title in titles)
titleList

['Senior Python Developer',
 'Energy engineer',
 'Legal executive',
 'Fitness centre manager',
 'Product manager',
 'Medical technical officer',
 'Physiological scientist',
 'Textile designer',
 'Television floor manager',
 'Waste management officer',
 'Software Engineer (Python)',
 'Interpreter',
 'Architect',
 'Meteorologist',
 'Audiological scientist',
 'English as a second language teacher',
 'Surgeon',
 'Equities trader',
 'Newspaper journalist',
 'Materials engineer',
 'Python Programmer (Entry-Level)',
 'Product/process development scientist',
 'Scientist, research (maths)',
 'Ecologist',
 'Materials engineer',
 'Historic buildings inspector/conservation officer',
 'Data scientist',
 'Psychiatrist',
 'Structural engineer',
 'Immigration officer',
 'Python Programmer (Entry-Level)',
 'Neurosurgeon',
 'Broadcast engineer',
 'Make',
 'Nurse, adult',
 'Air broker',
 'Editor, film/video',
 'Production assistant, radio',
 'Engineer, communications',
 'Sales executive',
 'Software Deve

c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.


In [334]:
def getTextFromTagsOrClass(tagName,attrs={}):
    """ This function takes html tag and class name. The class name is defaulted to blank dictionary or as optional.
    It returns the list of all matching tag and associated class"""
    titles = soup.findAll(tagName,attrs)
    return list((title.text).replace('\n','').strip() for title in titles)

posting_df = pd.DataFrame(
    {
        'Title': getTextFromTagsOrClass('h2',{'class':'title'}),
        'Company': getTextFromTagsOrClass('h3',{'class':'subtitle'}),
        'Location' : getTextFromTagsOrClass('p',{'class':'location'}),
        'Posting_Date' : getTextFromTagsOrClass('time')
    }
    
)

posting_df

Unnamed: 0,Title,Company,Location,Posting_Date
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08
...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08


### 2.Next, add a column that contains the url for the "Apply" button. Try this in two ways.
a. First, use the BeautifulSoup find_all method to extract the urls.

In [26]:
posting_df["URL"]  = list(link.get('href') for link in soup.find_all('a') if link.text == 'Apply')
posting_df['URL'].str.lower()
posting_df

Unnamed: 0,Title,Company,Location,Posting_Date,URL
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...
...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/mu...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ra...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/da...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fu...


b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [25]:
import re
posting_df["URL"] = list(f'https://realpython.github.io/fake-jobs/jobs/{re.sub(r"[^a-zA-Z0-9// -]", "", value).replace(" ","-").replace("/","-")}-{index}.html' for index, value in enumerate(posting_df['Title']))
posting_df['URL'].str.lower()
posting_df

Unnamed: 0,Title,Company,Location,Posting_Date,URL
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/Se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/En...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/Le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/Fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/Pr...
...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/Mu...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/Ra...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/Da...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/Fu...


### 3. Finally, we want to get the job description text for each job.

a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.

In [12]:
URL = 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html'
parseType = 'html.parser'
soupDesc = parseWebpageWithSoup(URL,parseType)

def has_class_and_id(tag):
    return tag.has_attr('class') or tag.has_attr('id')


# the job descripton is on second index of <p> so use that and it doesn not have any attributes like class or id.
for pTag in soupDesc.find_all('p'):
        if has_class_and_id(pTag) == False:
            print(pTag.string)

Administration even relate head color. Staff beyond chair recently and off. Own available buy country store build before. Already against which continue. Look road article quickly. International big employee determine positive go Congress. Level others record hospital employee toward like.


b.We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".

In [18]:
#define function to parse the webpage, This returns soup object
def getJobDescription(JobURL):
    parseType = 'html.parser'
    soupDesc = parseWebpageWithSoup(JobURL,parseType)
    
    for pTag in soupDesc.find_all('p'):
        if has_class_and_id(pTag) == False:
            return(pTag.string)

jobLink = 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html'
print(getJobDescription(jobLink))

At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.


c.Use the .apply method on the url column you created above to retrieve the description text for all of the jobs.

In [28]:
posting_df['URL'].apply(getJobDescription)

0     Professional asset web application environment...
1     Party prevent live. Quickly candidate change a...
2     Administration even relate head color. Staff b...
3     Tv program actually race tonight themselves tr...
4     Traditional page a although for study anyone. ...
                            ...                        
95    Paper age physical current note. There reality...
96    Able such right culture. Wrong pick structure ...
97    Create day party decade high clear. Past trade...
98    Pressure under rock next week. Recognize so re...
99    Management common popular project only. Must s...
Name: URL, Length: 100, dtype: object

## Webscraping Bonus

1.Navigate to https://www.billboard.com/charts/hot-100/. Using BeautifulSoup, extract out the This Week, artist, song, Last Week, Peak Position, and Weeks on Chart values into a pandas DataFrame. Hint: The HTML for the number one ranked song is slightly different from that of the rest of the song

In [196]:
URL = 'https://www.billboard.com/charts/hot-100/'
parseType = 'html.parser'
soupBillBoard = parseWebpageWithSoup(URL,parseType)

In [270]:
soupBillBoard.select('.o-chart-results-list-row')

[<ul class="o-chart-results-list-row // lrv-a-unstyle-list lrv-u-flex u-height-200 u-height-100@mobile-max u-height-100@tablet-only lrv-u-background-color-white a-chart-has-chart-detail" data-ajax="" data-detail-target="1">
 <li class="o-chart-results-list__item // lrv-u-background-color-black lrv-u-color-white u-width-100 u-width-55@mobile-max u-width-55@tablet-only lrv-u-height-100p lrv-u-flex lrv-u-flex-direction-column lrv-u-flex-shrink-0 lrv-u-align-items-center lrv-u-justify-content-center lrv-u-border-b-1 u-border-b-0@mobile-max lrv-u-border-color-grey">
 <span class="c-label a-font-primary-bold-l u-font-size-32@tablet u-letter-spacing-0080@tablet">
 	
 	1
 </span>
 <span class="c-label u-width-40 a-font-primary-bold-xxs lrv-u-background-color-grey-dark u-color-yellow lrv-u-text-align-center u-hidden@tablet">
 	
 	NEW
 </span>
 </li>
 <li class="o-chart-results-list__item // u-width-200 u-width-100@tablet-only u-width-67@mobile-max lrv-u-border-b-1 u-border-b-0@mobile-max lrv-u-

### Solution:
The solution is derived by very specific CSS classes that are defined for items that needed in dataframe. The HTML tag
and associated class is extracted by soup and later used to create dataframe. However, the lastweek ranks, peakPostion and WeeksOnChart had common classes so there I have to filter the duplicates and come up with unique data.

In [239]:
songTitle=[]
Position = (list((tagName.text).strip() for tagName in soupBillBoard.find_all("span", attrs={"class": "u-letter-spacing-0080@tablet"})))

songTitle.append((soupBillBoard.find("h3", attrs={"class": "u-font-size-23@tablet"})).text.strip()) 
art2 = (list((tagName.text).strip() for tagName in soupBillBoard.find_all("h3", attrs={"class": "lrv-u-font-size-18@tablet"})))
songList = songTitle + art2
songList

artistName = (list((tagName.text).strip() for tagName in soupBillBoard.find_all("span", attrs={"class": "lrv-u-font-size-14@mobile-max"})))
artistName

lastWeek=[]
PeakPosition=[]
WksOnChart=[]
listOfList=[]

num = (list((tagName.text).strip() for tagName in soupBillBoard.find_all("span", attrs={"class": "lrv-u-padding-tb-050@mobile-max"})))
num
start = 0
end = len(num)
step = 3
for i in range(start, end, step):
    x = i
    listOfList.append(num[x:x+step])

#This list "num" has duplicate data. So here we are omitting the duplicate ones and just grabbing the one we need.
for j in range(0, len(listOfList)):
    if j % 2:
        lastWeek.append(listOfList[j][0])
        PeakPosition.append(listOfList[j][1])
        WksOnChart.append(listOfList[j][2])

hot100CurrentWeek = pd.DataFrame(
    {
        'Position':Position,
        'Song' : songList,
        'Artist':artistName,
        'LastWeek':lastWeek,
        'PeakPosition':PeakPosition,
        'WksOnChart':WksOnChart
    }
)

hot100CurrentWeek

Unnamed: 0,Position,Song,Artist,LastWeek,PeakPosition,WksOnChart
0,1,Love Somebody,Morgan Wallen,-,1,1
1,2,A Bar Song (Tipsy),Shaboozey,1,1,28
2,3,Birds Of A Feather,Billie Eilish,2,2,23
3,4,Die With A Smile,Lady Gaga & Bruno Mars,4,3,10
4,5,Espresso,Sabrina Carpenter,3,3,28
...,...,...,...,...,...,...
95,96,Leave Me Alone,BigXthaPlug,99,96,2
96,97,Belong Together,Mark Ambor,-,74,24
97,98,The Emptiness Machine,Linkin Park,-,21,6
98,99,Mantra,Jennie,98,98,2


#### 2.After getting the code working for the current chart, navigate to last week's chart. Notice how the url for the page changes. Write a function which will, given a date, return a pandas DataFrame containing the Billboard chart data for that date.

In [242]:
def retrieveBillBoardChart(yourDate):
    URL = 'https://www.billboard.com/charts/hot-100/' + yourDate+'/'
    parseType = 'html.parser'
    soupBillBoard = parseWebpageWithSoup(URL,parseType)
    
    songTitle=[]
    Position = (list((tagName.text).strip() for tagName in soupBillBoard.find_all("span", attrs={"class": "u-letter-spacing-0080@tablet"})))

    songTitle.append((soupBillBoard.find("h3", attrs={"class": "u-font-size-23@tablet"})).text.strip()) 
    art2 = (list((tagName.text).strip() for tagName in soupBillBoard.find_all("h3", attrs={"class": "lrv-u-font-size-18@tablet"})))
    songList = songTitle + art2
    

    artistName = (list((tagName.text).strip() for tagName in soupBillBoard.find_all("span", attrs={"class": "lrv-u-font-size-14@mobile-max"})))

    lastWeek=[]
    PeakPosition=[]
    WksOnChart=[]
    listOfList=[]

    num = (list((tagName.text).strip() for tagName in soupBillBoard.find_all("span", attrs={"class": "lrv-u-padding-tb-050@mobile-max"})))
    start = 0
    end = len(num)
    step = 3
    for i in range(start, end, step):
        x = i
        listOfList.append(num[x:x+step])

    #This list "num" has duplicate data. So here we are omitting the duplicate ones and just grabbing the one we need.
    for j in range(0, len(listOfList)):
        if j % 2:
            lastWeek.append(listOfList[j][0])
            PeakPosition.append(listOfList[j][1])
            WksOnChart.append(listOfList[j][2])

    return pd.DataFrame(
        {
            'Position':Position,
            'Song' : songList,
            'Artist':artistName,
            'LastWeek':lastWeek,
            'PeakPosition':PeakPosition,
            'WksOnChart':WksOnChart
        }
    )

In [245]:
mychart = retrieveBillBoardChart('2024-10-26')

In [246]:
mychart

Unnamed: 0,Position,Song,Artist,LastWeek,PeakPosition,WksOnChart
0,1,A Bar Song (Tipsy),Shaboozey,1,1,27
1,2,Birds Of A Feather,Billie Eilish,2,2,22
2,3,Espresso,Sabrina Carpenter,4,3,27
3,4,Die With A Smile,Lady Gaga & Bruno Mars,5,3,9
4,5,I Had Some Help,Post Malone Featuring Morgan Wallen,3,1,23
...,...,...,...,...,...,...
95,96,Scared Love,Rod Wave,-,96,1
96,97,Lil Demon,Future,64,25,4
97,98,Mantra,Jennie,-,98,1
98,99,Leave Me Alone,BigXthaPlug,-,99,1


#### 3.Write a loop to retrieve the Billboard chart data for the last 10 weeks

In [253]:
from datetime import datetime, timedelta
now = datetime.now()

In [262]:
chartDataframes=[]
for n in range(0,11):
    chartDate = (datetime.now() - timedelta(weeks=n)).strftime('%Y-%m-%d')
    chartDataframes.append(retrieveBillBoardChart(chartDate))

chartDataframes

[   Position                    Song                  Artist LastWeek  \
 0         1           Love Somebody           Morgan Wallen        -   
 1         2      A Bar Song (Tipsy)               Shaboozey        1   
 2         3      Birds Of A Feather           Billie Eilish        2   
 3         4        Die With A Smile  Lady Gaga & Bruno Mars        4   
 4         5                Espresso       Sabrina Carpenter        3   
 ..      ...                     ...                     ...      ...   
 95       96          Leave Me Alone             BigXthaPlug       99   
 96       97         Belong Together              Mark Ambor        -   
 97       98   The Emptiness Machine             Linkin Park        -   
 98       99                  Mantra                  Jennie       98   
 99      100  Angel With An Attitude                Rod Wave       46   
 
    PeakPosition WksOnChart  
 0             1          1  
 1             1         28  
 2             2         23  
 3

### Webscraping Nightmare_Mode

Web Scraping the Ryman Calendar
In this exercise, your objective is to use BeautifulSoup in order to obtain a dataset of upcoming events at the Ryman. This information is available at https://ryman.com/events/, but you will take the contents of this website and convert it into a pandas DataFrame.

The website splits the events across multiple pages, but start by just working on the first page. Later on in the exercise, you'll take what you've done for the first page and apply it across other pages.

#### 1. Start by using either the inspector or by viewing the page source. Can you identify a tag that might be helpful for finding the names of all performers? For now, just worry about the headliner and don't worry about the opener. (Eg. For Vince Gill, featuring Wendy Moten, we only care about Vince Gill.) Make use of this to create a list containing just the names of each inductee.

In [302]:
URL = 'https://ryman.com/events/'
parseType = 'html.parser'
soupRyman = parseWebpageWithSoup(URL,parseType)

## I think I am going to use "h3" tag to get the names of the performers
tagList = list(tag.text.replace('\n','').strip() for tag in soupRyman.select('#list > div > div:nth-of-type(2) > h3'))
tagList

["A Tribute to Ramblin' Jack Elliott - CANCELED",
 'Straight No Chaser',
 'Clairo',
 'Ryman Sidewalk Sessions with Sam Jones & the Wretched Pews',
 'Nitty Gritty Dirt Band & Friends',
 'Kathleen Madigan',
 'Dawes',
 'Bonnie Raitt',
 'Leon Bridges',
 'Luke Grimes',
 'Chelsea Cutler and Jeremy Zucker present Brent Forever: The Tour',
 'ZZ Top']

#### 2.Next, try and find a tag that could be used to find the date and time for each show. Extract these into a list. Challenge: Convert these into two lists, one containing the date and the other containing the time. (Eg. split Mar 9, 2023 8:00 PM into Mar 9, 2023 and 8:00 PM.)

In [323]:
dateTimeList = list(tag.text.replace('\n','').replace("  "," ").replace("- ","-").strip() for tag in soupRyman.select('#list > div > div:nth-of-type(2) > div:nth-of-type(1) > a:nth-of-type(1)'))
dateTimeList

['Nov 2, 2024 8:00 PM',
 'Nov 3, 2024 7:00 PM',
 'Nov 4, 2024 7:30 PM',
 'Nov 7, 2024 5:30 PM',
 'Nov 7-8, 2024',
 'Nov 9, 2024 7:00 PM',
 'Nov 10, 2024 7:30 PM',
 'Nov 11, 2024 7:30 PM',
 'Nov 11-13, 2024',
 'Nov 14, 2024 7:30 PM',
 'Nov 17, 2024 7:30 PM',
 'Nov 18, 2024 7:30 PM']

In [329]:
#Identify the date list
import re
regexDate1 = r"\w{3}\s{1,2}\d{1,2}\W\s\d{4}"
regexDate2 = r"\w{3}\s{1,2}\d{1,2}\W\d{1,2}\W\s\d{4}"

dateList=[]
for dateTimeItem in dateTimeList:
    if re.match(regexDate1, dateTimeItem):
        dateList.append(re.findall(regexDate1,dateTimeItem))
    if re.match(regexDate2, dateTimeItem):
        dateList.append(re.findall(regexDate2,dateTimeItem))

dateList
    


[['Nov 2, 2024'],
 ['Nov 3, 2024'],
 ['Nov 4, 2024'],
 ['Nov 7, 2024'],
 ['Nov 7-8, 2024'],
 ['Nov 9, 2024'],
 ['Nov 10, 2024'],
 ['Nov 11, 2024'],
 ['Nov 11-13, 2024'],
 ['Nov 14, 2024'],
 ['Nov 17, 2024'],
 ['Nov 18, 2024']]

In [346]:
import re
regexDate3 = r"\d{1}\W\d{2}\s\w+"
#re.findall(regexDate3,'Nov 14, 2024 7:30 PM')
timeList=[]
for dateTimeItem in dateTimeList:
    timeList.append(re.findall(regexDate3,dateTimeItem))

timeList  

[['8:00 PM'],
 ['7:00 PM'],
 ['7:30 PM'],
 ['5:30 PM'],
 [],
 ['7:00 PM'],
 ['7:30 PM'],
 ['7:30 PM'],
 [],
 ['7:30 PM'],
 ['7:30 PM'],
 ['7:30 PM']]