# Web Scraping

### BUSI 520 - Python for Business Research
### Kerry Back, JGSB, Rice University

### Example

* Visit creditfixings.com
* Click on 2023 Auctions.  That takes us to

    https://www.creditfixings.com/CreditEventAuctions/AuctionByYear.jsp?year=2023
    
* Click on Diamond Sports Group L.L.C.
* Click on Diamond Sports Group L.L.C. Auction Results
* Click "I Agree"

* You should now be at https://www.creditfixings.com/CreditEventAuctions/results.jsp?ticker=DIAMSPO.  
* We want the data on this page.  Also, we want the other auctions for 2023 and for other years.
* The data is easy.  Use pandas read_html.
* To find the data for other auctions, we need to follow the links to find the "ticker."
* Pandas read_html reads all tables on a page into a list of dataframes.

In [1]:
import pandas as pd
lst = pd.read_html(
    "https://www.creditfixings.com/CreditEventAuctions/results.jsp?ticker=DIAMSPO" 
)
for data in lst:
    print(data.head(3))

  Relevant Currency
0               USD
  Auction Currency Rates  Auction Currency Rates.1
0                USD/USD                       1.0
                  Dealer   Bid  Offer               Dealer.1
0      Barclays Bank PLC  1.00   3.00      Barclays Bank PLC
1         BNP Paribas SA  1.00   3.00         BNP Paribas SA
2  Bofa Securities, Inc.  1.25   3.25  Bofa Securities, Inc.
                  Dealer Bid/Offer    Size
0      Barclays Bank PLC     Offer  15.302
1         BNP Paribas SA     Offer   0.000
2  Bofa Securities, Inc.     Offer   0.000
                                0      1
0    Sum of Buy Physical Requests   0.0m
1   Sum of Sell Physical Requests  44.7m
2  Sum of Physical Request Trades   0.0m
                       Dealer     Bid  Size
0       Bofa Securities, Inc.    2.0*   3.0
1  J.P. Morgan Securities LLC  1.875*  10.0
2  J.P. Morgan Securities LLC   1.75^  32.7


### Read a webpage

In [2]:
import requests
session = requests.Session() 

url = "https://www.creditfixings.com/CreditEventAuctions/AuctionByYear.jsp?year=2023"
html = session.get(url).text
html

'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" >\n\n\n\n\n\n\n\n\n\n\n\n<html>\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\r\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">\r\n<!-- Magnolia delivered pages are identified by full URLs proxy passed pages by relative URLs -->\r\n\n<head>\n    <title>Year Credit Event Fixing</title>\n\t<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>\r\n<meta content="" name="description"/>\r\n<meta content="" name="keywords"/>\r\n\n\t<link type="text/css" rel="stylesheet"\nhref="/information/affiliations/fixings/css/main.css" />\n<link type="text/css" rel="stylesheet"\nhref="/information/affiliations/fixings/css/richEdit.css" />\n<link type="text/css" rel="stylesheet"\nhref="/information/affiliations/fixings/css/header.css" />\n<link type="text/css" rel="stylesheet"\nhref="/information/affiliations/fixings/css/page_menu.css" />\n<style type="text/css">\n<!-- .style1 {color: #00

### Extract the elements of a page

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" >

<html>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<!-- Magnolia delivered pages are identified by full URLs proxy passed pages by relative URLs -->
<head>
<title>Year Credit Event Fixing</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="" name="description"/>
<meta content="" name="keywords"/>
<link href="/information/affiliations/fixings/css/main.css" rel="stylesheet" type="text/css"/>
<link href="/information/affiliations/fixings/css/richEdit.css" rel="stylesheet" type="text/css"/>
<link href="/information/affiliations/fixings/css/header.css" rel="stylesheet" type="text/css"/>
<link href="/information/affiliations/fixings/css/page_menu.css" rel="stylesheet" type="text/css"/>
<style type="text/css">
<!-- .style1 {color: #0065FF} .style2 {color: #993333} -->
</style>
</head>
<body>
<div id="

### Extract links containing a string

By viewing the source, we can see that the links to the individual auctions contain "holdings.jsp."

In [4]:
anchors = soup.select('a[href^="holdings.jsp"]') 
anchors

[<a href="holdings.jsp?auctionId=9175" target="_top">
                     		CASINO GUICHARDPERRACHON - 25 September 2023
                     	</a>,
 <a href="holdings.jsp?auctionId=9174" target="_top">
                     		Vue Entmt Intl Ltd (Bucket 1) - 14 September 2023
                     	</a>,
 <a href="holdings.jsp?auctionId=9173" target="_top">
                     		DIAMOND SPORTS GROUP LLC - 13 April 2023
                     	</a>]

### Extract elements of an anchor

In [5]:
anchor = anchors[0]
text = anchor.text.strip()
url = anchor.get("href")
print("text is", text)
print("url is", url)

text is CASINO GUICHARDPERRACHON - 25 September 2023
url is holdings.jsp?auctionId=9175


### Go to one of those pages and extract links containing a string

By viewing the source of one of the holdings pages, we can see that the results of the auction are at the link containing "results.jsp."

In [6]:
url = anchor.get('href')
url = f'https://www.creditfixings.com/CreditEventAuctions/{url}'
html = session.get(url).text
soup = BeautifulSoup(html, "html.parser")
anchors = soup.select('a[href^="results.jsp"]') 
anchors

[<a href="results.jsp?ticker=GROUPE">CASINO GUICHARDPERRACHON Auction Results</a>]

### Extract the link and read the page with pandas read_html

In [7]:
anchor = anchors[0]
url = anchor.get('href')
url = f'https://www.creditfixings.com/CreditEventAuctions/{url}'

lst = pd.read_html(url)
for df in lst:
    print(df.head(3))

  Relevant Currency
0               EUR
                             Dealer   Bid  Offer  \
0                 Barclays Bank PLC  0.50   2.50   
1                       BNP Paribas  0.25   2.25   
2  Citigroup Global Markets Limited  0.75   2.75   

                           Dealer.1  
0                 Barclays Bank PLC  
1                       BNP Paribas  
2  Citigroup Global Markets Limited  
                             Dealer Bid/Offer  Size
0                 Barclays Bank PLC     Offer   0.0
1                       BNP Paribas     Offer  21.8
2  Citigroup Global Markets Limited     Offer   0.0
                                0       1
0    Sum of Buy Physical Requests   61.0m
1   Sum of Sell Physical Requests  258.3m
2  Sum of Physical Request Trades   61.0m
                        Dealer     Bid  Size
0   J.P. Morgan Securities PLC  2.625*  50.0
1   J.P. Morgan Securities PLC    1.5*  10.0
2  Goldman Sachs International  1.375*  25.0


### Extract all auction data for a single year

This is everything, including imports, to get all of the data for a single year.  Put it in a loop over years to get multiple years.  Note that years < 2010 were coded differently, so it won't work for them.

In [None]:
import requests
session = requests.Session() 
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.creditfixings.com/CreditEventAuctions/AuctionByYear.jsp?year=2023"
html = session.get(url).text
soup = BeautifulSoup(html, "html.parser")
anchors = soup.select('a[href^="holdings.jsp"]') 
for anchor in anchors:
    text = anchor.text.strip()
    url = anchor.get("href")
    url = f'https://www.creditfixings.com/CreditEventAuctions/{url}'
    html = session.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    anchors = soup.select('a[href^="results.jsp"]') 
    anchor = anchors[0]
    url = anchor.get('href')
    url = f'https://www.creditfixings.com/CreditEventAuctions/{url}'

    lst = pd.read_html(url)
    for i, df in enumerate(lst):
        df.to_csv(f"{text}_{i}.csv")