## Homework 3 Part Two: Scraping
This homework asks you to scrape from three different sources: Supreme Court Decisions, rev.com transcripts, and a more complicated version of Shakespeare (bonus (yay, Shakespeare!!). 

Please follow the instructions and do the best you can. Look at the tutorial for examples, my answers for Homework 3 Part One, as well as the Beautiful Soup documentation, and any other Python resource (such as Stack overflow). As you get further into this assignment a lot of the trick will be using loops properly and appending information into lists. One of the great ways to carefully use Beautiful Soup is to first use find() to find the first instance of something and search through it. And then use find_all() to get a list of results that you must then loop through and search within.

If you only get 70% of the stuff done that's great!

In [1]:
#import requests and Beautiful Soup here
import requests
from bs4 import BeautifulSoup

## Supreme Court Decisions 2021 
Now it's time to scrape from reality. The Supreme Court posts its decisions in a format that is not immediately data friendly. They have a simple HTML table with some information about the decision, including a link to a PDF that contains the written decision. We won't mess with those PDFs this week, but we do want to transform their tables into something useful to us. 

We will be scraping this page: 
https://www.supremecourt.gov/opinions/slipopinion/21

*Note:* While you won't see all of the tables for all the months when you go to the page, they are all there in the HTML that you will download and in the HTML source you view (which is the same thing). Definitely do a view source, and study the structure of the HTML tables before you start coding.

You eventually want to end up with a list of lists (rows and then columns) for every decision from the 2021. Follow the process, and see how far you get.


Write your lines that use requests to get the page, and a second variable that passes the raw HTML into Beautiful Soup for parsing. Include a third line that prints the HTML in the prettify() way.

In [2]:
#First I scrape the html from the website
my_url = "https://www.supremecourt.gov/opinions/slipopinion/21"
raw_html = requests.get(my_url).content

#Then I save the html on my computer
with open("supreme_court.html", "wb") as file:
    file.write(raw_html)
    
    
html_file = open("supreme_court.html", "r")


soup_doc = BeautifulSoup(html_file, "html.parser")
print(soup_doc.prettify())

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
 <head id="ctl00_ctl00_Head1">
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="txt/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <script src="/js/jquery-3.5.1.min.js" type="text/javascript">
  </script>
  <script src="/js/bootstrap.min.js" type="text/javascript">
  </script>
  <link href="/css/font-awesome.min.css" rel="stylesheet" type="text/css"/>
  <link href="/css/bootstrap.min.css" rel="Stylesheet" type="text/css"/>
  <link href="/css/bootstrap-theme.min.css" rel="Stylesheet" type="text/css"/>
  <link href="/styles/newBootStrap2.css" rel="stylesheet" type="text/css"/>
  <!-- HTML5 shim and Respond.js IE8 support of HTML5 elements and media queries -->
  <!--[if lt IE 9]>
          <script src="/js/html5shiv.js"></script>
          <script src="/js/respond.min.js"></script>
        <![endif]-->
  <!--[if lt IE 8]>
       

Isolate the HTML row with the first row of information for the case Biden v. Texas (as of 11/9/22 that is the most recent case. These things can update though!)

In [3]:
# First I create a variable 'june' that contains all the supreme court cases in june (id="cell6").  
june = soup_doc.find(id="cell6")

#Then I create a variable called 'biden_v_texas' which contains the first judgement in the table
# (OBS! This varible will change if a new judgement is added)
biden_v_texas = june.find_all('tr')[1]

Print out each cell of information from that first row. Your output should look like this:


```
66
6/30/22
21-954
Biden v. Texas
 
R
597/2
```

In [4]:
#Then I create the variable 'cells' which contains all elements of the biden_v_texas row. 
cells = biden_v_texas.find_all('td')

# And loop through all the elements and print them.         
for cell in cells:
    print(cell.string)

66
6/30/22
21-954
Biden v. Texas
 
R
597/2


But wait, there is more information hidden inside the tags! Really important information. Find it and print it out like this (still just for this first row):
```
/opinions/21pdf/21-954_7l48.pdf 
 The Government’s rescission of Migrant Protection Protocols did not violate section 1225 of the Immigration and Nationality Act, and the then-Secretary of Homeland Security’s October 29 Memoranda constituted valid final agency action.
 ```

In [5]:
print(biden_v_texas.a['href'])
print(biden_v_texas.a['title'])

/opinions/21pdf/21-954_7l48.pdf
The Government’s rescission of Migrant Protection Protocols did not violate section 1225 of the Immigration and Nationality Act, and the then-Secretary of Homeland Security’s October 29 Memoranda constituted valid final agency action.


Okay, time to make this useful. Take the information you printed in the last two cells, and combine them all into a list. Output the list, it should look like this:
```
['66',
 '6/30/22',
 '21-954',
 'Biden v. Texas',
 '\xa0',
 'R',
 '597/2',
 'The Government’s rescission of Migrant Protection Protocols did not violate section 1225 of the Immigration and Nationality Act, and the then-Secretary of Homeland Security’s October 29 Memoranda constituted valid final agency action.',
 '/opinions/21pdf/21-954_7l48.pdf']
 ```
 

In [6]:
#Combine everything you have coded so far,
#Including finding that first row in the table
#It will be useful for later

#First I create an empty list
bvt_list = []

# Then I take every 'td' in the biden_v_texas variable and append them to the list. 
# Afterwards I take the 'title' and 'href' and append them to the bvt_list list too. 
cells = biden_v_texas.find_all('td')
for cell in cells:
    bvt_list.append(cell.string)
bvt_list.append(biden_v_texas.a['title'])
bvt_list.append(biden_v_texas.a['href'])

bvt_list

['66',
 '6/30/22',
 '21-954',
 'Biden v. Texas',
 '\xa0',
 'R',
 '597/2',
 'The Government’s rescission of Migrant Protection Protocols did not violate section 1225 of the Immigration and Nationality Act, and the then-Secretary of Homeland Security’s October 29 Memoranda constituted valid final agency action.',
 '/opinions/21pdf/21-954_7l48.pdf']

Now, run the exact same code, but for the first row in the third table, April 2021. The output should look like this:
```
['28',
 '4/28/22',
 '20-807',
 'LeDure v. Union Pacific Railroad Co.',
 '\xa0',
 'PC',
 '596/1',
 'Judgment affirmed by an equally divided Court.',
 '/opinions/21pdf/20-807_3f14.pdf']
```


In [7]:
april = soup_doc.find(id="cell4")

#Then I create a variable called 'april1' which contains the first judgement in the april table
april1 = april.find_all('tr')[1]

# Then I create an empty list.
april1_list = []

# And take every 'td' in the april1 variable and append them to the list. 
# Afterwards I take the 'title' and 'href' and append them to the april1_list list too. 
cells1 = april1.find_all('td')
for cell in cells1:
    
    april1_list.append(cell.string)
april1_list.append(april1.a['title'])
april1_list.append(april1.a['href'])

april1_list

['28',
 '4/28/22',
 '20-807',
 'LeDure v. Union Pacific Railroad Co.',
 '\xa0',
 'PC',
 '596/1',
 'Judgment affirmed by an equally divided Court.',
 '/opinions/21pdf/20-807_3f14.pdf']

Great! Now you want to go through all of the rows in that thrid table, April (but not the header), and get a list of lists with the information for every case in that row. 

Note, that the code here should be similar to the code above, but you will need to loop through all of the rows in April, and collect the info for each row with a new list that will then be appended to a larger list each to time the loop finishes (before looping back to the next row).

Your output should look like this:

```
[['28',
  '4/28/22',
  '20-807',
  'LeDure v. Union Pacific Railroad Co.',
  '\xa0',
  'PC',
  '596/1',
  '/opinions/21pdf/20-807_3f14.pdf',
  'Judgment affirmed by an equally divided Court.'],
 ['27',
  '4/28/22',
  '20-219',
  'Cummings v. Premier Rehab Keller',
  '\xa0',
  'R',
  '596/1',
  '/opinions/21pdf/20-219_1b82.pdf',
  'Emotional distress damages are not recoverable in a private action to enforce either the Rehabilitation Act of 1973 or the Affordable Care Act.'],
 ['26',
  '4/21/22',
  '20-1472',
  'Boechler v. Commissioner',
  '\xa0',
  'AB',
  '596/1',
  '/opinions/21pdf/20-1472_6j37.pdf',
  'The 30-day time limit to file a petition for review of a collection due process determination, 26 U. S. C. §6330(d)(1), is a nonjurisdictional deadline subject to equitable tolling.'],
 ['25',
  '4/21/22',
  '20-303',
  'United States v. Vaello Madero',
  '4/28/22',
  'BK',
  '596/1',
  '/opinions/21pdf/20-303_new_21o2.pdf',
  'The Constitution does not require Congress to make Supplemental Security Income benefits available to residents of Puerto Rico.'],
 ['24',
  '4/21/22',
  '20-826',
  'Brown v. Davenport',
  '\xa0',
  'NG',
  '596/1',
  '/opinions/21pdf/20-826_p702.pdf',
  'When a state court has ruled on the merits of a state prisoner’s claim, a federal court cannot grant habeas relief without applying both the test this Court outlined in Brecht v. Abrahamson, 507 U. S. 619, and the one Congress prescribed in the Antiterrorism and Effective Death Penalty Act of 1996; the Sixth Circuit erred in granting habeas relief to Mr. Davenport based solely on its assessment that he could satisfy the Brecht standard.'],
 ['23',
  '4/21/22',
  '20-1566',
  'Cassirer v. Thyssen-Bornemisza Collection Foundation',
  '\xa0',
  'EK',
  '596/1',
  '/opinions/21pdf/20-1566_l5gm.pdf',
  'In a suit raising non-federal claims against a foreign state or instrumentality under the Foreign Sovereign Immunities Act of 1976, a court should determine the substantive law by using the same choice-of-law rule applicable in a similar suit against a private party.'],
 ['22',
  '4/21/22',
  '20-1029',
  'City of Austin v. Reagan National Advertising of Austin, LLC',
  '\xa0',
  'SS',
  '596/1',
  '/opinions/21pdf/20-1029_i42k.pdf',
  'The distinction between on-premises signs and off-premises signs in the City of Austin’s sign code is facially content neutral under the First Amendment.'],
 ['21',
  '4/04/22',
  '20-659',
  'Thompson v. Clark',
  '\xa0',
  'BK',
  '596/1',
  '/opinions/21pdf/20-659_3ea4.pdf',
  'Petitioner Thompson’s showing that his criminal prosecution ended without a conviction satisfies the requirement to demonstrate a favorable termination of a criminal prosecution in a Fourth Amendment claim under 42 U. S. C. §1983 for malicious prosecution; an affirmative indication of innocence is not needed.']]
  
```

In [8]:
aprilall = april.find_all('tr')[1:]

aprilall_list = []

for rows in aprilall:
    one_row = []
    cells = rows.find_all('td')
    for cell in cells:
        one_row.append(cell.string)
    one_row.append(cells[3].a['href'])
    one_row.append(cells[3].a['title'])
    aprilall_list.append(one_row)
aprilall_list

[['28',
  '4/28/22',
  '20-807',
  'LeDure v. Union Pacific Railroad Co.',
  '\xa0',
  'PC',
  '596/1',
  '/opinions/21pdf/20-807_3f14.pdf',
  'Judgment affirmed by an equally divided Court.'],
 ['27',
  '4/28/22',
  '20-219',
  'Cummings v. Premier Rehab Keller',
  '\xa0',
  'R',
  '596/1',
  '/opinions/21pdf/20-219_1b82.pdf',
  'Emotional distress damages are not recoverable in a private action to enforce either the Rehabilitation Act of 1973 or the Affordable Care Act.'],
 ['26',
  '4/21/22',
  '20-1472',
  'Boechler v. Commissioner',
  '\xa0',
  'AB',
  '596/1',
  '/opinions/21pdf/20-1472_6j37.pdf',
  'The 30-day time limit to file a petition for review of a collection due process determination, 26 U. S. C. §6330(d)(1), is a nonjurisdictional deadline subject to equitable tolling.'],
 ['25',
  '4/21/22',
  '20-303',
  'United States v. Vaello Madero',
  '4/28/22',
  'BK',
  '596/1',
  '/opinions/21pdf/20-303_new_21o2.pdf',
  'The Constitution does not require Congress to make Suppl

Finally, go through EVERY table, and get out every row--no headers. So you have all of the 2021 decisions from 66-1 in highly useful list-within-list format.

In [9]:
all_months = soup_doc.find_all(class_="table table-bordered")

all_months

[<table cellpadding="2" cellspacing="0" class="table table-bordered" style="text-align: left;">
 <tr>
 <th scope="col" style="text-align: center;" width="20">R-</th>
 <th scope="col" style="text-align: center;" width="60">Date</th>
 <th scope="col" style="text-align: center;" width="60">Docket</th>
 <th scope="col" style="text-align: center;" width="400">Name</th>
 <th scope="col" style="text-align: center;" width="60">Revised</th>
 <th scope="col" style="text-align: center;" width="20">J.</th>
 <th scope="col" style="text-align: center;" width="40">Pt.</th>
 </tr>
 <tr>
 <td style="text-align: center;">66</td>
 <td style="text-align: center;">6/30/22</td>
 <td style="text-align: center; white-space: nowrap;">21-954</td>
 <td><a href="/opinions/21pdf/21-954_7l48.pdf" target="_blank" title="The Government’s rescission of Migrant Protection Protocols did not violate section 1225 of the Immigration and Nationality Act, and the then-Secretary of Homeland Security’s October 29 Memoranda con

In [10]:
all_months_list= []
for months in all_months:
    one_table = []
    all_rows = months.find_all('tr')
    for rows in all_rows[1:]:
        one_row = []
        cells = rows.find_all('td')
        for cell in cells:
            one_row.append(cell.string)
        one_row.append(cells[3].a['href'])
        one_row.append(cells[3].a['title'])
        all_months_list.append(one_row)
all_months_list

[['66',
  '6/30/22',
  '21-954',
  'Biden v. Texas',
  '\xa0',
  'R',
  '597/2',
  '/opinions/21pdf/21-954_7l48.pdf',
  'The Government’s rescission of Migrant Protection Protocols did not violate section 1225 of the Immigration and Nationality Act, and the then-Secretary of Homeland Security’s October 29 Memoranda constituted valid final agency action.'],
 ['65',
  '6/30/22',
  '20-1530',
  'West Virginia v. EPA',
  '7/13/22',
  'R',
  '597/2',
  '/opinions/21pdf/20-1530_new_l537.pdf',
  'Congress did not grant the Environmental Protection Agency in Section 111(d) of the Clean Air Act the authority to devise emissions caps based on the generation shifting approach the Agency took in the Clean Power Plan.'],
 ['64',
  '6/29/22',
  '21-429',
  'Oklahoma v. Castro-Huerta',
  '\xa0',
  'BK',
  '597/2',
  '/opinions/21pdf/21-429_8o6a.pdf',
  'The Federal Government and the State have concurrent jurisdiction to prosecute crimes committed by non-Indians against Indians in Indian country.'],


## Part Two: Frances Haugen Senate Transcript 
Often there is information that is publicly available but not in a format you want it to be in. That is particularly the case with transcripts. The Web site https://www.rev.com/ publishes some of its transcripts publicly. We are going to start by extracting the dialogue of the transcript from the HTML for Frances Haugen's Senate testimony.

**Please note here:** this is the first time I have assigned scraping from the site. Because it is a commercial site, it is possible that they might have protocols that try to defy/block scraping. I was able to download the HTML for all of the pages that we need to scrape below with no problem. Once all 18 of you begin trying to scrape these pages, their site might take notice, or it might not. More importantly, **be careful about how many times you actually run your cell that excutes the actual request.** As I discussed in class, when you have completed a request for an HTML page, save the HTML file locally. And then never run the request again (if you are scraping pages that are being continually updated that is not an option, but here is it). If you need to take a break and come back to your work, make sure to load in the local HTML file that you have downloaded and not to run request again. For the final multi-page scrape it won't make much sense to do that, but you still could. (I don't anticipate there being a problem for this site. But I have found many coders who are new to scraping run requests over and over way too many times and do eventually get blocked by the site, so it is always something to keep in mind when you are scraping a site. And please DO NOT select "run all cells", unless you have successfully requested and saved a page local and then commented out the request. Be careful about running all cells generally--especially while you are first building you code.)

First, go to this page and take a look at what you're contending with:

https://www.rev.com/blog/transcripts/facebook-whistleblower-frances-haugen-testifies-on-children-social-media-use-full-senate-hearing-transcript



Step 1: In the next few cells, use requests to download the HTML. 

Save the downloaded HTML locally and then load in the local file. 

Then run that locally-loaded HTML through Beautiful Soup to parse it. 

Then print the prettify() version of that downloaded HTML.

In [11]:
#First I scrape the html from the website
my_url2 = "https://www.rev.com/blog/transcripts/facebook-whistleblower-frances-haugen-testifies-on-children-social-media-use-full-senate-hearing-transcript"
raw_html2 = requests.get(my_url2).content

In [12]:
#Then I save the html on my computer
with open("haugen_transcript.html", "wb") as file:
    file.write(raw_html2)

In [13]:
html_file2 = open("haugen_transcript.html", "r")

In [14]:
soup_doc2 = BeautifulSoup(html_file2, "html.parser")
print(soup_doc2.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <script>
   if(navigator.userAgent.match(/MSIE|Internet Explorer/i)||navigator.userAgent.match(/Trident\/7\..*?rv:11/i)){var href=document.location.href;if(!href.match(/[?&]nowprocket/)){if(href.indexOf("?")==-1){if(href.indexOf("#")==-1){document.location.href=href+"?nowprocket=1"}else{document.location.href=href.replace("#","?nowprocket=1#")}}else{if(href.indexOf("#")==-1){document.location.href=href+"&nowprocket=1"}else{document.location.href=href.replace("#","&nowprocket=1#")}}}}
  </script>
  <script>
   class RocketLazyLoadScripts{constructor(e){this.triggerEvents=e,this.eventOptions={passive:!0},this.userEventListener=this.triggerListener.bind(this),this.delayedScripts={normal:[],async:[],defer:[]},this.allJQueries=[]}_addUserInteractionListener(e){this.triggerEvents.forEach((t=>window.addEventListener(t,e.userEventListener,e.eventOptions)))}_removeUserInteractionListener(e){this.triggerEvents.forEach((t=>wi

Step 2: Get the text of the first line of dialogue:

`'Mr. Chairman Blumenthal: (00:04)\n[crosstalk 00:00:04].'`

In [15]:
frances = soup_doc2.find(class_='fl-callout-text')

frances.find('p').get_text()

'Mr. Chairman Blumenthal: (00:04)\n[crosstalk 00:00:04].'

Next, get the text of the second line:

`'Mr. Chairman Blumenthal: (02:54)\nSo, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how Facebook has put profits ahead of people.'`

In [16]:
frances = soup_doc2.find(class_='fl-callout-text')

frances.find_all('p')[1].get_text()

'Mr. Chairman Blumenthal: (02:54)\nSo, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how Facebook has put profits ahead of people.'

Great! But, that text is not nicely structured. Using this second line of dialogue, get the speakers name, the text, and the text of the speech individually. Use print to print them individually along with the strings "speaker:", "time:", and "text:" before each of them like this:

`
speaker: Mr. Chairman Blumenthal: (
time: 02:54
text: 
So, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how Facebook has put profits ahead of people.
`

In [17]:
speaker1 = frances.find_all('p')[1]

print("speaker: " + speaker1.a.previous + " time: " + speaker1.a.next + " text: " + speaker1.a.next.next.next.next)

speaker: Mr. Chairman Blumenthal: ( time: 02:54 text: 
So, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how Facebook has put profits ahead of people.


Now run the exact same code but use python to remove that `(` after the speaker's name, so your print out looks like this:

`speaker: Mr. Chairman Blumenthal: 
time: 02:54
text: 
So, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how Facebook has put profits ahead of people.`

In [18]:
print("speaker: " + speaker1.a.previous.get_text(strip=True).strip('(') + "time: " + speaker1.a.next + " text: " + speaker1.a.next.next.next.next)

speaker: Mr. Chairman Blumenthal: time: 02:54 text: 
So, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how Facebook has put profits ahead of people.


Now, modify that code so you iterate through every line of speech, and print out each line the same way, also printing a delimeter between each instance of dialogue. Like this:

`speaker: Mr. Chairman Blumenthal: 
time: 00:04
text: 
[crosstalk 00:00:04].
###################
speaker: Mr. Chairman Blumenthal: 
time: 02:54
text: 
So, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how Facebook has put profits ahead of people.
###################
speaker: Mr. Chairman Blumenthal: 
time: 04:09
text: 
Among other revelations, the information that you have provided to Congress is powerful proof that Facebook knew its products were harming teenagers. Facebook exploited teens using powerful algorithms that amplified their insecurities and abuses through what it found was an addict’s narrative. There is a question, which I hope you will discuss, as to whether there is such a thing as a safe algorithm. Facebook saw teens creating secret accounts that are often hidden from their parents as unique value proposition. In their words, a unique value proposition. A way to drive out numbers for advertisers and shareholders at the expense of safety, and it doubled down on targeting children pushing products on pre-teens not just teens, but pre-teens that it knows are harmful to our kids’ mental health and wellbeing.
###################
speaker: Mr. Chairman Blumenthal: 
time: 05:21
text: 
Instead of telling parents, Facebook concealed the facts, it sought to stonewall and block this information from becoming public, including to this committee when Senator Blackburn and I specifically asked the company. And still, even now, as of just last Thursday, when a Facebook witness came before this committee, it has refused this disclosure or even to tell us when it might decide whether to disclose additional documents. And they’ve continued their tactics, even after they knew the disruption it caused it. Isn’t just that they made money from these practices, but they continued to profit from them. Their profit was more important than the pain that they caused.
###################
speaker: Mr. Chairman Blumenthal: 
time: 06:14
text: 
Last Thursday, the message from Ms. Antigone Davis, Facebook’s Global Head of Safety was simple, “This research is not a bombshell.” and she repeated the line, not a bombshell. Well, this research is the very definition of a bombshell. Facebook and big tech are facing a big tobacco moment, a moment of reckoning, the parallel is striking. I sued big tobacco as Connecticut’s attorney general, I helped to lead the states in that legal action and I remember very, very well, the moment in the course of our litigation when we learned of those files that showed not only that big tobacco knew that its product caused cancer but that they had done the research, they concealed the files, and now we knew and the world knew. And big tech now faces that big tobacco jaw dropping moment of truth. It is documented proof that Facebook knows its products can be addictive and toxic to children. And it’s not just that they made money, again, it’s that they valued their profit more than the pain that they caused to children and their families.
###################`

And so forth to the end

In [19]:
speakers = frances.find_all('p')

for speaker in speakers: 
    print("speaker: " + speaker.a.previous.get_text(strip=True).strip('(') + "time: " + speaker.a.next + " text: " + speaker.a.next.next.next.next + "###################")

speaker: Mr. Chairman Blumenthal: time: 00:04 text: 
[crosstalk 00:00:04].###################
speaker: Mr. Chairman Blumenthal: time: 02:54 text: 
So, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how Facebook has put profits 

Finally, make that useful!!!

Instead of printing. Make an empty list before the loop starts. Then inside the loop make a dictionary with three keys: speaker, time, and text -- and append each dictionary to the list as you loop through. The output should look like this:

`[{'speaker': 'Mr. Chairman Blumenthal: ',
  'time': '00:04',
  'text': '\n[crosstalk 00:00:04].'},
 {'speaker': 'Mr. Chairman Blumenthal: ',
  'time': '02:54',
  'text': '\nSo, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how Facebook has put profits ahead of people.'},
 {'speaker': 'Mr. Chairman Blumenthal: ',
  'time': '04:09',
  'text': '\nAmong other revelations, the information that you have provided to Congress is powerful proof that Facebook knew its products were harming teenagers. Facebook exploited teens using powerful algorithms that amplified their insecurities and abuses through what it found was an addict’s narrative. There is a question, which I hope you will discuss, as to whether there is such a thing as a safe algorithm. Facebook saw teens creating secret accounts that are often hidden from their parents as unique value proposition. In their words, a unique value proposition. A way to drive out numbers for advertisers and shareholders at the expense of safety, and it doubled down on targeting children pushing products on pre-teens not just teens, but pre-teens that it knows are harmful to our kids’ mental health and wellbeing.'},
 {'speaker': 'Mr. Chairman Blumenthal: ',
  'time': '05:21',
  'text': '\nInstead of telling parents, Facebook concealed the facts, it sought to stonewall and block this information from becoming public, including to this committee when Senator Blackburn and I specifically asked the company. And still, even now, as of just last Thursday, when a Facebook witness came before this committee, it has refused this disclosure or even to tell us when it might decide whether to disclose additional documents. And they’ve continued their tactics, even after they knew the disruption it caused it. Isn’t just that they made money from these practices, but they continued to profit from them. Their profit was more important than the pain that they caused.'},`

All the way to the end:



`,
{'speaker': 'Mr. Chairman Blumenthal: ',
  'time': '03:20:20',
  'text': '\nThe record will remain open for two weeks. Any senators who want to submit questions for the record should do so by October 19th, this hearing is adjourned.'},
 {'speaker': 'Miss Frances Haugen: ',
  'time': '03:20:35',
  'text': '\nThank you. [crosstalk 03:20:35] (silence).'},
 {'speaker': 'Speaker 1: ',
  'time': '03:21:30',
  'text': '\n[inaudible 03:21:30].'},
 {'speaker': 'Miss Frances Haugen: ',
  'time': '03:21:42',
  'text': '\n[inaudible 03:21:42].'},
 {'speaker': 'Mr. Chairman Blumenthal: ',
  'time': '03:22:20',
  'text': '\nThank you. [inaudible 03:22:20] See you tomorrow.'}]`

In [20]:
speakers1 = []
for speaker in speakers:
    speak = {}
    cells = frances.find_all('p')
    speak['speaker'] = speaker.a.previous.get_text(strip=True).strip('(')
    speak['time'] = speaker.a.next
    speak['text'] = speaker.a.next.next.next.next
    speakers1.append(speak)
speakers1  

[{'speaker': 'Mr. Chairman Blumenthal: ',
  'time': '00:04',
  'text': '\n[crosstalk 00:00:04].'},
 {'speaker': 'Mr. Chairman Blumenthal: ',
  'time': '02:54',
  'text': '\nSo, welcome my colleagues and I want to thank Ranking Member Senator Blackburn for her cooperation and collaboration, we’ve been working very closely. And the ranking member who is here, Senator Wicker, as well as our Chairwoman Maria Cantwell. Senator Cantwell, I’m sure will be here shortly. Most important, I’d like to thank our witness Frances Haugen for being here and the two council who are representing her [inaudible 00:03:27] my heartfelt gratitude for your courage and strength in coming forward. As you have done standing up to one of the most powerful impactful corporate giants in the history of the world, without any exaggeration. You have a compelling, credible voice, which we’ve heard already. But you are not here alone, you’re armed with documents and evidence, and you speak volumes, as they do, about how

## Part Two (Section Two): More Transcripts!!!
Now let's start collecting their table of contents, so we can get the transcript information as well as the links to the transcripts that they are making publically avaiable. We are going to start with the second page of the table of contents:

https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/2

Why page 2? Because page one has its own special URL, but all subsequent pages have the exact same URL except for that number right at the end of the URL `/2` (and, in fact `/1` works to get the first page). 

(When we are scraping across pages (which you will attempt a little bit further down this notebook) it is super helpful if you can find a consistent naming pattern for numbering pages so that you can loop through URLs without any other difficult work (like scraping page link URLs)).

So take a look at page 2 and try to figure out how their contents and links held in the HTML (this may be challenging, you really need to think about structure here).



First, do your request, save it locally and load in the local file and run it through Beautiful Soup's html parser.

In [21]:
my_url3 = "https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/2"
raw_html3 = requests.get(my_url3).content

In [22]:
with open("testimony_transcript.html", "wb") as file:
    file.write(raw_html3)

In [23]:
html_file3 = open("testimony_transcript.html", "r")

In [24]:
soup_doc3 = BeautifulSoup(html_file3, "html.parser")
print(soup_doc3.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <script>
   if(navigator.userAgent.match(/MSIE|Internet Explorer/i)||navigator.userAgent.match(/Trident\/7\..*?rv:11/i)){var href=document.location.href;if(!href.match(/[?&]nowprocket/)){if(href.indexOf("?")==-1){if(href.indexOf("#")==-1){document.location.href=href+"?nowprocket=1"}else{document.location.href=href.replace("#","?nowprocket=1#")}}else{if(href.indexOf("#")==-1){document.location.href=href+"&nowprocket=1"}else{document.location.href=href.replace("#","&nowprocket=1#")}}}}
  </script>
  <script>
   class RocketLazyLoadScripts{constructor(e){this.triggerEvents=e,this.eventOptions={passive:!0},this.userEventListener=this.triggerListener.bind(this),this.delayedScripts={normal:[],async:[],defer:[]},this.allJQueries=[]}_addUserInteractionListener(e){this.triggerEvents.forEach((t=>window.addEventListener(t,e.userEventListener,e.eventOptions)))}_removeUserInteractionListener(e){this.triggerEvents.forEach((t=>wi

Extract the first transcript's description:

`'Attorney General Merrick Garland testifies on DOJ budget in Senate hearing 4/26/22 Transcript'`

In [25]:
transcripts = soup_doc3.find_all(class_='fl-post-column')

merrick = transcripts[0]
merrick.strong.string

#transcript.find('meta content')

'Attorney General Merrick Garland testifies on DOJ budget in Senate hearing 4/26/22 Transcript'

Extract the first transcript's time label (this isn't so useful, but it could be, so why not!):

`'7 months ago'`

In [26]:
merrick.small.string

'7 months ago'

Extract the link to that transcript!

`'https://www.rev.com/blog/transcripts/attorney-general-merrick-garland-testifies-on-doj-budget-in-senate-hearing-4-26-22-transcript'`

In [27]:
merrick.a['href']

'https://www.rev.com/blog/transcripts/attorney-general-merrick-garland-testifies-on-doj-budget-in-senate-hearing-4-26-22-transcript'

Now, a leap, get all of the transcript details -- description, time, link -- on the page, and print them out, like so:

`Attorney General Merrick Garland testifies on DOJ budget in Senate hearing 4/26/22 Transcript
7 months ago
https://www.rev.com/blog/transcripts/attorney-general-merrick-garland-testifies-on-doj-budget-in-senate-hearing-4-26-22-transcript
############
Crypto CEOs Testify Before House Financial Services Hearing Transcript
11 months ago
https://www.rev.com/blog/transcripts/crypto-ceos-testify-before-house-financial-services-hearing-transcript
############
Senate Hearing Oversight Capitol Police After January 6 Attack: Transcript
11 months ago
https://www.rev.com/blog/transcripts/senate-hearing-oversight-capitol-police-after-january-6-attack-transcript
############
Janet Yellen & Jerome Powell Testimony on Economy Full Hearing Transcript November 30
11 months ago
https://www.rev.com/blog/transcripts/janet-yellen-jerome-powell-testimony-on-economy-full-hearing-transcript-november-30
############
Dr. Fauci & Dr. Walensky Testify on COVID-19 Response, Vaccines Full Senate Hearing Transcript
1 year ago
https://www.rev.com/blog/transcripts/dr-fauci-dr-walensky-testify-on-covid-19-response-vaccines-full-senate-hearing-transcript
############
AG Merrick Garland Testifies on Justice Department Oversight: Full Senate Hearing Transcript
1 year ago
https://www.rev.com/blog/transcripts/ag-merrick-garland-testifies-on-justice-department-oversight-full-senate-hearing-transcript
############
Tom Cotton Tells AG Merrick Garland He “Should Resign in Disgrace” Transcript
1 year ago
https://www.rev.com/blog/transcripts/tom-cotton-tells-ag-merrick-garland-he-should-resign-in-disgrace-transcript
############
Facebook Whistleblower Frances Haugen Testifies Before UK Parliament Transcript
1 year ago
https://www.rev.com/blog/transcripts/facebook-whistleblower-frances-haugen-testifies-before-uk-parliament-transcript
############
Facebook Whistleblower Frances Haugen Testifies on Children & Social Media Use: Full Senate Hearing Transcript
1 year ago
https://www.rev.com/blog/transcripts/facebook-whistleblower-frances-haugen-testifies-on-children-social-media-use-full-senate-hearing-transcript
############
Facebook Head of Safety Testimony on Mental Health Effects: Full Senate Hearing Transcript
1 year ago
https://www.rev.com/blog/transcripts/facebook-head-of-safety-testimony-on-mental-health-effects-full-senate-hearing-transcript
############
Janet Yellen & Jerome Powell Testimony on Pandemic Economic Recovery Full Hearing Transcript September 30
1 year ago
https://www.rev.com/blog/transcripts/janet-yellen-jerome-powell-testimony-on-economic-recovery-full-hearing-transcript-sept-30
############
Military Leaders, Gen. Milley Testify on Afghanistan Exit: Full House Hearing Transcript September 29
1 year ago
https://www.rev.com/blog/transcripts/military-leaders-gen-milley-testify-on-afghanistan-exit-full-house-hearing-transcript-september-29
############
Matt Gaetz Exchange with Military Leaders Transcript: Hearing on Afghanistan Exit
1 year ago
https://www.rev.com/blog/transcripts/matt-gaetz-exchange-with-military-leaders-transcript-hearing-on-afghanistan-exit
############
Senate Hearing on Texas Abortion Law Transcript
1 year ago
https://www.rev.com/blog/transcripts/senate-hearing-on-texas-abortion-law-transcript
############`

In [28]:
# I create a loop and print the description, time, link for all the transcripts on the page
for transcript in transcripts: 
    print(transcript.strong.string + transcript.small.string + transcript.a['href'] + "###################")

Attorney General Merrick Garland testifies on DOJ budget in Senate hearing 4/26/22 Transcript7 months agohttps://www.rev.com/blog/transcripts/attorney-general-merrick-garland-testifies-on-doj-budget-in-senate-hearing-4-26-22-transcript###################
Crypto CEOs Testify Before House Financial Services Hearing Transcript11 months agohttps://www.rev.com/blog/transcripts/crypto-ceos-testify-before-house-financial-services-hearing-transcript###################
Senate Hearing Oversight Capitol Police After January 6 Attack: Transcript11 months agohttps://www.rev.com/blog/transcripts/senate-hearing-oversight-capitol-police-after-january-6-attack-transcript###################
Janet Yellen & Jerome Powell Testimony on Economy Full Hearing Transcript November 3012 months agohttps://www.rev.com/blog/transcripts/janet-yellen-jerome-powell-testimony-on-economy-full-hearing-transcript-november-30###################
Dr. Fauci & Dr. Walensky Testify on COVID-19 Response, Vaccines Full Senate Hear

Now, make that into a list of lists or a list of dictionaries--up to you!

Either:

`[['Attorney General Merrick Garland testifies on DOJ budget in Senate hearing 4/26/22 Transcript',
  '7 months ago',
  'https://www.rev.com/blog/transcripts/attorney-general-merrick-garland-testifies-on-doj-budget-in-senate-hearing-4-26-22-transcript'],
 ['Crypto CEOs Testify Before House Financial Services Hearing Transcript',
  '11 months ago',
  'https://www.rev.com/blog/transcripts/crypto-ceos-testify-before-house-financial-services-hearing-transcript'],
 ['Senate Hearing Oversight Capitol Police After January 6 Attack: Transcript',
  '11 months ago',
  'https://www.rev.com/blog/transcripts/senate-hearing-oversight-capitol-police-after-january-6-attack-transcript'],`
  
  and so forth

Or:

`[{'content': 'Attorney General Merrick Garland testifies on DOJ budget in Senate hearing 4/26/22 Transcript',
  'time': '7 months ago',
  'link': 'https://www.rev.com/blog/transcripts/attorney-general-merrick-garland-testifies-on-doj-budget-in-senate-hearing-4-26-22-transcript'},
 {'content': 'Crypto CEOs Testify Before House Financial Services Hearing Transcript',
  'time': '11 months ago',
  'link': 'https://www.rev.com/blog/transcripts/crypto-ceos-testify-before-house-financial-services-hearing-transcript'},
 {'content': 'Senate Hearing Oversight Capitol Police After January 6 Attack: Transcript',
  'time': '11 months ago',
  'link': 'https://www.rev.com/blog/transcripts/senate-hearing-oversight-capitol-police-after-january-6-attack-transcript'},
 {'content': 'Janet Yellen & Jerome Powell Testimony on Economy Full Hearing Transcript November 30',
  'time': '11 months ago',
  'link': 'https://www.rev.com/blog/transcripts/janet-yellen-jerome-powell-testimony-on-economy-full-hearing-transcript-november-30'},`
  
  and so forth...

In [29]:
transcript = []
for scripts in transcripts:
    rows = {}
    rows['description'] = scripts.strong.string
    rows['time'] = scripts.small.string
    rows['link'] = scripts.a['href']
    transcript.append(rows)
transcript


[{'description': 'Attorney General Merrick Garland testifies on DOJ budget in Senate hearing 4/26/22 Transcript',
  'time': '7 months ago',
  'link': 'https://www.rev.com/blog/transcripts/attorney-general-merrick-garland-testifies-on-doj-budget-in-senate-hearing-4-26-22-transcript'},
 {'description': 'Crypto CEOs Testify Before House Financial Services Hearing Transcript',
  'time': '11 months ago',
  'link': 'https://www.rev.com/blog/transcripts/crypto-ceos-testify-before-house-financial-services-hearing-transcript'},
 {'description': 'Senate Hearing Oversight Capitol Police After January 6 Attack: Transcript',
  'time': '11 months ago',
  'link': 'https://www.rev.com/blog/transcripts/senate-hearing-oversight-capitol-police-after-january-6-attack-transcript'},
 {'description': 'Janet Yellen & Jerome Powell Testimony on Economy Full Hearing Transcript November 30',
  'time': '12 months ago',
  'link': 'https://www.rev.com/blog/transcripts/janet-yellen-jerome-powell-testimony-on-economy

## Part Two (Section Three): Multiple Pages!!!
Now, we are going to use the exact same code you used above but on multiple pages. This means making a loop that goes from page to page and executes the exact same process. We will only scrape pages 1 through 4.

First, using Python figure out how to make a loop that changes the URL so that the number at the end of the URL shifts from 1 to 4. You should just print out the following:

`https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/1
https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/2
https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/3
https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/4`

In [30]:
url = "https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/"

num = 0 
x = range(4)
for urls in x:
    num = num + 1
    print(f"{url}{num}")

https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/1
https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/2
https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/3
https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/4


Now, take a deep breath, big leap here:

Make a loop that loops through those page urls:

In each loop you need to:

    make the URL with the proper number (like above)

    use requests to download the page with that URL

    (skip the save locally part if you want)

    parse the page with Beautiful Soup

    extract each listing on the page: content, link, time

    append that info to a master list 
    (make sure the master list is declared 
    before the loop starts)

Your resulting list should have 56 elements, and (unless they update the site between 11/9 and the time you are working on this) it should begin:

`['Jan. 6 Committee Releases Testimony On Lines Cut From Trump’s Speech The Day After Capitol Riot Transcript',
  '4 months ago',
  'https://www.rev.com/blog/transcripts/jan-6-committee-releases-testimony-on-lines-cut-from-trumps-speech-the-day-after-capitol-riot-transcript'],
 ['Full Jan. 6 Committee Hearing – Day 8 Transcript',
  '4 months ago',
  'https://www.rev.com/blog/transcripts/full-jan-6-committee-hearing-day-8-transcript'],
 ['Jan. 6 Committee Hearing – Day 7 Transcript',
  '4 months ago',
  'https://www.rev.com/blog/transcripts/jan-6-committee-hearing-day-7-transcript'],`
  
and end:
  
`['Elizabeth Warren Questions JPMorgan Chase CEO Jamie Dimon on Overdraft Fees Transcript',
  '1 year ago',
  'https://www.rev.com/blog/transcripts/elizabeth-warren-questions-jpmorgan-chase-ceo-jamie-dimon-on-overdraft-fees-transcript'],
 ['DHS Secretary Alejandro Mayorkas Testimony on Immigration, Southern Border Transcript',
  '1 year ago',
  'https://www.rev.com/blog/transcripts/dhs-secretary-alejandro-mayorkas-testimony-on-immigration-southern-border-transcript'],
 ['Dr. Fauci, CDC Director Testify Before Senate on COVID-19 Guidelines Transcript',
  '1 year ago',
  'https://www.rev.com/blog/transcripts/dr-fauci-cdc-director-testify-before-senate-on-covid-19-guidelines-transcript']]`


In [31]:
#First I scrape the html from the website
my_url = "https://www.supremecourt.gov/opinions/slipopinion/21"
raw_html = requests.get(my_url).content

master_list = []

url = "https://www.rev.com/blog/transcript-category/congressional-testimony-hearing-transcripts/page/"

x = range(1,5)
for num in x:
    my_url = "".join((url,str(num)))
    raw_html = requests.get(my_url).content
    speech_doc = BeautifulSoup(raw_html, "html.parser")
    for grid in speech_doc.find_all(class_='fl-post-grid-post'):
        grid_list = []
        grid_list.append(grid.strong.string)
        grid_list.append(grid.small.string)
        grid_list.append(grid.a['href'])
        master_list.append(grid_list)

len(master_list)
print("*********")
master_list

*********


[['Jan. 6 Committee Releases Testimony On Lines Cut From Trump’s Speech The Day After Capitol Riot Transcript',
  '4 months ago',
  'https://www.rev.com/blog/transcripts/jan-6-committee-releases-testimony-on-lines-cut-from-trumps-speech-the-day-after-capitol-riot-transcript'],
 ['Full Jan. 6 Committee Hearing – Day 8 Transcript',
  '4 months ago',
  'https://www.rev.com/blog/transcripts/full-jan-6-committee-hearing-day-8-transcript'],
 ['Jan. 6 Committee Hearing – Day 7 Transcript',
  '4 months ago',
  'https://www.rev.com/blog/transcripts/jan-6-committee-hearing-day-7-transcript'],
 ['Day 6 of Jan. 6 committee hearings 6/28/22 Transcript',
  '5 months ago',
  'https://www.rev.com/blog/transcripts/day-6-of-jan-6-committee-hearings-6-28-22-transcript'],
 ['Former top Meadows aide to testify in unexpected Jan. 6 committee hearing 6/28/22 Transcript',
  '5 months ago',
  'https://www.rev.com/blog/transcripts/former-top-meadows-aide-to-testify-in-unexpected-jan-6-committee-hearing-6-28-22-

## Real Shakespeare: Extra Credit
Haven't done enough scraping, you're in luck!!!

The Folger  Shakespeare Library has HTML versions of their Shakespeare publicly available, but in terrible HTML format. If you want to challenge yourself try pulling out the first 100 lines of Twelfth Night, available here:

http://floatingmedia.com/columbia/FolgerShakes/TN.html

The final output should resemble what you see below. Each of these lines contains three elements:

1) a code for act.scene.line along with whether is the stage direction 
2) the speaker or the last person who spoke prior to the stage direction
3) a line or stage direction.

`
line-SD 1.1.0	NOSPEAKER	Enter Orsino, Duke of Illyria, Curio, and other Lords,
line-SD 1.1.0	NOSPEAKER	with
line-SD 1.1.0	NOSPEAKER	 Musicians playing.
line-1.1.1	ORSINO	If music be the food of love, play on.
line-1.1.2	ORSINO	Give me excess of it, that, surfeiting,
line-1.1.3	ORSINO	The appetite may sicken and so die.
line-1.1.4	ORSINO	That strain again! It had a dying fall.
line-1.1.5	ORSINO	O, it came o’er my ear like the sweet sound
line-1.1.6	ORSINO	That breathes upon a bank of violets,
line-1.1.7	ORSINO	Stealing and giving odor. Enough; no more.
line-1.1.8	ORSINO	’Tis not so sweet now as it was before.
line-1.1.9	ORSINO	O spirit of love, how quick and fresh art thou,
line-1.1.10	ORSINO	That, notwithstanding thy capacity
line-1.1.11	ORSINO	Receiveth as the sea, naught enters there,
line-1.1.12	ORSINO	Of what validity and pitch soe’er,
line-1.1.13	ORSINO	But falls into abatement and low price
line-1.1.14	ORSINO	Even in a minute. So full of shapes is fancy
line-1.1.15	ORSINO	That it alone is high fantastical.
line-1.1.16	CURIO	Will you go hunt, my lord?
line-1.1.17	ORSINO	What, Curio?
line-1.1.18	CURIO	The hart.
line-1.1.19	ORSINO	Why, so I do, the noblest that I have.
line-1.1.20	ORSINO	O, when mine eyes did see Olivia first,
line-1.1.21	ORSINO	Methought she purged the air of pestilence.
line-1.1.22	ORSINO	That instant was I turned into a hart,
line-1.1.23	ORSINO	And my desires, like fell and cruel hounds,
line-1.1.24	ORSINO	E’er since pursue me.
line-SD 1.1.24.1	ORSINO	Enter Valentine.
line-1.1.25	ORSINO	How now, what news from her?
line-1.1.26	VALENTINE	So please my lord, I might not be admitted,
line-1.1.27	VALENTINE	But from her handmaid do return this answer:
line-1.1.28	VALENTINE	The element itself, till seven years’ heat,
line-1.1.29	VALENTINE	Shall not behold her face at ample view,
line-1.1.30	VALENTINE	But like a cloistress she will veilèd walk,
line-1.1.31	VALENTINE	And water once a day her chamber round
line-1.1.32	VALENTINE	With eye-offending brine—all this to season
line-1.1.33	VALENTINE	A brother’s dead love, which she would keep fresh
line-1.1.34	VALENTINE	And lasting in her sad remembrance.
line-1.1.35	ORSINO	O, she that hath a heart of that fine frame
line-1.1.36	ORSINO	To pay this debt of love but to a brother,
line-1.1.37	ORSINO	How will she love when the rich golden shaft
line-1.1.38	ORSINO	Hath killed the flock of all affections else
line-1.1.39	ORSINO	That live in her; when liver, brain, and heart,
line-1.1.40	ORSINO	These sovereign thrones, are all supplied, and filled
line-1.1.41	ORSINO	Her sweet perfections with one self king!
line-1.1.42	ORSINO	Away before me to sweet beds of flowers!
line-1.1.43	ORSINO	Love thoughts lie rich when canopied with bowers.
line-SD 1.1.43.1	ORSINO	They exit.
line-SD 1.2.0	ORSINO	Enter Viola, a Captain, and Sailors.
line-1.2.1	VIOLA	What country, friends, is this?
line-1.2.2	CAPTAIN	This is Illyria, lady.
line-1.2.3	VIOLA	And what should I do in Illyria?
line-1.2.4	VIOLA	My brother he is in Elysium.
line-1.2.5	VIOLA	Perchance he is not drowned.—What think you,
line-1.2.6	VIOLA	sailors?
line-1.2.7	CAPTAIN	It is perchance that you yourself were saved.
line-1.2.8	VIOLA	O, my poor brother! And so perchance may he be.
line-1.2.9	CAPTAIN	True, madam. And to comfort you with chance,
line-1.2.10	CAPTAIN	Assure yourself, after our ship did split,
line-1.2.11	CAPTAIN	When you and those poor number saved with you
line-1.2.12	CAPTAIN	Hung on our driving boat, I saw your brother,
line-1.2.13	CAPTAIN	Most provident in peril, bind himself
line-1.2.14	CAPTAIN	(Courage and hope both teaching him the practice)
line-1.2.15	CAPTAIN	To a strong mast that lived upon the sea,
line-1.2.16	CAPTAIN	Where, like Arion
line-1.2.16	CAPTAIN	 on the dolphin’s back,
line-1.2.17	CAPTAIN	I saw him hold acquaintance with the waves
line-1.2.18	CAPTAIN	So long as I could see.
line-SD 1.2.19	VIOLA	, giving
line-SD 1.2.19	VIOLA	 him money
line-1.2.19	VIOLA	For saying so, there’s gold.
line-1.2.20	VIOLA	Mine own escape unfoldeth to my hope,
line-1.2.21	VIOLA	Whereto thy speech serves for authority,
line-1.2.22	VIOLA	The like of him. Know’st thou this country?
line-1.2.23	CAPTAIN	Ay, madam, well, for I was bred and born
line-1.2.24	CAPTAIN	Not three hours’ travel from this very place.
line-1.2.25	VIOLA	Who governs here?
line-1.2.26	CAPTAIN	A noble duke, in nature as in name.
line-1.2.27	VIOLA	What is his name?
line-1.2.28	CAPTAIN	Orsino.
line-1.2.29	VIOLA	Orsino. I have heard my father name him.
line-1.2.30	VIOLA	He was a bachelor then.
line-1.2.31	CAPTAIN	And so is now, or was so very late;
line-1.2.32	CAPTAIN	For but a month ago I went from hence,
line-1.2.33	CAPTAIN	And then ’twas fresh in murmur (as, you know,
line-1.2.34	CAPTAIN	What great ones do the less will prattle of)
line-1.2.35	CAPTAIN	That he did seek the love of fair Olivia.
line-1.2.36	VIOLA	What’s she?
line-1.2.37	CAPTAIN	A virtuous maid, the daughter of a count
line-1.2.38	CAPTAIN	That died some twelvemonth since, then leaving her
line-1.2.39	CAPTAIN	In the protection of his son, her brother,
line-1.2.40	CAPTAIN	Who shortly also died, for whose dear love,
line-1.2.41	CAPTAIN	They say, she hath abjured the sight
line-1.2.42	CAPTAIN	And company of men.
line-1.2.43	VIOLA	O, that I served that lady,
line-1.2.44	VIOLA	And might not be delivered to the world
line-1.2.45	VIOLA	Till I had made mine own occasion mellow,
line-1.2.46	VIOLA	What my estate is.
line-1.2.47	CAPTAIN	That were hard to compass
line-1.2.48	CAPTAIN	Because she will admit no kind of suit,
`

Request and parse the HTML, and give it a try!

In [32]:
#First I scrape the html from the website
my_url = "http://floatingmedia.com/columbia/FolgerShakes/TN.html"
raw_html = requests.get(my_url).content

#Then I save the html on my computer
with open("twelfth_night.html", "wb") as file:
    file.write(raw_html)
    
    
shakespeare = open("twelfth_night.html", "r")


soup_doc = BeautifulSoup(shakespeare, "html.parser")
print(soup_doc.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html xmlns:html="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Twelfth Night
  </title>
  <meta content="Folger Shakespeare Library" name="author"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=600" name="viewport"/>
  <style media="screen,print" type="text/css">
   @import "fdt.css";
  </style>
  <script type="text/javascript">
   var smartCopyYN = true;
    function smartCopy() {
      if (smartCopyYN == false) return true;
      if (navigator.appVersion.indexOf("MSIE")!=-1) {
        document.getElementById('copyPaste').style.visibility = "hidden";
      }
      var html = "";
      var top = window.pageYOffset || document.documentElement.scrollTop;
      if (typeof window.getSelection != "undefined") {
        var sel = window.getSelection();
        if (sel.rangeCount) {
         

I surrender! 