## Homework 3 Part Two: Scraping
This homework asks you to scrape from three different sources: The Guardian, Supreme Court Decisions and a more complicated version of Shakespeare (bonus). 

Again, please follow the instructions and do the best you can. Look at the tutorial for examples, my answers for Homework 3 Part One, as well as the Beautiful Soup documentation, and any other Python resource (such as Stack overflow). As you get further into this assignment a lot of the trick will be using loops properly and appending information into lists. One of the great ways to carefully use Beautiful Soup is to first use find() to find the first instance of something and search through it. And then use find_all() to get a list of results that you must then loop through and search within.

In [2]:
import requests
from bs4 import BeautifulSoup

## Supreme Court Decisions 2020 
Now it's time to scrape from reality. The Supreme Court posts its decisions in a format that is not immediately data friendly. They have a simple HTML table with some information about the decision, including a link to a PDF that contains the written decision. We won't mess with those PDFs this week, but we do want to transform their tables into something useful to us. 

We will be scraping this page: 
https://www.supremecourt.gov/opinions/slipopinion/20

*Note:* While you won't see all of the tables for all the months when you go to the page, they are all there in the HTML that you will download and in the HTML source you view (which is the same thing). Definitely do a view source, and study the structure of the HTML tables before you start coding.

You eventually want to end up with a list of lists (rows and then columns) for every decision from the 2020. Follow the process, and see how far you get.


Write your lines that use requests to get the page, and a second variable that passes the raw HTML into Beautiful Soup for parsing. Include a third line that prints the HTML in the prettify() way.

In [3]:
raw_html = requests.get('https://www.supremecourt.gov/opinions/slipopinion/20').content
soup = BeautifulSoup(raw_html, "html.parser")
#print(soup.prettify())


Isolate the HTML row with the first row of information for the case Alabama Assn. of Realtors v. Department of Health and Human Servs. (as of 11/10/21 that is the most recent case. These things can update though!)

In [4]:
first_row = soup.find_all('tr')[2]
first_row


<tr>
<td style="text-align: center;">68</td>
<td style="text-align: center;">8/26/21</td>
<td style="text-align: center; white-space: nowrap;">21A23</td>
<td><a href="/opinions/20pdf/21a23_ap6c.pdf" target="_blank" title="The District Court’s judgment—which vacated as unlawful the Centers for Disease Control and Prevention’s imposition of a nationwide moratorium on evictions of any tenants who live in a county that is experiencing substantial or high levels of COVID–19 transmission and who make certain declarations of financial need, 86 Fed. Reg. 43244—is enforceable and the stay of that judgment is vacated.">Alabama Assn. of Realtors v. Department of Health and Human Servs.</a></td>
<td style="text-align: center;"> </td>
<td style="text-align: center;">PC</td>
<td style="text-align: center;">594/2</td>
</tr>

Print out each cell of information from that first row. Your output should look like this:


```
68
8/26/21
21A23
Alabama Assn. of Realtors v. Department of Health and Human Servs.
 
PC
594/2
```

In [5]:
first_row_text = first_row.get_text()
print(first_row_text)



68
8/26/21
21A23
Alabama Assn. of Realtors v. Department of Health and Human Servs.
 
PC
594/2



But wait, there is more information hidden inside the tags! Really important information. Find it and print it out like this (still just for this first row):
```
/opinions/20pdf/21a23_ap6c.pdf 
 The District Court’s judgment—which vacated as unlawful the Centers for Disease Control and Prevention’s imposition of a nationwide moratorium on evictions of any tenants who live in a county that is experiencing substantial or high levels of COVID–19 transmission and who make certain declarations of financial need, 86 Fed. Reg. 43244—is enforceable and the stay of that judgment is vacated.
 ```

In [6]:
#first_row.td(string=True)
hidden_url = first_row.find('a')['href']
hidden_info = first_row.find('a')['title']
print(hidden_url)
print(hidden_info)

/opinions/20pdf/21a23_ap6c.pdf
The District Court’s judgment—which vacated as unlawful the Centers for Disease Control and Prevention’s imposition of a nationwide moratorium on evictions of any tenants who live in a county that is experiencing substantial or high levels of COVID–19 transmission and who make certain declarations of financial need, 86 Fed. Reg. 43244—is enforceable and the stay of that judgment is vacated.


Okay, time to make this useful. Take the information you printed in the last two cells, and combine them all into a list. Output the list, it should look like this:
```
['68',
 '8/26/21',
 '21A23',
 'Alabama Assn. of Realtors v. Department of Health and Human Servs.',
 '\xa0',
 'PC',
 '594/2',
 '/opinions/20pdf/21a23_ap6c.pdf',
 'The District Court’s judgment—which vacated as unlawful the Centers for Disease Control and Prevention’s imposition of a nationwide moratorium on evictions of any tenants who live in a county that is experiencing substantial or high levels of COVID–19 transmission and who make certain declarations of financial need, 86 Fed. Reg. 43244—is enforceable and the stay of that judgment is vacated.']
 ```
 

In [7]:
first_row_list = []
for td in first_row.find_all('td'):
    first_row_list.append(td.get_text())
    
first_row_list.append(hidden_url)
first_row_list.append(hidden_info)

first_row_list

['68',
 '8/26/21',
 '21A23',
 'Alabama Assn. of Realtors v. Department of Health and Human Servs.',
 '\xa0',
 'PC',
 '594/2',
 '/opinions/20pdf/21a23_ap6c.pdf',
 'The District Court’s judgment—which vacated as unlawful the Centers for Disease Control and Prevention’s imposition of a nationwide moratorium on evictions of any tenants who live in a county that is experiencing substantial or high levels of COVID–19 transmission and who make certain declarations of financial need, 86 Fed. Reg. 43244—is enforceable and the stay of that judgment is vacated.']

Now, run the exact same code, but for the first row in the third table, June 2020. The output should look like this:
```
['64',
 '6/29/21',
 '20-440',
 'Minerva Surgical, Inc. v. Hologic, Inc.',
 '\xa0',
 'EK',
 '594/2',
 '/opinions/20pdf/20-440_9ol1.pdf',
 'The well-grounded patent law doctrine of assignor estoppel applies only when the assignor’s claim of invalidity contradicts explicit or implicit representations the assignor made in assigning the patent.']
```


In [8]:
for div in soup.find_all('div'):
    if div.get('id') == 'cell6':
        june_table = div
        june_row1 = div.find_all('tr')[1]
print(june_row1)

<tr>
<td style="text-align: center;">64</td>
<td style="text-align: center;">6/29/21</td>
<td style="text-align: center; white-space: nowrap;">20-440</td>
<td><a href="/opinions/20pdf/20-440_9ol1.pdf" target="_blank" title="The well-grounded patent law doctrine of assignor estoppel applies only when the assignor’s claim of invalidity contradicts explicit or implicit representations the assignor made in assigning the patent.">Minerva Surgical, Inc. v. Hologic, Inc.</a></td>
<td style="text-align: center;"> </td>
<td style="text-align: center;">EK</td>
<td style="text-align: center;">594/2</td>
</tr>


Great! Now you want to go through all of the rows in that thrid table, June (but not the header), and get a list of lists with the information for every case in that row. 

Note, that the code here should be similar to the code above, but you will need to loop through all of the rows in June, and collect the info for each row with a new list that will then be appended to a larger list each to time the loop finishes (before looping back to the next row).

In [9]:
june_table_clean = []
for row in june_table.find_all('tr')[1:]:
    row_list = []
    for td in row.find_all('td'):
        row_list.append(td.get_text())
    row_list.append(row.find('a')['href'])
    row_list.append(row.find('a')['title'])
    june_table_clean.append(row_list)
june_table_clean[:5]

[['64',
  '6/29/21',
  '20-440',
  'Minerva Surgical, Inc. v. Hologic, Inc.',
  '\xa0',
  'EK',
  '594/2',
  '/opinions/20pdf/20-440_9ol1.pdf',
  'The well-grounded patent law doctrine of assignor estoppel applies only when the assignor’s claim of invalidity contradicts explicit or implicit representations the assignor made in assigning the patent.'],
 ['63',
  '6/29/21',
  '19-897',
  'Johnson v. Guzman Chavez',
  '\xa0',
  'A',
  '594/2',
  '/opinions/20pdf/19-897_c07d.pdf',
  'The detention of an alien ordered removed from the United States who reenters without authorization is governed by 8 U. S. C. §1231.'],
 ['62',
  '6/29/21',
  '19-1039',
  'PennEast Pipeline Co. v. New Jersey',
  '\xa0',
  'R',
  '594/2',
  '/opinions/20pdf/19-1039_8n5a.pdf',
  'A certificate of public convenience and necessity issued by the Federal Energy Regulatory Commission pursuant to §717f(h) of the Natural Gas Act authorizes a private company to condemn all necessary rights-of-way, whether owned by priv

Finally, go through EVERY table, and get out every row--no headers. So you have all of the 2020 decisions from 68-1 info in highly useful list-within-list format.

In [10]:
all_tables = []
for div in soup.find_all('div'):
    if div.get('id') is not None:
        if div.get('id')[:4] == 'cell':
            month_table = div
            for row in month_table.find_all('tr')[1:]:
                row_list = []
                for td in row.find_all('td'):
                    row_list.append(td.get_text())
                row_list.append(row.find('a')['href'])
                row_list.append(row.find('a')['title'])
                all_tables.append(row_list)
all_tables[:5]

[['68',
  '8/26/21',
  '21A23',
  'Alabama Assn. of Realtors v. Department of Health and Human Servs.',
  '\xa0',
  'PC',
  '594/2',
  '/opinions/20pdf/21a23_ap6c.pdf',
  'The District Court’s judgment—which vacated as unlawful the Centers for Disease Control and Prevention’s imposition of a nationwide moratorium on evictions of any tenants who live in a county that is experiencing substantial or high levels of COVID–19 transmission and who make certain declarations of financial need, 86 Fed. Reg. 43244—is enforceable and the stay of that judgment is vacated.'],
 ['67',
  '7/02/21',
  '20-1084',
  'Dunn v. Reeves',
  '\xa0',
  'PC',
  '594/2',
  '/opinions/20pdf/20-1084_19m1.pdf',
  'In this federal habeas case, the Eleventh Circuit erred in characterizing the Alabama court’s case-specific analysis as a “categorical rule” that any prisoner will always lose an ineffective-assistance-of-trial-counsel claim if he fails to call and question trial counsel concerning his or her actions and r

In [11]:
soup.find_all('id')

[]

## The Guardian: Best Non-Fiction Books of All Time 
I do not endorse this list. However, there are some interesting things within. You will notice that the Internet is filled with rankings, a form that is as readilty consumable as it is programmable, because code understands ranking pretty well.

For this task you want to extract different elements of this list separately. You will start by extracting the information for the first entry on the list. You want to get three elements separately: RankNumber, Title_Author_Year, Blurb. And these need to be placed into a Python list with three elements. Once you've accomplished that, you want to loop through all of the entries in that list, 1-100, and make individual Python lists with those same three elements, and then put those lists into a Python list as you go...

Step one, go to this page and take a look at what you're contending with:

https://www.theguardian.com/books/2017/dec/31/the-100-best-nonfiction-books-of-all-time-the-full-list

Note, some of the tags you will see on Chrome in the "Inspect" area and even in "View Source" or not the same as the HTML that's being downloaded by requests and parsed by Beautiful Soup.


Step 1: In the next two cells, use requests to download the HTML, and use beautiful soup to parse it. Then print the prettify() version of that downloaded HTML, and copy and paste that output into an HTML editor to look at the tags in there. The overall structure will be the same as what you see in Chrome, but some of the "class=" names will be different.

In [12]:
raw_html = requests.get('https://www.theguardian.com/books/2017/dec/31/the-100-best-nonfiction-books-of-all-time-the-full-list').content
book_soup = BeautifulSoup(raw_html, "html.parser")
#print(book_soup.prettify())

Step 2: Find the HTML that contains the first entry on the list. Your output should look like this:

`<p class="dcr-eu20cu"><strong>1. <a data-link-name="in body link" href="https://www.theguardian.com/books/2016/feb/01/100-best-nonfiction-books-of-all-time-the-sixth-extinction-elizabeth-kolbert">The Sixth Extinction by Elizabeth Kolbert (2014)</a> </strong><br/> An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.</p>`

In [13]:
#the class attributes for the div are actually updated dynamically by the JavaScript, 
#so they are not always consistent 
#they may not even be consistent from download to download.
book_soup.find_all('p')[2]

<p class="dcr-eu20cu"><strong>1. <a data-link-name="in body link" href="https://www.theguardian.com/books/2016/feb/01/100-best-nonfiction-books-of-all-time-the-sixth-extinction-elizabeth-kolbert">The Sixth Extinction by Elizabeth Kolbert (2014)</a> </strong><br/> An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.</p>

Step 3: Extract the number from that entry. 

Result: 

`'1. '`


In [14]:
rank = book_soup.find_all('p')[2].get_text()[:3]
rank

'1. '

Step 4: extract the title_author_year. 

Result:
`'The Sixth Extinction by Elizabeth Kolbert (2014)'`
        

In [15]:
book_title = book_soup.find_all('p')[2].strong.get_text()[3:].strip()
book_title

'The Sixth Extinction by Elizabeth Kolbert (2014)'

Step 5: Extract the blurb.

Result:

`' An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.'`



In [16]:
book_description = book_soup.find_all('p')[2].br.next
book_description

' An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.'

In [17]:
#book_soup.find_all('p')[2].get_text()[len(book_title)+5:]

Step 6: Take those three elements you extracted, and put them into a Python list. If you had success in the three steps above, you don't need to use beautiful soup to do this, you just need to take those individual elements that you extracted and place them inside a list.

Result: 

`['1. ',
 'The Sixth Extinction by Elizabeth Kolbert (2014)',
 ' An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.']`

In [18]:
top_book = [rank, book_title, book_description]
top_book


['1. ',
 'The Sixth Extinction by Elizabeth Kolbert (2014)',
 ' An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.']

Step 7: This is more a leap than a step! Take all of the methods you used to isolate that one entry, and apply it to every entry. So that you are creating the same list you see above, over and over for each entry, and place each those into a master list-of-lists.

(Hints:
1) You will need to use some version of find_all() to get all of the entries.

2) You will need a loop to iterate through all of those entries.

3) For each entry you will need to extract each of the three elements (number, title_author_year,description) in the exact same way you did with the first entry.

4) It may help to use print() inside your loop to make sure you're getting everything out correctly.

5) Once you're sure you're getting everything out correctly, you will need to make a Python list that will capture each list that is being built within the loop.

6) It may be helpful to include an **is not None** if statement in this loop as there are some variations and even mistakes in the HTML 

)

Your desired output is in the cell below.



In [19]:
books = []
for p in book_soup.find_all('p')[2:]:
    rank = p.get_text()[:3]
    if p.strong is not None:
        book_title = p.strong.get_text()[3:].strip()
        if p.br is None:
            book_description = p.strong.next
        else: book_description = p.br.next
        book = [rank, book_title, book_description]
        books.append(book)

In [20]:
books[:5]

[['1. ',
  'The Sixth Extinction by Elizabeth Kolbert (2014)',
  ' An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.'],
 ['2. ',
  'The Year of Magical Thinking by Joan Didion (2005)',
  'This steely and devastating examination of the author’s grief following the sudden death of her husband changed the nature of writing about bereavement. '],
 ['3. ',
  'No Logo by Naomi Klein (1999)',
  ' Naomi Klein’s timely anti-branding bible combined a fresh approach to corporate hegemony with potent reportage from the dark side of capitalism. '],
 ['4. ',
  'Birthday Letters by Ted Hughes (1998)',
  ' These passionate, audacious poems addressed to Hughes’s late wife, Sylvia Plath, contribute to the couple’s mythology and are a landmark in English poetry. '],
 ['5. ',
  'Dreams from My Father by Barack Obama (1995)',
  ' This remarkably candid memoir revealed not only a literary talent, but a force that would change the face of US politics for e

Final Result: `
[['1. ',
  'The Sixth Extinction by Elizabeth Kolbert (2014)',
  ' An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.'],
 ['2. ',
  'The Year of Magical Thinking by Joan Didion (2005)',
  'This steely and devastating examination of the author’s grief following the sudden death of her husband changed the nature of writing about bereavement. '],
 ['3. ',
  'No Logo by Naomi Klein (1999)',
  ' Naomi Klein’s timely anti-branding bible combined a fresh approach to corporate hegemony with potent reportage from the dark side of capitalism. '],
 ['4. ',
  'Birthday Letters by Ted Hughes (1998)',
  ' These passionate, audacious poems addressed to Hughes’s late wife, Sylvia Plath, contribute to the couple’s mythology and are a landmark in English poetry. '],
 ['5. ',
  'Dreams from My Father by Barack Obama (1995)',
  ' This remarkably candid memoir revealed not only a literary talent, but a force that would change the face of US politics for ever. '],
 ['6. ',
  'A Brief History of Time by Stephen Hawking (1988)',
  ' The theoretical physicist’s mega-selling account of the origins of the universe is a masterpiece of scientific inquiry that has influenced the minds of a generation. '],
`
and so on until:
`['98. ',
  'The Anatomy of Melancholy by Robert Burton (1621)',
  'Burton’s garrulous, repetitive masterpiece is a compendious study of melancholia, a sublime literary doorstop that explores humanity in all its aspects.'],
 ['99. ',
  'The History of the World by Walter Raleigh (1614)',
  'Raleigh’s most important prose work, close to 1m words in total, used ancient history as a sly commentary on present-day issues.'],
 ['100. ',
  'King James Bible: The Authorised Version (1611)',
  'It is impossible to imagine the English-speaking world celebrated in this series without the King James Bible, which is as universal and influential as Shakespeare.']]`
 
I didn't want to print all 100 elements of this list here. But notice that this list begins with `[[` and ends with `]]` That is because this is a list of lists. Note the commas between each entry, `],[` that means that each list for each book is an element in the list-of-lists. 