# Best Books of 2020
Scraping 'Best Books' lists from The New York Times. Using a JSON file from NPR for the same. Looking for 'peer-reviewed' verification via an intersection of lists from organizations that I trust.

## Importing Libraries and Data

In [13]:
import gzip 
import io
import requests
from bs4 import BeautifulSoup
import wget
import json
from pprint import pprint
 
import warnings

### NPR's Best Books of 2020 list
Data was found to be nicely-formatted via the front-end API which loaded (what appeared to be) a JSON file, after having inspected the page. 

In [6]:
# NPR's Best Books of 2020 List
wget.download('https://apps.npr.org/best-books/2020.json', 'npr.json')

'npr.json'

It turns out that this JSON was a GZIP file (despite not having a .gz file extension):

In [35]:
with gzip.open('npr.json', 'r') as gzip_file:
    npr_file = gzip_file.read()

npr = json.loads(npr_file)

In [37]:
npr[0]

{'title': "Dancing Man: A Broadway Choreographer's Journey",
 'author': 'Bob Avian with Tom Santopietro',
 'dimensions': {'width': 397, 'height': 595},
 'cover': '9781496825889',
 'tags': ['staff picks',
  'nonfiction',
  'biography & memoir',
  'for music lovers',
  'no biz like show biz'],
 'id': 1}

In [40]:
npr_books = []
for item in npr:
    book_name = item['title'.strip()]
    book_name = book_name.replace('\xa0','')
    npr_books.append(book_name)
    
print('There are', len(npr_books), 'recommended books by NPR:', npr_books)

There are 383 recommended books by NPR: ["Dancing Man: A Broadway Choreographer's Journey", 'American Oligarchs: The Kushners, The Trumps, And The Marriage Of Money And Power', 'A Castle In The Clouds', 'Wagnerism: Art And Politics In The Shadow Of Music', "The Jakarta Method: Washington's Anticommunist Crusade And The Mass Murder Program That Shaped Our World", 'Luster: A Novel', 'Caste: The Origins Of Our Discontents', "A Children's Bible: A Novel", 'All Because You Matter', 'Black Heroes Of The Wild West', 'Echo Mountain', 'Everybody Counts: A Counting Story From 0 To 7.5 Billion', 'Go With The Flow', 'The Lights And Types Of Ships At Night', 'The Most Beautiful Thing', 'The Talk: Conversations About Race, Love & Truth', 'Twins', 'Wink', 'A Game Of Fox & Squirrels', 'Mad, Bad & Dangerous To Know', 'The Invisible Life Of Addie LaRue', 'The Mermaid, The Witch, And The Sea', 'Burnt Sugar: A Novel', 'No Filter: The Inside Story Of Instagram', 'Race For Profit: How Banks And The Real Est

### The New York Times 100 Notable Books of 2020
There was no API here, so I had to scrape.

In [42]:
# NYT 100 Notable Books of 2020
NYT = 'https://www.nytimes.com/interactive/2020/books/notable-books.html'
NYT_list = requests.get(NYT)
NYT_soup = BeautifulSoup(NYT_list.content, 'html.parser')
NYT_soup_1 = NYT_soup.body

In [48]:
NYT_books = []
book_title_list = NYT_soup_1.find_all('div', attrs={'class': 'g-book-title balance-text'})

for item in book_title_list:
    book_name = item.a.text.strip()
    book_name = book_name.replace('\n','')
    NYT_books.append(book_name)
    
print('There are', len(NYT_books), 'recommended books by The New York Times:', NYT_books)

There are 100 recommended books by The New York Times: ['The Aosawa Murders', 'The Beauty in Breaking: A Memoir', 'The Beauty of Your Face', 'Becoming Wild: How Animal Cultures Raise Families, Create Beauty, and Achieve Peace', 'Beheld', 'The Biggest Bluff: How I Learned to Pay Attention, Master Myself, and Win', 'Black Wave: Saudi Arabia, Iran, and the Forty-Year Rivalry That Unraveled Culture, Religion, and Collective Memory in the Middle East', 'Blacktop Wasteland', 'The Book of Eels: Our Enduring Fascination With the Most Mysterious Creature in the Natural World', 'The Boy In The Field', 'Breasts and Eggs', 'A Burning', 'Burning Down the House: Newt Gingrich, the Fall of a Speaker, and the Rise of the New Republican Party', 'Caste: The Origins of our Discontents', "A Children's Bible", 'Cleanness', 'Deacon King Kong', 'The Dead Are Arising: The Life of Malcolm X', 'The Death of Jesus', 'The Death of Vivek Oji', 'Deaths of Despair and the Future of Capitalism', 'Desert Notebooks: A 

### Finding Books Mentioned on Both Lists
I converted each Python array into a set. If we're keeping track of runtime complexity (these are both very small lists so it doesn't matter), iterating over the list was O(n) and adding each element to the set was O(1). With two lists of length m and length n, the operation is O(max(m,n)) runtime which simplifies to O(n).

Checking for the set intersection was O(min(m,n)), since the operation is complete after all elements have been checked in the smaller set. Checking if an element is a member of the other set is O(1). The operation has a runtime of O(min(m,n)) * O(1), which then simplifies to O(n). 

Overall, this simplifies to O(n).

In [66]:
npr_books = set(npr_books)
NYT_books = set(NYT_books)
combined_list = npr_books & NYT_books

print(len(combined_list),'options mentioned between NPR and The NYT:', combined_list)

6 options mentioned between NPR and The NYT: {'Uncanny Valley: A Memoir', 'Cleanness', 'The Undocumented Americans', 'Just Us: An American Conversation', 'Hamnet', 'A Promised Land'}
