# AIM 5001: Week 12 Assignment
# Working with HTML, XML, JSON, and Web APIs
Pujita Ravichandar

The books of interests that I have chosen are the following: The Oceans by Eelco J. Rohling, Classical Mechanics by John R. Taylor, and Proving Einstein Right:The Daring Expeditions That Changed How We Look at the Universe by James S. Gates and Cathie Pelletier. These are some of my favorite science books, and some books that I am currently reading. The variables I have chosen to include in my JSON, HTML, and XML files are the title, author(s), year published, and length of the books in pages. The exploration of writing the data files and uploading them into a pandas datafame is below.


## Part I: Working with HTML, XML, and JSON

### JSON Data

The link to the github file hosting the JSON data can be found
here: https://github.com/pujitaravi/AIM-5001/blob/master/wk12assndata.json. Writing this JSON file was a matter of following the appropriate syntax. The values were then loaded into the pandas dataframe using the `pd.read_json()` function. The resulting data frame and the accompanying code is detailed below.



In [242]:
#import necessary packages
import pandas as pd
import json

#use read_json to load the data into a daraframe
data = pd.read_json('https://raw.githubusercontent.com/pujitaravi/AIM-5001/master/wk12assndata.json')
data

Unnamed: 0,title,author,year published,length
0,The Oceans,Eelco J. Rohling,2017,273
1,Classical Mechanics,John R. Taylor,2003,808
2,Proving Einstein Right: The Daring Expeditions...,Sylvester James Gates and Cathie Palletier,2019,451


### HTML

The link to my Github file hosing the HTML data can be found here: https://github.com/pujitaravi/AIM-5001/blob/master/data.html. Writing this data was amatter of following the apporpriate HTML syntax. The specifics can be seen within the contents of the link provided. These values were uploaded into a data frame by reading the html with the `read_html()` command, then the first (and in this case only) table can be called. The resulting data frame and the accompanying data is detailed below.

In [253]:
#read the HTML file housed within the Github repo
tables = pd.read_html('https://raw.githubusercontent.com/pujitaravi/AIM-5001/master/data.html')
len(tables)

1

In [254]:
# the first item (and only item here) in the list is a data frame
html_data = tables[0]
type(html_data)

pandas.core.frame.DataFrame

In [255]:
#display the first few rows of the data frame
html_data.head()

Unnamed: 0.1,Unnamed: 0,title,author,year published,length
0,0,The Oceans,Eelco J. Rohling,2017,273
1,1,Classical Mechanics,John R. Taylor,2003,808
2,2,Proving Einstein Right,Sylvester James Gates and Cathie Pelletier,2019,451


### XML

The link to my Github file hosting the XML data can be found here: https://github.com/pujitaravi/AIM-5001/blob/master/week12data.xml. The specifics of the data values can be seen in the contents of the links provided. These values were uploaded into a data frame using the steps detailed in the code below. The resulting data frame and the accompanying data is detailed below.

In [260]:
# load the urllib.request function 
import urllib.request

# load the objectify() function from the lxml library
from lxml import objectify

In [262]:
# open the web page containing the data set
path, headers = urllib.request.urlretrieve('https://raw.githubusercontent.com/pujitaravi/AIM-5001/master/week12data.xml')

# objectify() is then used to parse the web page
parsed = objectify.parse(open(path))

#get a reference to the root node of the XML file
root = parsed.getroot()

In [263]:
# define an empty list that will be used to store the parsed data
data = []

# element from the XML data
for elt in root.book:
    el_data = {}
    for child in elt.getchildren():
        el_data[child.tag] = child.pyval
    data.append(el_data)
    
data

[{'title': 'The Oceans',
  'author': 'Eelco J. Rohling',
  'year_published': 2017,
  'length': 273},
 {'title': 'Classical Mechanics',
  'author': 'John R. Taylor',
  'year_published': 2003,
  'length': 808},
 {'title': 'Proving Einstein Right: The Daring Expeditions that Changed How We Look at the Universe',
  'author': 'Sylvester James Gates and Cathie Pelletier',
  'year_published': 2019,
  'length': 451}]

In [264]:
#check the results
perf = pd.DataFrame(data)
perf.head()

Unnamed: 0,title,author,year_published,length
0,The Oceans,Eelco J. Rohling,2017,273
1,Classical Mechanics,John R. Taylor,2003,808
2,Proving Einstein Right: The Daring Expeditions...,Sylvester James Gates and Cathie Pelletier,2019,451


## Part II: Working with Web API's

After signing up for an API Key from the New York Times Developer website, and pulled the data about the top ranking books. The API Key on the NYT website pulled JSON data. This was then explored and manipulated into a pandas dataframe that can be used to manipulate the data into a workable format. The code detailing this along with the data frame is highlighted below.

In [265]:
import requests

#my API key
API_key = 'wSTu5zYxHoRVVnG0eXYr4G40jTigiZev'

#url of the data with my API Key
url = 'https://api.nytimes.com/svc/books/v3/lists/current/hardcover-fiction.json?api-key='+API_key

#send request
r = requests.get(url)

#convert the object's JSON content into a list of 
# native Python objects
json_data = r.json()

In [266]:
json_data

{'status': 'OK',
 'copyright': 'Copyright (c) 2020 The New York Times Company.  All Rights Reserved.',
 'num_results': 15,
 'last_modified': '2020-12-03T00:39:01-05:00',
 'results': {'list_name': 'Hardcover Fiction',
  'list_name_encoded': 'hardcover-fiction',
  'bestsellers_date': '2020-11-28',
  'published_date': '2020-12-13',
  'published_date_description': 'latest',
  'next_published_date': '',
  'previous_published_date': '2020-12-06',
  'display_name': 'Hardcover Fiction',
  'normal_list_ends_at': 15,
  'updated': 'WEEKLY',
  'books': [{'rank': 1,
    'rank_last_week': 0,
    'weeks_on_list': 1,
    'asterisk': 0,
    'dagger': 0,
    'primary_isbn10': '1524761338',
    'primary_isbn13': '9781524761332',
    'publisher': 'Ballantine',
    'description': 'In a sequel to “Ready Player One,” Wade Watts discovers a technological advancement and goes on a new quest.',
    'price': 0,
    'title': 'READY PLAYER TWO',
    'author': 'Ernest Cline',
    'contributor': 'by Ernest Cline',
 

In [267]:
#look at how the data is organized
json_data.keys()

dict_keys(['status', 'copyright', 'num_results', 'last_modified', 'results'])

In [268]:
r.json()['results'].keys()

dict_keys(['list_name', 'list_name_encoded', 'bestsellers_date', 'published_date', 'published_date_description', 'next_published_date', 'previous_published_date', 'display_name', 'normal_list_ends_at', 'updated', 'books', 'corrections'])

In [269]:
#pull the relevant data
data = r.json()['results']['books']

In [270]:
from pandas.io.json import json_normalize

#normalize data
df = json_normalize(data)
df.head()

  df = json_normalize(data)


Unnamed: 0,rank,rank_last_week,weeks_on_list,asterisk,dagger,primary_isbn10,primary_isbn13,publisher,description,price,...,book_image_height,amazon_product_url,age_group,book_review_link,first_chapter_link,sunday_review_link,article_chapter_link,isbns,buy_links,book_uri
0,1,0,1,0,0,1524761338,9781524761332,Ballantine,"In a sequel to “Ready Player One,” Wade Watts ...",0,...,500,https://www.amazon.com/dp/1524761338?tag=NYTBS...,,,,,,"[{'isbn10': '1524761338', 'isbn13': '978152476...","[{'name': 'Amazon', 'url': 'https://www.amazon...",nyt://book/473f18d4-0433-5c42-abaa-54c7c9dd26e5
1,2,0,1,0,0,316420255,9780316420259,"Little, Brown",The 28th book in the Alex Cross series. An inv...,0,...,500,https://www.amazon.com/dp/0316420255?tag=NYTBS...,,,,,,"[{'isbn10': '0316420255', 'isbn13': '978031642...","[{'name': 'Amazon', 'url': 'https://www.amazon...",nyt://book/662286db-c00a-50de-9ba0-dec968f24b50
2,3,5,9,0,0,1538728575,9781538728574,Grand Central,A doctor serving in the Navy in Afghanistan go...,0,...,500,https://www.amazon.com/dp/1538728575?tag=NYTBS...,,,,,,"[{'isbn10': '1538728575', 'isbn13': '978153872...","[{'name': 'Amazon', 'url': 'https://www.amazon...",nyt://book/b9bf792c-a853-54ce-8f33-7117f51be365
3,4,3,7,0,0,385545967,9780385545969,Doubleday,The third book in the Jake Brigance series. A ...,0,...,500,https://www.amazon.com/dp/0385545967?tag=NYTBS...,,,,,,"[{'isbn10': '0385545967', 'isbn13': '978038554...","[{'name': 'Amazon', 'url': 'https://www.amazon...",nyt://book/33a48cf6-d7f3-5113-aa1e-6adcbb3853c3
4,5,2,2,0,0,1538761696,9781538761694,Grand Central,The F.B.I. agent Atlee Pine’s search for her t...,0,...,500,https://www.amazon.com/dp/1538761696?tag=NYTBS...,,,,,,"[{'isbn10': '1538761696', 'isbn13': '978153876...","[{'name': 'Amazon', 'url': 'https://www.amazon...",nyt://book/da8e02be-ca28-5393-9abc-02be863d6cb7


In [271]:
#display the columns in the dataframe
df.columns

Index(['rank', 'rank_last_week', 'weeks_on_list', 'asterisk', 'dagger',
       'primary_isbn10', 'primary_isbn13', 'publisher', 'description', 'price',
       'title', 'author', 'contributor', 'contributor_note', 'book_image',
       'book_image_width', 'book_image_height', 'amazon_product_url',
       'age_group', 'book_review_link', 'first_chapter_link',
       'sunday_review_link', 'article_chapter_link', 'isbns', 'buy_links',
       'book_uri'],
      dtype='object')

In [272]:
#make a dataframe with selected columns
df[['title', 'author', 'publisher', 'rank', 'weeks_on_list', 'isbns']]

Unnamed: 0,title,author,publisher,rank,weeks_on_list,isbns
0,READY PLAYER TWO,Ernest Cline,Ballantine,1,1,"[{'isbn10': '1524761338', 'isbn13': '978152476..."
1,DEADLY CROSS,James Patterson,"Little, Brown",2,1,"[{'isbn10': '0316420255', 'isbn13': '978031642..."
2,THE RETURN,Nicholas Sparks,Grand Central,3,9,"[{'isbn10': '1538728575', 'isbn13': '978153872..."
3,A TIME FOR MERCY,John Grisham,Doubleday,4,7,"[{'isbn10': '0385545967', 'isbn13': '978038554..."
4,DAYLIGHT,David Baldacci,Grand Central,5,2,"[{'isbn10': '1538761696', 'isbn13': '978153876..."
5,THE AWAKENING,Nora Roberts,St. Martin's,6,1,"[{'isbn10': '1250272610', 'isbn13': '978125027..."
6,THE LAW OF INNOCENCE,Michael Connelly,"Little, Brown",7,3,"[{'isbn10': '0316485624', 'isbn13': '978031648..."
7,RHYTHM OF WAR,Brandon Sanderson,Tor,8,2,"[{'isbn10': '0765326388', 'isbn13': '978076532..."
8,THE SENTINEL,Lee Child and Andrew Child,Delacorte,9,5,"[{'isbn10': 'isbn10 mus', 'isbn13': '978110159..."
9,THE VANISHING HALF,Brit Bennett,Riverhead,10,26,"[{'isbn10': '0525536299', 'isbn13': '978052553..."
