## ETL Project
## Project Goals:  
#### Extracting Data  
#### Transforming It  
#### Loading it into a database

In [None]:
# Import packages
import pandas as pd
import requests
import json
from pprint import pprint

## First Dataset: Goodreads CSV
We are grateful for this clean and organized dataset! Here's a little information about where to find it and what it's creator intended:  
 kaggle source url:  https://www.kaggle.com/jealousleopard/goodreadsbooks
 author description:  "The primary reason for creating this dataset is the requirement of a good clean dataset of books...This prompted me to use the Goodreads API to get a well-cleaned dataset, with the promising features only ( minus the redundant ones ), and the result is the dataset you're at now."

In [17]:
# Read in Goodreads CSV data
file = 'books.csv'
bookpd_df = pd.read_csv(file)

In [21]:
# Read the csv into a pandas dataframe
# Print the first two lines to see the data and ensure it loaded correctly
bookpd_df.head(2)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,# num_pages,ratings_count,text_reviews_count
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling-Mary GrandPré,4.56,439785960,9780439785969,eng,652,1944099,26249
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling-Mary GrandPré,4.49,439358078,9780439358071,eng,870,1996446,27613


#### Transform
Our first transformation of data: jettisoning the columns that did not seem necessary, namely: bookID, isbn13, language_code, num_pages, and the ratings information. Although the bookID would be helpful for indexing, we decided that the isbn number would be a better primary key since it is a universally recognized integer, rather than this dataset's specific bookID indexing.

In [22]:
# Transform the data to 
clean_books_df = bookpd_df[['title', 'authors', 'average_rating', 'isbn', '# num_pages']]
clean_books_df.head()

Unnamed: 0,title,authors,average_rating,isbn,# num_pages
0,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling-Mary GrandPré,4.56,0439785960,652
1,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling-Mary GrandPré,4.49,0439358078,870
2,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling-Mary GrandPré,4.47,0439554934,320
3,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,0439554896,352
4,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling-Mary GrandPré,4.55,043965548X,435


In [42]:
clean_books_df.to_csv('clean_books.csv', index=False)

In [23]:
isbn_list = clean_books_df['isbn'].tolist()

In [24]:
# Print first five items of list to check that it loaded correctly
print(isbn_list[0:4])

['0439785960', '0439358078', '0439554934', '0439554896']


## Second Dataset: Open Library API
Introducing our second dataset: the Open Library API. 


In [25]:
# Open Library 
url = 'https://openlibrary.org/api/books?'

#### Extraction
Note: loading this data is fairly simple due to the ease of the API interaction, however, due to the extensive volume of the list that we are looping through the API, it takes about 5-10 minutes to complete this extraction step. In order to save time, we are looping the completed list into a new document in order to avoid having to loop through the data to extract it.

In [26]:
books_urls = []

for isbn in isbn_list:

    try: 
        query = f'{url}bibkeys=ISBN:{isbn}&format=json'
        response = requests.get(query).json()
        books_urls.append(response['ISBN:{}'.format(isbn)]['info_url'])
        
        
    except: 
        books_urls.append('no response found')

In [28]:
print(books_urls[0:4])

['https://openlibrary.org/books/OL24280830M/Harry_Potter_and_the_Half-Blood_Prince', 'https://openlibrary.org/books/OL24330394M/Harry_Potter_and_the_Order_of_the_Phoenix', 'https://openlibrary.org/books/OL26018592M/Harry_Potter_and_the_Sorcerers_Stone', 'no response found']


In [None]:
# In order to save the information and avoid having to run the 
with open('books_urls.txt', 'w+') as output:
    for book in book_urls:
        output.write(book+'\n')

In [None]:
# read in the files 
with open('books_urls.txt', 'r')

In [33]:
# create a dictionary to hold all responses
books_dict = {
    'isbn': isbn_list,
    'library_url': books_urls
}

In [31]:
books_dict

{'isbn': ['0439785960',
  '0439358078',
  '0439554934',
  '0439554896',
  '043965548X',
  '0439682584',
  '0976540606',
  '0439827604',
  '0517226952',
  '0345453743',
  '1400052920',
  '0739322206',
  '0517149257',
  '076790818X',
  '0767915062',
  '0767910435',
  '0767903862',
  '076790382X',
  '0060920084',
  '0380713802',
  '0380727501',
  '0380715430',
  '0345538374',
  '0618517650',
  '0618346244',
  '0618346252',
  '0618260587',
  '0618391002',
  '0618510826',
  '0618153977',
  '193337201X',
  '097669400X',
  '0689840926',
  '1557344493',
  '0385326505',
  '1575606240',
  '1595580271',
  '1595962808',
  '0670059676',
  '0141312629',
  '0595321801',
  '1590301943',
  '0449146979',
  '0061159174',
  '006076273X',
  '0060749911',
  '0273704745',
  '1932386106',
  '096513671X',
  '0374517193',
  '0374280398',
  '0374519749',
  '0374522596',
  '0374518734',
  '0374522871',
  '0374519323',
  '0374516006',
  '0374520658',
  '0822205106',
  '0679734996',
  '1596670231',
  '1581807740',


In [34]:
books_urls_df = pd.DataFrame.from_dict(books_dict)
books_urls_df

Unnamed: 0,isbn,library_url
0,0439785960,https://openlibrary.org/books/OL24280830M/Harr...
1,0439358078,https://openlibrary.org/books/OL24330394M/Harr...
2,0439554934,https://openlibrary.org/books/OL26018592M/Harr...
3,0439554896,no response found
4,043965548X,https://openlibrary.org/books/OL27305590M/Harr...
...,...,...
3495,0618537252,https://openlibrary.org/books/OL3410175M/Gatsb...
3496,0330419129,no response found
3497,0330351699,no response found
3498,0333695275,https://openlibrary.org/books/OL10555114M/Into...


In [43]:
books_urls_df.to_csv('urls.csv', index=False)

In [40]:
# Joining the two dataframes: clean_books_df and books_df
combined_all = pd.merge(clean_books_df, books_urls_df, on='isbn', how='left')

# Print the first two items to ensure that it merged correctly
combined_all.head(2)

Unnamed: 0,title,authors,average_rating,isbn,# num_pages,library_url
0,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling-Mary GrandPré,4.56,439785960,652,https://openlibrary.org/books/OL24280830M/Harr...
1,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling-Mary GrandPré,4.49,439358078,870,https://openlibrary.org/books/OL24330394M/Harr...


#### Loading 
The final step is to load our two data tables into a database. We created a new table within the database to ensure that it loaded properly and that the isbn worked as a primary key correctly.

## Final Notes
The most difficult part of the project was finding datasets that would be useful. The Goodreads csv was a great place to start, but then finding something that we could pair it with proved challenging. Scraping amazon or another bookseller website would be ideal, but automating that process and running it through a loop would be fairly tedious and challenging, especially since we didn't already have a url. An API was a good option, but we were unable to find a reliable API that showed pricing info. The Open Library API was easy to use and still provided another source of information on top of the original csv. If someone were to search the dataset by an isbn number, they could locate the book, the information from goodreads, and likely access it in the Open Library website. From there, there is seller information and more for getting access to the text itself.
### Potential Analysis
Some potential analysis would be to test:  
- how many authors there are total in the database and how many books each author has 
- how many books have an Open Library url vs. how many do not
- the average rating of books with an Open Library url vs. the average rating of the books without an url
- the number of books attributed to an author with an Open Library url vs. the average rating of the books attributed to an author who does not have an Open Library url