<a href="https://colab.research.google.com/github/maureenwidjaja/PIC16B-Group-Project/blob/main/PIC16B_group_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Scrape Books Using Open Library API
- get by 'Subject' name ->> can be anything, e.g. "fantasy" etc.



In [2]:
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import json
import pandas as pd
import numpy as np

In [None]:
import requests

def get_books_by_subject(subject, limit=100, details=True, ebooks=False, published_in=None, offset=0):
    '''
    Args:
    details: if True, includes related subjects, prolific authors, and publishers.
    ebooks: if True,  filters for books with e-books.
    published_in: filters by publication year.
                  For example:
                  http://openlibrary.org/subjects/love.json?published_in=1500-1600
    limit: num of works to include in the response, controls pagination.
    offset: starting offset in the total works, controls pagination.
    '''
    # Creates the API endpoint URL using the subject provided.
    base_url = (f'https://openlibrary.org/subjects/{subject}.json?limit=1')


    # Sends an HTTP GET request to Open Library's API with the query parameters
    # stored in params.
    # The response is stored in response, which contains JSON data.
    response = requests.get(base_url)#, params=params)

    if response.status_code != 200:
        print(f"Error fetching books for {subject}")
        return []

    data = response.json()
    books = data.get("works", [])

    if not books:
        print(f"No books found for {subject}")
        return []

    book_list = []
    for book in books:
        title = book.get("title", "Unknown Title")
        author = book["authors"][0]["name"] if book.get("authors") else "Unknown Author"
        published_year = book.get("first_publish_year", "Unknown Year")

        # Other details we may need for
        subjects = ", ".join(book.get("subject", ["No subjects available"]))
        description = book.get("description", "No description available")
        ebook_available = book.get("ebook_count_i", 0) > 0
        publishers = ", ".join(book.get("publishers", ["Unknown Publisher"]))

        book_list.append({
            "title": title,
            "author": author,
            "published_year": published_year,
            "subjects": subjects,
            "description": description,
            "ebook_available": ebook_available,
            "publishers": publishers
        })

    return book_list  # Return the list of books


# Step 2: Combine into one genre
For instance, "sci-fi" subject and "science-fiction" subject returns different results. So, our next objective is to combine all of them into one genre "Science Fiction". The same goes for other genres like "Romance" or "Fantasy".


### Function to combine sub-genres into a big genre:

In [None]:
def combine_genre(subject):
    """
    Args:
    subject: book subject

    This function collects book lists under sub-genres and combines them into
    one main genre.

    Returns:
    List of all books under a specific genre.
    """

    if subject is None:
        raise ValueError("Please pass a subject name.")

    # Dictionary of genres and their corresponding lists with adjusted formatting
    genre_dict = {
        "romance": [
            "fiction_romance_general", "fiction_romance_historical_general",
            "romance", "man_woman_relationships", "fiction_romance_suspense",
            "fiction_romance_contemporary",
            "fiction_romance_erotica", "fiction_romance_erotic",
            "marriage_fiction", "fiction_erotica_general", "romance",
            "fiction_christian_romance_general", "fiction_romance_historical"
        ],
        "fantasy": [
            "fiction", "fantasy_fiction", "magic", "fiction_fantasy_general",
            "adventure_and_adventurers_fiction",
            "adventure_and_adventurers", "good_and_evil", "fairies", "dragons",
            "cartoons_and_comics", "witchcraft", "history", "wizards", "fairies_fiction"
        ],
        "historical_fiction": [
            "fiction", "historical_fiction", "history", "fiction_historical_general",
            "fiction_romance_historical_general", "fiction_historical", "fiction_general",
            "fiction_romance_historical", "world_war_1939_1945", "great_britain_fiction"
        ],
        "horror": [
            "fiction", "horror", "horror_stories", "horror_tales", "american_horror_tales",
            "horror_fiction", "detective_and_mystery_stories", "crime", "catalepsy", "murder",
            "burial_vaults"
        ],
        "humor": [
            "anecdotes", "humor_general", "american_wit_and_humor",
            "wit_and_humor", "humour", "humor", "funny"
        ],
        "literature": [
            "philosophy", "in_literature", "theory", "criticism", "criticism_and_interpretation",
            "english_literature", "modern_literature", "american_literature",
            "literature", "litterature"
        ],
        "mystery_thriller": [
            "detective_and_mystery_stories", "mystery_fiction", "murder", "mystery",
            "thriller", "detective", "fiction_thrillers_general",
            "suspense", "fiction_thrillers_suspense", "fiction_suspense",
            "mystery", "thriller", "murder",
            "fiction_thrillers_espionage", "police",
            "suspense_fiction", "fiction_general", "detective_and_mystery_stories",
            "crimes_against", "fiction_psychological", "investigation"
        ],
        "science_fiction": [
            "science_fiction", "fiction_science_fiction_general", "american_science_fiction",
            "extraterrestrial_beings", "life_on_other_planets", "extraterrestrial_beings_fiction",
            "time_travel", "sci_fi", "sci-fi", "science-fiction"
        ]
    }

    if subject not in genre_dict:
        raise ValueError("Invalid genre. Please choose from the predefined genres: \
        romance, fantasy, historical_fiction, horror, humor, literature, \
        mystery_thriller, science_fiction.")

    books_under_genre = []
    seen_books = set()  # To store unique books
    i = 1
    print(f"\nBooks under the genre '{subject}':\n")

    for sub_genre in genre_dict[subject]:
        books = get_books_by_subject(sub_genre)  # Get books for the sub-genre

        if books:
            for book in books:
                # Extract the book title and author for uniqueness check)
                title_author = book['author']
                if title_author is None:
                  print("no author")

                # Ensure no dupicates
                if title_author not in seen_books:
                    print(f"{i}. {book['title']} by {book['author']}")
                    books_under_genre.append(book)
                    seen_books.add(title_author)
                    i += 1

        else:
            print(f"No books found for sub-genre '{sub_genre}'")

    return books_under_genre


### Romance books:

In [None]:
romance_books = combine_genre("romance")


Books under the genre 'romance':

1. Pride and Prejudice by Jane Austen
2. Wuthering Heights by Emily Brontë
3. Is he lying to you? by Dan Crum
4. Rebecca by Daphne du Maurier
5. Loving by Danielle Steel
6. Fifty Shades Freed by E. L. James
7. Memoirs of Fanny Hill by John Cleland
8. Decamerone by Giovanni Boccaccio
9. Far From the Madding Crowd by Thomas Hardy


### Fantasy books:

In [None]:
fantasy_books = combine_genre("fantasy")


Books under the genre 'fantasy':

1. Pride and Prejudice by Jane Austen
2. Alice's Adventures in Wonderland by Lewis Carroll
3. The Marvelous Land of Oz by L. Frank Baum
4. Five Children and It by Edith Nesbit
5. A Christmas Carol by Charles Dickens
6. Harry Potter and the Chamber of Secrets by J. K. Rowling


### Historical Fiction books:

In [None]:
combine_genre("historical_fiction")


Books under the genre 'historical_fiction':

1. Pride and Prejudice by Jane Austen (1813)
2. A Christmas Carol by Charles Dickens (1843)
3. Alice's Adventures in Wonderland by Lewis Carroll (1865)
4. The 12th SS by Meyer, Hubert (2005)


### Horror Books

In [None]:
combine_genre("horror")


Books under the genre 'horror':

1. Pride and Prejudice by Jane Austen (1813)
2. The Picture of Dorian Gray by Oscar Wilde (1890)
3. Frankenstein or The Modern Prometheus by Mary Shelley (1818)
4. Carrie by Stephen King (1974)
5. A Study in Scarlet by Arthur Conan Doyle (1887)
6. The Hound of the Baskervilles by Arthur Conan Doyle (1900)
7. Memoirs of Sherlock Holmes [11 stories] by Arthur Conan Doyle (1893)
8. The Works of Edgar Allan Poe in Five Volumes by Edgar Allan Poe (1903)


### Humor Books

In [None]:
combine_genre("humor")


Books under the genre 'humor':

1. The Second Jungle Book by Rudyard Kipling (1887)
2. Candide by Voltaire (1746)
3. Three Men in a Boat (to say nothing of the dog) by Jerome Klapka Jerome (1889)
4. Adventures of Huckleberry Finn by Mark Twain (1876)
5. Alice's Adventures in Wonderland by Lewis Carroll (1865)
6. The BFG by Roald Dahl (1980)


### Literature Books

In [None]:
combine_genre("literature")


Books under the genre 'literature':

1. The Art of War by Sun Tzu (1900)
2. Bible by Bible (1200)
3. La Poetica by Aristotle (1479)
4. The Merchant of Venice by William Shakespeare (1600)
5. Alice's Adventures in Wonderland by Lewis Carroll (1865)
6. Pride and Prejudice by Jane Austen (1813)
7. Don Quixote by Miguel de Cervantes Saavedra (1600)
8. Adventures of Huckleberry Finn by Mark Twain (1876)
9. Literacy for the 21st Century by Gail E. Tompkins (1996)


### Mystery/Thriller Books

In [None]:
combine_genre("mystery_thriller")


Books under the genre 'mystery_thriller':

1. A Study in Scarlet by Arthur Conan Doyle (1887)
2. The Hound of the Baskervilles by Arthur Conan Doyle (1900)
3. Treasure Island by Robert Louis Stevenson (1880)
4. Murder on the Orient Express by Agatha Christie (1933)
5. A Christmas Carol by Charles Dickens (1843)
6. The Thirty-Nine Steps by John Buchan (1915)
7. The Moonstone by Wilkie Collins (1868)
8. Alice's Adventures in Wonderland by Lewis Carroll (1865)
9. The Da Vinci Code by Dan Brown (2003)
10. Wuthering Heights by Emily Brontë (1846)
11. The Mysterious Affair at Styles by Agatha Christie (1920)


### Science Fiction Books

In [None]:
combine_genre("science_fiction")


Books under the genre 'science_fiction':

1. Alice's Adventures in Wonderland by Lewis Carroll (1865)
2. Frankenstein or The Modern Prometheus by Mary Shelley (1818)
3. Fahrenheit 451 by Ray Bradbury (1953)
4. The War of the Worlds by H. G. Wells (1898)
5. The Time Machine by H. G. Wells (1895)
6. The Giver by Lois Lowry (1993)
7. A Wrinkle in Time by Madeleine L'Engle (1962)


# Create table for putting books based on genre

In [None]:
def book_details(books_under_genre):
  book_data = {'ISBN': [],
               'Title': [],
               'Author': [],
               'Published_year': [],
               'Subject': []
               }


# Step 3 - Scrape User Ratings & Reviews

In [4]:
import requests
from bs4 import BeautifulSoup
import re
link = "https://openlibrary.org/subjects"
data = requests.get(link).text
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
book_link = "https://openlibrary.org/works/OL103123W/Fahrenheit_451#reader-observations"

def link2soup(link):
    data = requests.get(link).text
    return BeautifulSoup(data)

soup = link2soup(book_link)

In [None]:
def parse_to_books(link):
    '''
    Assume we start in "https://openlibrary.org/subjects" (base link) page,
    crawls to each chosen subject link and add it to the base link.
    '''
    # assume start in "https://openlibrary.org/subjects" page
    chosen_subjects = {
        "romance", "fantasy", "historical_fiction", "horror", "humor",
        "literature", "mystery_and_detective_stories", "science_fiction"
    }

    # Extract all subjects links
    all_links = [a['href'] for a in soup.select("div#subjectsPage ul li a") if 'href' in a.attrs]
    # print(all_links[:5])

    # Filter only the chosen subjects
    filtered_links = [
        link for link in all_links
        # ensure only parsing on wanted subjects
        if any(sub in link for sub in chosen_subjects) and "juvenile_literature" not in link]
    # print(filtered_links)

    # assume we wish to click the 'fantasy' (specific subject) page
    # crawling through links
    specific_subject_urls = []
    for i in range(len(filtered_links)):
        next_button = filtered_links[i].split('/')[2]
        specific_subject_url = link + "/" + next_button
        specific_subject_urls.append(specific_subject_url)

    return(specific_subject_urls)

In [None]:
subject_links = parse_to_books(link)
subject_links

['https://openlibrary.org/subjects/fantasy',
 'https://openlibrary.org/subjects/historical_fiction',
 'https://openlibrary.org/subjects/horror',
 'https://openlibrary.org/subjects/humor',
 'https://openlibrary.org/subjects/literature',
 'https://openlibrary.org/subjects/mystery_and_detective_stories',
 'https://openlibrary.org/subjects/romance',
 'https://openlibrary.org/subjects/science_fiction']

In [None]:
# not complete in a function yet
# assumes we want to go to the total "works" under each subject
specific_subject_url_book = []
for url in subject_links:
    response2 = requests.get(url)  # Make request to each URL
    soup2 = BeautifulSoup(response2.text, 'html.parser')  # Parse the HTML

    # Extract total works link
    total_works_link = [a['href'] for a in soup2.select("span#coversCount.count a") if 'href' in a.attrs]

    # Print extracted links (if any)
    print(f"Total works links for {url}: {total_works_link}")

    # Ensure the extracted links list is not empty before iterating
    if total_works_link:
        for i in range(len(total_works_link)):
            next_button2 = total_works_link[i].split('/')[1]  # Extract the relevant part
            print(f"Extracted path: {next_button2}")  # Debugging output

            # Construct the full URL
            specific_subject_url = url + "/" + next_button2  # Using `url`, not `link`
            specific_subject_url_book.append(specific_subject_url)  # Store the URLs

# Print all collected subject URLs
print("\nFinal collected URLs:", specific_subject_url_book)

Total works links for https://openlibrary.org/subjects/fantasy: ['/search?subject=Fantasy']
Extracted path: search?subject=Fantasy
Total works links for https://openlibrary.org/subjects/historical_fiction: ['/search?subject=Historical fiction']
Extracted path: search?subject=Historical fiction
Total works links for https://openlibrary.org/subjects/horror: ['/search?subject=Horror']
Extracted path: search?subject=Horror
Total works links for https://openlibrary.org/subjects/humor: []
Total works links for https://openlibrary.org/subjects/literature: ['/search?subject=Literature']
Extracted path: search?subject=Literature
Total works links for https://openlibrary.org/subjects/mystery_and_detective_stories: ['/search?subject=Mystery and detective stories']
Extracted path: search?subject=Mystery and detective stories
Total works links for https://openlibrary.org/subjects/romance: ['/search?subject=Romance']
Extracted path: search?subject=Romance
Total works links for https://openlibrary.or

In [13]:
# extracts the link to each book on a page
def book_links(subject):
    subject_link = f'https://openlibrary.org/search?subject={subject}'
    data = requests.get(subject_link).text
    response = requests.get(subject_link)
    soup = BeautifulSoup(response.text, 'html.parser')

    book_links = []
    endof_links = [a['href'] for a in soup.select("div#searchResults li.searchResultItem.sri--w-main a.results")
                 if 'href' in a.attrs]
    for link in endof_links:
        book_links.append(subject_link + link )

    return book_links

print(book_links ('Romance'))

['https://openlibrary.org/search?subject=Romance/works/OL77775W?edition=key%3A/books/OL46930495M', 'https://openlibrary.org/search?subject=Romance/works/OL1003017W?edition=key%3A/books/OL44595490M', 'https://openlibrary.org/search?subject=Romance/works/OL44995W?edition=key%3A/books/OL37044526M', 'https://openlibrary.org/search?subject=Romance/works/OL15450151W?edition=key%3A/books/OL20475668M', 'https://openlibrary.org/search?subject=Romance/works/OL274518W?edition=key%3A/books/OL25954563M', 'https://openlibrary.org/search?subject=Romance/works/OL100239W?edition=key%3A/books/OL37847175M', 'https://openlibrary.org/search?subject=Romance/works/OL15437W?edition=key%3A/books/OL26320183M', 'https://openlibrary.org/search?subject=Romance/works/OL1230715W?edition=key%3A/books/OL21214329M', 'https://openlibrary.org/search?subject=Romance/works/OL81294W?edition=key%3A/books/OL44811282M', 'https://openlibrary.org/search?subject=Romance/works/OL15450W?edition=key%3A/books/OL23325250M', 'https://o

In [None]:
# follows the link to the next page so we can continue parsing
def next_button(subject):
    subject_link = f'https://openlibrary.org/search?subject={subject}'
    data = requests.get(subject_link).text
    response = requests.get(subject_link)
    soup = BeautifulSoup(response.text, 'html.parser')

    next_button = soup.select('div.pagination a.ChoosePage')

    last_page = soup.select_one('a.ChoosePage[data-ol-link-track="Pager|LastPage"]').get_text()

    if not next_button:
        return
    return (base_url+next_button[0].attrs['href'])
# print(next_button('Romance'))

# need to somehow loop this so it's not only going from the first page to the secon

In [15]:
import requests
from bs4 import BeautifulSoup

def every_page(subject):
    base_url = 'https://openlibrary.org'
    subject_link = f'https://openlibrary.org/search?subject={subject}'
    response = requests.get(subject_link)
    soup = BeautifulSoup(response.text, 'html.parser')

    last_page = soup.select_one('a.ChoosePage[data-ol-link-track="Pager|LastPage"]').get_text()
    last_page_number = int(last_page)

    page_links = []

    for page in range(1, last_page_number + 1):
        page_link = f'{subject_link}&page={page}'
        response = requests.get(page_link)
        soup = BeautifulSoup(response.text, 'html.parser')

        book_links = [a['href'] for a in soup.select("div#searchResults li.searchResultItem.sri--w-main a.results") if 'href' in a.attrs]

        for link in book_links:
            page_links.append(base_url + link + '\n')  # Adding newline character for each link

    return page_links

print(every_page('Romance'))


KeyboardInterrupt: 

In [None]:
def parse_book_review(subject_links):
  pass

In [None]:
community_reviews = [x.get_text() for x in soup.select("span.reviews__value")]
community_data = [re.search(r'\s*(\w[\w\s]*)',x).group(1).strip() for x in community_reviews]
print(community_reviews)

[]


In [None]:
community_reviews = [x.get_text() for x in soup.select("span.reviews__value")]
community_data = [re.search(r'\s*(\w[\w\s]*)',x).group(1).strip() for x in community_reviews]

review_percentage = [x.get_text() for x in soup.select("span.percentage")]
percentage_data = [re.search(r'\s*(\w[\w\s]*)',x).group(1).strip() for x in review_percentage]

import pandas as pd
df = pd.DataFrame(data = {
    "Review" : community_data,
    "Ratings" : percentage_data
   # "Number of Reviews" : number_of_reviews

})
df

Unnamed: 0,Review,Ratings


In [None]:
number_of_reviews = [x.get_text() for x in soup.select("h2.observation-title")]
number_of_reviews = [re.search(r'\((\d+)\)',x).group(1).strip() for x in number_of_reviews]

In [None]:
book_button1 = [x.get_text() for x in soup.select('h3', class_= 'booktitle')]
book_button2 = [x.get_text() for x in soup.select('div', class_= 'book-cover')]


May need:

1. List of books that we like
2. Books in general (found above)

We can use both data to recommend books we might like (haven't read yet) - do this by finding all users who like the same books as us and then seeing what other books they like. We'll use those results to create that recommendation

## Collaborative Filtering

We only want to see books that have been reviewed more than 15 times.

# What to do next:
1. Build ML model
  - training data: csv file containing books in a specific genre?
  - testing data: our prediction now?

2. Approaches to consider:
  - Collaborative Filtering (based on user ratings, user reviews e.g. Goodreads)
  - Content-Based Filtering (based on genre, content description, etc.)
  - Combination of both Filtering Methods

3. Define Training Data
  - What should the csv file include?
    1. Book Information: Book ID, Title, Author, Genres, Description
    2. User Ratings: User ID, Book ID, Rating, User Reviews

4. Machine Learning Models to consider:
  - Content-Based Filtering: Book descriptions and genres
      - TF-IDF (Term Frequency-Inverse Document Frequency): evaluates the importance of a word in a document : https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/
      - Sci-Kit Learn: classifiers, feature-extraction
  - Collaborative Filtering: User ratings and reviews
      - Single Value Decomposition (SVD): can decompose a matrix into 3 matrices, good for ratings: https://www.geeksforgeeks.org/singular-value-decomposition-svd/
  - From surprise: https://surpriselib.com/


5. Hybrid model
  - Step 1: Get the top books for the user through collaborative filtering
  - Step 2: Find the most similar books through content based filtering
  - Step 3: Return the list of recommended books



In [None]:
# create dataframe (csv file) of books


In [None]:
# import SVD, import test train split
from surprise import SVD
from surprise.model_selection import test_train_split