<a href="https://colab.research.google.com/github/maureenwidjaja/PIC16B-Group-Project/blob/main/PIC16B_group_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Scrape Books Using Open Library API
- get by 'Subject' name ->> can be anything, e.g. "fantasy" etc.



In [1]:
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import json
import pandas as pd
import numpy as np

In [17]:
import requests

def get_books_by_subject(subject, limit, details=True, ebooks=False, published_in=None, offset=0):
    '''
    Args:
    details: if True, includes related subjects, prolific authors, and publishers.
    ebooks: if True,  filters for books with e-books.
    published_in: filters by publication year.
                  For example:
                  http://openlibrary.org/subjects/love.json?published_in=1500-1600
    limit: num of works to include in the response, controls pagination.
    offset: starting offset in the total works, controls pagination.
    '''

    if (limit <= 0):
      print("Enter a limit input")
      return

    # Differentiate between gere
    genre_fantasy = ["magic", "adventure", "fairy tale", "epic fantasy",
              "illusion", "legend", "wizards", "dragons", "heroes"]
    genre_literary_fiction = ["literary", "prose", "literary fiction"]
    genre_historical_fiction = ["period", "historical", "ancient",
                                "medieval", "war", "past", "historical fiction"]
    genre_sci_fi = ["space", "future", "technology", "aliens", "robots",
                    "cyberpunk", "time travel", "dystopia", "starships",
                    "extraterrestrial"]
    genre_mystery = ["detective", "investigation", "secrets", "puzzle", "crime",
                     "suspect"]
    genre_thriller = ["suspense", "intense", "action", "plot twist", "chase",
                      "danger", "mystery", "cliffhanger", "adrenaline", "crime"]
    genre_horror = ["fear", "supernatural", "monsters", "ghosts", "psychological",
                    "terror", "darkness", "haunted", "death", "creepy"]










    # Creates the API endpoint URL using the subject provided.
    base_url = (f'https://openlibrary.org/subjects/{subject}.json')

    # Dict to store query parameters.
    params = {
        "limit": limit,
        "offset": offset
    }

    # Check if details are "true", it will be added to the API request
    if details:
        params["details"] = "true"
    if ebooks:
        params["ebooks"] = "true"
    if published_in:
        params["published_in"] = published_in  # Example: "2000-2020"

    # Sends an HTTP GET request to Open Library's API with the query parameters
    # stored in params.
    # The response is stored in response, which contains JSON data.
    response = requests.get(base_url, params=params)

    # Checks if the request was successful
    if response.status_code == 200:
      data = response.json()
      # print(data)
      # Retrieves the list of books (works).
      # If missing, returns an empty list.
      books = data.get("works", [])

    if not books:
      print("No books found for this subject.")
      return

    # Extract start and end year if range is given
    start_year, end_year = None, None
    if published_in and "-" in published_in:
      # convert string into two integers, e.g: 1900 and 2000.
      start_year, end_year = map(int, published_in.split("-"))

    print(f"Books in {subject}:")
    for index, book in enumerate(books, start=1):
      title = book.get("title", "Unknown Title")
      author = book["authors"][0]["name"] if "authors" in book else "Unknown Author"
      published_year = book.get("first_publish_year", "Unknown Year")


      # Ensure published year is an integer before filtering
      if isinstance(published_year, int) and start_year and end_year:
      # Ensure book publication is between range
        if not (start_year <= published_year <= end_year):
          # print(f"- {title} by {author} ({published_year})") checking
          continue  # Skip books outside the range

      print(f"{index}. {title} by {author} ({published_year})")





In [None]:

# test cases:
get_books_by_subject("love", limit=20)

Books in love:
1. Wuthering Heights by Emily Brontë (1846)
2. The Great Gatsby by F. Scott Fitzgerald (1920)
3. Ethan Frome by Edith Wharton (1910)
4. Romeo and Juliet by William Shakespeare (1597)
5. Anna Karenina by Лев Толстой (1876)
6. πολιτεία by Πλάτων (1554)
7. Le Comte de Monte Cristo by Alexandre Dumas (1830)
8. Sonnets by William Shakespeare (1609)
9. As You Like It by William Shakespeare (1734)
10. The Importance of Being Earnest by Oscar Wilde (1893)
11. Chronicles of Avonlea by Lucy Maud Montgomery (1912)
12. Le petit prince by Antoine de Saint-Exupéry (1943)
13. Συμπόσιον by Πλάτων (1559)
14. Cyrano de Bergerac by Edmond Rostand (1821)
15. Rose in Bloom by Louisa May Alcott (1876)
16. कामसूत्र by Mallanaga Vātsyāyana (1883)
17. Vita nuova by Dante Alighieri (1829)
18. Γοργίας by Πλάτων (1827)
19. Works [37 plays, 6 poems, sonnets] by William Shakespeare (1730)
20. Чайка by Антон Павлович Чехов (1915)


In [None]:
get_books_by_subject("romance", limit=30)

Books in romance:
1. Wuthering Heights by Emily Brontë (1846)
2. Emma by Jane Austen (1815)
3. Sense and Sensibility by Jane Austen (1811)
4. Little Women by Louisa May Alcott (1848)
5. Northanger Abbey by Jane Austen (1818)
6. Ethan Frome by Edith Wharton (1910)
7. Anna Karenina by Лев Толстой (1876)
8. Le Comte de Monte Cristo by Alexandre Dumas (1830)
9. Uncle Tom's Cabin by Harriet Beecher Stowe (1850)
10. The Moonstone by Wilkie Collins (1868)
11. Women in Love by David Herbert Lawrence (1877)
12. This Side of Paradise by F. Scott Fitzgerald (1920)
13. Cranford by Elizabeth Cleghorn Gaskell (1853)
14. Heart of Darkness by Joseph Conrad (1899)
15. Jude the Obscure by Thomas Hardy (1895)
16. The Woodlanders by Thomas Hardy (1800)
17. The pioneers by James Fenimore Cooper (1800)
18. Under the Greenwood Tree or, The Mellstock quire by Thomas Hardy (1872)
19. The Age of Innocence by Edith Wharton (1920)
20. Framley Parsonage by Anthony Trollope (1861)
21. Daisy Miller by Henry James (

In [None]:
get_books_by_subject("fantasy", limit=20)

Books in fantasy:
1. Alice's Adventures in Wonderland by Lewis Carroll (1865)
2. The Wonderful Wizard of Oz by L. Frank Baum (1899)
3. Treasure Island by Robert Louis Stevenson (1880)
4. Gulliver's Travels by Jonathan Swift (1726)
5. The Prince by Niccolò Machiavelli (1515)
6. Through the Looking-Glass by Lewis Carroll (1865)
7. Five Children and It by Edith Nesbit (1905)
8. The Lost World by Arthur Conan Doyle (1900)
9. The Marvelous Land of Oz by L. Frank Baum (1904)
10. Ozma of Oz by L. Frank Baum (1907)
11. A Midsummer Night's Dream by William Shakespeare (1600)
12. The Emerald City of Oz by L. Frank Baum (1910)
13. Dorothy and the Wizard in Oz by L. Frank Baum (1908)
14. The Lost Princess of Oz by L. Frank Baum (1917)
15. The Story of the Amulet by Edith Nesbit (1905)
16. The Complete Life and Adventures of Santa Claus by L. Frank Baum (1902)
17. Alice's Adventures in Wonderland / Through the Looking Glass by Lewis Carroll (1889)
18. The Road to Oz by L. Frank Baum (1909)
19. The 

In [5]:
get_books_by_subject("sci-fi", limit=20, published_in='1990-2020')

Books in sci-fi:
5. Divergent by Veronica Roth (2010)
6. The Circle by Dave Eggers (2013)
7. Allegiant by Veronica Roth (2001)
8. Artemis by Andy Weir (2017)
10. Children of Blood and Bone by Tomi Adeyemi (2017)
11. The Mayflower Project by Katherine Applegate (2001)
13. The Telling by Ursula K. Le Guin (2000)
14. The knife of never letting go by Patrick Ness (2008)
18. Waterworld (Movie-Tie-in) by Max Allan Collins (1995)
19. Backwards (Red Dwarf) by Rob Grant (1996)


In [3]:
get_books_by_subject("science-fiction", limit=20, published_in='1990-2020')

Books in science-fiction:
5. Mockingjay by Suzanne Collins (2010)
6. The Martian by Andy Weir (2011)
12. The City of Ember (The First Book of Ember) by Jeanne DuPrau (1998)
15. Gathering Blue by Lois Lowry (2000)
17. Pillars of Creation by Terry Goodkind (2001)
18. 3001 by Arthur C. Clarke (1997)
20. The Fall of Hyperion by Dan Simmons (1990)


In [16]:
# Define the base URL for the Romance subject
base_url = 'https://openlibrary.org/subjects/fantasy.json'
response = requests.get(base_url)

# Print the entire response in a formatted JSON block
if response.status_code == 200:
    data = response.json()

    # Access the list of works
    books = data.get("works", [])

    # Search for "Alice's Adventures in Wonderland"
    for book in books:
        if "Alice's Adventures in Wonderland" in book.get("title", ""):
            print(json.dumps(book, indent=4))
            break
    else:
        print("Book not found in the Romance subject.")
else:
    print(f"Failed to fetch data. Status code: {response.status_code}")


{
    "key": "/works/OL138052W",
    "title": "Alice's Adventures in Wonderland",
    "edition_count": 3546,
    "cover_id": 10527843,
    "cover_edition_key": "OL31754751M",
    "subject": [
        "Alice (fictitious character : carroll), fiction",
        "British and irish fiction (fictional works by one author)",
        "Fiction, fantasy, general",
        "JUVENILE FICTION",
        "classics",
        "Fantasy & Magic",
        "Imagination & Play",
        "adventure and adventurers",
        "adventure and adventurers, fiction",
        "adventure stories",
        "adventure travel",
        "animals",
        "anthropomorphism",
        "artists' illustrated books",
        "books and reading",
        "child and youth fiction",
        "children",
        "children's fiction",
        "children's literature",
        "children's literature, english",
        "children's stories",
        "children's stories, english",
        "classic literature",
        "coloring books",

In [9]:
response['works'][2]['subject'] # to find genre and/or key words

TypeError: 'Response' object is not subscriptable

In [12]:
print(response.json())  # or print(response.text) if you want raw JSON text

{'has_fulltext': 'true', 'key': '/subjects', 'm': 'edit', 'title': 'Subjects', 'type': {'key': '/type/i18n_page'}, 'latest_revision': 110, 'revision': 110, 'created': {'type': '/type/datetime', 'value': '2009-10-16T16:46:43.549533'}, 'last_modified': {'type': '/type/datetime', 'value': '2021-10-04T16:52:28.455494'}}


# What to do next:
1. Build ML model
  - training data: csv file containing books in a specific genre?
  - testing data: our prediction now?

2. Approaches to consider:
  - Collaborative Filtering (based on user ratings, user reviews e.g. Goodreads)
  - Content-Based Filtering (based on genre, content description, etc.)
  - Combination of both Filtering Methods

3. Define Training Data
  - What should the csv file include?
    1. Book Information: Book ID, Title, Author, Genres, Description
    2. User Ratings: User ID, Book ID, Rating, User Reviews

4. Machine Learning Models to consider:
  - Content-Based Filtering: Book descriptions and genres
      - TF-IDF (Term Frequency-Inverse Document Frequency): evaluates the importance of a word in a document : https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/
      - Sci-Kit Learn: classifiers, feature-extraction
  - Collaborative Filtering: User ratings and reviews
      - Single Value Decomposition (SVD): can decompose a matrix into 3 matrices, good for ratings: https://www.geeksforgeeks.org/singular-value-decomposition-svd/
  - From surprise: https://surpriselib.com/


5. Hybrid model
  - Step 1: Get the top books for the user through collaborative filtering
  - Step 2: Find the most similar books through content based filtering
  - Step 3: Return the list of recommended books



In [None]:
# create dataframe (csv file) of books


In [None]:
# import SVD, import test train split
from surprise import SVD
from surprise.model_selection import test_train_split