## Comparación de similitud entre libros
En este *notebook* se va a mostrar la similitud entre parejas de libros para comprobar que, en efecto, su forma de representación y función de similitud cobran sentido a la hora de ser comparados. Para ello, se tomarán 3 conjuntos de 2000 parejas de libros cada uno de ellos. El primer conjunto consistirá de parejas formadas aleatoriamente, el segundo de parejas de libros de mismo autor y, finalmente, el tercer conjunto consistirá de parejas de libros que comparten género. Para cada uno de estos conjuntos, se calculará la similitud media. Se espera como resultado que la similitud aleatoria sea comparativamente menor que las otras dos similitudes medias.

In [1]:
import os
import pandas as pd
from ast import literal_eval

dataset_path = os.path.join(os.getcwd(), '..', '..', 'datasets')
goodbooks_path = os.path.join(dataset_path, 'goodbooks_ext', 'books_enriched.csv')
# Dataset de libros de Goodbooks Extended
books = pd.read_csv(
    goodbooks_path, index_col=[0], converters={"authors": literal_eval, "genres": literal_eval}
)
raw_path = os.path.join(dataset_path, 'raw', 'books_raw.pkl')
# Dataset de representación semántica de libros
books_raw: pd.DataFrame = pd.read_pickle(raw_path)
books_raw_ids = books_raw['book_id'].values
books_reduced = books[books['book_id'].isin(books_raw_ids)]
books_info = books_reduced[['book_id', 'title', 'authors', 'genres']].copy()

### Parejas aleatorias

In [2]:
import random
import numpy as np

random.seed(42)
N = 4000
# Elección aleatoria para formar parejas de libros
random_book_ids: list[int] = random.sample(books_raw_ids.tolist(), N)
pairs = [(random_book_ids[i], random_book_ids[i + 1]) for i in range(0, N, 2)]
# Calcular similitud entre campos 'semantic_sbert' de cada pareja
sims = [
    np.dot(
        books_raw[books_raw['book_id'] == id1]['semantic_sbert'].values[0],
        books_raw[books_raw['book_id'] == id2]['semantic_sbert'].values[0]
    )
    for (id1, id2) in pairs
]
np.mean(sims), np.std(sims)

(0.25669712, 0.11769364)

### Parejas de libros con mismo autor

In [3]:
# Tomar autores que tengan al menos 2 libros
books_info['authors'] = books_info['authors'].apply(lambda x: x[0])
books_info['authors'] = books_info['authors'].apply(lambda x: x.strip('[').strip(']'))
author_count = books_info['authors'].value_counts()
author_count = author_count[author_count > 1]
author_ids = author_count.index.tolist()
books_info = books_info[books_info['authors'].isin(author_ids)]
books_info.head()

Unnamed: 0,book_id,title,authors,genres
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,"[young-adult, fiction, fantasy, science-fictio..."
1,2,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,"[fantasy, fiction, young-adult, classics]"
2,3,"Twilight (Twilight, #1)",Stephenie Meyer,"[young-adult, fantasy, romance, fiction, paran..."
3,4,To Kill a Mockingbird,Harper Lee,"[classics, fiction, historical-fiction, young-..."
5,6,The Fault in Our Stars,John Green,"[young-adult, romance, fiction, contemporary]"


In [4]:
# Agrupar por autores
author_grouped = books_info.groupby('authors')

In [5]:
# Libros de Suzanne Collins
suzanne_collins = author_grouped.get_group('Suzanne Collins')
suzanne_collins

Unnamed: 0,book_id,title,authors,genres
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,"[young-adult, fiction, fantasy, science-fictio..."
14,17,"Catching Fire (The Hunger Games, #2)",Suzanne Collins,"[young-adult, fiction, fantasy, science-fictio..."
16,20,"Mockingjay (The Hunger Games, #3)",Suzanne Collins,"[young-adult, fiction, fantasy, science-fictio..."
466,507,The Hunger Games Trilogy Boxset (The Hunger Ga...,Suzanne Collins,"[young-adult, fiction, fantasy, science-fictio..."
1438,1531,"Gregor the Overlander (Underland Chronicles, #1)",Suzanne Collins,"[fantasy, young-adult, fiction, science-fiction]"
2731,2935,Gregor and the Code of Claw (Underland Chronic...,Suzanne Collins,"[fantasy, young-adult, fiction]"
2955,3179,Gregor and the Curse of the Warmbloods (Underl...,Suzanne Collins,"[fantasy, young-adult, fiction]"
3426,3712,Gregor and the Prophecy of Bane (Underland Chr...,Suzanne Collins,"[fantasy, young-adult, fiction]"
4262,4720,Gregor and the Marks of Secret (Underland Chro...,Suzanne Collins,"[fantasy, young-adult, fiction]"


In [6]:
# Crear pares de libros por autor
authors: list[str] = list(set(author_grouped.groups.keys()))
paired_books: list[tuple[int, int]] = []
for author in authors:
    author_books = author_grouped.get_group(author)
    book_ids = author_books['book_id'].values
    num_books = len(book_ids)
    if num_books % 2 != 0:
        book_ids = book_ids[:-1]
        num_books -= 1
    paired_books += [(book_ids[i], book_ids[i + 1]) for i in range(0, num_books, 2)]
    
# Selección aleatoria de N = 2000 parejas
random.seed(42)
N = 2000
random_paired_books = random.sample(paired_books, N)

# Calcular similitud entre campos 'semantic_sbert' de cada pareja
sims = [
    np.dot(
        books_raw[books_raw['book_id'] == id1]['semantic_sbert'].values[0],
        books_raw[books_raw['book_id'] == id2]['semantic_sbert'].values[0]
    )
    for (id1, id2) in random_paired_books
]
np.mean(sims), np.std(sims)

(0.52206093, 0.18181205)

### Parejas de libros con misma colección de géneros

In [7]:
# Hacer las listas de géneros únicas
books_info['genres'] = books_info['genres'].apply(lambda x: frozenset(x))

In [8]:
genre_grouped = books_info.groupby('genres')

In [9]:
sf_young_genre = frozenset(['young-adult', 'fiction', 'fantasy', 'science-fiction'])
sf_young = genre_grouped.get_group(sf_young_genre)
sf_young.head()

Unnamed: 0,book_id,title,authors,genres
55,62,"The Golden Compass (His Dark Materials, #1)",Philip Pullman,"(fiction, young-adult, fantasy, science-fiction)"
194,215,Ready Player One,Ernest Cline,"(fiction, young-adult, fantasy, science-fiction)"
201,223,"Artemis Fowl (Artemis Fowl, #1)",Eoin Colfer,"(fiction, young-adult, fantasy, science-fiction)"
343,376,"The Death Cure (Maze Runner, #3)",James Dashner,"(fiction, young-adult, fantasy, science-fiction)"
442,480,"The Amber Spyglass (His Dark Materials, #3)",Philip Pullman,"(fiction, young-adult, fantasy, science-fiction)"


In [10]:
# Crear pares de libros por conjunto de géneros
genres: list[frozenset[str]] = list(set(genre_grouped.groups.keys()))
paired_books: list[tuple[int, int]] = []
for genre in genres:
    genre_books = genre_grouped.get_group(genre)
    book_ids = genre_books['book_id'].values
    num_books = len(book_ids)
    if num_books % 2 != 0:
        book_ids = book_ids[:-1]
        num_books -= 1
    paired_books += [(book_ids[i], book_ids[i + 1]) for i in range(0, num_books, 2)]

# Selección aleatoria de N = 2000 parejas
random.seed(42)
N = 2000
random_paired_books = random.sample(paired_books, N)

# Calcular similitud entre campos 'semantic_sbert' de cada pareja
sims = [
    np.dot(
        books_raw[books_raw['book_id'] == id1]['semantic_sbert'].values[0],
        books_raw[books_raw['book_id'] == id2]['semantic_sbert'].values[0]
    )
    for (id1, id2) in random_paired_books
]
np.mean(sims), np.std(sims)

(0.42866492, 0.18337564)