# Link book topics to reviews

In this notebook, we link the book topics found by the top2vec model to the reviews that readers have left after reading those books.

In [9]:
import numpy as np
import pandas as pd
from top2vec import Top2Vec
from scipy.spatial.distance import cosine
from tqdm import tqdm

# custom made classes to preprocess and tidy-up dataframes
from src.topic_summary import ModelAnalyser, NurGenreMapper, ReviewExtractor

#### Set paths

In [4]:
# please change the following paths to reflect the location of the top2vec model in your local directory

folder_path = "/Users/evaviviani/github/impact-and-fiction/models/" 

#### Load custom-made functions

In [6]:
def normalize_distances(distances):
    """
    Normalize the cosine distances to get a distribution over topics.

    Parameters:
    - distances (list of float): Cosine distances for the n closest topics.

    Returns:
    - normalized_similarities: Normalized cosine similarity scores for the n closest topics.
    """
    # Convert distances to similarities
    similarities = [1 - dist for dist in distances]
    
    # Normalize the similarities
    total_sim = sum(similarities)
    normalized_similarities = [sim / total_sim for sim in similarities]
    
    return normalized_similarities

In [7]:
# this class is useful to analyse the Top2Vec models output
analyser = ModelAnalyser()

#### Load top2vec model

In this section we load the top2vec model and flatten all the information to have a dataframe with which we can merge the reviews.

In [8]:
model = analyser.load_model(folder_path + 't2v_model-10921_novels-win_-1-all_win_denoised-min_frac_0.01-max_frac_0.1')

For each document, select the n closest topics

In [10]:
dv, tv, wv = analyser.get_vectors_from_model(model)

document_topic = []
topics_distance = []

for d in tqdm(dv):
    idx, distances = analyser.closest_topics(d, tv, n=5)
    document_topic.append(idx)
    topics_distance.append(normalize_distances(distances))

100%|████████████████████████████████████| 10921/10921 [00:15<00:00, 706.31it/s]


Next, we create dataframe out of the data contained in the output of the top2vec model. Steps are explained in the comments below.

In [11]:
# extract document IDs
document_ids = [i.split('-')[0] for i in model.document_ids] 

# flatten the list of lists of cosine distances
cosine_distances = [distance for sublist in topics_distance for distance in sublist]

# do the same for topic numbers and positions (just in case these info turn out to be useful afterwards)
topic_numbers = [number for sublist in document_topic for number in sublist]

topic_position = [number_pos for sublist in document_topic for number_pos in range(len(sublist))]

# repeated list of document_ids
repeated_ids = [doc_id for doc_id in document_ids for _ in range(len(topics_distance[0]))]

df = pd.DataFrame({
    'isbn': repeated_ids,
    'cosine_distance': cosine_distances,
    'topic_position': topic_position,
    'topic_number': topic_numbers
})

In [12]:
topic_words, word_scores, topic_nums = model.get_topics()

# add topic words
df['topic_words'] = [model.topic_words[i] for i in df['topic_number']]
# add topic vectors
df['topic_vectors'] = [model.topic_vectors[i] for i in df['topic_number']]
# add word vectors
df['topic_word_vectors'] = [word_scores[i] for i in df['topic_number']]

This is our resulting dataframe which contains the information about how many and which topics are associated to which book. The relative topic words in that topic, topic vectors ad topic_word_vectors.

In [13]:
df.head(2)

Unnamed: 0,isbn,cosine_distance,topic_position,topic_number,topic_words,topic_vectors,topic_word_vectors
0,9789402522297,0.267482,0,0,"[cara, teemde, privejet, querida, lustgevoelen...","[0.010043722, 0.15458588, -0.021236569, -0.000...","[0.673147, 0.5679335, 0.55956966, 0.55685467, ..."
1,9789402522297,0.197697,1,40,"[bliken, communications, flynn, hudson, blikke...","[0.009354808, 0.13535821, -0.019368501, -0.042...","[0.54636735, 0.48367235, 0.44614616, 0.4025609..."


#### Load reviews

In this section we need to put together multiple datasets containing the mapping between words and isbns, and the relative reviews. 

In [15]:
# please change the following paths to reflect the location of the following files in your local directory

review_dir = '../../models/reviews/review-impact_matches.tsv.gz'
raw_review_data = '../../models/reviews/reviews-stats.tsv.gz'
isbn_map = "../../data/work-isbn-mapping.tsv"
isbn_work_id_mappings_file = "../../data/work_isbn_genre.tsv"


In [16]:
# this class helps to preprocess the inputs and output a genre mapping file
mapper = NurGenreMapper(isbn_map, isbn_work_id_mappings_file)

# this class produces as output the impact_reviews
extractor = ReviewExtractor(review_dir, raw_review_data)

In [17]:
# get the mapping file which contains `word_id` and `isbn` columns. These are necessary to merge the reviews
mapped_df = mapper.process_genre_mapping()

# this is our impact reviews dataset:
reviews = extractor.get_impact_reviews()

# NB. left-join is the best way to merge the files without losing data
reviews_merged_with_genre = pd.merge(reviews, mapped_df, on = 'work_id', how = 'left')

Dataset consists of impact terms extracted from the reviews of books by the impact model and scored according to affect, style, narrative and reflection. 

In [20]:
# I give it a shorter name so that's easier to handle in the code
dt = reviews_merged_with_genre

Step below takes all impact terms and if there are any multiwords it appends the word as an additional row below.

In [22]:
dt_expanded = dt.assign(impact_term=dt['impact_term'].str.split()).explode('impact_term')

This is our resulting dataframe containing the info regarding the reviews:

In [23]:
dt_expanded.head(2)

Unnamed: 0,work_id,review_id,affect,style,narrative,reflection,impact_term,review_num_words,isbn,nur_genre
0,impfic-work-3723,impfic-review-1,1,0,0,0,fantastisch,185,9789048416547,Young_adult
1,impfic-work-3723,impfic-review-1,1,0,0,0,fantastisch,185,9789400804876,Young_adult


#### Merge the reviews with the output of top2vec

In [24]:
topic_and_reviews = pd.merge(dt_expanded, df, on = 'isbn', how = 'left')

This is the result of the merge:

In [25]:
topic_and_reviews.head(2)

Unnamed: 0,work_id,review_id,affect,style,narrative,reflection,impact_term,review_num_words,isbn,nur_genre,cosine_distance,topic_position,topic_number,topic_words,topic_vectors,topic_word_vectors
0,impfic-work-3723,impfic-review-1,1,0,0,0,fantastisch,185,9789048416547,Young_adult,,,,,,
1,impfic-work-3723,impfic-review-1,1,0,0,0,fantastisch,185,9789400804876,Young_adult,,,,,,


The dataframe contains many columns which we don't necessarily need for the following steps. So I drop them and make a smaller version of this dataframe with only the info that I need.

In [26]:
minimal_topic_reviews = topic_and_reviews[['isbn', 'nur_genre', 'topic_number', 'impact_term' , 'affect', 'style', 'narrative', 'reflection']].drop_duplicates()

This is our final dataframe:

In [27]:
minimal_topic_reviews.head(2)

Unnamed: 0,isbn,nur_genre,topic_number,impact_term,affect,style,narrative,reflection
0,9789048416547,Young_adult,,fantastisch,1,0,0,0
1,9789400804876,Young_adult,,fantastisch,1,0,0,0


#### Add themes

Below our topic-to-theme mapping:

In [28]:
genre_topic_mapping = {
    'Romance and sex': [0, 1, 10, 12, 28, 45, 51, 93],
    'Wildlife / nature': [21, 33, 56, 65, 76, 90],
    'Behaviours / feelings': [2, 12, 13, 20, 18, 41, 42, 54, 59, 68, 93],
    'Medicine / health': [3, 20, 72],
    'School': [18, 42],
    'Arts': [7, 13, 14, 24, 40, 41, 46, 48, 49, 50, 53],
    'Culture': [4, 8, 10, 11, 13, 14, 15, 17, 19, 21, 29, 33, 34, 35, 38, 39, 43, 44, 47, 53, 58, 59, 65, 79, 84],
    'Geography and setting': [5, 17, 19, 21, 29, 38, 31, 32, 33, 34, 35, 38, 43, 47, 48, 53, 54, 61, 64, 74, 77, 79, 82, 85, 88, 89, 92, 93, 94],
    'Law': [6, 7, 39, 75, 80, 81, 89],
    'Crime': [6, 15, 16, 25, 39, 55, 62, 63, 73, 75, 78, 94],
    'History': [8, 9, 11, 16, 22, 26, 47, 53, 70, 71, 74, 77, 80],
    'War': [8, 9, 22, 31, 34, 55, 62, 63, 70, 71, 76, 80, 81, 86],
    'Religion, spirituality and philosophy': [27, 30, 34, 66, 87],
    'Politics': [23, 32, 36, 79, 80],
    'Lifestyle and sport': [33, 38, 49, 52, 57, 90],
    'Supernatural, fantasy and sci-fi': [37, 60, 69, 91, 83],
    'Other': [24, 46, 67, 76, 83]
}

I transform the above into a dataframe so that's easier to work with.

In [29]:
genres = []
topics = []

for genre, topic_list in genre_topic_mapping.items():
    for topic in topic_list:
        genres.append(genre)
        topics.append(topic)

topic_categories = pd.DataFrame({
    'category': genres,
    'topic_number': topics
})

topic_categories

Unnamed: 0,category,topic_number
0,Romance and sex,0
1,Romance and sex,1
2,Romance and sex,10
3,Romance and sex,12
4,Romance and sex,28
...,...,...
162,Other,24
163,Other,46
164,Other,67
165,Other,76


In [30]:
# check that all topics have a corresponding thematic

all_topics = list(range(0, 95))
provided_topics = [topic for sublist in genre_topic_mapping.values() for topic in sublist]

missing_topics = [topic for topic in all_topics if topic not in provided_topics]

In [31]:
print(f'Do all topics have a corresponding thematic? {not missing_topics}')

Do all topics have a corresponding thematic? True


How many topics per theme?

In [32]:
# Count the number of topics per thematic category
topic_categories.groupby(['category']).agg({
    'topic_number': 'nunique'
}).reset_index()

Unnamed: 0,category,topic_number
0,Arts,11
1,Behaviours / feelings,11
2,Crime,12
3,Culture,25
4,Geography and setting,28
5,History,13
6,Law,7
7,Lifestyle and sport,6
8,Medicine / health,3
9,Other,5


#### Prepare dataset to produce circular histograms.

Books may have multiple topics, so we count proportions of each topic in a document. 

In [33]:
topic_proportions = [sim for idx, sim in zip(document_topic, topics_distance)]

In [34]:
flat_topic_proportions = [item for sublist in topic_proportions for item in sublist]

In [35]:
df_minimal = df[['isbn','topic_number']].drop_duplicates()

In [37]:
df_minimal['topic_proportion'] = flat_topic_proportions

These are our proportions:

In [38]:
df_minimal.head(2)

Unnamed: 0,isbn,topic_number,topic_proportion
0,9789402522297,0,0.267482
1,9789402522297,40,0.197697


We add this info in the main dataframe:

In [39]:
df_categories_proportions = pd.merge(df_minimal, topic_categories, on = 'topic_number', how='left')

In [41]:
dt_expanded_minimal = dt_expanded[['isbn', 'nur_genre']]

In [43]:
df_categories_proportions_nur = pd.merge(df_categories_proportions, dt_expanded_minimal, on = 'isbn', how='left')

This is the final dataset used to create the circular histograms:

In [45]:
df_categories_proportions_nur.head(2)

Unnamed: 0,isbn,topic_number,topic_proportion,category,nur_genre
0,9789402522297,0,0.267482,Romance and sex,
1,9789402522297,40,0.197697,Arts,


This dataframe is used later on for producing the circular histograms. These figures are rendered in R because the graphical capabilities of this software allows more flexibility for producing plots.

The code to reproduce the circular plots in R present in the paper is `plot_circular_histograms.R`. It takes as input `circular_histograms_data.csv` and it saves automatically the plots.

In [46]:
df_categories_proportions_nur.to_csv('circular_histograms_data.csv', index=False)