<hr>
<center>

# Text Mining - Homework 2

</center>
<center>

Authors: 

Timothy, Denis, Pablo

</center>
<hr>


## Part 2

<hr>

Notebook for the creation of the dictionaries of the books dataset.

# 0. Packages

In [2]:
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize  # For tokenizing
from nltk.stem import WordNetLemmatizer  # For lemmatizing
from nltk.corpus import stopwords  # Stopwords list
import re  # For regex expressions
from pandarallel import pandarallel  # For parallelizing pandas row operations
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from typing import Union  # Allows setting as inputs of a function a set of options

# Other utilities
import ast
from collections import Counter
from itertools import chain
import pickle

# 1. Importing the data

In [3]:
# Original (cleaned) data 
df = pd.read_csv('/home/pablo/Downloads/books_and_genres_tim_cleaned.csv')

# Drop the language column (all in English)
df.drop(columns = ['lang'], inplace = True)

df.head()

Unnamed: 0,title,text,genres
0,apocolocyntosis,"Produced by Ted Garvin, Ben Courtney and PG Di...","['literature', 'read-for-school', 'classics', ..."
1,the house on the borderland,"Produced by Suzanne Shell, Sjaani and PG Distr...","['literature', 'mystery', 'speculative-fiction..."
2,the warriors,"Produced by Charles Aldarondo, Charlie Kirschn...","['school', 'non-fiction', 'literary-fiction', ..."
3,a voyage to the moon,"Produced by Christine De Ryck, Stig M. Valstad...","['speculative-fiction', '20th-century', 'scien..."
4,la fiammetta,"Produced by Ted Garvin, Dave Morgan and PG Dis...","['literature', 'read-for-school', 'school', 'c..."


In [4]:
print('Before conversion, genre list is stored as a string:\n', type(df.iloc[0, 2]))

Before conversion, genre list is stored as a string:
 <class 'str'>


In [5]:
# Convert the string representation to a list using ast.literal_eval
df['genres'] = df['genres'].apply(ast.literal_eval)

print('After conversion, genre list is stored as a list:\n', type(df.iloc[0, 2]))

After conversion, genre list is stored as a list:
 <class 'list'>


In [6]:
# Flatten all genre lists into one list
all_genres = list(chain.from_iterable(df['genres']))

# Use Counter to get frequencies
genre_counter = Counter(all_genres)

# Convert the counter to a data frame
genre_df = pd.DataFrame(list(genre_counter.items()), columns=['genre', 'frequency'])

# Sort the data frame by frequency (highest first)
genre_df = genre_df.sort_values(by='frequency', ascending=False).reset_index(drop = True)

# Compute relative frequency of book genres
genre_df['relative_frequency'] = genre_df['frequency'] / len(df)

# Display data frame
genre_df.head()

Unnamed: 0,genre,frequency,relative_frequency
0,fiction,6244,0.690174
1,classics,4565,0.504587
2,historical,3359,0.371283
3,non-fiction,2655,0.293467
4,20th-century,2629,0.290594


Note that the sum adds up to more than the number of books in our dataset since a book can be associated to more than one genre.

# 2. Text preprocessing

In [7]:
# Initialize parallelization for pandas
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## 2.1. Tokenizing and stopword removal

Below, we tokenize the text and lowercase it.

In [8]:
def preprocess_lower(text, rm_stopwords = False, stopword_set = None):
    """
    Preprocess text by:
       - Converting to lowercase.
       - Removing punctuation and digits.
       - Tokenizing.
       - Removing stopwords (optional).
    
    Returns:
        list: A list of tokens lowercased and without punctuation.
    """
    text_lower = text.lower()
    text_no_punct = re.sub(r'[^a-zA-Z\s]', ' ', text_lower)  # Remove digits and punctuation, replace by whitespace (and not by nothing) because of incorrectly spaced punctuation
    tokens = word_tokenize(text_no_punct)
    tokens = word_tokenize(text_no_punct)
    # Remove stopwords if desired
    if rm_stopwords == True:
        tokens = [token for token in tokens if token not in stopword_set]
    # We return the whole string of tokens so that we can find n-grams later
    return " ".join(tokens)

In [9]:
my_stop_words = set(stopwords.words('english'))

# Create set of custom stopwords (optional)
my_custom_stopwords = {'thou', 'thy', 'thee', 'em', 'er', 'ti', 'la', 'ha',
                       'nay', 'etc', 'hence', 'copyright'}  # Basically, old English words and words from editorial note

# Update stopwords (optional)
my_stop_words.update(my_custom_stopwords)

In [10]:
df['text_lower_no_stop'] = df['text'].parallel_apply(
    lambda row: preprocess_lower(text = row, rm_stopwords=True, stopword_set=my_stop_words)
    )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1131), Label(value='0 / 1131'))), …

In [11]:
df.head()

Unnamed: 0,title,text,genres,text_lower_no_stop
0,apocolocyntosis,"Produced by Ted Garvin, Ben Courtney and PG Di...","[literature, read-for-school, classics, religi...",produced ted garvin ben courtney pg distribute...
1,the house on the borderland,"Produced by Suzanne Shell, Sjaani and PG Distr...","[literature, mystery, speculative-fiction, cla...",produced suzanne shell sjaani pg distributed p...
2,the warriors,"Produced by Charles Aldarondo, Charlie Kirschn...","[school, non-fiction, literary-fiction, contem...",produced charles aldarondo charlie kirschner o...
3,a voyage to the moon,"Produced by Christine De Ryck, Stig M. Valstad...","[speculative-fiction, 20th-century, science-fi...",produced christine de ryck stig valstad suzann...
4,la fiammetta,"Produced by Ted Garvin, Dave Morgan and PG Dis...","[literature, read-for-school, school, classics...",produced ted garvin dave morgan pg distributed...


## 2.2. Normalization: lemmatizing

Below, we stick to lemmatizing as it is the normalization option that we think provides with the best results in this case.

In [12]:
def preprocess_lemmatize(text):
    """
    Preprocess text by applying lemmatized.
    Should just input a string which has been previously pre-processed, which at least removes
    the punctuation.

    Returns:
        str: A string of lemmatized tokens separated by spaces.
    """

    tokens = text.split()  # Split input text based on whitespaces
    lemmatizer = WordNetLemmatizer()  # Initiallize lemmatizer
    lemmatized_text = []  # Initialize empty list to store lemmatized text
    for word in tokens:
        lemmatized_text.append(lemmatizer.lemmatize(word))

    return " ".join(lemmatized_text)

Lemmatizing with `Spacy` instead of with `WordNetLemmatizer` of `nltk` leads to similar results. In general, however, it seems that there are no significant changes after lemmatizing the text.

In [13]:
df['text_lemmatized'] = df['text_lower_no_stop'].parallel_apply(
    lambda row: preprocess_lemmatize(text = row)
    )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1131), Label(value='0 / 1131'))), …

In [14]:
df.head()

Unnamed: 0,title,text,genres,text_lower_no_stop,text_lemmatized
0,apocolocyntosis,"Produced by Ted Garvin, Ben Courtney and PG Di...","[literature, read-for-school, classics, religi...",produced ted garvin ben courtney pg distribute...,produced ted garvin ben courtney pg distribute...
1,the house on the borderland,"Produced by Suzanne Shell, Sjaani and PG Distr...","[literature, mystery, speculative-fiction, cla...",produced suzanne shell sjaani pg distributed p...,produced suzanne shell sjaani pg distributed p...
2,the warriors,"Produced by Charles Aldarondo, Charlie Kirschn...","[school, non-fiction, literary-fiction, contem...",produced charles aldarondo charlie kirschner o...,produced charles aldarondo charlie kirschner o...
3,a voyage to the moon,"Produced by Christine De Ryck, Stig M. Valstad...","[speculative-fiction, 20th-century, science-fi...",produced christine de ryck stig valstad suzann...,produced christine de ryck stig valstad suzann...
4,la fiammetta,"Produced by Ted Garvin, Dave Morgan and PG Dis...","[literature, read-for-school, school, classics...",produced ted garvin dave morgan pg distributed...,produced ted garvin dave morgan pg distributed...


In [15]:
# Print sample text
book_index = 15
print('Original text: \n', df['text'][book_index])

Original text: 
 Produced by Juliet Sutherland, Dave Morgan and PG Distributed Proofreaders




[Illustration: Darrin's Blow Knocked the Midshipman Down]




DAVE DARRIN'S SECOND YEAR AT ANNAPOLIS

or

Two Midshipmen as Naval Academy "Youngsters"


By

H. IRVING HANCOCK
Illustrated




MCMXI




CONTENTS


CHAPTER

I. A QUESTION OF MIDSHIPMAN HONOR

II. DAVE'S PAP-SHEET ADVICE

III. MIDSHIPMAN PENNINGTON GOES TOO FAR

IV. A LITTLE MEETING ASHORE

V. WHEN THE SECONDS WONDERED

VI. IN TROUBLE ON FOREIGN SOIL

VII. PENNINGTON GETS HIS WISH

VIII. THE TRAGEDY OF THE GALE

IX. THE DESPAIR OF THE "RECALL"

X. THE GRIM WATCH FROM THE WAVES

XI. MIDSHIPMAN PENNINGTON'S ACCIDENT

XII. BACK IN THE HOME TOWN

XIII. DAN RECEIVES A FEARFUL FACER

XIV. THE FIRST HOP WITH THE HOME GIRLS

XV. A DISAGREEABLE FIRST CLASSMAN

XVI. HOW DAN FACED THE BOARD

XVII. LOSING THE TIME-KEEPER'S COUNT

XVIII. FIGHTING THE FAMOUS DOUBLE BATTLE

XIX. THE OFFICER IN CHARGE IS SHOCKED

XX. CONCLUSION




CHAPTER I


A

In [16]:
print('Lowercased text, without stopwords: \n', df['text_lower_no_stop'][book_index])

Lowercased text, without stopwords: 


In [17]:
print('Lemmatized text: \n', df['text_lemmatized'][book_index])

Lemmatized text: 


## 2.3. Vectorizing - *tf-idf*

In this step, we just vectorize the already-preprocessed text (though we could remove stopwords with the parameters `stop_words`, lowercase the text with `lowercase`, etc.). For more information, check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

- **`fit()`** learns the vocabulary from all text in the Series.
- **`transform()`** converts each row into a numerical representation based on that vocabulary.
- The output is a **sparse matrix**, which you convert to dense with `.todense()`.
- **`vectorized_text.shape`** gives the size of the document-term matrix:  
  - Rows = number of documents (i.e., number of books)  
  - Columns = number of unique words in the vocabulary  
- **`cv.get_feature_names_out()`** returns the list of terms that were extracted.

### Applying *tf-idf*

In [18]:
def vectorizer(cv: Union[CountVectorizer, TfidfVectorizer], df: pd.DataFrame, column_text: str) -> pd.DataFrame:

    # Note that we can fit the count vectorizer with a pandas series
    cv.fit(df[column_text])
    dtm = cv.transform(df[column_text])  # Create DTM

    # Return dense interpretation of sparse matrix
    dtm_dense = dtm.todense()

    # Print DTM size
    print("Document-term matrix has size", dtm_dense.shape)

    # Save extracted terms
    terms = cv.get_feature_names_out()

    return dtm_dense, terms

According to the notebooks `session4_vectormath` and the one of the 3rd TA session (`vectorization_students_2025`), the way we can replicate the *tf-idf* function seen in class is by setting the following parameters: 

- Setting the smoothing parameter to `True` may be useful for preventing zero values whenever there is a term that is included in the matrix but that isn't seen in any document.
- On the other hand, setting the parameter `sublinear_tf=True` is essential to replicate the idea of the regular tf-idf seen in class.

For more information, check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html. 

In [6]:
# Create a dynamic min_df which is computed as two thirds of the relative frequency 
# of the number of genres for which to construct a dictionary

n_genres = 30  # Number of genres to consider

min_df_top_genres = genre_df['relative_frequency'][n_genres - 1] / 1.5

print(f'Min. df considered for {n_genres} genres: {min_df_top_genres}')

print(f'Relative frequency of the {n_genres}th genre: {genre_df['relative_frequency'][n_genres - 1]}')

Min. df considered for 30 genres: 0.04310821266718249
Relative frequency of the 30th genre: 0.06466231900077374


For simplicity, we set it to 0.05 below.

In [20]:
# Step 1: initialize the tfidf vectorizer
tfidf = TfidfVectorizer(
    lowercase=False,
    stop_words=None,
    sublinear_tf=True,  # Apply tf-idf seen in class 
    smooth_idf=False, 
    ngram_range=(1,2),  # Include unigrams, bigrams
    min_df=0.05,  # Ignore terms appearing in less than 5% of the documents
    max_df=0.5,  # Ignore terms appearing in more than 50% of the documents 
    )

# Step 2: execute the function with differentt preprocessed descriptions
dtm_lemmatized, terms_lemmatized = vectorizer(
    cv = tfidf, df = df, column_text='text_lemmatized'
    )

Document-term matrix has size (9047, 40951)


# 3. Dictionary generation

Above, we have created the DTM for all of the books included in the corpus. Now, the idea is to **aggregate the *tf-idf* weights by genre**. Note, however, that this is not straightforward. Some potential issues:
- Adding the weights:
- Averaging the weights:

Below, we adopt the last approach, which tends to capture those terms that are more unique for each genre.

In [21]:
# We create a data frame from the dense document-term matrix, with columns named
# the extracted terms
dtm_df = pd.DataFrame(dtm_lemmatized, columns=terms_lemmatized)

# Append the genres column from the original data frame, considering that the
# order of the documents is preserved after applying tf-idf
dtm_df['genres'] = df['genres'].values

dtm_df.head()

Unnamed: 0,aaron,ab,aback,abandon,abandoned,abandoning,abandonment,abasement,abashed,abate,...,zealand,zealous,zealously,zenith,zephyr,zero,zest,zigzag,zone,genres
0,0.0,0.0,0.032537,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[literature, read-for-school, classics, religi..."
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013291,0.009549,0.0,...,0.0,0.0,0.0,0.010766,0.0,0.0,0.0,0.0,0.0,"[literature, mystery, speculative-fiction, cla..."
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.009385,0.0,0.009709,"[school, non-fiction, literary-fiction, contem..."
3,0.0,0.0,0.0,0.0,0.005152,0.0,0.0,0.0,0.0,0.010075,...,0.0,0.008131,0.0,0.015832,0.0,0.0,0.0,0.0,0.008626,"[speculative-fiction, 20th-century, science-fi..."
4,0.0,0.0,0.0,0.013656,0.011962,0.019864,0.0,0.0,0.019257,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[literature, read-for-school, school, classics..."


Note that the document-term matrix produced by scikit‐learn’s vectorizer preserves the order of the input documents (i.e. the order of rows in the DTM corresponds to the order in the original data frame). This characteristic is considered above to append the genres to the DTM. 

In [22]:
# Keep only the top n genres in a list
top_n_genres = genre_df['genre'][0:n_genres].tolist()
print(top_n_genres)

['fiction', 'classics', 'historical', 'non-fiction', '20th-century', 'literature', 'historical-fiction', 'novels', 'short-stories', 'romance', 'fantasy', 'american', 'literary-fiction', 'adventure', 'childrens', 'adult', 'biography', 'science-fiction', 'mystery', 'school', 'adult-fiction', 'philosophy', 'young-adult', 'unfinished', 'drama', 'poetry', 'contemporary', 'humor', 'religion', 'politics']


In [None]:
# Dictionary to store top terms for each genre.
top_terms_by_genre = {}

# Loop over each genre in the top_20_genres list
for genre in top_n_genres:

    # Select rows where the document's genres include the current genre
    genre_mask = dtm_df['genres'].apply(lambda g: genre in g)
    dtm_genre = dtm_df[genre_mask]
    
    # We drop the genres column to work only with numeric tf-idf scores.
    # Then, we aggregate the tf-idf scores for each term across all documents in this genre.
    # Here we use the mean, but for different results we could also use the sum
    # as an aggregation method
    aggregated_scores = dtm_genre.drop(columns=['genres']).mean(axis=0)  # Compute mean across rows
    
    # Sort the aggregated scores in descending order and select the top 30 terms.
    top_30_terms = aggregated_scores.sort_values(ascending=False).head(30)
    
    # Save the result for this genre.
    top_terms_by_genre[genre] = top_30_terms

# Now, top_terms_by_genre is a dictionary where each key is a genre
# and the value is a pandas Series of the top 30 terms (with their aggregated tf-idf scores).
# Display the results:
for genre, series in top_terms_by_genre.items():
    print(f"Top 30 terms for genre: {genre}")
    print(series)
    print("\n")

Top 30 terms for genre: fiction
honour        0.007794
color         0.007642
car           0.007621
nodded        0.007591
honor         0.007464
dr            0.007406
stared        0.007374
maybe         0.007320
colour        0.007310
paused        0.007256
dollar        0.007214
said mr       0.007212
aunt          0.007170
shook head    0.007115
poet          0.007080
hotel         0.007035
priest        0.006955
job           0.006944
glanced       0.006913
grey          0.006864
paris         0.006847
mary          0.006792
palace        0.006738
staring       0.006715
lad           0.006690
cousin        0.006623
mistress      0.006615
tiny          0.006589
lake          0.006582
kissed        0.006579
dtype: float64


Top 30 terms for genre: classics
honour        0.008814
colour        0.008111
poet          0.007868
aunt          0.007635
honor         0.007393
said mr       0.007332
priest        0.007321
mistress      0.007299
grey          0.007287
paused        0.00726

In [24]:
print(top_terms_by_genre)

{'fiction': honour        0.007794
color         0.007642
car           0.007621
nodded        0.007591
honor         0.007464
dr            0.007406
stared        0.007374
maybe         0.007320
colour        0.007310
paused        0.007256
dollar        0.007214
said mr       0.007212
aunt          0.007170
shook head    0.007115
poet          0.007080
hotel         0.007035
priest        0.006955
job           0.006944
glanced       0.006913
grey          0.006864
paris         0.006847
mary          0.006792
palace        0.006738
staring       0.006715
lad           0.006690
cousin        0.006623
mistress      0.006615
tiny          0.006589
lake          0.006582
kissed        0.006579
dtype: float64, 'classics': honour        0.008814
colour        0.008111
poet          0.007868
aunt          0.007635
honor         0.007393
said mr       0.007332
priest        0.007321
mistress      0.007299
grey          0.007287
paused        0.007262
color         0.007249
lad           0.0

In [25]:
# Save dictionaries into a pickle 
with open('top30_by_genre.pkl', 'wb') as f:
    pickle.dump(top_terms_by_genre, f)

In [26]:
# Save dtm with genre
dtm_df.to_csv('dtm_with_genres.csv')