# Tasks

For each of the hotels extract the text on the description (and possibly other text metadata) and do the following:
1. Pre-process the text by removing stop words and stemming. Customize your stopword list if needed.
2. Create two wordclouds before and after pre-processing for each city (a total of four). Comment on the changes in the wordclouds.

# 0. Packages

In [17]:
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize  # For tokenizing
from nltk.stem import PorterStemmer  # For stemming
from nltk.stem import LancasterStemmer  # For stemming
from nltk.stem.snowball import SnowballStemmer  # For stemming
from nltk.stem import WordNetLemmatizer  # For lemmatizing
from nltk.corpus import stopwords  # Stopwords list
import re  # For regex expressions
from pandarallel import pandarallel  # For parallelizing pandas row operations
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from typing import Union  # Allows setting as inputs of a function a set of options
import matplotlib.pyplot as plt  # To create word cloud
from wordcloud import WordCloud  # To create word cloud

# Other utilities
import ast
from collections import Counter
from itertools import chain
import pickle

# 1. Importing the data

In [18]:
# Original (cleaned) data 
df_og = pd.read_csv('/home/pablo/Downloads/books_and_genres_tim_cleaned.csv')

# Create a copy to have original dataset in memory, to avoid loading it again
df = df_og.copy()

df.head()

Unnamed: 0,title,text,genres,lang
0,apocolocyntosis,"Produced by Ted Garvin, Ben Courtney and PG Di...","['literature', 'read-for-school', 'classics', ...",en
1,the house on the borderland,"Produced by Suzanne Shell, Sjaani and PG Distr...","['literature', 'mystery', 'speculative-fiction...",en
2,the warriors,"Produced by Charles Aldarondo, Charlie Kirschn...","['school', 'non-fiction', 'literary-fiction', ...",en
3,a voyage to the moon,"Produced by Christine De Ryck, Stig M. Valstad...","['speculative-fiction', '20th-century', 'scien...",en
4,la fiammetta,"Produced by Ted Garvin, Dave Morgan and PG Dis...","['literature', 'read-for-school', 'school', 'c...",en


In [19]:
# Drop language indicator column
df.drop(columns = ['lang'], inplace = True)

In [20]:
print('Before conversion, genre list is stored as a string:\n', type(df.iloc[0, 2]))

Before conversion, genre list is stored as a string:
 <class 'str'>


In [21]:
# Convert the string representation to a list using ast.literal_eval
df['genres'] = df['genres'].apply(ast.literal_eval)

print('After conversion, genre list is stored as a list:\n', type(df.iloc[0, 2]))

After conversion, genre list is stored as a list:
 <class 'list'>


In [22]:
# Flatten all genre lists into one list
all_genres = list(chain.from_iterable(df['genres']))

# Use Counter to get frequencies
genre_counter = Counter(all_genres)

# Convert the counter to a data frame
genre_df = pd.DataFrame(list(genre_counter.items()), columns=['genre', 'frequency'])

# Sort the data frame by frequency (highest first)
genre_df = genre_df.sort_values(by='frequency', ascending=False).reset_index(drop = True)

# Display data frame
genre_df

Unnamed: 0,genre,frequency
0,fiction,6244
1,classics,4565
2,historical,3359
3,non-fiction,2655
4,20th-century,2629
...,...,...
94,romantic-suspense,58
95,paranormal-romance,36
96,vampires,27
97,bdsm,18


Note that the sum adds up to more than the number of books in our dataset since a book can be associated to more than one genre.

# 2. Text preprocessing

In [23]:
# Initialize parallelization for pandas
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## 2.1. Tokenizing and stopword removal

Below, we tokenize the text and lowercase it.

In [24]:
def preprocess_lower(text, rm_stopwords = False, stopword_set = None):
    """
    Preprocess text by:
       - Converting to lowercase.
       - Removing punctuation and digits.
       - Tokenizing.
       - Removing stopwords (optional).
    
    Returns:
        list: A list of tokens lowercased and without punctuation.
    """
    text_lower = text.lower()
    text_no_punct = re.sub(r'[^a-zA-Z\s]', '', text_lower)
    tokens = word_tokenize(text_no_punct)
    # Remove stopwords if desired
    if rm_stopwords == True:
        tokens = [token for token in tokens if token not in stopword_set]
    # We return the whole string of tokens so that we can find n-grams later
    return " ".join(tokens)

In [25]:
my_stop_words = set(stopwords.words('english'))

# Create set of custom stopwords (optional)
my_custom_stopwords = {}

# Update stopwords (optional)
my_stop_words.update(my_custom_stopwords)

In [26]:
df['text_lower_no_stop'] = df['text'].parallel_apply(
    lambda row: preprocess_lower(text = row, rm_stopwords=True, stopword_set=my_stop_words)
    )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1131), Label(value='0 / 1131'))), …

In [27]:
df.head()

Unnamed: 0,title,text,genres,text_lower_no_stop
0,apocolocyntosis,"Produced by Ted Garvin, Ben Courtney and PG Di...","[literature, read-for-school, classics, religi...",produced ted garvin ben courtney pg distribute...
1,the house on the borderland,"Produced by Suzanne Shell, Sjaani and PG Distr...","[literature, mystery, speculative-fiction, cla...",produced suzanne shell sjaani pg distributed p...
2,the warriors,"Produced by Charles Aldarondo, Charlie Kirschn...","[school, non-fiction, literary-fiction, contem...",produced charles aldarondo charlie kirschner o...
3,a voyage to the moon,"Produced by Christine De Ryck, Stig M. Valstad...","[speculative-fiction, 20th-century, science-fi...",produced christine de ryck stig valstad suzann...
4,la fiammetta,"Produced by Ted Garvin, Dave Morgan and PG Dis...","[literature, read-for-school, school, classics...",produced ted garvin dave morgan pg distributed...


## 2.2. Normalization

### A. Stemming

In [28]:
def preprocess_stem(text, stemmer = 'porter'):
    """
    Preprocess text by applying stemming.
    Should just input a string which has been previously pre-processed, which at least removes
    the punctuation.

    Returns:
        str: A string of stemmed tokens separated by spaces.
    """

    tokens = text.split()  # Split input text based on whitespaces

    if stemmer == 'porter':
        stem_class = PorterStemmer()
    elif stemmer == 'lancaster':
        stem_class = LancasterStemmer()
    elif stemmer == 'snowball':
        stem_class = SnowballStemmer("english")
    else:
        print('Stemmer type not accepted. Choose "porter", "lancaster" or "snowball".')

    stemmed_tokens = [stem_class.stem(token) for token in tokens]

    return " ".join(stemmed_tokens)

In [29]:
df['text_stemmed'] = df['text_lower_no_stop'].parallel_apply(
    lambda row: preprocess_stem(text = row, stemmer = 'porter')
    )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1131), Label(value='0 / 1131'))), …

In [30]:
# Print sample text
book_index = 10
print('Original text: \n', df['text'][book_index])

Original text: 
 Produced by Juliet Sutherland, Mary Meehan
and the Online Distributed Proofreading Team.






                         THE HUNT BALL MYSTERY

                       BY SIR WILLIAM MAGNAY, Bt.

Author of "A Prince of Lovers," "The Mystery of the Unicorn," etc., etc.

                                 1918




Contents

Chap

      I THE INTRUDER

     II THE STAINED FLOWERS

    III THE STREAK ON THE CUFF

     IV THE MISSING GUEST

      V THE LOCKED ROOM

     VI THE MYSTERY OF CLEMENT HENSHAW

    VII THE INCREDULITY OF GERVASE HENSHAW

   VIII KELSON'S PERPLEXITY

     IX THE CLOAK OF NIGHT

      X AN ALARMING DISCOVERY

     XI GIFFORD'S COMMISSION

    XII HAD HENSHAW A CLUE?

   XIII WHAT GIFFORD SAW IN THE WOOD

    XIV GIFFORD'S PERPLEXITY

     XV ANOTHER DISCOVERY

    XVI AN EXPLANATION

   XVII WHAT A GIRL SAW

  XVIII THE LOST BROOCH

    XIX IN THE CHURCHYARD

     XX AN INVOLUNTARY EAVESDROPPER

    XXI GIFFORD CONTINUES HIS STORY

   XXII HOW GIFFORD E

In [31]:
print('Lowercased text, without stopwords: \n', df['text_lower_no_stop'][book_index])

Lowercased text, without stopwords: 


In [32]:
print('Stemmed text: \n', df['text_stemmed'][book_index])

Stemmed text: 
 produc juliet sutherland mari meehan onlin distribut proofread team hunt ball mysteri sir william magnay bt author princ lover mysteri unicorn etc etc content chap intrud ii stain flower iii streak cuff iv miss guest v lock room vi mysteri clement henshaw vii incredul gervas henshaw viii kelson perplex ix cloak night x alarm discoveri xi gifford commiss xii henshaw clue xiii gifford saw wood xiv gifford perplex xv anoth discoveri xvi explan xvii girl saw xviii lost brooch xix churchyard xx involuntari eavesdropp xxi gifford continu stori xxii gifford escap xxiii edith morriston stori xxiv stori end xxv defianc xxvi issu join xxvii gifford reward chapter intrud im afraid must gone van sir gone hugh gifford exclaim angrili busi send train till luggag put guard told luggag branchest porter protest deprecatingli see sir train nearli twenti minut late hurri get must overlook suitcas thing want owner return say kelson went address tall soldierli man stroll nice thing happen t

### B. Lemmatizing

In [33]:
def preprocess_lemmatize(text):
    """
    Preprocess text by applying stemming.
    Should just input a string which has been previously pre-processed, which at least removes
    the punctuation.

    Returns:
        str: A string of stemmed tokens separated by spaces.
    """

    tokens = text.split()  # Split input text based on whitespaces
    lemmatizer = WordNetLemmatizer()  # Initiallize lemmatizer
    lemmatized_text = []  # Initialize empty list to store lemmatized text
    for word in tokens:
        lemmatized_text.append(lemmatizer.lemmatize(word))

    return " ".join(lemmatized_text)

In [34]:
df['text_lemmatized'] = df['text_lower_no_stop'].parallel_apply(
    lambda row: preprocess_lemmatize(text = row)
    )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1131), Label(value='0 / 1131'))), …

In [35]:
df.head()

Unnamed: 0,title,text,genres,text_lower_no_stop,text_stemmed,text_lemmatized
0,apocolocyntosis,"Produced by Ted Garvin, Ben Courtney and PG Di...","[literature, read-for-school, classics, religi...",produced ted garvin ben courtney pg distribute...,produc ted garvin ben courtney pg distribut pr...,produced ted garvin ben courtney pg distribute...
1,the house on the borderland,"Produced by Suzanne Shell, Sjaani and PG Distr...","[literature, mystery, speculative-fiction, cla...",produced suzanne shell sjaani pg distributed p...,produc suzann shell sjaani pg distribut proofr...,produced suzanne shell sjaani pg distributed p...
2,the warriors,"Produced by Charles Aldarondo, Charlie Kirschn...","[school, non-fiction, literary-fiction, contem...",produced charles aldarondo charlie kirschner o...,produc charl aldarondo charli kirschner onlin ...,produced charles aldarondo charlie kirschner o...
3,a voyage to the moon,"Produced by Christine De Ryck, Stig M. Valstad...","[speculative-fiction, 20th-century, science-fi...",produced christine de ryck stig valstad suzann...,produc christin de ryck stig valstad suzann l ...,produced christine de ryck stig valstad suzann...
4,la fiammetta,"Produced by Ted Garvin, Dave Morgan and PG Dis...","[literature, read-for-school, school, classics...",produced ted garvin dave morgan pg distributed...,produc ted garvin dave morgan pg distribut pro...,produced ted garvin dave morgan pg distributed...


In [36]:
# Print sample text
book_index = 15
print('Original text: \n', df['text'][book_index])

Original text: 
 Produced by Juliet Sutherland, Dave Morgan and PG Distributed Proofreaders




[Illustration: Darrin's Blow Knocked the Midshipman Down]




DAVE DARRIN'S SECOND YEAR AT ANNAPOLIS

or

Two Midshipmen as Naval Academy "Youngsters"


By

H. IRVING HANCOCK
Illustrated




MCMXI




CONTENTS


CHAPTER

I. A QUESTION OF MIDSHIPMAN HONOR

II. DAVE'S PAP-SHEET ADVICE

III. MIDSHIPMAN PENNINGTON GOES TOO FAR

IV. A LITTLE MEETING ASHORE

V. WHEN THE SECONDS WONDERED

VI. IN TROUBLE ON FOREIGN SOIL

VII. PENNINGTON GETS HIS WISH

VIII. THE TRAGEDY OF THE GALE

IX. THE DESPAIR OF THE "RECALL"

X. THE GRIM WATCH FROM THE WAVES

XI. MIDSHIPMAN PENNINGTON'S ACCIDENT

XII. BACK IN THE HOME TOWN

XIII. DAN RECEIVES A FEARFUL FACER

XIV. THE FIRST HOP WITH THE HOME GIRLS

XV. A DISAGREEABLE FIRST CLASSMAN

XVI. HOW DAN FACED THE BOARD

XVII. LOSING THE TIME-KEEPER'S COUNT

XVIII. FIGHTING THE FAMOUS DOUBLE BATTLE

XIX. THE OFFICER IN CHARGE IS SHOCKED

XX. CONCLUSION




CHAPTER I


A

In [37]:
print('Lowercased text, without stopwords: \n', df['text_lower_no_stop'][book_index])

Lowercased text, without stopwords: 


In [38]:
print('Lemmatized text: \n', df['text_lemmatized'][book_index])

Lemmatized text: 


## 2.3. Vectorizing - *tf-idf*

In this step, we just vectorize the already-preprocessed text (though we could remove stopwords with the parameters `stop_words`, lowercase the text with `lowercase`, etc.). For more information, check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

- **`fit()`** learns the vocabulary from all text in the Series.
- **`transform()`** converts each row into a numerical representation based on that vocabulary.
- The output is a **sparse matrix**, which you convert to dense with `.todense()`.
- **`vectorized_text.shape`** gives the size of the document-term matrix:  
  - Rows = number of documents (i.e., number of books)  
  - Columns = number of unique words in the vocabulary  
- **`cv.get_feature_names_out()`** returns the list of terms that were extracted.

### Applying *tf-idf*

In [39]:
def vectorizer(cv: Union[CountVectorizer, TfidfVectorizer], df: pd.DataFrame, column_text: str) -> pd.DataFrame:

    # Note that we can fit the count vectorizer with a pandas series
    cv.fit(df[column_text])
    dtm = cv.transform(df[column_text])  # Create DTM

    # Return dense interpretation of sparse matrix
    dtm_dense = dtm.todense()

    # Print DTM size
    print("Document-term matrix has size", dtm_dense.shape)

    # Save extracted terms
    terms = cv.get_feature_names_out()

    return dtm_dense, terms

According to the notebooks `session4_vectormath` and the one of the 3rd TA session (`vectorization_students_2025`), the way we can replicate the *tf-idf* function seen in class is by setting the following parameters: 

- Setting the smoothing parameter to `True` may be useful for preventing zero values whenever there is a term that is included in the matrix but that isn't seen in any document.
- On the other hand, setting the parameter `sublinear_tf=True` is essential to replicate the idea of the regular tf-idf seen in class.

For more information, check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html. 

In [None]:
tfidf = TfidfVectorizer(
    lowercase=False,
    stop_words=None,
    sublinear_tf=True,  # Apply tf-idf seen in class 
    smooth_idf=False,
    ngram_range = (1,2),  # Include unigrams, bigrams
    min_df=0.05,  # Ignore terms appearing in less than 5% of the documents
    max_df=0.5,  # Ignore terms appearing in more than 50% of the documents 
    )

dtm_lower, terms_lower = vectorizer(
    cv = tfidf, df = df, column_text='text_lower_no_stop'
    )

In [None]:
# Step 1: initialize the tfidf vectorizer
tfidf = TfidfVectorizer(
    lowercase=False,
    stop_words=None,
    sublinear_tf=True,  # Apply tf-idf seen in class 
    smooth_idf=False, 
    ngram_range=(1,2),  # Include unigrams, bigrams
    min_df=0.05,  # Ignore terms appearing in less than 5% of the documents
    max_df=0.5,  # Ignore terms appearing in more than 50% of the documents 
    )

# Step 2: execute the function with differentt preprocessed descriptions
dtm_stemmed, terms_stemmed = vectorizer(
    cv = tfidf, df = df, column_text='text_stemmed'
    )

Document-term matrix has size (100, 57862)


In [None]:
# Step 1: initialize the tfidf vectorizer
tfidf = TfidfVectorizer(
    lowercase=False,
    stop_words=None,
    sublinear_tf=True,  # Apply tf-idf seen in class 
    smooth_idf=False, 
    ngram_range=(1,2),  # Include unigrams, bigrams
    min_df=0.05,  # Ignore terms appearing in less than 5% of the documents
    max_df=0.5,  # Ignore terms appearing in more than 50% of the documents 
    )

# Step 2: execute the function with differentt preprocessed descriptions
dtm_lemmatized, terms_lemmatized = vectorizer(
    cv = tfidf, df = df, column_text='text_lemmatized'
    )

Document-term matrix has size (100, 54802)


In [None]:
dtm_lemmatized

matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.01621914],
        [0.        , 0.        , 0.02567191, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.00726339,
         0.        ]])

Very strange outcome: the number of terms when stemming or lemmatizing has increased compared to the lowercased text without stopwords.

# 3. Dictionary generation

Above, we have created the DTM for all of the books included in the corpus. Now, the idea is to **aggregate the *tf-idf* weights by genre**. Note, however, that this is not straightforward. Some potential issues:
- Adding the weights:
- Averaging the weights:

Below, we adopt the first/last approach, which tends to capture those terms that are more unique for each genre.

In [None]:
# We create a data frame from the dense document-term matrix, with columns named
# the extracted terms
dtm_df = pd.DataFrame(dtm_lemmatized, columns=terms_lemmatized)

# Append the genres column from the original data frame, considering that the
# order of the documents is preserved after applying tf-idf
dtm_df['genres'] = df['genres'].values

dtm_df.head()

Unnamed: 0,aa,aaron,ab,aback,abandon,abandoned,abandoning,abandonment,abashed,abate,...,zealous,zealously,zenith,zero,zest,zeus,zigzag,zone,zones,genres
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[mystery, adult, love, romance, mystery-thrill..."
1,0.0,0.0,0.0,0.0,0.020328,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[american, amazon, non-fiction, economics, fic..."
2,0.0,0.0,0.0,0.0,0.0,0.01322,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,[mythology]
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[middle-grade, classics, biography, fiction, s..."
4,0.0,0.050728,0.0,0.0,0.0,0.0,0.0,0.014189,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[non-fiction, christian, fiction, college]"


Note that the document-term matrix produced by scikit‐learn’s vectorizer preserves the order of the input documents (i.e. the order of rows in the DTM corresponds to the order in the original data frame). This characteristic is considered above to append the genres to the DTM. 

In [None]:
# Keep only the top 20 genres in a list
top_20_genres = genre_df['genre'][0:20].tolist()
print(top_20_genres)

['fiction', 'classics', 'historical', 'non-fiction', '20th-century', 'literature', 'historical-fiction', 'novels', 'young-adult', 'adventure', 'romance', 'adult', 'philosophy', 'adult-fiction', 'fantasy', 'school', 'science-fiction', 'humor', 'biography', 'literary-fiction']


In [None]:
# Dictionary to store top terms for each genre.
top_terms_by_genre = {}

# Loop over each genre in the top_20_genres list
for genre in top_20_genres:

    # Select rows where the document's genres include the current genre
    genre_mask = dtm_df['genres'].apply(lambda g: genre in g)
    dtm_genre = dtm_df[genre_mask]
    
    # We drop the genres column to work only with numeric tf-idf scores.
    # Then, we aggregate the tf-idf scores for each term across all documents in this genre.
    # Here we use the mean, but for different results we could also use the sum
    # as an aggregation method
    aggregated_scores = dtm_genre.drop(columns=['genres']).mean(axis=0)  # Compute mean across rows
    
    # Sort the aggregated scores in descending order and select the top 30 terms.
    top_30_terms = aggregated_scores.sort_values(ascending=False).head(30)
    
    # Save the result for this genre.
    top_terms_by_genre[genre] = top_30_terms

# Now, top_terms_by_genre is a dictionary where each key is a genre
# and the value is a pandas Series of the top 30 terms (with their aggregated tf-idf scores).
# For example, to display the results:
for genre, series in top_terms_by_genre.items():
    print(f"Top 30 terms for genre: {genre}")
    print(series)
    print("\n")

Top 30 terms for genre: fiction
mrs        0.011630
youre      0.011213
ive        0.010584
wouldnt    0.009691
id         0.009242
youll      0.009209
couldnt    0.009146
wasnt      0.008715
isnt       0.008653
ye         0.008650
thee       0.008234
hed        0.008213
thy        0.007785
youd       0.007771
whats      0.007676
em         0.007525
dr         0.007443
youve      0.007347
thou       0.007319
tea        0.007291
shes       0.007217
kitchen    0.007043
uncle      0.007036
hadnt      0.007008
lake       0.006960
car        0.006953
havent     0.006940
kings      0.006714
th         0.006698
bible      0.006665
dtype: float64


Top 30 terms for genre: classics
ye         0.011185
mrs        0.011096
ive        0.010985
youre      0.010898
thee       0.010844
thy        0.010148
id         0.009759
youll      0.009611
kings      0.009577
hath       0.009382
thou       0.009360
aint       0.009269
lake       0.008927
tis        0.008884
wouldnt    0.008868
whats      0.00861

In [None]:
print(top_terms_by_genre)

{'fiction': mrs        0.011630
youre      0.011213
ive        0.010584
wouldnt    0.009691
id         0.009242
youll      0.009209
couldnt    0.009146
wasnt      0.008715
isnt       0.008653
ye         0.008650
thee       0.008234
hed        0.008213
thy        0.007785
youd       0.007771
whats      0.007676
em         0.007525
dr         0.007443
youve      0.007347
thou       0.007319
tea        0.007291
shes       0.007217
kitchen    0.007043
uncle      0.007036
hadnt      0.007008
lake       0.006960
car        0.006953
havent     0.006940
kings      0.006714
th         0.006698
bible      0.006665
dtype: float64, 'classics': ye         0.011185
mrs        0.011096
ive        0.010985
youre      0.010898
thee       0.010844
thy        0.010148
id         0.009759
youll      0.009611
kings      0.009577
hath       0.009382
thou       0.009360
aint       0.009269
lake       0.008927
tis        0.008884
wouldnt    0.008868
whats      0.008611
isnt       0.008603
em         0.008468


In [None]:
# Save dictionaries into a pickle 
with open('top30_by_genre.pkl', 'wb') as f:
    pickle.dump(top_terms_by_genre, f)