# Tasks

For each of the hotels extract the text on the description (and possibly other text metadata) and do the following:
1. Pre-process the text by removing stop words and stemming. Customize your stopword list if needed.
2. Create two wordclouds before and after pre-processing for each city (a total of four). Comment on the changes in the wordclouds.

# 0. Packages

In [139]:
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize  # For tokenizing
from nltk.stem import PorterStemmer  # For stemming
from nltk.stem import LancasterStemmer  # For stemming
from nltk.stem.snowball import SnowballStemmer  # For stemming
from nltk.stem import WordNetLemmatizer  # For lemmatizing
from nltk.corpus import stopwords  # Stopwords list
import re  # For regex expressions
from pandarallel import pandarallel  # For parallelizing pandas row operations
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from typing import Union  # Allows setting as inputs of a function a set of options
import matplotlib.pyplot as plt  # To create word cloud
from wordcloud import WordCloud  # To create word cloud

# Other utilities
import ast
from collections import Counter
from itertools import chain

# 1. Importing the data

In [None]:
# # Original (cleaned) data 
# df_og = pd.read_csv('/home/pablo/Downloads/books_and_genres_tim_cleaned.csv')

# # Create a copy to have original dataset in memory, to avoid loading it again
# df = df_og.copy()

In [None]:
# # Create and save a subsample of books to work with
# df_sample = df_og.sample(100)

# # Save to .csv
# df_sample.to_csv('books_and_genres_tim_cleaned_100sample.csv', index = False)

In [157]:
# Import sample of books
df_sample = pd.read_csv('books_and_genres_tim_cleaned_100sample.csv')

df_sample.head()

Unnamed: 0,title,text,genres,lang
0,the wreck,"Produced by Marilynda Fraser-Cunliffe, LN Yadd...","['mystery', 'adult', 'love', 'romance', 'myste...",en
1,the robbers,Produced by David Widger\n\n\n\n\n\n ...,"['american', 'amazon', 'non-fiction', 'economi...",en
2,on and near the delaware,Produced by David Widger\n\n\n\n\n\n ...,['mythology'],en
3,the scotch twins,Produced by Lynn Hill and Luana Rodriquez. HT...,"['middle-grade', 'classics', 'biography', 'fic...",en
4,epistle to the hebrews,"Produced by Colin Bell, Thomas Strong and the ...","['non-fiction', 'christian', 'fiction', 'colle...",en


In [158]:
print('Before conversion, genre list is stored as a string:\n', type(df_sample.iloc[0, 2]))

Before conversion, genre list is stored as a string:
 <class 'str'>


In [159]:
# Convert the string representation to a list using ast.literal_eval
df_sample['genres'] = df_sample['genres'].apply(ast.literal_eval)

print('After conversion, genre list is stored as a list:\n', type(df_sample.iloc[0, 2]))

After conversion, genre list is stored as a list:
 <class 'list'>


In [121]:
# Flatten all genre lists into one list
all_genres = list(chain.from_iterable(df_sample['genres']))

# Use Counter to get frequencies
genre_counter = Counter(all_genres)

# Convert the counter to a data frame
genre_df = pd.DataFrame(list(genre_counter.items()), columns=['genre', 'frequency'])

# Sort the data frame by frequency (highest first)
genre_df = genre_df.sort_values(by='frequency', ascending=False).reset_index(drop = True)

# Display data frame
genre_df

Unnamed: 0,genre,frequency
0,fiction,76
1,classics,51
2,historical,45
3,non-fiction,33
4,20th-century,29
...,...,...
86,romantic-suspense,1
87,modern,1
88,self-help,1
89,writing,1


Note that the sum adds up to more than the number of books in our dataset since a book can be associated to more than one genre.

# 2. Text preprocessing

In [122]:
# Initialize parallelization for pandas
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## 2.1. Tokenizing and stopword removal

Below, we tokenize the text and lowercase it.

In [123]:
def preprocess_lower(text, rm_stopwords = False, stopword_set = None):
    """
    Preprocess text by:
       - Converting to lowercase.
       - Removing punctuation and digits.
       - Tokenizing.
       - Removing stopwords (optional).
    
    Returns:
        list: A list of tokens lowercased and without punctuation.
    """
    text_lower = text.lower()
    text_no_punct = re.sub(r'[^a-zA-Z\s]', '', text_lower)
    tokens = word_tokenize(text_no_punct)
    # Remove stopwords if desired
    if rm_stopwords == True:
        tokens = [token for token in tokens if token not in stopword_set]
    # We return the whole string of tokens so that we can find n-grams later
    return " ".join(tokens)

In [124]:
my_stop_words = set(stopwords.words('english'))

# Create set of custom stopwords (optional)
my_custom_stopwords = {}

# Update stopwords (optional)
my_stop_words.update(my_custom_stopwords)

In [125]:
df_sample['text_lower_no_stop'] = df_sample['text'].parallel_apply(
    lambda row: preprocess_lower(text = row, rm_stopwords=True, stopword_set=my_stop_words)
    )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

In [86]:
df_sample.head()

Unnamed: 0,title,text,genres,lang,text_lower_no_stop
0,the wreck,"Produced by Marilynda Fraser-Cunliffe, LN Yadd...","[mystery, adult, love, romance, mystery-thrill...",en,produced marilynda frasercunliffe ln yaddanapu...
1,the robbers,Produced by David Widger\n\n\n\n\n\n ...,"[american, amazon, non-fiction, economics, fic...",en,produced david widger robbers frederich schill...
2,on and near the delaware,Produced by David Widger\n\n\n\n\n\n ...,[mythology],en,produced david widger myths legends land charl...
3,the scotch twins,Produced by Lynn Hill and Luana Rodriquez. HT...,"[middle-grade, classics, biography, fiction, s...",en,produced lynn hill luana rodriquez html versio...
4,epistle to the hebrews,"Produced by Colin Bell, Thomas Strong and the ...","[non-fiction, christian, fiction, college]",en,produced colin bell thomas strong online distr...


## 2.2. Normalization

### A. Stemming

In [126]:
def preprocess_stem(text, stemmer = 'porter'):
    """
    Preprocess text by applying stemming.
    Should just input a string which has been previously pre-processed, which at least removes
    the punctuation.

    Returns:
        str: A string of stemmed tokens separated by spaces.
    """

    tokens = text.split()  # Split input text based on whitespaces

    if stemmer == 'porter':
        stem_class = PorterStemmer()
    elif stemmer == 'lancaster':
        stem_class = LancasterStemmer()
    elif stemmer == 'snowball':
        stem_class = SnowballStemmer("english")
    else:
        print('Stemmer type not accepted. Choose "porter", "lancaster" or "snowball".')

    stemmed_tokens = [stem_class.stem(token) for token in tokens]

    return " ".join(stemmed_tokens)

In [127]:
df_sample['text_stemmed'] = df_sample['text_lower_no_stop'].parallel_apply(
    lambda row: preprocess_stem(text = row, stemmer = 'porter')
    )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

In [89]:
# Print sample text
book_index = 10
print('Original text: \n', df_sample['text'][book_index])

Original text: 
 Produced by K Nordquist, Jacqueline Jeremy and the Online
Distributed Proofreading Team at http://www.pgdp.net (This
book was produced from scanned images of public domain
material from the Google Print project.)









    THE JOB

    AN AMERICAN NOVEL

    BY
    SINCLAIR LEWIS

    AUTHOR OF MAIN STREET, BABBITT, ETC.

    GROSSET & DUNLAP
    PUBLISHERS     NEW YORK

    Made in the United States of America


    Copyright, 1917, by Harper & Brothers
    Printed in the United States of America
    Published February, 1917




    TO

    MY WIFE

    WHO HAS MADE "THE JOB" POSSIBLE AND LIFE ITSELF
    QUITE BEAUTIFULLY IMPROBABLE




    CONTENTS


                      Page

    Part I               3
    THE CITY

    Part II            133
    THE OFFICE

    Part III           251
    MAN AND WOMAN




Part I

THE CITY

CHAPTER I


Captain Lew Golden would have saved any foreign observer a great deal of
trouble in studying America. He was an almost perfect t

In [90]:
print('Lowercased text, without stopwords: \n', df_sample['text_lower_no_stop'][book_index])

Lowercased text, without stopwords: 


In [91]:
print('Stemmed text: \n', df_sample['text_stemmed'][book_index])

Stemmed text: 
 produc k nordquist jacquelin jeremi onlin distribut proofread team httpwwwpgdpnet book produc scan imag public domain materi googl print project job american novel sinclair lewi author main street babbitt etc grosset dunlap publish new york made unit state america copyright harper brother print unit state america publish februari wife made job possibl life quit beauti improb content page part citi part ii offic part iii man woman part citi chapter captain lew golden would save foreign observ great deal troubl studi america almost perfect type petti smalltown middleclass lawyer live panama pennsylvania never captain anyth except crescent volunt fire compani own titl collect rent wrote insur meddl lawsuit carri quit visibl mustachecomb wore collar tie warm day appear street shirtsleev discuss compar temperatur past thirti year doctor smith mansion hous busdriv never use word beauti except refer setter dogbeauti word music faith rebellion exist rather fanci larg ambiti ban

### B. Lemmatizing

In [128]:
def preprocess_lemmatize(text):
    """
    Preprocess text by applying stemming.
    Should just input a string which has been previously pre-processed, which at least removes
    the punctuation.

    Returns:
        str: A string of stemmed tokens separated by spaces.
    """

    tokens = text.split()  # Split input text based on whitespaces
    lemmatizer = WordNetLemmatizer()  # Initiallize lemmatizer
    lemmatized_text = []  # Initialize empty list to store lemmatized text
    for word in tokens:
        lemmatized_text.append(lemmatizer.lemmatize(word))

    return " ".join(lemmatized_text)

In [129]:
df_sample['text_lemmatized'] = df_sample['text_lower_no_stop'].parallel_apply(
    lambda row: preprocess_lemmatize(text = row)
    )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

In [94]:
df_sample.head()

Unnamed: 0,title,text,genres,lang,text_lower_no_stop,text_stemmed,text_lemmatized
0,the wreck,"Produced by Marilynda Fraser-Cunliffe, LN Yadd...","[mystery, adult, love, romance, mystery-thrill...",en,produced marilynda frasercunliffe ln yaddanapu...,produc marilynda frasercunliff ln yaddanapudi ...,produced marilynda frasercunliffe ln yaddanapu...
1,the robbers,Produced by David Widger\n\n\n\n\n\n ...,"[american, amazon, non-fiction, economics, fic...",en,produced david widger robbers frederich schill...,produc david widger robber frederich schiller ...,produced david widger robber frederich schille...
2,on and near the delaware,Produced by David Widger\n\n\n\n\n\n ...,[mythology],en,produced david widger myths legends land charl...,produc david widger myth legend land charl ski...,produced david widger myth legend land charles...
3,the scotch twins,Produced by Lynn Hill and Luana Rodriquez. HT...,"[middle-grade, classics, biography, fiction, s...",en,produced lynn hill luana rodriquez html versio...,produc lynn hill luana rodriquez html version ...,produced lynn hill luana rodriquez html versio...
4,epistle to the hebrews,"Produced by Colin Bell, Thomas Strong and the ...","[non-fiction, christian, fiction, college]",en,produced colin bell thomas strong online distr...,produc colin bell thoma strong onlin distribut...,produced colin bell thomas strong online distr...


In [95]:
# Print sample text
book_index = 15
print('Original text: \n', df_sample['text'][book_index])

Original text: 
 Produced by Juliet Sutherland, Charles Franks
and the Online Distributed Proofreading Team




TRY AND TRUST

Or, Abner Holden's Bound Boy

BY

HORATIO ALGER, JR.
AUTHOR OF "PAUL THE PEDDLER," "FROM FARM BOY TO
SENATOR," "SLOW AND SURE," ETC.

THE MERSHON COMPANY
RAHWAY, N.J.          NEW YORK


TO MY YOUNG FRIEND,

A. FLORIAN HENRIQUES
(BOISIE),

THIS VOLUME IS AFFECTIONATELY DEDICATED



CONTENTS

I.     AROUND THE BREAKFAST TABLE
II.    INTRODUCING THE HERO
III.   A COLLISION
IV.    A DISAGREEABLE SURPRISE
V.     THE ENVELOPE
VI.    ON THE WAY
VII.   A NEW HOME
VIII.  THE GHOST IN THE ATTIC
IX.    EXPOSING A FRAUD
X.     THE CLOUDS GATHER
XI.    A CRISIS
XII.   RALPH THE RANGER
XIII.  A MOMENT OF PERIL
XIV.   TAKEN PRISONER
XV.    A FOUR-FOOTED FOE
XVI.   JUST TOO LATE
XVII.  NEW ACQUAINTANCES
XVIII. A YOUNG ARISTOCRAT
XIX.   A SUSPICIOUS CHARACTER
XX.    FACING A BURGLAR
XXI.   HERBERT'S REWARD
XXII.  ROBBED IN THE NIGHT
XXIII. A BUSINESS CALL
XXIV.  FINDING A BOAR

In [96]:
print('Lowercased text, without stopwords: \n', df_sample['text_lower_no_stop'][book_index])

Lowercased text, without stopwords: 


In [97]:
print('Lemmatized text: \n', df_sample['text_lemmatized'][book_index])

Lemmatized text: 


## 2.3. Vectorizing - *tf-idf*

In this step, we just vectorize the already-preprocessed text (though we could remove stopwords with the parameters `stop_words`, lowercase the text with `lowercase`, etc.). For more information, check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

### Example of construction of a simple dictionary (counts of terms)

#### Vectorization

In [107]:
cv = CountVectorizer(
    ngram_range = (1,2),  # Include unigrams, bigrams and trigrams
    min_df=0.05,  # Ignore terms appearing in less than 5% of the documents
    max_df=0.5,  # Ignore terms appearing in more than 50% of the documents 
    lowercase=False, 
    stop_words=None)

# Note that we can fit the count vectorizer with a pandas series
cv.fit(df_sample['text_lower_no_stop'])
vectorized_text = cv.transform(df_sample['text_lower_no_stop'])

# Return dense interpretation of sparse matrix
vectorized_text_dense = vectorized_text.todense()

# Print DTM size
print("Document-term matrix has size", vectorized_text.shape)

# Print terms extracted
terms = cv.get_feature_names_out()
print(terms)

Document-term matrix has size (100, 53634)
['aa' 'aaron' 'ab' ... 'zigzag' 'zone' 'zones']


- **`fit()`** learns the vocabulary from all text in the Series.
- **`transform()`** converts each row into a numerical representation based on that vocabulary.
- The output is a **sparse matrix**, which you convert to dense with `.todense()`.
- **`vectorized_text.shape`** gives the size of the document-term matrix:  
  - Rows = number of documents (i.e., number of books)  
  - Columns = number of unique words in the vocabulary  
- **`cv.get_feature_names_out()`** returns the list of terms that were extracted.

#### Simple dictionary from the counts

In [108]:
# Calculate term frequencies (total counts across all documents)
term_frequencies = vectorized_text.sum(axis=0).A1  # Convert to 1D array

# Create a DataFrame for easier handling
df_terms_description = pd.DataFrame({
    'term': terms,
    'frequency': term_frequencies
})

# Sort the DataFrame by frequency in descending order
df_terms_description = df_terms_description.sort_values(by='frequency', ascending=False).reset_index(drop=True)

# Assign an ID from 1 (most frequent) to V (least frequent)
df_terms_description['id'] = df_terms_description.index + 1

# Display the top 10 terms as a sanity check
df_terms_description.head(10)

Unnamed: 0,term,frequency,id
0,mrs,3065,1
1,thou,1723,2
2,th,1567,3
3,ye,1448,4
4,thy,1404,5
5,ive,1290,6
6,jo,1237,7
7,roger,1187,8
8,german,1092,9
9,thee,1071,10


### Applying *tf-idf*

In [130]:
def vectorizer(cv: Union[CountVectorizer, TfidfVectorizer], df: pd.DataFrame, column_text: str) -> pd.DataFrame:

    # Note that we can fit the count vectorizer with a pandas series
    cv.fit(df[column_text])
    dtm = cv.transform(df[column_text])  # Create DTM

    # Return dense interpretation of sparse matrix
    dtm_dense = dtm.todense()

    # Print DTM size
    print("Document-term matrix has size", dtm_dense.shape)

    # Save extracted terms
    terms = cv.get_feature_names_out()

    return dtm_dense, terms

According to the notebooks `session4_vectormath` and the one of the 3rd TA session (`vectorization_students_2025`), the way we can replicate the *tf-idf* function seen in class is by setting the following parameters: 

- Setting the smoothing parameter to `True` may be useful for preventing zero values whenever there is a term that is included in the matrix but that isn't seen in any document.
- On the other hand, setting the parameter `sublinear_tf=True` is essential to replicate the idea of the regular tf-idf seen in class.

For more information, check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html. 

In [131]:
tfidf = TfidfVectorizer(
    lowercase=False,
    stop_words=None,
    sublinear_tf=True,  # Apply tf-idf seen in class 
    smooth_idf=False,
    ngram_range = (1,2),  # Include unigrams, bigrams
    min_df=0.05,  # Ignore terms appearing in less than 5% of the documents
    max_df=0.5,  # Ignore terms appearing in more than 50% of the documents 
    )

dtm_lower, terms_lower = vectorizer(
    cv = tfidf, df = df_sample, column_text='text_lower_no_stop'
    )

Document-term matrix has size (100, 53634)


In [132]:
# Step 1: initialize the tfidf vectorizer
tfidf = TfidfVectorizer(
    lowercase=False,
    stop_words=None,
    sublinear_tf=True,  # Apply tf-idf seen in class 
    smooth_idf=False, 
    ngram_range=(1,2),  # Include unigrams, bigrams
    min_df=0.05,  # Ignore terms appearing in less than 5% of the documents
    max_df=0.5,  # Ignore terms appearing in more than 50% of the documents 
    )

# Step 2: execute the function with differentt preprocessed descriptions
dtm_stemmed, terms_stemmed = vectorizer(
    cv = tfidf, df = df_sample, column_text='text_stemmed'
    )

Document-term matrix has size (100, 57862)


In [None]:
# Step 1: initialize the tfidf vectorizer
tfidf = TfidfVectorizer(
    lowercase=False,
    stop_words=None,
    sublinear_tf=True,  # Apply tf-idf seen in class 
    smooth_idf=False, 
    ngram_range=(1,2),  # Include unigrams, bigrams
    min_df=0.05,  # Ignore terms appearing in less than 5% of the documents
    max_df=0.5,  # Ignore terms appearing in more than 50% of the documents 
    )

# Step 2: execute the function with differentt preprocessed descriptions
dtm_lemmatized, terms_lemmatized = vectorizer(
    cv = tfidf, df = df_sample, column_text='text_lemmatized'
    )

Document-term matrix has size (100, 54802)


In [140]:
dtm_lemmatized

matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.01621914],
        [0.        , 0.        , 0.02567191, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.00726339,
         0.        ]])

Very strange outcome: the number of terms when stemming or lemmatizing has increased compared to the lowercased text without stopwords.

# 3. Dictionary generation

Above, we have created the DTM for all of the books included in the corpus. Now, the idea is to **aggregate the *tf-idf* weights by genre**. Note, however, that this is not straightforward. Some potential issues:
- Adding the weights:
- Averaging the weights:

Below, we adopt the first/last approach, which tends to capture those terms that are more unique for each genre.

In [147]:
# We create a data frame from the dense document-term matrix, with columns named
# the extracted terms
dtm_df = pd.DataFrame(dtm_lower, columns=terms_lower)

# Append the genres column from the original data frame, considering that the
# order of the documents is preserved after applying tf-idf
dtm_df['genres'] = df_sample['genres'].values

dtm_df.head()

Unnamed: 0,aa,aaron,ab,aback,abandon,abandoned,abandoning,abandonment,abashed,abate,...,zealous,zealously,zenith,zero,zest,zeus,zigzag,zone,zones,genres
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[mystery, adult, love, romance, mystery-thrill..."
1,0.0,0.0,0.0,0.0,0.020328,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[american, amazon, non-fiction, economics, fic..."
2,0.0,0.0,0.0,0.0,0.0,0.01322,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,[mythology]
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[middle-grade, classics, biography, fiction, s..."
4,0.0,0.050728,0.0,0.0,0.0,0.0,0.0,0.014189,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[non-fiction, christian, fiction, college]"


Note that the document-term matrix produced by scikit‐learn’s vectorizer preserves the order of the input documents (i.e. the order of rows in the DTM corresponds to the order in the original data frame). This characteristic is considered above to append the genres to the DTM. 

In [148]:
# Keep only the top 20 genres in a list
top_20_genres = genre_df['genre'][0:20].tolist()
print(top_20_genres)

['fiction', 'classics', 'historical', 'non-fiction', '20th-century', 'literature', 'historical-fiction', 'novels', 'young-adult', 'adventure', 'romance', 'adult', 'philosophy', 'adult-fiction', 'fantasy', 'school', 'science-fiction', 'humor', 'biography', 'literary-fiction']


In [149]:
# Dictionary to store top terms for each genre.
top_terms_by_genre = {}

# Loop over each genre in the top_20_genres list
for genre in top_20_genres:

    # Select rows where the document's genres include the current genre
    genre_mask = dtm_df['genres'].apply(lambda g: genre in g)
    dtm_genre = dtm_df[genre_mask]
    
    # We drop the genres column to work only with numeric tf-idf scores.
    # Then, we aggregate the tf-idf scores for each term across all documents in this genre.
    # Here we use the mean, but for different results we could also use the sum
    # as an aggregation method
    aggregated_scores = dtm_genre.drop(columns=['genres']).mean(axis=0)  # Compute mean across rows
    
    # Sort the aggregated scores in descending order and select the top 30 terms.
    top_30_terms = aggregated_scores.sort_values(ascending=False).head(30)
    
    # Save the result for this genre.
    top_terms_by_genre[genre] = top_30_terms

# Now, top_terms_by_genre is a dictionary where each key is a genre
# and the value is a pandas Series of the top 30 terms (with their aggregated tf-idf scores).
# For example, to display the results:
for genre, series in top_terms_by_genre.items():
    print(f"Top 30 terms for genre: {genre}")
    print(series)
    print("\n")

Top 30 terms for genre: fiction
mrs        0.011630
youre      0.011213
ive        0.010584
wouldnt    0.009691
id         0.009242
youll      0.009209
couldnt    0.009146
wasnt      0.008715
isnt       0.008653
ye         0.008650
thee       0.008234
hed        0.008213
thy        0.007785
youd       0.007771
whats      0.007676
em         0.007525
dr         0.007443
youve      0.007347
thou       0.007319
tea        0.007291
shes       0.007217
kitchen    0.007043
uncle      0.007036
hadnt      0.007008
lake       0.006960
car        0.006953
havent     0.006940
kings      0.006714
th         0.006698
bible      0.006665
dtype: float64


Top 30 terms for genre: classics
ye         0.011185
mrs        0.011096
ive        0.010985
youre      0.010898
thee       0.010844
thy        0.010148
id         0.009759
youll      0.009611
kings      0.009577
hath       0.009382
thou       0.009360
aint       0.009269
lake       0.008927
tis        0.008884
wouldnt    0.008868
whats      0.00861

In [150]:
print(top_terms_by_genre)

{'fiction': mrs        0.011630
youre      0.011213
ive        0.010584
wouldnt    0.009691
id         0.009242
youll      0.009209
couldnt    0.009146
wasnt      0.008715
isnt       0.008653
ye         0.008650
thee       0.008234
hed        0.008213
thy        0.007785
youd       0.007771
whats      0.007676
em         0.007525
dr         0.007443
youve      0.007347
thou       0.007319
tea        0.007291
shes       0.007217
kitchen    0.007043
uncle      0.007036
hadnt      0.007008
lake       0.006960
car        0.006953
havent     0.006940
kings      0.006714
th         0.006698
bible      0.006665
dtype: float64, 'classics': ye         0.011185
mrs        0.011096
ive        0.010985
youre      0.010898
thee       0.010844
thy        0.010148
id         0.009759
youll      0.009611
kings      0.009577
hath       0.009382
thou       0.009360
aint       0.009269
lake       0.008927
tis        0.008884
wouldnt    0.008868
whats      0.008611
isnt       0.008603
em         0.008468
