# Literary Analysis: Comparing Nonfiction and Fiction through Topic Modeling and Sentiment Analysis 

### ADS 509 Final Project
##### Team 3: Claire Bentzen, Tara Dehdari, Logan Van Dine

##### Introduction

In this project, we will conduct a comparative analysis of two significant literary works: "Pride and Prejudice" by Jane Austen (fiction) and "A Vindication of the Rights of Woman" by Mary Wollstonecraft (nonfiction). Both texts engage deeply with themes of gender, society, and individual rights, making them ideal for exploring the differences in language, themes, and sentiment between fiction and nonfiction.

Using text mining techniques, we will analyze how each genre approaches these themes, examining the stylistic and rhetorical differences that characterize fiction versus nonfiction. Our analysis will involve data cleaning, tokenization, and the application of descriptive statistics, sentiment analysis, and topic modeling. By comparing these works, we aim to uncover the unique ways in which each genre communicates similar ideas, providing insights into the broader distinctions between fiction and nonfiction writing.


### Imports

In [5]:
import requests
import os
import re
import pandas as pd
import nltk

from bs4 import BeautifulSoup  
from nltk.corpus import stopwords
from string import punctuation
from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation

### Scraping

This portion scrapes and saves the full text of Pride and Prejudice and A Vindication of the Rights of Woman from Project Gutenberg. It ensures the save directory exists, extracts text from the HTML, saves it to .txt files, and verifies that the files were created successfully, showing a preview of the content.

In [8]:
# Define the URLs for the books
## fiction
url_pride_prej = 'https://www.gutenberg.org/cache/epub/1342/pg1342-images.html'
url_awakening = 'https://www.gutenberg.org/cache/epub/160/pg160-images.html'
url_north_south = 'https://www.gutenberg.org/cache/epub/4276/pg4276-images.html'
url_wuthering_heights = 'https://www.gutenberg.org/cache/epub/768/pg768-images.html'
## nonfiction
url_vin_of_women = 'https://www.gutenberg.org/cache/epub/3420/pg3420-images.html'
url_enfranchisement = 'https://www.gutenberg.org/cache/epub/73404/pg73404-images.html'
url_on_liberty = 'https://www.gutenberg.org/cache/epub/34901/pg34901-images.html'
url_subjection_women = 'https://www.gutenberg.org/cache/epub/27083/pg27083-images.html'

# Define the directory to save the files
data_dir = './data'

# Ensure the directory exists
os.makedirs(data_dir, exist_ok=True)

# Function to scrape and save books
def scrape_and_save_book(url, file_name):
    # Send a GET request to the URL
    response = requests.get(url)
    response.raise_for_status()  # Check that the request was successful
    
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract all text from <p> tags
    paragraphs = soup.find_all(['p', 'h1', 'h2', 'h3'])
    book_text = '\n'.join([para.get_text() for para in paragraphs if para.get_text().strip()])

    
    # Save the extracted text to a file
    file_path = os.path.join(data_dir, file_name)
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(book_text)
    
    print(f"Text from '{file_name}' scraped and saved.")
    
    # Check if the file is saved and contains content
    if os.path.exists(file_path):
        print(f"File '{file_path}' has been created successfully.")
        # Check the first few lines of the file
        with open(file_path, 'r', encoding='utf-8') as file:
            preview = file.read(100)  # Read the first 100 characters
            print("File content preview:\n")
            print(preview)
    else:
        print(f"Failed to create the file '{file_path}'.")

# List of Fiction book tuples
fiction_books = [
    (url_pride_prej, 'pride_and_prejudice.txt'),
    (url_awakening, 'the_awakening.txt'),
    (url_north_south, 'north_and_south.txt'),
    (url_wuthering_heights, 'wuthering_heights.txt')
]

# Loop over the list and scrape/save each book
for url, filename in fiction_books:
    scrape_and_save_book(url, filename)

# List of Nonfiction book tuples
nonfiction_books = [
    (url_vin_of_women, 'vindication_of_rights_of_woman.txt'),
    (url_enfranchisement, 'the_enfranchisement_of_women.txt'),
    (url_on_liberty, 'on_liberty.txt'),
    (url_subjection_women, 'the_subjection_of_women.txt')
]

# Loop over the list and scrape/save each book
for url, filename in nonfiction_books:
    scrape_and_save_book(url, filename)

Text from 'pride_and_prejudice.txt' scraped and saved.
File './data/pride_and_prejudice.txt' has been created successfully.
File content preview:

The Project Gutenberg eBook of Pride and Prejudice
Title: Pride and Prejudice
Author: Jane Austen
Re
Text from 'the_awakening.txt' scraped and saved.
File './data/the_awakening.txt' has been created successfully.
File content preview:

The Project Gutenberg eBook of The Awakening, and Selected Short Stories
Title: The Awakening, and S
Text from 'north_and_south.txt' scraped and saved.
File './data/north_and_south.txt' has been created successfully.
File content preview:

The Project Gutenberg eBook of North and South
Title: North and South
Author: Elizabeth Cleghorn Gas
Text from 'wuthering_heights.txt' scraped and saved.
File './data/wuthering_heights.txt' has been created successfully.
File content preview:

The Project Gutenberg eBook of Wuthering Heights
Title: Wuthering Heights
Author: Emily Brontë
Relea
Text from 'vindication_of_rights

### Data Cleaning and Tokenization

This section converts the raw text into a dataframe format that includes information about the books.

In [11]:
# Initialize empty dataframe
books = pd.DataFrame(columns=['Title', 'Author', 'Release_Date', 'Updated_Date', 'Language', 'Credits', 'Text', 'Genre'])

# Function to convert text to DataFrame
def convert_to_df(file_name, genre):
    # Establish file path
    file_path = os.path.join(data_dir, file_name)
    
    # Open contents of file
    if os.path.exists(file_path):
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            
            # Extract relevant sections with error handling for missing fields
            title = re.search(r'Title:\s*(.*)', content)
            title = title.group(1) if title else 'Unknown'
            
            author = re.search(r'Author:\s*(.*)', content)
            author = author.group(1) if author else 'Unknown'
            
            release_date = re.search(r'Release date:\s*(.*)\[eBook', content)
            release_date = release_date.group(1).strip() if release_date else 'Unknown'
            
            updated_date = re.search(r'Most recently updated:\s*(.*)', content)
            updated_date = updated_date.group(1) if updated_date else 'Not available'
            
            language = re.search(r'Language:\s*(.*)', content)
            language = language.group(1) if language else 'Unknown'
            
            credits = re.search(r'Credits:\s*(.*)', content)
            credits = credits.group(1) if credits else 'Not available'
            
            # Extract book text
            match = re.search(r'Credits:.*?\n(.*)', content, re.DOTALL)
            book_text = match.group(1).strip() if match else 'No text available'
            
            # Dictionary for book information
            book_info = {
                'Title': title,
                'Author': author,
                'Release_Date': release_date,
                'Updated_Date': updated_date,
                'Language': language,
                'Credits': credits,
                'Text': book_text,
                'Genre': genre  
            }
            
            # Add data to books DataFrame
            books.loc[len(books)] = book_info
            
    else:
        print(f"The file '{file_name}' does not exist.")



##### Call on Dataframe Function with both texts

In [13]:
# Create list/loop to call on dataframe function
# Combined list of  tuples for both Fiction and Nonfiction
books_list = [
    ('pride_and_prejudice.txt', 'Fiction'),
    ('the_awakening.txt', 'Fiction'),
    ('north_and_south.txt', 'Fiction'),
    ('wuthering_heights.txt', 'Fiction'),
    ('vindication_of_rights_of_woman.txt', 'Nonfiction'),
    ('the_enfranchisement_of_women.txt', 'Nonfiction'),
    ('on_liberty.txt', 'Nonfiction'),
    ('the_subjection_of_women.txt', 'Nonfiction')
]

# Loop over the combined list and convert to DataFrame
for filename, genre in books_list:
    convert_to_df(filename, genre)

In [14]:
# Display the books DataFrame to check the loaded texts
books.head()

Unnamed: 0,Title,Author,Release_Date,Updated_Date,Language,Credits,Text,Genre
0,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,"PREFACE.\nList of Illustrations.\nChapter: I.,...",Fiction
1,"The Awakening, and Selected Short Stories",Kate Chopin,"March 11, 2006","February 28, 2021",English,Judith Boss and David Widger,The Awakeningand Selected Short Stories\nby Ka...,Fiction
2,North and South,Elizabeth Cleghorn Gaskell,"July 1, 2003","February 8, 2024",English,Produced by Chuck Greif and the Online Distrib...,NORTH AND SOUTH.\n“SHE LAY CURLED UPON THE SOF...,Fiction
3,Wuthering Heights,Emily Brontë,"December 1, 1996","January 18, 2022",English,David Price,Wuthering Heights\nby Emily Brontë\nCHAPTER I\...,Fiction
4,A Vindication of the Rights of Woman,Mary Wollstonecraft,"September 1, 2002","January 8, 2021",English,"This etext was produced by Amy E Zelmer, Col C...",This etext was produced by\nAmy E Zelmer <a.z...,Nonfiction


This section cleans and tokenizes the Text column with the following steps:
1. Cast to lowercase
2. Remove punctuation
3. Tokenize
4. Remove stopwords

In [16]:
# Punctuation
punctuation = set(punctuation) 

# Removes punctuation
def remove_punctuation(text, punct_set=punctuation): 
    
    return("".join([ch for ch in text if ch not in punct_set]))

# Stopwords
sw = stopwords.words("english")

# Removes stopwords
def remove_stop(tokens):
    
    tokens = [word for word in tokens if word not in sw]
    
    return(tokens)
 

# Tokenize the text
def tokenize(text):     
    
    return text.split()

# Applies the pipeline
def pipeline(text): 
    
    text = str.lower(text)
    text = remove_punctuation(text)
    tokens = tokenize(text)
    tokens = remove_stop(tokens)
    
    return(' '.join(tokens))

In [17]:
# Converts Text column to string
books['Text'] = books['Text'].astype(str)

# Cleans Text
books['Cleaned_Text'] = books['Text'].apply(pipeline)

# Tokenizes Text
books['Tokens'] = books['Cleaned_Text'].apply(tokenize)

In [18]:
books

Unnamed: 0,Title,Author,Release_Date,Updated_Date,Language,Credits,Text,Genre,Cleaned_Text,Tokens
0,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,"PREFACE.\nList of Illustrations.\nChapter: I.,...",Fiction,preface list illustrations chapter ii iii iv v...,"[preface, list, illustrations, chapter, ii, ii..."
1,"The Awakening, and Selected Short Stories",Kate Chopin,"March 11, 2006","February 28, 2021",English,Judith Boss and David Widger,The Awakeningand Selected Short Stories\nby Ka...,Fiction,awakeningand selected short stories kate chopi...,"[awakeningand, selected, short, stories, kate,..."
2,North and South,Elizabeth Cleghorn Gaskell,"July 1, 2003","February 8, 2024",English,Produced by Chuck Greif and the Online Distrib...,NORTH AND SOUTH.\n“SHE LAY CURLED UPON THE SOF...,Fiction,north south “she lay curled upon sofa back dra...,"[north, south, “she, lay, curled, upon, sofa, ..."
3,Wuthering Heights,Emily Brontë,"December 1, 1996","January 18, 2022",English,David Price,Wuthering Heights\nby Emily Brontë\nCHAPTER I\...,Fiction,wuthering heights emily brontë chapter 1801—i ...,"[wuthering, heights, emily, brontë, chapter, 1..."
4,A Vindication of the Rights of Woman,Mary Wollstonecraft,"September 1, 2002","January 8, 2021",English,"This etext was produced by Amy E Zelmer, Col C...",This etext was produced by\nAmy E Zelmer <a.z...,Nonfiction,etext produced amy e zelmer azelmercqueduau co...,"[etext, produced, amy, e, zelmer, azelmercqued..."
5,Enfranchisement of women,Harriet Hardy Taylor Mill,"April 16, 2024",Not available,English,Claudine Corbasson and the Online Distributed ...,To the reader\nFootnotes\n1\nENFRANCHISEMENT O...,Nonfiction,reader footnotes 1 enfranchisement women mrs j...,"[reader, footnotes, 1, enfranchisement, women,..."
6,On Liberty,John Stuart Mill,"January 10, 2011","August 12, 2019",English,"Produced by Curtis Weyant, Martin Pettit and t...",Distributed Proofreading Team at http://www.pg...,Nonfiction,distributed proofreading team httpwwwpgdpnet p...,"[distributed, proofreading, team, httpwwwpgdpn..."
7,The Subjection of Women,John Stuart Mill,"October 28, 2008","January 25, 2021",English,Produced by Michael Roe and the Online Distrib...,Proofreading Team at https://www.pgdp.net (Thi...,Nonfiction,proofreading team httpswwwpgdpnet file produce...,"[proofreading, team, httpswwwpgdpnet, file, pr..."


### Exploratory Data Analysis

##### Descriptive Statistics

In [21]:
# Function to pull descriptive statistics from clean, tokenized text
def descriptive_stats(tokens, title, num_tokens=5, verbose=True):
    if verbose:
        print(f"Descriptive statistics for '{title}':")
        print(f"There are {len(tokens)} tokens in the text.")
        print(f"There are {len(set(tokens))} unique tokens in the text.")
        print(f"There are {len(''.join(tokens))} characters in the text.")
        print(f"The lexical diversity is {len(set(tokens))/len(tokens):.3f} in the text.")

        counts = Counter(tokens)

        if num_tokens > 0 : 
            print(counts.most_common(num_tokens))
            print('\n') # add spacing for cleaner output
        
    return([len(tokens),
            len(set(tokens)),
            len("".join(tokens)),
            len(set(tokens))/len(tokens)])

In [22]:
# Loop through each title and run descriptive statistics
for index, row in books.iterrows():
    title = row['Title']
    tokens = row['Tokens']
    
    # Call the descriptive_stats function for each book
    descriptive_stats(tokens, title, num_tokens=10)

Descriptive statistics for 'Pride and Prejudice':
There are 59460 tokens in the text.
There are 8684 unique tokens in the text.
There are 373076 characters in the text.
The lexical diversity is 0.146 in the text.
[('mr', 780), ('elizabeth', 580), ('could', 527), ('would', 480), ('said', 380), ('darcy', 356), ('mrs', 346), ('much', 327), ('must', 312), ('miss', 301)]


Descriptive statistics for 'The Awakening, and Selected Short Stories':
There are 32102 tokens in the text.
There are 7732 unique tokens in the text.
There are 191489 characters in the text.
The lexical diversity is 0.241 in the text.
[('edna', 281), ('upon', 259), ('one', 253), ('would', 194), ('little', 186), ('pontellier', 171), ('like', 162), ('said', 160), ('mrs', 159), ('robert', 139)]


Descriptive statistics for 'North and South':
There are 88256 tokens in the text.
There are 13345 unique tokens in the text.
There are 508748 characters in the text.
The lexical diversity is 0.151 in the text.
[('margaret', 1186), (

### Text Classification

In [24]:
# Split text up by paragraphs (\n)
books_expanded = books.assign(Text=books['Text'].str.split('\n')).explode('Text').reset_index(drop=True)

# Drop irrelevant columns
books_expanded = books_expanded.drop(columns=['Cleaned_Text', 'Tokens'])

# Remove rows where the Text column doesn't have at least 5 words
books_filtered = books_expanded[books_expanded['Text'].str.split().str.len() >= 5]

In [25]:
# Cleans Text
books_filtered['Cleaned_Text'] = books_filtered['Text'].apply(pipeline)

# Tokenizes Text
books_filtered['Tokens'] = books_filtered['Cleaned_Text'].apply(tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  books_filtered['Cleaned_Text'] = books_filtered['Text'].apply(pipeline)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  books_filtered['Tokens'] = books_filtered['Cleaned_Text'].apply(tokenize)


In [26]:
books_filtered

Unnamed: 0,Title,Author,Release_Date,Updated_Date,Language,Credits,Text,Genre,Cleaned_Text,Tokens
80,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,CHISWICK PRESS:—CHARLES WHITTINGHAM AND CO.,Fiction,chiswick press—charles whittingham co,"[chiswick, press—charles, whittingham, co]"
81,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,"TOOKS COURT, CHANCERY LANE, LONDON.",Fiction,tooks court chancery lane london,"[tooks, court, chancery, lane, london]"
85,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,Walt Whitman has somewhere a fine and just dis...,Fiction,walt whitman somewhere fine distinction “loving,"[walt, whitman, somewhere, fine, distinction, ..."
86,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,by allowance” and “loving with personal love.”...,Fiction,allowance” “loving personal love” distinction ...,"[allowance”, “loving, personal, love”, distinc..."
87,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,to books as well as to men and women; and in t...,Fiction,books well men women case,"[books, well, men, women, case]"
...,...,...,...,...,...,...,...,...,...,...
68597,The Subjection of Women,John Stuart Mill,"October 28, 2008","January 25, 2021",English,Produced by Michael Roe and the Online Distrib...,"any evil actually caused by it), dries up pro ...",Nonfiction,evil actually caused dries pro tanto,"[evil, actually, caused, dries, pro, tanto]"
68598,The Subjection of Women,John Stuart Mill,"October 28, 2008","January 25, 2021",English,Produced by Michael Roe and the Online Distrib...,"the principal fountain of human happiness, and",Nonfiction,principal fountain human happiness,"[principal, fountain, human, happiness]"
68599,The Subjection of Women,John Stuart Mill,"October 28, 2008","January 25, 2021",English,Produced by Michael Roe and the Online Distrib...,"leaves the species less rich, to an inappreciable",Nonfiction,leaves species less rich inappreciable,"[leaves, species, less, rich, inappreciable]"
68600,The Subjection of Women,John Stuart Mill,"October 28, 2008","January 25, 2021",English,Produced by Michael Roe and the Online Distrib...,"degree, in all that makes life valuable to the",Nonfiction,degree makes life valuable,"[degree, makes, life, valuable]"


#### 1. Linear SVM Model

In [28]:
# 80/20 Split
X_train, X_test, Y_train, Y_test = train_test_split(books_filtered['Text'],
                                                    books_filtered['Genre'],
                                                    test_size=0.2,
                                                    random_state=123,
                                                    stratify=books_filtered['Genre'])

print('Size of Training Data ', X_train.shape[0])
print('Size of Test Data ', X_test.shape[0])

Size of Training Data  45964
Size of Test Data  11492


In [29]:
# TF-IDF Vectorization
tfidf = TfidfVectorizer(min_df = 10, ngram_range=(1,2), stop_words="english")
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

In [30]:
# Fit SVM Model
model = LinearSVC(random_state=123, dual=True)
model.fit(X_train_tf, Y_train)

In [31]:
# Classify Test Data
Y_pred = model.predict(X_test_tf)
print('Accuracy Score - ', accuracy_score(Y_test, Y_pred))

Accuracy Score -  0.8753045596936999


In [32]:
svm_cm = confusion_matrix(Y_test, Y_pred)
svm_cm

array([[7131,  661],
       [ 772, 2928]])

In [33]:
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

     Fiction       0.90      0.92      0.91      7792
  Nonfiction       0.82      0.79      0.80      3700

    accuracy                           0.88     11492
   macro avg       0.86      0.85      0.86     11492
weighted avg       0.87      0.88      0.87     11492



#### 2. Naive Bayes

In [35]:
# Only keep words with 5 or more occurrences
word_cutoff = 5

tokens = [word for text in books_filtered['Text'] for word in text.split()]

# Word Distribution
word_dist = nltk.FreqDist(tokens)

feature_words = set()

# Create feature words set
for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 8739 as features in the model.


In [36]:
def conv_features(text,fw) :  
    # Initialize empty dictionary
    ret_dict = {}
    
    # Split text into tokens
    tokens = text.split()
    
    # For each token:
    for word in tokens:
        # If word is found in fw, then add to dictionary with value True
        if word in fw:
            ret_dict[word] = True
    
    # Return dictionary
    return(ret_dict)

In [37]:
# Create feature set
featuresets = [(conv_features(text, feature_words), genre) 
               for text, genre in zip(books_filtered['Cleaned_Text'], books_filtered['Genre'])]

In [38]:
# 80/20 split
train_set, test_set = train_test_split(featuresets, test_size=0.2, random_state=123)

In [39]:
# Fit Naive Bayes model
clf = nltk.NaiveBayesClassifier.train(train_set)

In [40]:
# Classify Test Data
print(nltk.classify.accuracy(clf, test_set))

0.8739993038635573


In [41]:
# Most Informative Features
clf.show_most_informative_features(25)

Most Informative Features
               political = True           Nonfic : Fictio =    107.8 : 1.0
                 virtues = True           Nonfic : Fictio =    107.8 : 1.0
                    went = True           Fictio : Nonfic =     79.7 : 1.0
                  social = True           Nonfic : Fictio =     66.8 : 1.0
                majority = True           Nonfic : Fictio =     58.5 : 1.0
              government = True           Nonfic : Fictio =     56.5 : 1.0
                equality = True           Nonfic : Fictio =     54.3 : 1.0
             sensibility = True           Nonfic : Fictio =     52.9 : 1.0
                  attain = True           Nonfic : Fictio =     47.2 : 1.0
                morality = True           Nonfic : Fictio =     46.9 : 1.0
                    room = True           Fictio : Nonfic =     44.4 : 1.0
              cultivated = True           Nonfic : Fictio =     43.0 : 1.0
              principles = True           Nonfic : Fictio =     42.0 : 1.0

### Topic Modeling

In [79]:
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=5, max_df=0.7)
tfidf_vectors = tfidf_vectorizer.fit_transform(books_filtered['Cleaned_Text'])
tfidf_vectors.shape

(57456, 7316)

In [81]:
# Fit an LSA (TruncatedSVD) model
lsa_model = TruncatedSVD(n_components=2, random_state=123)
W_lsa_matrix = lsa_model.fit_transform(tfidf_vectors)
H_lsa_matrix = lsa_model.components_

In [83]:
def display_topics(model, features, no_top_words=5):
    for topic, words in enumerate(model.components_):
        total = words.sum()
        largest = words.argsort()[::-1] # invert sort order
        print("\nTopic %02d" % topic)
        for i in range(0, no_top_words):
            print("  %s (%2.2f)" % (features[largest[i]], abs(words[largest[i]]*100.0/total)))

In [85]:
display_topics(lsa_model, tfidf_vectorizer.get_feature_names_out())


Topic 00
  said (3.25)
  mr (2.97)
  margaret (1.64)
  thornton (1.08)
  hale (1.00)

Topic 01
  mr (53.50)
  thornton (16.50)
  darcy (10.96)
  hale (9.59)
  bell (5.81)


In [87]:
# Get topic from W matrix for each text
books_filtered['lsa_topic'] = W_lsa_matrix.argmax(axis=1)

# Count categories by each NMF assigned topic
lsa_comparison = books_filtered.groupby(['lsa_topic', 'Genre']).size().reset_index(name='count')
lsa_comparison_sort = lsa_comparison.sort_values(by=['lsa_topic', 'count'], ascending=[True, False])
lsa_comparison_sort

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  books_filtered['lsa_topic'] = W_lsa_matrix.argmax(axis=1)


Unnamed: 0,lsa_topic,Genre,count
0,0,Fiction,37259
1,0,Nonfiction,18480
2,1,Fiction,1697
3,1,Nonfiction,20
