# Literary Analysis: Comparing Nonfiction and Fiction through Topic Modeling and Sentiment Analysis 

### ADS 509 Final Project
##### Team 3: Claire Bentzen, Tara Dehdari, Logan Van Dine

##### Introduction

In this project, we will conduct a comparative analysis of two significant literary works: "Pride and Prejudice" by Jane Austen (fiction) and "A Vindication of the Rights of Woman" by Mary Wollstonecraft (nonfiction). Both texts engage deeply with themes of gender, society, and individual rights, making them ideal for exploring the differences in language, themes, and sentiment between fiction and nonfiction.

Using text mining techniques, we will analyze how each genre approaches these themes, examining the stylistic and rhetorical differences that characterize fiction versus nonfiction. Our analysis will involve data cleaning, tokenization, and the application of descriptive statistics, sentiment analysis, and topic modeling. By comparing these works, we aim to uncover the unique ways in which each genre communicates similar ideas, providing insights into the broader distinctions between fiction and nonfiction writing.


### Imports

In [41]:
import requests
from bs4 import BeautifulSoup  
import os
import re
import pandas as pd
import nltk
from nltk.corpus import stopwords
from string import punctuation

### Scraping

This portion scrapes and saves the full text of Pride and Prejudice and A Vindication of the Rights of Woman from Project Gutenberg. It ensures the save directory exists, extracts text from the HTML, saves it to .txt files, and verifies that the files were created successfully, showing a preview of the content.

In [2]:
# Define the URLs for the books
url_pride_prej = 'https://www.gutenberg.org/cache/epub/1342/pg1342-images.html'
url_vin_of_women = 'https://www.gutenberg.org/cache/epub/3420/pg3420-images.html'

# Define the directory to save the files
data_dir = './data'

# Ensure the directory exists
os.makedirs(data_dir, exist_ok=True)

# Function to scrape and save books
def scrape_and_save_book(url, file_name):
    # Send a GET request to the URL
    response = requests.get(url)
    response.raise_for_status()  # Check that the request was successful
    
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract all text from <p> tags
    paragraphs = soup.find_all('p')
    book_text = '\n'.join([para.get_text() for para in paragraphs])
    
    # Save the extracted text to a file
    file_path = os.path.join(data_dir, file_name)
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(book_text)
    
    print(f"Text from '{file_name}' scraped and saved.")
    
    # Check if the file is saved and contains content
    if os.path.exists(file_path):
        print(f"File '{file_path}' has been created successfully.")
        # Check the first few lines of the file
        with open(file_path, 'r', encoding='utf-8') as file:
            preview = file.read(500)  # Read the first 500 characters
            print("File content preview:\n")
            print(preview)
    else:
        print(f"Failed to create the file '{file_path}'.")

# Scrape and save Pride and Prejudice
scrape_and_save_book(url_pride_prej, 'pride_and_prejudice.txt')

# Scrape and save A Vindication of the Rights of Woman
scrape_and_save_book(url_vin_of_women, 'vindication_of_rights_of_woman.txt')

Text from 'pride_and_prejudice.txt' scraped and saved.
File './data/pride_and_prejudice.txt' has been created successfully.
File content preview:

Title: Pride and Prejudice
Author: Jane Austen
Release date: June 1, 1998 [eBook #1342]
                Most recently updated: June 17, 2024
Language: English
Credits: Chuck Greif and the Online Distributed Proofreading Team at http://www.pgdp.net (This file was produced from images available at The Internet Archive)

PREFACE.
List of Illustrations.
Chapter: I., 
II., 
III., 
IV., 
V., 
VI., 
VII., 
VIII., 
IX., 
X., 
XI., 
XII., 
XIII., 
XIV., 
XV., 
XVI., 
XVII., 
XVIII., 
XIX., 
XX., 
XXI., 

Text from 'vindication_of_rights_of_woman.txt' scraped and saved.
File './data/vindication_of_rights_of_woman.txt' has been created successfully.
File content preview:

Title: A Vindication of the Rights of Woman
Author: Mary Wollstonecraft
Release date: September 1, 2002 [eBook #3420]
                Most recently updated: January 8, 2021
Language: 

### Data Cleaning and Tokenization

This section converts the raw text into a dataframe format that includes information about the books.

In [38]:
# Initialize empty dataframe
books = pd.DataFrame(columns=['Title', 'Author', 'Release_Date', 'Updated_Date', 'Language', 'Credits', 'Text'])

def convert_to_df(file_name):
    # Establish file path
    file_path = os.path.join(data_dir, file_name)
    
    # Open contents of file
    if os.path.exists(file_path):
        with open(file_path, 'r', encoding='utf-8') as file:
            # Read contents of file
            content = file.read()  
            
            # Extract relevant sections
            title = re.search(r'Title:\s*(.*)', content).group(1)
            author = re.search(r'Author:\s*(.*)', content).group(1)
            release_date = re.search(r'Release date:\s*(.*)\[eBook', content).group(1).strip()
            updated_date = re.search(r'Most recently updated:\s*(.*)', content).group(1)
            language = re.search(r'Language:\s*(.*)', content).group(1)
            credits = re.search(r'Credits:\s*(.*)', content).group(1)
            
            # Book text
            match = re.search(r'Credits:.*?\n(.*)', content, re.DOTALL)
            book_text = match.group(1).strip()
            
            # Dictionary for data
            book_info = {
                'Title': title,
                'Author': author,
                'Release_Date': release_date,
                'Updated_Date': updated_date,
                'Language': language,
                'Credits': credits,
                'Text': book_text
            }
            
            # Add data to books dataframe
            books.loc[len(books)] = book_info
            
            return books

    else:
        print(f"The file '{file_name}' does not exist.")

In [39]:
# Convert Pride and Prejudice text
convert_to_df('pride_and_prejudice.txt')

Unnamed: 0,Title,Author,Release_Date,Updated_Date,Language,Credits,Text
0,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,"PREFACE.\nList of Illustrations.\nChapter: I.,..."


In [40]:
# Convert A Vindication of the Rights of Woman text
convert_to_df('vindication_of_rights_of_woman.txt')

Unnamed: 0,Title,Author,Release_Date,Updated_Date,Language,Credits,Text
0,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,"PREFACE.\nList of Illustrations.\nChapter: I.,..."
1,A Vindication of the Rights of Woman,Mary Wollstonecraft,"September 1, 2002","January 8, 2021",English,"This etext was produced by Amy E Zelmer, Col C...",This etext was produced by\nAmy E Zelmer <a.z...


This section cleans and tokenizes the Text column with the following steps:
1. Cast to lowercase
2. Remove punctuation
3. Tokenize
4. Remove stopwords

In [45]:
# Punctuation
punctuation = set(punctuation) 

# Removes punctuation
def remove_punctuation(text, punct_set=punctuation): 
    
    return("".join([ch for ch in text if ch not in punct_set]))

# Stopwords
sw = stopwords.words("english")

# Removes stopwords
def remove_stop(tokens):
    
    tokens = [word for word in tokens if word not in sw]
    
    return(tokens)
 

# Tokenize the text
def tokenize(text):     
    
    return text.split()

# Applies the pipeline
def pipeline(text): 
    
    text = str.lower(text)
    text = remove_punctuation(text)
    tokens = tokenize(text)
    tokens = remove_stop(tokens)
    
    return(' '.join(tokens))

In [48]:
# Converts Text column to string
books['Text'] = books['Text'].astype(str)

# Cleans Text
books['Cleaned_Text'] = books['Text'].apply(pipeline)

# Tokenizes Text
books['Tokens'] = books['Cleaned_Text'].apply(tokenize)

In [50]:
books

Unnamed: 0,Title,Author,Release_Date,Updated_Date,Language,Credits,Text,Cleaned_Text,Tokens
0,Pride and Prejudice,Jane Austen,"June 1, 1998","June 17, 2024",English,Chuck Greif and the Online Distributed Proofre...,"PREFACE.\nList of Illustrations.\nChapter: I.,...",preface list illustrations chapter ii iii iv v...,"[preface, list, illustrations, chapter, ii, ii..."
1,A Vindication of the Rights of Woman,Mary Wollstonecraft,"September 1, 2002","January 8, 2021",English,"This etext was produced by Amy E Zelmer, Col C...",This etext was produced by\nAmy E Zelmer <a.z...,etext produced amy e zelmer azelmercqueduau co...,"[etext, produced, amy, e, zelmer, azelmercqued..."
