# Quiz 05: Spark APIs [100 points]

## Author: Hannah Marr

## CS 119

## Accumulators [10 points]

1. [10 points].The title of this Q&A is wrong. It’s really about global variables (aka accumulators). The question shows code that is incorrect.

In [7]:
val data = Array(1,2,3,4,5)
var counter = 0
var rdd = sc.parallelize(data)

# Wrong: Don't do this!!
rdd.foreach(x => counter += x)

println("Counter value: " + counter)

SyntaxError: invalid syntax (1178760134.py, line 1)

Write a corrected version of the code and demonstrate its intended operation.

The issue with the original code is that Spark’s transformations and actions, such as foreach, are executed in parallel across multiple worker nodes. Since counter is a global variable, it is not properly synchronized across these nodes, leading to inconsistent results. Modifying global variables inside a distributed action like foreach is not recommended because each node has its own copy of the variable.

Instead, you should use Accumulators in Spark, which are designed for safe updates across multiple worker nodes. The following is a corrected version of the code using an accumulator to achieve the intended operation.

In [23]:
# Corrected PySpark code using Accumulators (will not run in Jupyter)
val data = Array(1, 2, 3, 4, 5)
val counter = sc.longAccumulator("Counter Accumulator")
val rdd = sc.parallelize(data)

rdd.foreach(x => counter.add(x))

println("Counter value: " + counter.value)

SyntaxError: invalid syntax (1628007579.py, line 2)

Explanation:

Accumulator: This is a special variable that allows safe and distributed accumulation of values across different nodes. Here we use a longAccumulator, which is a long-type accumulator initialized to zero.

rdd.foreach: Instead of updating the global variable counter, we now add the values to the accumulator using counter.add(x).

counter.value: After the action completes, we retrieve the accumulated value using counter.value.

Intended Operation:

The RDD data is parallelized across different worker nodes.

Each worker processes part of the data and adds to the shared accumulator.

After all the workers finish processing, the final value of the accumulator (sum of all elements in the array) is printed.

For the array [1, 2, 3, 4, 5], the output would be: Counter value: 15

This ensures that the code runs correctly in parallel while safely aggregating the results across all nodes.

In [19]:
# Code implemented in a Python environment
!pip install pyspark
from pyspark import SparkContext

sc = SparkContext("local", "Accumulator Example")

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Initialize an accumulator with initial value 0
counter = sc.accumulator(0)

# Use foreach to add each element to the accumulator
rdd.foreach(lambda x: counter.add(x))

# Print the accumulated value
print("Counter value: ", counter.value)

sc.stop()



24/10/17 10:10:25 WARN Utils: Your hostname, Hannahs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.243.30.11 instead (on interface en0)
24/10/17 10:10:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/17 10:10:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Counter value:  15


---

## Airline Traffic [45 points]

Ontime statistics for domestic airlines are published by the Bureau of Transportation Statistics. The schema is here, but the actual data has 4 additional columns (between B. and C.) which are not documented and may be safely deleted for the purpose of this exercise.
Based on the statistics for June 2024 and July 2024, please report on

1. [15 points] Describe in words and in code (where applicable) the steps you took to set up the environment for gathering the statistical data in the below questions.

Step 1: Unzipping the Files

In [42]:
import zipfile
import os

# Define the paths to the uploaded zip files
zip_files = [
    '/Users/hannahmarr/Desktop/Tufts/CS119/Quizzes/ONTIME.TD.202407.REL01.03SEP2024.zip',
    '/Users/hannahmarr/Desktop/Tufts/CS119/Quizzes/ONTIME.TD.202406.REL01.06AUG2024.zip'
]

# Extract the contents of the zip files
extracted_paths = []
for zip_file in zip_files:
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        extract_path = zip_file.replace('.zip', '')  # Extract to a folder with the same name
        zip_ref.extractall(extract_path)
        extracted_paths.append(extract_path)

# List the extracted files
extracted_files = []
for path in extracted_paths:
    extracted_files.extend(os.listdir(path))

extracted_files, extracted_paths  # Display the extracted files and directories

(['ontime.td.202407.asc', 'ontime.td.202406.asc'],
 ['/Users/hannahmarr/Desktop/Tufts/CS119/Quizzes/ONTIME.TD.202407.REL01.03SEP2024',
  '/Users/hannahmarr/Desktop/Tufts/CS119/Quizzes/ONTIME.TD.202406.REL01.06AUG2024'])

The ZIP files have been successfully extracted, and the contents are as follows:

ontime.td.202407.asc (for July 2024)

ontime.td.202406.asc (for June 2024)

Both files are in .asc format, which typically means they are text files with a structured format (likely tab-delimited or fixed-width columns). Next, I will load these .asc files into pandas for inspection and proceed with the data analysis. ​

In [47]:
# Load the .asc files to inspect the format and structure
june_file = '/Users/hannahmarr/Desktop/Tufts/CS119/Quizzes/ONTIME.TD.202406.REL01.06AUG2024/ontime.td.202406.asc'
july_file = '/Users/hannahmarr/Desktop/Tufts/CS119/Quizzes/ONTIME.TD.202407.REL01.03SEP2024/ontime.td.202407.asc'

# Read a few lines from each file to inspect the format
with open(june_file, 'r') as june_f, open(july_file, 'r') as july_f:
    june_preview = [next(june_f) for _ in range(10)]
    july_preview = [next(july_f) for _ in range(10)]

june_preview, july_preview  # Display the first 10 lines of both files for inspection

(['DL|4800|||9E|4800|CHS|JFK|20240607|5|700|700|650|900|900|841|0|0|120|111|-10|-19|-9|705|830|N272PQ|15|11|85||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0||FORM-1|N\n',
  'DL|4800|||9E|4800|CHS|JFK|20240608|6|700|700|654|900|900|849|0|0|120|115|-6|-11|-5|708|841|N302PQ|14|8|93||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0||FORM-1|N\n',
  'DL|4800|||9E|4800|CHS|JFK|20240609|7|700|700|656|900|900|848|0|0|120|112|-4|-12|-8|710|840|N676CA|14|8|90||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0||FORM-1|N\n',
  'DL|4800|||9E|4800|CHS|JFK|20240610|1|700|700|1043|900|900|1220|0|0|120|97|223|200|-23|1056|1214|N301PQ|13|6|78||4|0|0|0|196|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0||FORM-1|N\n',
  'DL|4800|||9E|4800|CHS|JFK|20240611|2|700|700|657|900|900|847|0|0|120|110|-3|-13|-10|715|840|N335PQ|18|7|85||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0||FORM-1|N\n',
  'DL|4800|||9E|4800|CHS|JFK|202406

The .asc files are delimited by vertical bars (|), which suggests that they can be treated as delimited text files for easier loading into a DataFrame. The structure contains information about flights, including airline codes, airports, departure times, delays, etc.

Next, I will load these files into pandas DataFrames, clean the data by removing the unnecessary columns, and start analyzing.

In [113]:
# Importing the necessary library
import pandas as pd

# Load the .asc files into pandas DataFrames
column_names = [
    'Carrier', 'Flight_Number', 'UNDOC_1', 'UNDOC_2', 'UNDOC_3', 'UNDOC_4', 'Departure_Airport', 'Arrival_Airport',
    'Date_of_Flight_Operation_YMD', 'Day_of_Week_of_Flight_Operation_M=1', 'Scheduled_Departure_Time1', 'Scheduled_Departure_Time2',
    'Gate_Departure_Time', 'Scheduled_Arrival_Time_OAG', 'Scheduled_Arrival_Time_CRS', 'Gate_Arrival_Time', 
    'Min_Diff_OAG_Scheduled_Depart_Time', 'Min_Diff_OAG_Scheduled_Arrive_Time', 'Elapsed_Time_CRS_Mins', 'Gate_to_Gate_Time_Actual', 
    'Departure_Delay', 'Arrival_Delay', 'Elapsed_Time_Diff', 'Wheels-Off_Time', 'Wheels-On_Time', 'Aircraft_Tail_Number', 
    'Cancellation_Code', 'Mins_Late_E', 'Mins_Late_F', 'Mins_Late_G', 'Mins_Late_H', 'Mins_Late_I', 'UNDOC_33', 'UNDOC_34', 'UNDOC_35', 
    'UNDOC_36', 'UNDOC_37', 'UNDOC_38', 'UNDOC_39', 'UNDOC_40', 'UNDOC_41', 'UNDOC_42', 'UNDOC_43', 'UNDOC_44', 'UNDOC_45', 'UNDOC_46', 
    'UNDOC_47', 'UNDOC_48', 'UNDOC_49', 'UNDOC_50', 'UNDOC_51', 'UNDOC_52', 'UNDOC_53', 'UNDOC_54', 'UNDOC_55', 'UNDOC_56', 'UNDOC_57', 
    'UNDOC_58', 'UNDOC_59', 'UNDOC_60', 'UNDOC_61', 'UNDOC_62', 'UNDOC_63', 'UNDOC_64', 'UNDOC_65', 'UNDOC_66', 'UNDOC_67', 'UNDOC_68', 
    'UNDOC_69', 'UNDOC_70', 'UNDOC_71', 'UNDOC_72', 'UNDOC_73', 'UNDOC_74', 'UNDOC_75', 'UNDOC_76', 'UNDOC_77', 'UNDOC_78', 'UNDOC_79', 
    'UNDOC_80', 'UNDOC_81', 'UNDOC_82', 'UNDOC_83', 'UNDOC_84'
]

# Load June and July data
june_data = pd.read_csv(june_file, sep='|', names = column_names)
july_data = pd.read_csv(july_file, sep='|', names = column_names)

# Combine both months into one dataframe
all_data = pd.concat([june_data, july_data], ignore_index=True)

# Check the number of columns (this was to determine the number of undocumented columns that I would need to drop)
print("Number of columns:", all_data.shape[1])

# Drop the undocumented columns
all_data.drop(columns=['UNDOC_1', 'UNDOC_2', 'UNDOC_3', 'UNDOC_4', 'UNDOC_33', 'UNDOC_34', 'UNDOC_35', 
    'UNDOC_36', 'UNDOC_37', 'UNDOC_38', 'UNDOC_39', 'UNDOC_40', 'UNDOC_41', 'UNDOC_42', 'UNDOC_43', 'UNDOC_44', 'UNDOC_45', 'UNDOC_46', 
    'UNDOC_47', 'UNDOC_48', 'UNDOC_49', 'UNDOC_50', 'UNDOC_51', 'UNDOC_52', 'UNDOC_53', 'UNDOC_54', 'UNDOC_55', 'UNDOC_56', 'UNDOC_57', 
    'UNDOC_58', 'UNDOC_59', 'UNDOC_60', 'UNDOC_61', 'UNDOC_62', 'UNDOC_63', 'UNDOC_64', 'UNDOC_65', 'UNDOC_66', 'UNDOC_67', 'UNDOC_68', 
    'UNDOC_69', 'UNDOC_70', 'UNDOC_71', 'UNDOC_72', 'UNDOC_73', 'UNDOC_74', 'UNDOC_75', 'UNDOC_76', 'UNDOC_77', 'UNDOC_78', 'UNDOC_79', 
    'UNDOC_80', 'UNDOC_81', 'UNDOC_82', 'UNDOC_83', 'UNDOC_84'], inplace=True)

# Display the first few rows of the combined dataset
all_data.head(10)

  june_data = pd.read_csv(june_file, sep='|', names = column_names)
  july_data = pd.read_csv(july_file, sep='|', names = column_names)


Number of columns: 84


Unnamed: 0,Carrier,Flight_Number,Departure_Airport,Arrival_Airport,Date_of_Flight_Operation_YMD,Day_of_Week_of_Flight_Operation_M=1,Scheduled_Departure_Time1,Scheduled_Departure_Time2,Gate_Departure_Time,Scheduled_Arrival_Time_OAG,...,Elapsed_Time_Diff,Wheels-Off_Time,Wheels-On_Time,Aircraft_Tail_Number,Cancellation_Code,Mins_Late_E,Mins_Late_F,Mins_Late_G,Mins_Late_H,Mins_Late_I
0,DL,4800,CHS,JFK,20240607,5,700,700,650,900,...,-9,705,830,N272PQ,15,11,85,,0,0
1,DL,4800,CHS,JFK,20240608,6,700,700,654,900,...,-5,708,841,N302PQ,14,8,93,,0,0
2,DL,4800,CHS,JFK,20240609,7,700,700,656,900,...,-8,710,840,N676CA,14,8,90,,0,0
3,DL,4800,CHS,JFK,20240610,1,700,700,1043,900,...,-23,1056,1214,N301PQ,13,6,78,,4,0
4,DL,4800,CHS,JFK,20240611,2,700,700,657,900,...,-10,715,840,N335PQ,18,7,85,,0,0
5,DL,4800,CHS,JFK,20240612,3,700,700,658,900,...,-16,713,836,N932XJ,15,6,83,,0,0
6,DL,4800,CHS,JFK,20240613,4,700,700,1817,900,...,49,1917,2045,N691CA,60,21,88,,677,0
7,DL,4800,CHS,JFK,20240614,5,700,700,659,900,...,-11,711,840,N186PQ,12,8,89,,0,0
8,DL,4800,CHS,JFK,20240615,6,700,700,659,900,...,-7,714,841,N604LR,15,11,87,,0,0
9,DL,4800,CHS,JFK,20240616,7,700,700,658,900,...,-14,707,837,N316PQ,9,7,90,,0,0


2. [6 points] Which US Airline Has the Least Delays? Report by full names, (e.g., Delta Airlines, not DL) 

In [119]:
# Extract unique airline carrier codes from Carrier column
all_airline_carriers = all_data['Departure_Airport'].unique()

# Display unique airport codes
print(all_airline_carriers)

['CHS' 'ATL' 'FSD' 'MSP' 'TRI' 'ABE' 'TYS' 'LGA' 'JFK' 'MCI' 'DTW' 'CLT'
 'RIC' 'ROC' 'CHO' 'IND' 'MQT' 'PWM' 'CVG' 'TVC' 'DSM' 'PIT' 'CHA' 'ORF'
 'ILM' 'CLE' 'ORD' 'CSG' 'CAE' 'GSP' 'BUF' 'MEM' 'RDU' 'STL' 'CWA' 'OMA'
 'SAV' 'PNS' 'GNV' 'XNA' 'BHM' 'MKE' 'JAX' 'GTR' 'EWR' 'BNA' 'MLI' 'MLU'
 'LIT' 'MSN' 'BDL' 'GSO' 'MGM' 'HPN' 'DLH' 'SDF' 'ALB' 'AVL' 'GRR' 'AEX'
 'DAY' 'MYR' 'BGR' 'SYR' 'PVD' 'BTV' 'HSV' 'RAP' 'ROA' 'FAY' 'SHV' 'TUL'
 'ORH' 'CMH' 'EVV' 'BTR' 'DCA' 'LFT' 'BGM' 'MOB' 'TLH' 'ATW' 'MDT' 'OAJ'
 'VLD' 'ITH' 'RST' 'AGS' 'DHN' 'BWI' 'AUS' 'MBS' 'BMI' 'BQK' 'FAR' 'ABY'
 'GRB' 'SFO' 'DFW' 'SRQ' 'LAX' 'PHX' 'SJC' 'SNA' 'MIA' 'PHL' 'SAT' 'STT'
 'MCO' 'SMF' 'MSO' 'SEA' 'FAT' 'BZN' 'TUS' 'MSY' 'OKC' 'BOS' 'SJU' 'ELP'
 'TPA' 'LAS' 'FLL' 'ABQ' 'PDX' 'BFL' 'DEN' 'AVP' 'SAN' 'IAH' 'RSW' 'SBA'
 'PBI' 'ONT' 'JAC' 'CID' 'DRO' 'ECP' 'VPS' 'RNO' 'EYW' 'PSP' 'FCA' 'SLC'
 'SBP' 'BUR' 'MHT' 'DAB' 'IAD' 'GEG' 'LEX' 'MFE' 'RDM' 'OGG' 'KOA' 'MTJ'
 'HNL' 'LIH' 'STS' 'ICT' 'MRY' 'STX' 'LBB' 'COS' 'B

In [133]:
# Calculate the mean departure and arrival delays for each airline
airline_delay = all_data.groupby('Carrier')[['Departure_Delay', 'Arrival_Delay']].mean()
airline_delay['Mean_Delay'] = airline_delay.mean(axis=1)  # Average of departure and arrival delays

# Identify the airline with the least delays
least_delayed_airline = airline_delay.sort_values('Mean_Delay').iloc[0]

# Retrieve the airline carrier code from the index
carrier_of_least_delayed_airline = airline_delay.sort_values('Mean_Delay').index[0]

# Display the result
print("Airline with the least delays:", carrier_of_least_delayed_airline)
print(least_delayed_airline)  # Display the full row for details

Airline with the least delays: HA
Departure_Delay    5.805322
Arrival_Delay      4.851308
Mean_Delay         5.328315
Name: HA, dtype: float64


The airline with the least delays is Hawaiian Airlines (carrier code HA), with an average delay time of 5.33 minutes.

3. [6 points] What Departure Time of Day Is Best to Avoid Flight Delays, segmented into 5 time blocks [night (10 pm - 6 am), morning (6 am to 10 am), mid-day (10 am to 2 pm), afternoon (2 pm - 6 pm), evening (6 pm - 10 pm)]

In [141]:
import pandas as pd

# Function to convert local 24-hour time without leading zeros (e.g., 650 -> "06:50")
def convert_to_time_str(time_value):
    try:
        time_value = str(int(time_value))  # Ensure it's a string representation of an integer
        if len(time_value) <= 2:
            # Time is in hours only (e.g., '5' becomes '05:00')
            return f"{time_value.zfill(2)}:00"
        else:
            # Split the last two digits as minutes, the rest as hours (e.g., '650' becomes '06:50')
            return f"{time_value[:-2].zfill(2)}:{time_value[-2:]}"
    except ValueError:
        # If conversion fails, return NaT (Not a Time) to handle bad data
        return pd.NaT

# Apply the conversion function to the Gate_Departure_Time column
all_data['Formatted_Gate_Departure_Time'] = all_data['Gate_Departure_Time'].apply(convert_to_time_str)

# Check if any NaT or invalid values were generated
invalid_times = all_data[all_data['Formatted_Gate_Departure_Time'].isna()]
print("Invalid time entries:", invalid_times)

# Proceed to extract the hour from valid formatted times
all_data['Hour_Gate_Departure_Time'] = pd.to_datetime(all_data['Formatted_Gate_Departure_Time'], format='%H:%M', errors='coerce').dt.hour

# Define time block categories based on the extracted hour
def time_block(hour):
    if pd.isna(hour):
        return 'Unknown'
    elif 22 <= hour or hour < 6:
        return 'Night'
    elif 6 <= hour < 10:
        return 'Morning'
    elif 10 <= hour < 14:
        return 'Mid-day'
    elif 14 <= hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'

# Apply the time block categories
all_data['Time_Block'] = all_data['Hour_Gate_Departure_Time'].apply(time_block)

# Calculate the mean delay for each time block
time_block_delay = all_data.groupby('Time_Block')['Departure_Delay'].mean()

# Print the time block delay results
print(time_block_delay)

Invalid time entries: Empty DataFrame
Columns: [Carrier, Flight_Number, Departure_Airport, Arrival_Airport, Date_of_Flight_Operation_YMD, Day_of_Week_of_Flight_Operation_M=1, Scheduled_Departure_Time1, Scheduled_Departure_Time2, Gate_Departure_Time, Scheduled_Arrival_Time_OAG, Scheduled_Arrival_Time_CRS, Gate_Arrival_Time, Min_Diff_OAG_Scheduled_Depart_Time, Min_Diff_OAG_Scheduled_Arrive_Time, Elapsed_Time_CRS_Mins, Gate_to_Gate_Time_Actual, Departure_Delay, Arrival_Delay, Elapsed_Time_Diff, Wheels-Off_Time, Wheels-On_Time, Aircraft_Tail_Number, Cancellation_Code, Mins_Late_E, Mins_Late_F, Mins_Late_G, Mins_Late_H, Mins_Late_I, Formatted_Gate_Departure_Time]
Index: []

[0 rows x 29 columns]
Time_Block
Afternoon     20.004364
Evening       31.640302
Mid-day       13.862641
Morning        6.662696
Night         26.876475
Unknown      107.474552
Name: Departure_Delay, dtype: float64


The best departure time of day to avoid flight delays is Morning, with an average departure delay of only 6.66 minutes.

4. [5 points] Which Airports Have The Most Flight Delays? Report by full name, (e.g., “Newark Liberty International,” not “EWR,” when the airport code EWR is provided).

In [150]:
# Sum arrival and departure delays for each airport
airport_delays = all_data.groupby('Departure_Airport')[['Departure_Delay', 'Arrival_Delay']].sum()
airport_delays['Total_Delay'] = airport_delays.sum(axis=1)

# Sort by total delay
most_delayed_airports = airport_delays.sort_values('Total_Delay', ascending=False)
most_delayed_airports

Unnamed: 0_level_0,Departure_Delay,Arrival_Delay,Total_Delay
Departure_Airport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DFW,1544728,1344845,2889573
CLT,1279648,1124205,2403853
ORD,1283296,1110008,2393304
ATL,1213106,1043339,2256445
DEN,1032185,809265,1841450
...,...,...,...
YKM,-19,-493,-512
WRG,-169,-434,-603
PIH,81,-713,-632
AKN,-644,-48,-692


The five airports with the highest flight delays, starting with the airport with the highest total departure and arrival delays, are Dallas-Fort Worth International (DFW), Charlotte - Douglas (CLT), Chicago - O'Hare (ORD), Atlanta - Hartsfield Jackson (ATL), and Denver - International (DEN).

5. [5 points] What Are the Top 5 Busiest Airports in the US. Report by full name, (e.g., “Newark Liberty International,” not “EWR”).

In [172]:
# Count the number of flights by airport (arrivals + departures)
busiest_airports_departures = all_data['Departure_Airport'].value_counts()
busiest_airports_arrivals = all_data['Arrival_Airport'].value_counts()

# Combine the counts of arrivals and departures by the airport code (using sum)
busiest_airports = pd.DataFrame({
    'Departures': busiest_airports_departures,
    'Arrivals': busiest_airports_arrivals
}).fillna(0)  # Fill NaN values with 0 where an airport has no departures or no arrivals

# Sum both columns to get the total number of flights for each airport
busiest_airports['Total_Flights'] = busiest_airports['Departures'] + busiest_airports['Arrivals']

# Sort by the busiest airports (most flights)
busiest_airports_sorted = busiest_airports.sort_values('Total_Flights', ascending=False)

# Display the top 10 busiest airports
print(busiest_airports_sorted.head(5))

     Departures  Arrivals  Total_Flights
ATL       59567     59561         119128
DFW       57164     57155         114319
DEN       55986     55976         111962
ORD       55735     55734         111469
CLT       43837     43834          87671


The top 5 busiest airports are Atlanta - Hartsfield Jackson (ATL), Dallas-Fort Worth International (DFW), Denver - International (DEN), Chicago - O'Hare (ORD), and Charlotte - Douglas (CLT). These are also the airports with the highest flight delays.

---

## ShortStoryJam [45 pts]

ShortStoryJam is a proposed new business for users to upload their short stories. We wish to set up a framework for analyzing an arbitrarily large number of stories. We would like to be able to deploy hundreds of servers to analyze different stories in parallel.

1. [3 points] To seed the effort, the text of about 22 short stories by Edgar Allan Poe, he of the “quoth the raven” fame, are available in my github repository. Clean the text and remove stopwords,

In [183]:
import requests
import re
import string

# Fetch the stopwords list from the given URL
stopwords_list = requests.get("https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt").content
stopwords = list(set(stopwords_list.decode().splitlines()))

# Function to remove stopwords from a list of words
def remove_stopwords(words):
    # Clean and split the input words
    list_ = re.sub(r"[^a-zA-Z0-9]", " ", words.lower()).split()
    return [itm for itm in list_ if itm not in stopwords]

# Function to clean the text (lowercase, remove punctuation, digits, and stopwords)
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove content inside square brackets
    text = re.sub('\[.*?\]', '', text)
    # Remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    # Remove digits and newlines
    text = re.sub('[\d\n]', ' ', text)
    # Remove stopwords
    return ' '.join(remove_stopwords(text))

  text = re.sub('\[.*?\]', '', text)
  text = re.sub('[\d\n]', ' ', text)


In [221]:
# Read the text file
def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# Save the cleaned text back to a file
def save_cleaned_text(file_path, cleaned_text):
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(cleaned_text)

# Example usage with A Descent into the Maelstrom story:
if __name__ == "__main__":
    # Specify the path of the downloaded text file
    file_path = '/Users/hannahmarr/Downloads/A_DESCENT_INTO_THE_MAELSTROM.txt'
    cleaned_file_path = '/Users/hannahmarr/Downloads/A_DESCENT_INTO_THE_MAELSTROM_CLEANED.txt'  # Path where cleaned file will be saved

    # Read the original text
    original_text = read_file(file_path)

    # Clean the text
    cleaned_text = clean_text(original_text)

     # Step 5: Save the cleaned text (optional)
    save_cleaned_text(cleaned_file_path, cleaned_text)

    # Output the first 500 characters of cleaned text to verify
    print(cleaned_text[:500])

ways god nature providence ways models frame commensurate vastness profundity unsearchableness works depth greater democritus joseph glanville reached summit loftiest crag minutes man exhausted speak long ago length guided route youngest sons years happened event happened mortal man man survived hours deadly terror endured broken body soul suppose man single day change hairs jetty black white weaken limbs unstring nerves tremble exertion frightened shadow scarcely cliff giddy cliff edge careless


In [204]:
# Example usage with The Cask of Amontillado story:
if __name__ == "__main__":
    # Specify the path of the downloaded text file
    file_path = '/Users/hannahmarr/Downloads/THE_CASK_OF_AMONTILLADO.txt'
    cleaned_file_path = '/Users/hannahmarr/Downloads/THE_CASK_OF_AMONTILLADO_CLEANED.txt'  # Path where cleaned file will be saved

    # Read the original text
    original_text = read_file(file_path)

    # Clean the text
    cleaned_text = clean_text(original_text)

    # Output the first 500 characters of cleaned text to verify
    print(cleaned_text[:500])

injuries fortunato borne ventured insult vowed revenge nature soul suppose utterance threat length avenged point definitively settled definitiveness resolved precluded idea risk punish punish impunity wrong unredressed retribution overtakes redresser equally unredressed avenger fails felt wrong understood word deed fortunato doubt good continued smile face perceive smile thought immolation weak point fortunato man respected feared prided connoisseurship wine italians true virtuoso spirit enthusi


2. [8 points] Use NLTK to decompose the first story (A_DESCENT_INTO…) into sentences & sentences into tokens. Here is the code for doing that, after you set the variable paragraph to hold the text of the story.

In [209]:
import nltk

# Download NLTK data (punkt for tokenization and averaged_perceptron_tagger for POS tagging)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Function to read the story text from a local file
def read_story_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# Decompose the story into sentences and then tokenize each sentence
def decompose_story(paragraph):
    # Split text into sentences
    sent_text = nltk.sent_tokenize(paragraph)
    # Tokenize each sentence and apply POS tagging
    all_tagged = [nltk.pos_tag(nltk.word_tokenize(sent)) for sent in sent_text]
    return all_tagged

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hannahmarr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/hannahmarr/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [225]:
# Example usage with A Descent Into the Maelstrom story:
if __name__ == "__main__":
    # Path to the .txt file in your Downloads folder
    file_path = '/Users/hannahmarr/Downloads/A_DESCENT_INTO_THE_MAELSTROM.txt'

    # Read the story text into the 'paragraph' variable
    paragraph = read_story_from_file(file_path)

    # Decompose and tokenize the story text
    all_tagged_sentences = decompose_story(paragraph)

    # Output the first tagged sentence to verify
    print(all_tagged_sentences[0])

[('The', 'DT'), ('ways', 'NNS'), ('of', 'IN'), ('God', 'NNP'), ('in', 'IN'), ('Nature', 'NNP'), (',', ','), ('as', 'IN'), ('in', 'IN'), ('Providence', 'NNP'), (',', ','), ('are', 'VBP'), ('not', 'RB'), ('as', 'IN'), ('our', 'PRP$'), ('ways', 'NNS'), (';', ':'), ('nor', 'CC'), ('are', 'VBP'), ('the', 'DT'), ('models', 'NNS'), ('that', 'IN'), ('we', 'PRP'), ('frame', 'VBP'), ('any', 'DT'), ('way', 'NN'), ('commensurate', 'NN'), ('to', 'TO'), ('the', 'DT'), ('vastness', 'NN'), (',', ','), ('profundity', 'NN'), (',', ','), ('and', 'CC'), ('unsearchableness', 'NN'), ('of', 'IN'), ('His', 'PRP$'), ('works', 'NNS'), (',', ','), ('_which', 'NNS'), ('have', 'VBP'), ('a', 'DT'), ('depth', 'NN'), ('in', 'IN'), ('them', 'PRP'), ('greater', 'JJR'), ('than', 'IN'), ('the', 'DT'), ('well', 'NN'), ('of', 'IN'), ('Democritus_', 'NNP'), ('.', '.')]


3. [11 points] Tag all remaining words in the story as parts of speech using the Penn POS Tags. This SO answer shows how to obtain the POS tag values. Create and print a dictionary with the Penn POS Tags as keys and a list of words as the values.

In [228]:
import nltk
from collections import defaultdict

# Download NLTK data (punkt for tokenization and averaged_perceptron_tagger for POS tagging)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Function to read the story text from a local file
def read_story_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# Define the list of allowed Penn POS tags
allowed_pos_tags = {
    'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD',
    'NN', 'NNS', 'NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR',
    'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ',
    'WDT', 'WP', 'WP$', 'WRB'
}

# Function to tag words and create a dictionary of POS tags with corresponding words
def tag_words_by_pos(paragraph):
    # Tokenize the paragraph into sentences
    sentences = nltk.sent_tokenize(paragraph)
    
    # Initialize a dictionary with POS tags as keys and list of words as values
    pos_dict = defaultdict(list)
    
    # Loop through each sentence, tokenize and tag it
    for sentence in sentences:
        words_with_tags = nltk.pos_tag(nltk.word_tokenize(sentence))
        for word, tag in words_with_tags:
            # Only include words with POS tags that are in the allowed_pos_tags list
            if tag in allowed_pos_tags:
                pos_dict[tag].append(word)
    
    return pos_dict

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hannahmarr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/hannahmarr/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [358]:
# Example usage
if __name__ == "__main__":
    # Path to the .txt file in your Downloads folder
    file_path = '/Users/hannahmarr/Downloads/A_DESCENT_INTO_THE_MAELSTROM_CLEANED.txt'
    
    # Step 5: Read the story text into the 'paragraph' variable
    paragraph = read_story_from_file(file_path)
    
    # Step 6: Tag words and build the POS dictionary
    pos_tagged_dict = tag_words_by_pos(paragraph)
    
    # Step 7: Print the dictionary to verify the POS tags and associated words
    for pos_tag, words in pos_tagged_dict.items():
        print(f"{pos_tag}: {words[:10]}")  # Print first 10 words for each POS tag to keep the output concise

JJ: ['nature', 'commensurate', 'unsearchableness', 'speak', 'mortal', 'suppose', 'single', 'jetty', 'black', 'white']
NN: ['providence', 'vastness', 'profundity', 'democritus', 'joseph', 'glanville', 'summit', 'crag', 'man', 'length']
RB: ['depth', 'long', 'ago', 'deadly', 'scarcely', 'carelessly', 'beneath', 'deeply', 'length', 'upward']
VB: ['raise', 'timid', 'morrow', 'watch', 'deck', 'elder', 'shake', 'slack', 'keel', 'hold']


4. [11 points] In this framework, each row will represent a story. The columns will be as follows:

The text of the story,

Two-letter prefixes of each tag, for example NN, VB, RB, JJ etc.and the words belonging to that tag in the story. 

Show your code and the tag columns, at least for the one story.

In [360]:
import nltk
from collections import defaultdict
import pandas as pd

# Download NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Define the list of allowed Penn POS tags
allowed_pos_tags = {
    'NN', 'VB', 'JJ', 'RB'
}

# Function to read the story text from a local file
def read_story_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# Step 3: Function to tag words and create a dictionary of POS tags with corresponding words
def tag_words_by_pos(paragraph):
    # Tokenize the paragraph into sentences
    sentences = nltk.sent_tokenize(paragraph)
    
    # Initialize a dictionary with POS tags as keys and list of words as values
    pos_dict = defaultdict(list)
    
    # Loop through each sentence, tokenize and tag it
    for sentence in sentences:
        words_with_tags = nltk.pos_tag(nltk.word_tokenize(sentence))
        for word, tag in words_with_tags:
            # Only include words with POS tags that are in the allowed_pos_tags list
            if tag in allowed_pos_tags:
                pos_dict[tag].append(word)
    
    return pos_dict

# Step 4: Create a DataFrame for the POS tag results
def create_pos_dataframe(title, text, pos_dict):
    # Prepare the base data with 'Title' and 'Text' columns
    data = {
        'Title': [title],
        'Text': [text]
    }
    
    # Loop through each allowed POS tag and add the corresponding words to the data
    for tag in allowed_pos_tags:
        data[tag] = [', '.join(pos_dict.get(tag, []))]  # Join the words with commas
    
    # Create the DataFrame
    df = pd.DataFrame(data)
    return df

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hannahmarr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/hannahmarr/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [362]:
# Example usage with A Descent Into the Maelstrom
if __name__ == "__main__":
    file_path = '/Users/hannahmarr/Downloads/A_DESCENT_INTO_THE_MAELSTROM_CLEANED.txt'

    # Read the story text into the 'paragraph' variable
    paragraph = read_story_from_file(file_path)
    
    # Tag words and build the POS dictionary
    pos_tagged_dict = tag_words_by_pos(paragraph)
    
    # Create the DataFrame
    df = create_pos_dataframe("Maelstrom", paragraph[0:100], pos_tagged_dict) # paragraph[0:100] to not dominate the dataframe with text

In [364]:
# Display the dataframe using Jupyter's built-in format for better readability
df

Unnamed: 0,Title,Text,NN,JJ,VB,RB
0,Maelstrom,ways god nature providence ways models frame commensurate vastness profundity unsearchableness works,"providence, vastness, profundity, democritus, joseph, glanville, summit, crag, man, length, route, event, man, man, body, soul, man, day, change, weaken, exertion, giddy, cliff, edge, rest, portion, body, tenure, slippery, edge, cliff, sheer, precipice, rock, dozen, truth, position, companion, ground, clung, shrubs, glance, sky, idea, fury, reason, courage, sit, distance, view, scene, event, story, spot, eye, manner, coast, latitude, province, nordland, district, lofoden, mountain, sit, helseggen, cloudy, hold, grass, feel, belt, vapor, beneath, sea, expanse, hue, bring, mind, geographer, account, mare, tenebrarum, panorama, imagination, conceive, eye, reach, character, gloom, surf, promontory, apex, distance, bleak, island, position, wilderness, surge, size, cluster, dark, appearance, space, distant, island, time, gale, brig, lay, reefed, trysail, sight, regular, quick, cross, water, direction, wind, foam, vicinity, distance, man, mile, islesen, hotholm, buckholm, moskoe, vurrgh, change, water, caught, glimpse, sea, summit, man, sound, herd, moment, term, character, beneath, velocity, moment, speed, impetuosity, fury, moskoe, coast, bed, conflicting, convulsion, rapidity, water, alteration, surface, distance, combination, gyratory, motion, germ, vast, definite, existence, circle, mile, diameter, edge, whirl, belt, spray, particle, mouth, funnel, interior, eye, fathom, jet, wall, water, horizon, angle, round, motion, voice, half, half, roar, mighty, cataract, mountain, base, rock, face, clung, herbage, agitation, length, man, maelstr, island, moskoe, midway, impart, conception, magnificence, horror, scene, sense, point, view, writer, question, time, summit, helseggen, storm, description, impression, spectacle, moskoe, depth, water, thirty, vurrgh, depth, passage, vessel, risk, stream, country, rapidity, ebb, sea, scarce, ship, attraction, beat, water, relaxes, tranquility, flood, calm, quarter, hour, violence, stream, fury, reach, stream, violence, moskoe, borne, pine, rise, broken, degree, craggy, stream, reflux, sea, water, year, morning, sexagesima, sunday, impetuosity, water, vicinity, vortex, reference, moskoe, depth, centre, moskoe, str, proof, fact, glance, abyss, whirl, crag, helseggen, pinnacle, phlegethon, simplicity, jonas, ramus, belief, fact, thing, ship, existence, influence, resist, hurricane, remember, perusal, aspect, idea, collision, reflux, ridge, water, flood, fall, result, whirlpool, vortex, suction, lesser, dia, britannica, imagine, centre, channel, maelstr, globe, gulf, bothnia, instance, opinion, imagination, guide, view, notion, inability, paper, absurd, thunder, abyss, whirl, man, round, crag, lee, roar, water, story, convince, moskoe, str, smack, seventy, habit, fishing, fishing, business, southward, fish, risk, choice, variety, abundance, day, craft, scrape, week, fact, matter, speculation, risk, life, labor, courage, capital, smack, cove, coast, practice, fine, weather, advantage, push, channel, moskoe, pool, drop, anchorage, sandflesen, time, water, expedition, wind, return, seldom, calculation, point, night, anchor, account, calm, thing, remain, week, death, gale, occasion, sea, spite, round, anchor, cross, day, luck, spot, weather, shift, gauntlet, moskoe, accident, heart, mouth, slack, wind, smack, brother, son, assistance, risk, heart, danger, danger, truth, day, day, hurricane, morning, afternoon, breeze, south, sun, shone, seaman, foreseen, clock, fish, day, str, slack, water, wind, quarter, time, rate, danger, reason, aback, breeze, helseggen, boat, wind, headway, return, anchorage, horizon, copper, cloud, velocity, breeze, direction, state, time, minute, storm, sky, spray, hurricane, seaman, thing, board, brother, safety, boat, thing, water, flush, hatch, bow, hatch, custom, str, precaution, circumstance, lay, brother, destruction, opportunity, threw, deck, bow, foot, fore, mast, instinct, thing, time, breath, clung, bolt, stand, dog, water, measure, arm, elder, brother, heart, joy, moment, joy, ear, word, moskoe, moment, shook, head, foot, ague, word, understand, wind, crossing, channel, calmest, weather, wait, watch, pool, hurricane, moment, dream, hope, time, fury, spent, feel, change, direction, pitch, overhead, burst, rift, sky, bright, moon, lustre, thing, distinctness, scene, brother, manner, din, word, voice, head, death, thought, watch, fob, face, moonlight, burst, clock, time, slack, whirl, fury, boat, laden, slip, beneath, landsman, sea, phrase, ridden, sea, bore, rise, sweep, slide, plunge, feel, dizzy, mountain, dream, glance, glance, position, instant, moskoe, quarter, mile, day, whirl, race, place, horror, spasm, boat, half, direction, thunderbolt, moment, water, kind, shrill, waste, steam, steam, belt, surf, moment, plunge, abyss, velocity, boat, sink, water, air, surface, surge, whirl, larboard, wall, horizon, mind, hope, deal, terror, despair, truth, thing, die, manner, paltry, consideration, life, view, manifestation, god, power, idea, mind, curiosity, whirl, sacrifice, grief, man, mind, extremity, boat, pool, light, circumstance, possession, cessation, reach, situation, surf, bed, ocean, ridge, sea, form, idea, confusion, mind, spray, blind, deafen, strangle, power, action, reflection, measure, death, prison, doom, circuit, belt, round, hour, surge, nearer, nearer, edge, time, bolt, brother, water, cask, thing, deck, brink, pit, agony, terror, force, secure, grasp, grief, attempt, sheer, care, contest, point, difference, cask, difficulty, smack, sweeps, position, starboard, prayer, god, descent, hold, barrel, destruction, death, water, moment, moment, sense, motion, vessel, exception, courage, scene, forget, horror, admiration, boat, surface, funnel, circumference, depth, bewildering, rapidity, spun, radiance, shot, rift, glory, observe, burst, terrific, grandeur, beheld, gaze, direction, view, manner, smack, hung, pool, keel, deck, plane, water, angle, beam, difficulty, situation, level, speed, search, profound, gulf, mist, magnificent, rainbow, bridge, time, eternity, mist, spray, doubt, yell, attempt, slide, abyss, distance, slope, proportionate, round, round, movement, circuit, whirl, progress, revolution, waste, ebony, borne, boat, object, embrace, whirl, house, furniture, curiosity, place, drew, nearer, nearer, doom, company, amusement, time, thing, plunge, wreck, dutch, merchant, ship, overtook, length, making, fact, fact, miscalculation, train, reflection, heart, dawn, hope, memory, observation, variety, matter, moskoe, str, number, chafed, account, difference, supposing, whirl, period, reason, reach, turn, flood, ebb, case, instance, fate, rule, shape, superiority, speed, descent, size, shape, cylinder, school, master, district, explanation, fact, consequence, vortex, resistance, suction, drawn, difficulty, body, form, circumstance, turn, account, revolution, barrel, yard, mast, level, station, lash, water, cask, counter, throw, water, attention, power, length, design, case, shook, head, station, reach, emergency, delay, bitter, struggle, fate, sea, moment, hesitation, result, escape, possession, mode, escape, anticipate, story, conclusion, hour, smack, distance, beneath, succession, chaos, barrel, sunk, half, distance, gulf, spot, change, place, character, whirlpool, slope, rainbow, sky, view, spot, pool, moskoe, hour, slack, sea, str, fatigue, danger, memory, horror, board, traveller, spirit, land, hair, day, expression, countenance, story, faith","nature, commensurate, unsearchableness, speak, mortal, suppose, single, jetty, black, white, unstring, tremble, cliff, thrown, extreme, unobstructed, black, sixteen, half, perilous, vain, danger, long, sufficient, brought, close, norwegian, eighth, degree, great, dreary, giddy, wide, ocean, inky, nubian, desolate, human, outstretched, black, cliff, high, white, opposite, visible, small, discernible, arose, craggy, ocean, unusual, strong, landward, double, hull, short, angry, moskoe, ambaaren, suarven, stockholm, true, understand, hear, interior, lofoden, burst, aware, loud, vast, american, prairie, seamen, ocean, current, gazed, current, monstrous, headlong, ungovernable, main, uproar, vast, scarred, gigantic, innumerable, eastward, precipitous, radical, general, smooth, prodigious, apparent, great, form, distinct, broad, terrific, smooth, black, shriek, niagara, agony, threw, scant, nervous, great, whirlpool, moskoe, str, ordinary, prepared, circumstantial, wild, feeble, lofoden, afford, convenient, flood, lofoden, boisterous, impetuous, dreadful, depth, carried, thrown, ebb, weather, boisterous, dangerous, mile, impossible, fruitless, bear, swim, lofoden, stream, large, current, torn, consist, fro, flux, high, low, noise, depth, shore, lofoden, sidelong, difficult, evident, attraction, phenomenon, plausible, unsatisfactory, flux, natural, prodigious, remote, idle, hear, subject, conclusive, unintelligible, good, deaden, proceeded, vurrgh, violent, good, proper, attempt, lofoden, regular, usual, great, preferred, single, desperate, main, str, otterholm, remain, slack, steady, fail, mis, stay, dead, rare, arrival, boisterous, round, fouled, innumerable, lee, good, twentieth, bad, good, str, minute, strong, current, unmanageable, eighteen, great, young, horrible, tenth, blew, terrible, late, steady, follow, smack, fine, plenty, fresh, starboard, great, unusual, feel, uneasy, astern, singular, amazing, dead, long, smack, attempt, experienced, feather, complete, small, cross, seas, foresail, flat, narrow, gunwale, bolt, mere, undoubtedly, hold, clear, rid, collect, grasp, mouth, close, str, violent, fit, meant, wished, whirl, perceive, str, whirl, hope, great, fool, gun, ship, seas, lay, flat, absolute, singular, black, circular, clear, clear, deep, blue, wear, lit, god, light, understand, hear, single, shook, pale, hideous, dragged, ocean, str, deep, strong, gale, large, strange, gigantic, sky, high, sick, lofty, quick, sufficient, exact, str, whirlpool, dead, str, foam, sharp, larboard, shot, noise, sound, imagine, whirl, thought, amazing, borne, bubble, starboard, ocean, stood, huge, strange, rid, great, suppose, strung, reflect, magnificent, foolish, individual, wonderful, shame, depths, principal, singular, general, high, black, mountainous, heavy, gale, great, rid, petty, forbidden, uncertain, impossible, middle, horrible, stern, small, coop, counter, gale, large, afford, madman, maniac, bolt, astern, great, fro, immense, lurch, headlong, hurried, sweep, open, instant, lived, foam, magic, interior, vast, prodigious, smooth, ebony, circular, flood, golden, black, inmost, general, downward, unobstructed, surface, parallel, footing, dead, thick, hung, narrow, great, dare, foam, great, descent, uniform, complete, downward, perceptible, wide, liquid, visible, large, timber, unnatural, original, grow, dreadful, watch, strange, numerous, delirious, relative, fir, tree, awful, invariable, tremble, terror, great, buoyant, coast, lofoden, thrown, extraordinary, stuck, disfigured, entered, late, level, ocean, general, descent, equal, extent, spherical, sphere, equal, cylindrical, escape, subject, forgotten, natural, bulky, great, anxious, vessel, high, original, resolved, loose, impossible, cask, tale, bring, vast, wild, rapid, foam, great, vast, steep, violent, gulf, uprise, clear, surface, ocean, mountainous, coast, exhausted, speechless, daily, raven, black, white, told, merry","raise, timid, morrow, watch, deck, elder, shake, slack, keel, hold, slow, channel","depth, long, ago, deadly, scarcely, carelessly, beneath, deeply, length, upward, dizzily, deplorably, horridly, forcibly, ghastly, forever, properly, nearer, hideously, constantly, midway, northward, gradually, rapidly, eastward, vurrgh, sway, suddenly, suddenly, suddenly, dizzily, heaven, excess, exceedingly, weather, moskoe, noise, heard, inevitably, gradually, storm, norway, likewise, frequently, terribly, shore, plainly, constantly, early, ground, regard, close, immeasurably, deadly, bodily, generally, decidedly, universally, altogether, schooner, shortly, violently, stout, afterward, brightly, suddenly, folly, norway, cleverly, mainmast, completely, presently, overboard, horror, bound, long, carefully, slack, ear, presently, properly, cleverly, presently, ahead, involuntarily, afterward, suddenly, subside, completely, indistinctly, positively, considerably, gradually, securely, overboard, steadily, scarcely, abyss, felt, instinctively, midway, perfectly, ghastly, accurately, instinctively, scarcely, distinctly, nature, heavily, arose, partly, partly, mind, appearance, distinctly, completely, slowly, early, rapidly, slowly, sphere, equally, longer, securely, despairingly, counter, precisely, brother, headlong, forever, farther, overboard, momently, gradually, slowly, moon, radiantly, borne, violently, scarcely"


[12 points] The conjecture of many linguists is that the number of different parts of speech per thousand words, (nouns, verbs, adjectives, adverbs, …). is pretty much the same for all stories in a given language. In this case, with all stories in English, and all from the same author, we expect it to be true. Is the conjecture consistent with your findings?

The conjecture suggests that the distribution of parts of speech (POS) is roughly uniform across stories in a given language. We can mathematically express this as:

POS Frequency Ratio = (Number of words for a specific POS / Total words in the story) × 1000

For each story, POS Frequency Ratio (for each POS tag like NN, VB, JJ, etc.) should be relatively constant. We can compute the ratio for each part of speech per thousand words.

My plan for mathematically evaluating this is as follows:

POS Tag Frequency: For each story, we calculate the number of occurrences of each POS tag (e.g., NN, VB, JJ).

Total Word Count: Calculate the total number of words in each story.

POS Frequency Ratio: Compute the POS frequency ratio (per thousand words) for each POS tag.

Consistency Across Stories: Analyze how the ratios vary between stories. A consistent pattern would support the conjecture.

In [366]:
# We already have the POS-tagged words for each story in a DataFrame
# We can compute the frequency of each POS tag and the total word count for each story

import pandas as pd

def calculate_pos_frequencies(pos_dict, total_words):
    # Initialize a dictionary to store frequencies per 1000 words for each POS tag
    pos_frequencies = {}
    
    for pos, words in pos_dict.items():
        pos_frequencies[pos] = (len(words) / total_words) * 1000  # Frequency per 1000 words
    
    return pos_frequencies

# Example usage (with existing DataFrame 'df' from previous steps)
if __name__ == "__main__":
    # Example: for one story (expand this for multiple stories)
    total_words = sum(len(pos_tagged_dict.get(tag, [])) for tag in allowed_pos_tags)  # Total words in the story
    
    # Calculate POS frequency ratios for the story
    pos_frequencies = calculate_pos_frequencies(pos_tagged_dict, total_words)
    
    # Convert the POS frequencies to a DataFrame for easier comparison across stories
    pos_freq_df = pd.DataFrame([pos_frequencies], index=['Maelstrom'])  # 'Maelstrom' is the example story title

In [368]:
# Print the POS frequency ratios (per 1000 words) for the story using Jupyter's built-in formatting for better viewability
pos_freq_df

Unnamed: 0,JJ,NN,RB,VB
Maelstrom,304.485155,599.49463,88.439672,7.580543


If we wanted to compare across multiple stories, we could repeat the process for each story and compile the results into a single DataFrame where each row represents a story and each column represents the POS frequency ratio for a specific part of speech.

In [374]:
# Here I will compare across five different stories to get a better sense of parts per speech distribution
# Berenice
# Read and clean the file text
if __name__ == "__main__":
    # Specify the path of the downloaded text file
    file_path = '/Users/hannahmarr/Downloads/BERENICE.txt'
    cleaned_file_path = '/Users/hannahmarr/Downloads/BERENICE_CLEANED.txt'  # Path where cleaned file will be saved

    # Read the original text
    original_text = read_file(file_path)

    # Clean the text
    cleaned_text = clean_text(original_text)

     # Step 5: Save the cleaned text (optional)
    save_cleaned_text(cleaned_file_path, cleaned_text)

    # Output the first 500 characters of cleaned text to verify
    print(cleaned_text[:500])

if __name__ == "__main__":
    file_path = '/Users/hannahmarr/Downloads/BERENICE_CLEANED.txt'

    # Read the story text into the 'paragraph' variable
    paragraph = read_story_from_file(file_path)
    
    # Tag words and build the POS dictionary
    pos_tagged_dict = tag_words_by_pos(paragraph)
    
    # Create the DataFrame
    df2 = create_pos_dataframe("Berenice", paragraph[0:100], pos_tagged_dict) # paragraph[0:100] to not dominate the dataframe with text

if __name__ == "__main__":
    # Example: for one story (expand this for multiple stories)
    total_words = sum(len(pos_tagged_dict.get(tag, [])) for tag in allowed_pos_tags)  # Total words in the story
    
    # Calculate POS frequency ratios for the story
    pos_frequencies = calculate_pos_frequencies(pos_tagged_dict, total_words)
    
    # Convert the POS frequencies to a DataFrame for easier comparison across stories
    pos_freq_df2 = pd.DataFrame([pos_frequencies], index=['Berenice'])  # 'Maelstrom' is the example story title

dicebant mihi sodales sepulchrum amicae visitarem curas meas aliquar tulum fore levatas ebn zaiat misery manifold wretchedness earth multiform overreaching wide horizon rainbow hues hues arch distinct intimately blended overreaching wide horizon rainbow beauty derived type unloveliness covenant peace simile sorrow ethics evil consequence good fact joy sorrow born memory bliss anguish day agonies origin ecstasies baptismal egaeus family mention towers land time honored gloomy gray hereditary hall


In [372]:
# Print the POS frequency ratios (per 1000 words) for the story using Jupyter's built-in formatting for better viewability
pos_freq_df2

Unnamed: 0,JJ,NN,VB,RB
Berenice,356.635071,541.469194,10.663507,91.232227


In [384]:
# The Cask of Amontillado
# Read and clean the file text
if __name__ == "__main__":
    # Specify the path of the downloaded text file
    file_path = '/Users/hannahmarr/Downloads/THE_CASK_OF_AMONTILLADO.txt'
    cleaned_file_path = '/Users/hannahmarr/Downloads/THE_CASK_OF_AMONTILLADO_CLEANED.txt'  # Path where cleaned file will be saved

    # Read the original text
    original_text = read_file(file_path)

    # Clean the text
    cleaned_text = clean_text(original_text)

     # Step 5: Save the cleaned text (optional)
    save_cleaned_text(cleaned_file_path, cleaned_text)

    # Output the first 500 characters of cleaned text to verify
    print(cleaned_text[:500])

if __name__ == "__main__":
    file_path = '/Users/hannahmarr/Downloads/THE_CASK_OF_AMONTILLADO_CLEANED.txt'

    # Read the story text into the 'paragraph' variable
    paragraph = read_story_from_file(file_path)
    
    # Tag words and build the POS dictionary
    pos_tagged_dict = tag_words_by_pos(paragraph)
    
    # Create the DataFrame
    df3 = create_pos_dataframe("Amontillado", paragraph[0:100], pos_tagged_dict) # paragraph[0:100] to not dominate the dataframe with text

if __name__ == "__main__":
    # Example: for one story (expand this for multiple stories)
    total_words = sum(len(pos_tagged_dict.get(tag, [])) for tag in allowed_pos_tags)  # Total words in the story
    
    # Calculate POS frequency ratios for the story
    pos_frequencies = calculate_pos_frequencies(pos_tagged_dict, total_words)
    
    # Convert the POS frequencies to a DataFrame for easier comparison across stories
    pos_freq_df3 = pd.DataFrame([pos_frequencies], index=['Amontillado'])  # 'Maelstrom' is the example story title

injuries fortunato borne ventured insult vowed revenge nature soul suppose utterance threat length avenged point definitively settled definitiveness resolved precluded idea risk punish punish impunity wrong unredressed retribution overtakes redresser equally unredressed avenger fails felt wrong understood word deed fortunato doubt good continued smile face perceive smile thought immolation weak point fortunato man respected feared prided connoisseurship wine italians true virtuoso spirit enthusi


In [386]:
# Print the POS frequency ratios (per 1000 words) for the story using Jupyter's built-in formatting for better viewability
pos_freq_df3

Unnamed: 0,JJ,NN,RB,VB
Amontillado,349.757224,558.544934,85.250338,6.447505


In [388]:
# The Black Cat
# Read and clean the file text
if __name__ == "__main__":
    # Specify the path of the downloaded text file
    file_path = '/Users/hannahmarr/Downloads/THE_BLACK_CAT.txt'
    cleaned_file_path = '/Users/hannahmarr/Downloads/THE_BLACK_CAT_CLEANED.txt'  # Path where cleaned file will be saved

    # Read the original text
    original_text = read_file(file_path)

    # Clean the text
    cleaned_text = clean_text(original_text)

     # Step 5: Save the cleaned text (optional)
    save_cleaned_text(cleaned_file_path, cleaned_text)

    # Output the first 500 characters of cleaned text to verify
    print(cleaned_text[:500])

if __name__ == "__main__":
    file_path = '/Users/hannahmarr/Downloads/THE_BLACK_CAT_CLEANED.txt'

    # Read the story text into the 'paragraph' variable
    paragraph = read_story_from_file(file_path)
    
    # Tag words and build the POS dictionary
    pos_tagged_dict = tag_words_by_pos(paragraph)
    
    # Create the DataFrame
    df4 = create_pos_dataframe("Black_Cat", paragraph[0:100], pos_tagged_dict) # paragraph[0:100] to not dominate the dataframe with text

if __name__ == "__main__":
    # Example: for one story (expand this for multiple stories)
    total_words = sum(len(pos_tagged_dict.get(tag, [])) for tag in allowed_pos_tags)  # Total words in the story
    
    # Calculate POS frequency ratios for the story
    pos_frequencies = calculate_pos_frequencies(pos_tagged_dict, total_words)
    
    # Convert the POS frequencies to a DataFrame for easier comparison across stories
    pos_freq_df4 = pd.DataFrame([pos_frequencies], index=['Black_Cat'])  # 'Maelstrom' is the example story title

wild homely narrative pen expect solicit belief mad expect case senses reject evidence mad surely dream morrow die day unburthen soul purpose place plainly succinctly comment series mere household events consequences events terrified tortured destroyed attempt expound presented horror terrible barroques intellect reduce phantasm common place intellect calm logical excitable perceive circumstances awe ordinary succession natural effects infancy docility humanity disposition tenderness heart consp


In [390]:
# Print the POS frequency ratios (per 1000 words) for the story using Jupyter's built-in formatting for better viewability
pos_freq_df4

Unnamed: 0,JJ,RB,NN,VB
Black_Cat,328.947368,77.935223,585.020243,8.097166


In [392]:
# The Fall of the House of Usher
# Read and clean the file text
if __name__ == "__main__":
    # Specify the path of the downloaded text file
    file_path = '/Users/hannahmarr/Downloads/THE_FALL_OF_THE_HOUSE_OF_USHER.txt'
    cleaned_file_path = '/Users/hannahmarr/Downloads/THE_FALL_OF_THE_HOUSE_OF_USHER_CLEANED.txt'  # Path where cleaned file will be saved

    # Read the original text
    original_text = read_file(file_path)

    # Clean the text
    cleaned_text = clean_text(original_text)

     # Step 5: Save the cleaned text (optional)
    save_cleaned_text(cleaned_file_path, cleaned_text)

    # Output the first 500 characters of cleaned text to verify
    print(cleaned_text[:500])

if __name__ == "__main__":
    file_path = '/Users/hannahmarr/Downloads/THE_FALL_OF_THE_HOUSE_OF_USHER.txt'

    # Read the story text into the 'paragraph' variable
    paragraph = read_story_from_file(file_path)
    
    # Tag words and build the POS dictionary
    pos_tagged_dict = tag_words_by_pos(paragraph)
    
    # Create the DataFrame
    df5 = create_pos_dataframe("House_of_Usher", paragraph[0:100], pos_tagged_dict) # paragraph[0:100] to not dominate the dataframe with text

if __name__ == "__main__":
    # Example: for one story (expand this for multiple stories)
    total_words = sum(len(pos_tagged_dict.get(tag, [])) for tag in allowed_pos_tags)  # Total words in the story
    
    # Calculate POS frequency ratios for the story
    pos_frequencies = calculate_pos_frequencies(pos_tagged_dict, total_words)
    
    # Convert the POS frequencies to a DataFrame for easier comparison across stories
    pos_freq_df5 = pd.DataFrame([pos_frequencies], index=['House_of_Usher'])  # 'Maelstrom' is the example story title

son luth suspendu sit touche sonne ranger dull dark soundless day autumn year clouds hung oppressively low heavens passing horseback singularly dreary tract country length shades evening drew view melancholy house usher glimpse building sense insufferable gloom pervaded spirit insufferable feeling unrelieved half pleasurable poetic sentiment mind receives sternest natural images desolate terrible looked scene mere house simple landscape features domain bleak walls vacant eye windows rank sedges 


In [394]:
# Print the POS frequency ratios (per 1000 words) for the story using Jupyter's built-in formatting for better viewability
pos_freq_df5

Unnamed: 0,NN,JJ,RB,VB
House_of_Usher,496.95987,262.667207,182.813133,57.559789


In [396]:
# Concatenating the 5 POS frequency distributions into one dataframe
pos_freq_df_5stories = pd.concat([pos_freq_df, pos_freq_df2, pos_freq_df3, pos_freq_df4, pos_freq_df5], ignore_index = False)

In [398]:
# Print the POS frequency ratios (per 1000 words) for all 5 stories using Jupyter's built-in formatting for better viewability
pos_freq_df_5stories

Unnamed: 0,JJ,NN,RB,VB
Maelstrom,304.485155,599.49463,88.439672,7.580543
Berenice,356.635071,541.469194,91.232227,10.663507
Amontillado,349.757224,558.544934,85.250338,6.447505
Black_Cat,328.947368,585.020243,77.935223,8.097166
House_of_Usher,262.667207,496.95987,182.813133,57.559789


To assess the distribution of parts of speech across these five stories, I will calculate the coefficient of variation among the five stories for each POS category. The coefficient of variation is a normalized measure of dispersion that is calculated by dividing the standard deviation of the data by the mean of the data. A small CV would indicate that the POS frequency ratio is consistent, supporting the conjecture that the number of different parts of speech per thousand words is pretty much the same for all stories in a given language.

In [411]:
# Calculate the mean for each column
mean = pos_freq_df_5stories.mean()

# Calculate the standard deviation for each column
std_dev = pos_freq_df_5stories.std()

# Calculate the coefficient of variation (CV) for each column
# Avoid division by zero (set CV to NaN where the mean is 0)
cv = std_dev / mean.replace(0, pd.NA)

# Display the coefficient of variation for each column
print(cv)

JJ    0.119161
NN    0.072111
RB    0.415724
VB    1.224677
dtype: float64


Interpretation of CV findings:

Small CV (< 10%): Indicates that the data is tightly clustered around the mean, reflecting consistency or low relative variability.

Moderate CV (10-20%): Indicates some variability, but still suggests that the data points are relatively stable with moderate differences across samples.

Large CV (> 20%): Indicates significant variability, meaning the data points are widely spread out relative to the mean. This could suggest inconsistency across the data.

My findings suggest that adjectives (JJ; CV of 0.119 or 11.9%) and nouns (NN; CV of 0.072 or 7.2%) are highly consistent, with low relative variability between stories. This suggests that the proportion of adjectives and nouns relative to the total number of words in the stories is fairly consistent across stories.

However, adverbs (RB; CV of 0.416 or 41.6%) and verbs (VB; CV of 1.225 or 122.5%) are highly variable between stories, suggesting that the proportion of adverbs and verbs relative to the total number of words in the stories is fairly inconsistent across stories.

It is likely that if I conducted this analysis across all Poe stories the CV would look different for these parts of speech. However, with the analysis I have conducted with the processing power I have, the conjecture that the number of different parts of speech per thousand words is pretty much the same for all stories in a given language is consistent with my findings for adjectives and nouns, but not for adverbs and verbs.