<a href="https://colab.research.google.com/github/humzkhan/Sentence_segmentation/blob/main/job.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Descriptive Alt Text](https://pbs.twimg.com/media/FVx-J8uWUAQ-1CM?format=jpg&name=large)


# Sentence Segmentation (Humzah Khan)


**Steps:**

* Load the short story
  * Use the `requests` library to fetch the story text from a URL
  * Verify the downloaded text by printing the first 500 characters
* Preprocess the text
  * Remove unnecessary separators (e.g., `-----`)
  * Standardize quotation marks and remove duplicates
  * Handle odd quotation cases
* Process the text using spaCy
  * Segment the text into sentences
  * Extract and analyze sentence boundaries
* Post-process the text
  * Stitch sentences with incomplete quotations
  * Handle line breaks embedded in sentences
* Sort and clean the final text
  * Remove duplicates
  * Alphabetize the sentences
  * Ensure consistency and formatting


In [51]:
import requests

# URL of the raw text file on GitHub
url = "https://raw.githubusercontent.com/humzkhan/Sentence_segmentation/refs/heads/main/ShortStory.txt"

# Download the file
response = requests.get(url)
story = response.text

# Print the first 500 characters to verify
print(story[:500])


The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light. The question came about as a result of a five dollar bet over highballs, and it happened this way:
Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac. As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer. They had at least a vague notion of


In [27]:
import spacy

# Preprocessing function to clean the text
def preprocess_text(story: str) -> str:
    """
    Preprocess the story text by removing unnecessary characters and normalizing spaces.
    :param story: Raw story text
    :return: Cleaned story text
    """
    # Remove page boundaries or separators
    story = story.replace('------------------------------------------------', '').strip()
    story = story.replace(""""It was Adell's turn to be contrary.""", "It was Adell's turn to be contrary.").strip()
    story = story.replace("The stars are the power-units, dear", """"The stars are the power-units, dear""").strip()
    story = story.replace(""""The stars and Galaxies died and snuffed out, """, """The stars and Galaxies died and snuffed out, """).strip()


    # Standardize quotation marks
    story = story.replace('“', '"').replace('”', '"')
    story = story.replace('‘', "'").replace('’', "'")


    #story = ' '.join(story.split())

    return story



# Sentence segmentation using spaCy
def segment_sentences_spacy(story: str) -> list[str]:
    """
    Segment the story text into sentences using spaCy.
    :param story: Preprocessed story text
    :return: List of segmented sentences
    """

    # Remove extra spaces or newlines
    story = ' '.join(story.split())

    # Load spaCy's English language model
    nlp = spacy.load("en_core_web_sm")

    # Process the story with spaCy
    doc = nlp(story)

    # Extract sentences
    sentences = [sent.text.strip() for sent in doc.sents]

    return sentences

In [28]:
# Preprocess and segment
cleaned_story = preprocess_text(story)
segmented_sentences = segment_sentences_spacy(cleaned_story)

# Output segmented sentences
for i, sentence in enumerate(segmented_sentences, 1):
    print(f"{i}: {sentence}")

1: The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
2: The question came about as a result of a five dollar bet over highballs, and it happened this way: Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac.
3: As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer.
4: They had at least a vague notion of the general plan of relays and circuits that had long since grown past the point where any single human could possibly have a firm grasp of the whole.
5: Multivac was self-adjusting and self-correcting.
6: It had to be, for nothing human could adjust and correct it quickly enough or even adequately enough -- so Adell and Lupov attended the monstrous giant only lightly and superficially, yet as well as any men could.
7: They fed it data, adjusted questions to its needs and translated 

In [47]:
import spacy

# Load spaCy's English language model
nlp = spacy.load("en_core_web_sm")

def segment_sentences_spacy_2(line: str) -> list[str]:
    """
    Segment the story text into sentences using spaCy.
    :param story: Preprocessed story text
    :return: List of segmented sentences
    """

    doc = nlp(line)
    return [sent.text.strip() for sent in doc.sents]


def fix_odd_quotation_case_1(line: str) -> str:
    """
    Fix a line with odd quotation marks by identifying and removing the invalid one.
    :param line: The input line with odd quotations.
    :return: The corrected line.
    """
    # Identify positions of all quotation marks
    quote_positions = [i for i, char in enumerate(line) if char == '"']

    # If there are not 3 quotation marks, return the line unchanged
    if len(quote_positions) != 3:
        return line

    # Extract text between quotation marks and validate pairs
    valid_pair = None
    for i in range(len(quote_positions) - 1):
        start = quote_positions[i]
        end = quote_positions[i + 1]

        # Check for spaces OUTSIDE the quotes
        valid_start = (start == 0 or line[start - 1] == " ")
        valid_end = (end == len(line) - 1 or line[end + 1] in [" ", "\n"])

        if valid_start and valid_end:
            valid_pair = (start, end)
            break

    # If no valid pair is found, default to the outermost pair
    if valid_pair is None:
        valid_pair = (quote_positions[0], quote_positions[-1])

    # Identify the invalid quotation
    invalid_quote = [pos for pos in quote_positions if pos not in valid_pair][0]

    # Remove the invalid quotation
    line = line[:invalid_quote] + line[invalid_quote + 1:]

    return line




def fix_odd_quotation_case_2_3(line: str) -> str:
    """
    Handle lines with one quotation mark by completing it based on context.
    :param line: The input line with one quotation mark.
    :return: The corrected line.
    """
    # Identify the position of the lone quotation mark
    quote_pos = line.find('"')

    if quote_pos == -1:  # If no quotation mark, return the line unchanged
        return line

    # Split the line into sentences using spaCy
    sentences = segment_sentences_spacy_2(line)

    # Case 0: Check if there is text on both sides of the quotation mark
    text_before = line[:quote_pos].strip()
    text_after = line[quote_pos + 1 :].strip()

    if text_before and text_after:
        # Both sides have text; assume the quotation mark is a typo and remove it
        return line[:quote_pos] + line[quote_pos + 1 :]

    # Case 2: Quotation mark at the beginning of the line
    if quote_pos == 0:
        if len(sentences) == 1:  # If there's only one sentence, add the quotation at the end
            return line + '"'
        else:  # If there are multiple sentences, add the quotation at the end of the first sentence
            first_sentence = sentences[0]
            rest_of_line = line[len(first_sentence):].strip()
            return f'{first_sentence}" {rest_of_line}'

    # Case 3: Quotation mark in the middle of the line
    elif 0 < quote_pos < len(line):  # Lone quotation in the middle or end of the line

        # Check for space after the lone quotation ([quote][space])
        if line[quote_pos + 1 : quote_pos + 2] == " ":
            # Add the missing quotation at the start of the sentence containing the quote
            for sentence in sentences:
                if '"' in sentence:  # Locate the sentence with the lone quotation
                    return f'"{sentence.strip()}" {line[len(sentence):].strip()}'
            # Default: Add the quotation at the start of the line
            return f'"{line}'

        # Check for space before the lone quotation ([space][quote])
        elif line[quote_pos - 1 : quote_pos] == " ":
            # Add the missing quotation at the end of the sentence containing the quote
            for sentence in sentences:
                if '"' in sentence:  # Locate the sentence with the lone quotation
                    return f'{sentence.strip()}" {line[len(sentence):].strip()}'
            # Default: Add the quotation at the end of the line
            return f'{line}"'

        # Case 4: Quotation mark at the end of the line
        else: # Default: Add the quotation at the start of the line
            return f'"{line}'

    return line  # If no specific case applies, return the line unchanged


def fix_odd_quotation_lines(text: str) -> str:
    """
    Fix lines with an odd number of quotation marks in the text, handling misplaced quotes appropriately.
    :param text: The raw input text, line-separated.
    :return: The corrected text with balanced quotation marks.
    """
    lines = text.split("\n")
    fixed_lines = []

    for line in lines:
        stripped_line = line.strip()
        quote_count = stripped_line.count('"')

        # Case 1: Handle lines with 3 quotation marks
        if quote_count % 2 == 1 and quote_count == 3:
            stripped_line = fix_odd_quotation_case_1(stripped_line)

        # Cases 2 & 3: Handle lone quotes or misplaced quotes
        if quote_count == 1:
            stripped_line = fix_odd_quotation_case_2_3(stripped_line)

        # Append corrected lines (if valid and not empty)
        if stripped_line:
            fixed_lines.append(stripped_line)


    return "\n".join(fixed_lines)



# Preprocessing function to clean the text
def preprocess_text(story: str) -> str:
    """
    Preprocess the story text by removing unnecessary characters and normalizing spaces.
    :param story: Raw story text
    :return: Cleaned story text
    """
    # Remove page boundaries or separators
    story = story.replace('------------------------------------------------', '').strip()
    #story = story.replace(""""It was Adell's turn to be contrary.""", "It was Adell's turn to be contrary.").strip()
    #story = story.replace("The stars are the power-units, dear", """"The stars are the power-units, dear""").strip()
    #story = story.replace(""""The stars and Galaxies died and snuffed out, """, """The stars and Galaxies died and snuffed out, """).strip()


    # Standardize quotation marks
    story = story.replace('“', '"').replace('”', '"')
    story = story.replace('‘', "'").replace('’', "'")

    # Fix odd quotation marks
    story = fix_odd_quotation_lines(story)

    #story = ' '.join(story.split())

    return story



# Sentence segmentation using spaCy
def segment_sentences_spacy(story: str) -> list[str]:
    """
    Segment the story text into sentences using spaCy.
    :param story: Preprocessed story text
    :return: List of segmented sentences
    """

    # Remove extra spaces or newlines
    story = ' '.join(story.split())

    # Load spaCy's English language model
    nlp = spacy.load("en_core_web_sm")

    # Process the story with spaCy
    doc = nlp(story)

    # Extract sentences
    sentences = [sent.text.strip() for sent in doc.sents]

    return sentences




In [48]:
# Preprocess and segment
cleaned_story = preprocess_text(story)
segmented_sentences = segment_sentences_spacy(cleaned_story)

# Output segmented sentences
for i, sentence in enumerate(segmented_sentences, 1):
    print(f"{i}: {sentence}")

1: The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
2: The question came about as a result of a five dollar bet over highballs, and it happened this way: Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac.
3: As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer.
4: They had at least a vague notion of the general plan of relays and circuits that had long since grown past the point where any single human could possibly have a firm grasp of the whole.
5: Multivac was self-adjusting and self-correcting.
6: It had to be, for nothing human could adjust and correct it quickly enough or even adequately enough -- so Adell and Lupov attended the monstrous giant only lightly and superficially, yet as well as any men could.
7: They fed it data, adjusted questions to its needs and translated 

#### Note: stitching quotations together into one coherent phrase (I find this more logical)

In [None]:
def stitch_sentences_with_nested_quotes(sentences: list[str]) -> list[str]:
    """
    Stitch sentences with nested quotes to ensure properly closed quoted sentences.

    :param sentences: List of segmented sentences
    :return: List of stitched sentences
    """
    stitched_sentences = []
    buffer = ""  # Temporary buffer for stitching
    inside_quote = False  # Tracks whether we're inside a quoted section

    for sentence in sentences:
        # Count the number of quotation marks
        quote_count = sentence.count('"')

        if inside_quote:
            # Add the sentence to the buffer
            buffer += f" {sentence}"
            if quote_count % 2 == 1:  # Check if this closes the open quote
                stitched_sentences.append(buffer.strip())
                buffer = ""  # Reset the buffer
                inside_quote = False
        else:
            # Check if this starts a quoted section
            if quote_count % 2 == 1:
                buffer = sentence
                inside_quote = True
            else:
                # If it's not a quoted section, just append as is
                stitched_sentences.append(sentence.strip())

    # If there's anything left in the buffer (edge case), append it
    if buffer:
        stitched_sentences.append(buffer.strip())

    return stitched_sentences


# Stitch sentences with nested quotes
stitched_sentences = stitch_sentences_with_nested_quotes(segmented_sentences)

# Output the stitched sentences with enumeration
for idx, sentence in enumerate(stitched_sentences, 1):
    print(f"{idx}: {sentence}")





1: The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
2: The question came about as a result of a five dollar bet over highballs, and it happened this way: Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac.
3: As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer.
4: They had at least a vague notion of the general plan of relays and circuits that had long since grown past the point where any single human could possibly have a firm grasp of the whole.
5: Multivac was self-adjusting and self-correcting.
6: It had to be, for nothing human could adjust and correct it quickly enough or even adequately enough -- so Adell and Lupov attended the monstrous giant only lightly and superficially, yet as well as any men could.
7: They fed it data, adjusted questions to its needs and translated 

In [None]:
def split_sentences_with_line_breaks(sentences: list[str], story: str) -> list[str]:
    """
    Split sentences with multiple quotations into separate segments based on line breaks in the raw text.

    :param sentences: List of stitched sentences
    :param story: Raw story text with line breaks
    :return: List of refined sentences split at line breaks
    """
    # Split the raw story into lines for reference
    raw_lines = story.splitlines()

    # Create a new list to store final sentences
    refined_sentences = []

    for sentence in sentences:
        # If the sentence contains two or more quotes, check for line breaks
        if sentence.count('"') >= 2:
            # Initialize a list to store split parts of the sentence
            split_parts = []

            # Track the current working sentence
            current_sentence = sentence

            # Iterate over each line in the raw story
            for line in raw_lines:
                line = line.strip()  # Clean the line
                if line and line in current_sentence:
                    # Split the sentence at the matching line
                    parts = current_sentence.split(line, 1)
                    split_parts.append(line)  # Add the matching line
                    if len(parts) > 1:
                        current_sentence = parts[1]  # Keep the remaining part for further checks

            # Add all split parts and any remaining portion to the refined sentences
            refined_sentences.extend(split_parts)
            if current_sentence.strip():
                refined_sentences.append(current_sentence.strip())
        else:
            # If the sentence doesn't meet the criteria, keep it as is
            refined_sentences.append(sentence.strip())

    return refined_sentences






# Step 3: Split stitched sentences based on embedded line breaks
final_sentences = split_sentences_with_line_breaks(stitched_sentences, cleaned_story)

# Output the final sentences
for idx, sentence in enumerate(final_sentences, 1):
    print(f"{idx}: {sentence}")




1: The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
2: The question came about as a result of a five dollar bet over highballs, and it happened this way: Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac.
3: As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer.
4: They had at least a vague notion of the general plan of relays and circuits that had long since grown past the point where any single human could possibly have a firm grasp of the whole.
5: Multivac was self-adjusting and self-correcting.
6: It had to be, for nothing human could adjust and correct it quickly enough or even adequately enough -- so Adell and Lupov attended the monstrous giant only lightly and superficially, yet as well as any men could.
7: They fed it data, adjusted questions to its needs and translated 

In [None]:
# Sort the final sentences alphabetically
sorted_sentences = sorted(final_sentences, key=str.lower)

# Print the sorted sentences with enumeration
for idx, sentence in enumerate(sorted_sentences, 1):
    print(f"{idx}: {sentence}")


1: "A hundred billion is not infinite and it's getting less infinite all the time. Consider! Twenty thousand years ago, mankind first solved the problem of utilizing stellar energy, and a few centuries later, interstellar travel became possible. It took mankind a million years to fill one small world and then only fifteen thousand years to fill the rest of the Galaxy. Now the population doubles every ten years --"
2: "A very good point. Already, mankind consumes two sunpower units per year."
3: "All right, but now we can hook up each individual spaceship to the Solar Station, and it can go to Pluto and back a million times without ever worrying about fuel. You can't do THAT on coal and uranium. Ask Multivac, if you don't believe me."
4: "All right, then. Billions and billions of years. Twenty billion, maybe. Are you satisfied?"
5: "All right. Who says they won't?"
6: "All the energy we can possibly ever use for free. Enough energy, if we wanted to draw on it, to melt all Earth into a b

In [None]:
# Sort sentences alphabetically, ignoring leading quotation marks and parentheses
sorted_sentences = sorted(
    final_sentences,
    key=lambda s: s.lstrip('"( ').rstrip(')" ').lower()
)

# Print the sorted sentences with enumeration
for idx, sentence in enumerate(sorted_sentences, 1):
    print(f"{idx}: {sentence}")


1: "A hundred billion is not infinite and it's getting less infinite all the time. Consider! Twenty thousand years ago, mankind first solved the problem of utilizing stellar energy, and a few centuries later, interstellar travel became possible. It took mankind a million years to fill one small world and then only fifteen thousand years to fill the rest of the Galaxy. Now the population doubles every ten years --"
2: A thought came, infinitely distant, but infinitely clear.
3: A timeless interval was spent in doing that.
4: "A very good point. Already, mankind consumes two sunpower units per year."
5: Adell put his glass to his lips only occasionally, and Lupov's eyes slowly closed.
6: Adell was just drunk enough to try, just sober enough to be able to phrase the necessary symbols and operations into a question which, in words, might have corresponded to this: Will mankind one day without the net expenditure of energy be able to restore the sun to its full youthfulness even after it ha

In [None]:
# Remove duplicates and sort sentences alphabetically, ignoring leading quotes and parentheses
unique_sorted_sentences = sorted(
    set(final_sentences),  # Remove duplicates using set
    key=lambda s: s.lstrip('"( ').rstrip(')" ').lower()  # Sort while ignoring quotes and parentheses
)

# Print the sorted unique sentences with enumeration
for idx, sentence in enumerate(unique_sorted_sentences, 1):
    print(f"{idx}: {sentence}")


1: "A hundred billion is not infinite and it's getting less infinite all the time. Consider! Twenty thousand years ago, mankind first solved the problem of utilizing stellar energy, and a few centuries later, interstellar travel became possible. It took mankind a million years to fill one small world and then only fifteen thousand years to fill the rest of the Galaxy. Now the population doubles every ten years --"
2: A thought came, infinitely distant, but infinitely clear.
3: A timeless interval was spent in doing that.
4: "A very good point. Already, mankind consumes two sunpower units per year."
5: Adell put his glass to his lips only occasionally, and Lupov's eyes slowly closed.
6: Adell was just drunk enough to try, just sober enough to be able to phrase the necessary symbols and operations into a question which, in words, might have corresponded to this: Will mankind one day without the net expenditure of energy be able to restore the sun to its full youthfulness even after it ha

test

In [None]:
import re

def extract_dialogue_with_speakers(text: str) -> list[dict]:
    """
    Extract dialogue and associate it with the respective speaker.

    :param text: The input story text
    :return: A list of dictionaries with 'speaker' and 'dialogue'
    """
    dialogue_segments = []
    current_speaker = None

    # Regex patterns to detect dialogue with attribution
    quote_pattern = r'"(.*?)"'  # Matches quotes
    attribution_pattern = r'(.*?)(?:said|asked|replied|interrupted) ([A-Za-z0-9\-]+)'  # Matches attributions

    # Split text into lines for processing
    lines = text.split("\n")

    for line in lines:
        line = line.strip()
        if not line:
            continue  # Skip empty lines

        # Check for attribution
        match = re.search(attribution_pattern, line)
        if match:
            # Extract speaker and dialogue
            dialogue, speaker = match.groups()
            dialogue = re.findall(quote_pattern, line)
            if dialogue:
                dialogue_segments.append({
                    "speaker": speaker,
                    "dialogue": " ".join(dialogue)
                })
            current_speaker = speaker  # Update current speaker
        elif '"' in line:
            # Dialogue without attribution
            dialogue = re.findall(quote_pattern, line)
            if dialogue:
                dialogue_segments.append({
                    "speaker": current_speaker if current_speaker else "Unknown",
                    "dialogue": " ".join(dialogue)
                })
        else:
            # Narrative text; associate it with the last known speaker
            if current_speaker:
                dialogue_segments[-1]["dialogue"] += f" {line}"

    return dialogue_segments


# Example Usage
story_text = """
Both seemed in their early twenties, both were tall and perfectly formed.
"Still," said VJ-23X, "I hesitate to submit a pessimistic report to the Galactic Council."
"I wouldn't consider any other kind of report. Stir them up a bit. We've got to stir them up."
VJ-23X sighed. "Space is infinite. A hundred billion Galaxies are there for the taking. More."
"A hundred billion is not infinite and it's getting less infinite all the time. Consider! Twenty thousand years ago, mankind first solved the problem of utilizing stellar energy, and a few centuries later, interstellar travel became possible."
VJ-23X interrupted. "We can thank immortality for that."
"Very well. Immortality exists and we have to take it into account. I admit it has its seamy side, this immortality. The Galactic AC has solved many problems for us, but in solving the problems of preventing old age and death, it has undone all its other solutions."
"""

dialogue_data = extract_dialogue_with_speakers(story_text)

# Output the extracted dialogue
for idx, segment in enumerate(dialogue_data, 1):
    print(f"{idx}. {segment['speaker']}: {segment['dialogue']}")


1. VJ-23X: Still, I hesitate to submit a pessimistic report to the Galactic Council.
2. VJ-23X: I wouldn't consider any other kind of report. Stir them up a bit. We've got to stir them up.
3. VJ-23X: Space is infinite. A hundred billion Galaxies are there for the taking. More.
4. VJ-23X: A hundred billion is not infinite and it's getting less infinite all the time. Consider! Twenty thousand years ago, mankind first solved the problem of utilizing stellar energy, and a few centuries later, interstellar travel became possible.
5. VJ-23X: We can thank immortality for that.
6. VJ-23X: Very well. Immortality exists and we have to take it into account. I admit it has its seamy side, this immortality. The Galactic AC has solved many problems for us, but in solving the problems of preventing old age and death, it has undone all its other solutions.


OLD

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)  # Approve the authentication prompts as needed

import nltk
from nltk.tokenize import sent_tokenize
import string
import os
import re

# Ensure you have downloaded the punkt tokenizer
nltk.download('punkt')
nltk.download('punkt_tab')

# Make sure this matches your exact folder structure in Google Drive
story_path = '/content/drive/MyDrive/ShortStory.txt'

# Read the story from Google Drive
with open(story_path, 'r', encoding='utf-8') as file:
    story = file.read()

if not os.path.exists(story_path):
    print(f"Base path does not exist: {story_path}")
else:
    print(f"Base path exists: {story_path}")

Mounted at /content/drive


In [None]:
# Remove dashed separators and trim extra whitespace
story = story.replace('------------------------------------------------', '').strip()

# Tokenize the story into sentences
sentences = sent_tokenize(story)

# Preprocess function to normalize text for sorting
def preprocess_sentence(sentence):
    # Strip leading/trailing spaces, convert to lowercase, and remove dashes
    sentence = sentence.strip().lower().replace('-', '')
    # Remove punctuation except periods within abbreviations (e.g., U.S.A.)
    sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace('.', '')))
    return sentence

# Sort sentences alphabetically based on the first alphanumeric character
sorted_sentences = sorted(sentences, key=preprocess_sentence)

# Display sorted sentences
for sentence in sorted_sentences:
    print(sentence)

# Optionally, save the sorted sentences to a new file
output_path = '/content/drive/MyDrive/SortedStory2.txt'
with open(output_path, 'w', encoding='utf-8') as file:
    file.write('\n'.join(sorted_sentences))

print(f"Sorted sentences saved to {output_path}")

A good point.
A hundred billion Galaxies are there for the taking.
"A hundred billion is not infinite and it's getting less infinite all the time.
A thought came, infinitely distant, but infinitely clear.
A timeless interval was spent in doing that.
"A very good point.
A very good point."
AC said, "THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER."
Adell put his glass to his lips only occasionally, and Lupov's eyes slowly closed.
Adell was just drunk enough to try, just sober enough to be able to phrase the necessary symbols and operations into a question which, in words, might have corresponded to this: Will mankind one day without the net expenditure of energy be able to restore the sun to its full youthfulness even after it had died of old age?
After all, our own Galaxy alone pours out a thousand sunpower units a year and we only use two of those."
All collected data had come to a final end.
All Earth ran by invisible beams of sunpower.
All Earth turned off its burning coal

In [None]:

# Read the story content
with open(story_path, 'r', encoding='utf-8') as file:
    story = file.read()

# Remove dashed separators and trim extra whitespace
story = story.replace('------------------------------------------------', '').strip()

# Function to tokenize while preserving double-quoted text as a single sentence
def tokenize_with_double_quotes(text):
    # Regular expression to match text within double quotes
    double_quoted_pattern = r'"([^"]+)"'
    # Find all double-quoted sections
    double_quoted_sections = re.findall(double_quoted_pattern, text)

    # Replace double-quoted sections with placeholders
    text_without_quotes = re.sub(double_quoted_pattern, "QUOTE_PLACEHOLDER", text)

    # Tokenize remaining text into sentences
    sentences = nltk.sent_tokenize(text_without_quotes)

    # Replace placeholders with actual double-quoted text
    result = []
    for sentence in sentences:
        if "QUOTE_PLACEHOLDER" in sentence:
            # Replace each placeholder with the correct quoted text
            for quote in double_quoted_sections:
                if "QUOTE_PLACEHOLDER" in sentence:
                    sentence = sentence.replace("QUOTE_PLACEHOLDER", f'"{quote}"', 1)
        result.append(sentence)

    return result

# Tokenize the story while preserving double-quoted text as a single sentence
sentences = tokenize_with_double_quotes(story)

# Preprocess function to normalize text for sorting
def preprocess_sentence(sentence):
    # Strip leading/trailing spaces, convert to lowercase, and remove dashes
    sentence = sentence.strip().lower().replace('-', '')
    # Remove punctuation except periods within abbreviations (e.g., U.S.A.)
    sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace('.', '')))
    # Remove leading non-alphanumeric characters
    return sentence.lstrip(string.punctuation + string.whitespace)

# Sort sentences alphabetically based on the first alphanumeric character
sorted_sentences = sorted(sentences, key=preprocess_sentence)

# Display sorted sentences
for sentence in sorted_sentences:
    print(sentence)



A thought came, infinitely distant, but infinitely clear.
Adell put his glass to his lips only occasionally, and Lupov's eyes slowly closed.
All Earth ran by invisible beams of sunpower.
All Earth turned off its burning coal, its fissioning uranium, and flipped the switch that connected all of it to a small station, one mile in diameter, circling the Earth at half the distance of the Moon.
Almost all stars were white dwarfs, fading to the end.
And there was light----
And yet one of them was unique among them all in being the originals Galaxy.
As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer.
But slowly Multivac learned enough to answer deeper questions more fundamentally, and on May 14, 2061, what had been theory, became fact.
Can that not be done?"It's amazing when you think of it,"THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER."All the energy we can possibly ever use for free

In [None]:
# Make sure this matches your exact folder structure in Google Drive
story_path = '/content/drive/MyDrive/ShortStory.txt'

# Read the story from Google Drive
with open(story_path, 'r', encoding='utf-8') as file:
    story = file.read()

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

print('\n-----\n'.join(sorted(tokenizer.tokenize(story))))

"A hundred billion is not infinite and it's getting less infinite all the time.
-----
"A very good point.
-----
"All right, but now we can hook up each individual spaceship to the Solar Station, and it can go to Pluto and back a million times without ever worrying about fuel.
-----
"All right, then.
-----
"All right.
-----
"All the energy we can possibly ever use for free.
-----
"And don't say we'll switch to another sun."
-----
"And you?"
-----
"Are you sure, Jerrodd?"
-----
"Ask Multivac."
-----
"Ask him how to turn the stars on again."
-----
"Ask the Microvac," wailed Jerrodette I.
-----
"But even so," said Man, "eventually it will all come to an end.
-----
"But how can that be all of Universal AC?"
-----
"But when all energy is gone, our bodies will finally die, and you and I with them."
-----
"Can't you just put in a new power-unit, like with my robot?"
-----
"Cosmic AC," said Man, "How may entropy be reversed?"
-----
"Darn right they will," muttered Lupov.
-----
"Did the men upon