<a href="https://colab.research.google.com/github/humzkhan/Sentence_segmentation/blob/main/job.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Descriptive Alt Text](https://pbs.twimg.com/media/FVx-J8uWUAQ-1CM?format=jpg&name=large)


# **Alphabetically Sorting Sentences in a Short Story**

## **Objective**
The task is to alphabetically sort all sentences in a short story. To achieve this, it’s critical to first correctly identify and segment individual sentences, which is more challenging than it may initially appear.

---

## **Challenges**
1. **Sentence Boundary Detection**
   - Identifying where one sentence ends and another begins is non-trivial due to:
     - **Abbreviations:** Cases like "Dr.", "Mr.", "e.g.," or "etc." can confuse sentence segmentation, as these appear like sentence-ending periods but are not.
     - **Company Names:** Names like "Acme Inc." or "Global Enterprises Ltd." that include periods can cause errors in naïve sentence splitting.
     - **Decimal Values:** Numbers like "3.14" or "1,000.50" may be mistakenly treated as sentence boundaries.
     - **Ellipses:** "..." (three dots) can occur within sentences for pauses or trailing thoughts, adding complexity to identifying actual sentence endings.
     - **Quotation Marks and Dialogue:** Handling nested quotes, dialogue interruptions, and multi-line quotations requires custom handling.
     - **Special Patterns:** Cases like "U.S.A." (abbreviated country names) or email addresses ("example@email.com") must not be split into multiple sentences.


2. **Handling Narrative and Dialogue**
   - Dialogue and narrative text often follow different structural patterns, requiring nuanced handling:
     - Dialogue can span multiple lines or be interrupted by narrative descriptions.
     - Typos such as missing or extra quotation marks need to be addressed.

3. **Edge Cases**
   - Embedded line breaks within sentences.
   - Handling duplicates and ensuring consistency across processed output.

---

## **Part One: Naive Solution**

### **Approach**
1. **Goal:** Alphabetically sort sentences as-is using a basic sentence segmentation approach.
2. **Method:** Use spaCy's pre-trained language model for sentence segmentation.
   - SpaCy's `doc.sents` identifies sentences by leveraging linguistic patterns and trained AI models.

### **Strengths of the Naive Approach**
- Quick to implement and immediately identifies many sentences correctly.
- Handles common sentence structures effectively.

### **Limitations**
- Struggles with quotations and dialogue-heavy text.
- Does not handle interruptions or multi-line dialogue properly.
- Relies entirely on pre-trained rules, which are not tailored to the nuances of this specific task.



In [52]:
import requests

# URL of the raw text file on GitHub
url = "https://raw.githubusercontent.com/humzkhan/Sentence_segmentation/refs/heads/main/ShortStory.txt"

# Download the file
response = requests.get(url)
story = response.text

# Print the first 500 characters to verify
print(story[:500])


The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light. The question came about as a result of a five dollar bet over highballs, and it happened this way:
Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac. As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer. They had at least a vague notion of


In [53]:
import spacy

# Preprocessing function to clean the text
def preprocess_text(story: str) -> str:
    """
    Preprocess the story text by removing unnecessary characters and normalizing spaces.
    :param story: Raw story text
    :return: Cleaned story text
    """
    # Remove page boundaries or separators
    story = story.replace('------------------------------------------------', '').strip()
    story = story.replace(""""It was Adell's turn to be contrary.""", "It was Adell's turn to be contrary.").strip()
    story = story.replace("The stars are the power-units, dear", """"The stars are the power-units, dear""").strip()
    story = story.replace(""""The stars and Galaxies died and snuffed out, """, """The stars and Galaxies died and snuffed out, """).strip()


    # Standardize quotation marks
    story = story.replace('“', '"').replace('”', '"')
    story = story.replace('‘', "'").replace('’', "'")


    #story = ' '.join(story.split())

    return story



# Sentence segmentation using spaCy
def segment_sentences_spacy(story: str) -> list[str]:
    """
    Segment the story text into sentences using spaCy.
    :param story: Preprocessed story text
    :return: List of segmented sentences
    """

    # Remove extra spaces or newlines
    story = ' '.join(story.split())

    # Load spaCy's English language model
    nlp = spacy.load("en_core_web_sm")

    # Process the story with spaCy
    doc = nlp(story)

    # Extract sentences
    sentences = [sent.text.strip() for sent in doc.sents]

    return sentences

## **Why spaCy?**

### **Justifications for Choosing spaCy**
1. **Pre-Trained Language Model:**
   - spaCy is equipped with a robust sentence boundary detection system trained on diverse datasets.
   - Handles abbreviations, numbers, ellipses, and complex structures with minimal manual intervention.

2. **Low Maintenance:**
   - Unlike manually defined rules or heuristics, spaCy adapts to linguistic nuances, making it easy to use and maintain.

3. **Scalable and Extendable:**
   - Provides a foundation that can be enhanced with custom logic (e.g., dialogue stitching, typo handling).

4. **Alternatives Considered:**
   - **Manual Heuristics:**
     - Fragile and time-consuming to implement for all edge cases.
   - **Regular Expressions:**
     - Prone to breaking with complex text patterns and difficult to maintain.
   - **Custom Models:**
     - Resource-intensive and unnecessary given spaCy’s pre-trained capabilities.


In [54]:
# Preprocess and segment
cleaned_story = preprocess_text(story)
segmented_sentences = segment_sentences_spacy(cleaned_story)

# Output segmented sentences
for i, sentence in enumerate(segmented_sentences, 1):
    print(f"{i}: {sentence}")

1: The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
2: The question came about as a result of a five dollar bet over highballs, and it happened this way: Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac.
3: As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer.
4: They had at least a vague notion of the general plan of relays and circuits that had long since grown past the point where any single human could possibly have a firm grasp of the whole.
5: Multivac was self-adjusting and self-correcting.
6: It had to be, for nothing human could adjust and correct it quickly enough or even adequately enough -- so Adell and Lupov attended the monstrous giant only lightly and superficially, yet as well as any men could.
7: They fed it data, adjusted questions to its needs and translated 

In [55]:
# Sort the final sentences alphabetically
sorted_sentences = sorted(segmented_sentences, key=str.lower)

# Print the sorted sentences with enumeration
for idx, sentence in enumerate(sorted_sentences, 1):
    print(f"{idx}: {sentence}")

1: "A hundred billion is not infinite and it's getting less infinite all the time.
2: "A very good point.
3: "All right, but now we can hook up each individual spaceship to the Solar Station, and it can go to Pluto and back a million times without ever worrying about fuel.
4: "All right, then.
5: "All right.
6: "All the energy we can possibly ever use for free.
7: "And don't say we'll switch to another sun."
8: "And you?"
9: "Are you sure, Jerrodd?"
10: "Ask Multivac."
11: "Ask the Microvac," wailed Jerrodette I. "Ask him how to turn the stars on again."
12: "But even so," said Man, "eventually it will all come to an end.
13: "But how can that be all of Universal AC?"
14: "But when all energy is gone, our bodies will finally die, and you and I with them."
15: "Can't you just put in a new power-unit, like with my robot?"
16: "Cosmic AC," said Man, "How may entropy be reversed?"
17: "Darn right they will," muttered Lupov.
18: "Did the men upon it die?" asked Zee Prime, startled and witho

---

## **Part Two: Extending the Solution**

### **Enhancements**
1. **Goal:** Improve sentence segmentation to handle dialogue and narrative more accurately.
2. **Key Improvements:**
   - **Dialogue Stitching:**
     - Identify and combine multi-line quotations into single cohesive units.
     - Handle interruptions (e.g., `"I am 200 years -" Bob interrupts, "Woow"`).
   - **Quotation Typo Fixing:**
     - Detect and correct lines with odd numbers of quotation marks.
     - Automatically remove misplaced quotations or complete missing ones.
   - **Edge Case Handling:**
     - Account for embedded line breaks.
     - Handle abbreviations, ellipses, and other tricky patterns.

3. **Result:**
   - A more robust segmentation system that works well with both narrative and dialogue-heavy text.
   - Higher accuracy in processing and sorting sentences alphabetically.




In [56]:
import requests

# URL of the raw text file on GitHub
url = "https://raw.githubusercontent.com/humzkhan/Sentence_segmentation/refs/heads/main/ShortStory.txt"

# Download the file
response = requests.get(url)
story = response.text

# Print the first 500 characters to verify
print(story[:500])


The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light. The question came about as a result of a five dollar bet over highballs, and it happened this way:
Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac. As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer. They had at least a vague notion of



1. **Quotation Completion Automation**
   - **Why Automate?** Quotation marks are critical for maintaining the integrity of dialogue, and missing or misplaced quotes can disrupt sentence segmentation entirely. Manually fixing these errors would be time-intensive and error-prone.
   - **Examples of Investigated Cases:**

     1. **Extra Quotation Marks:**
        - Raw: `"It was Adell's turn to be contrary. "Maybe we can build things up again someday," he said.`
        - Issue: Three quotation marks. The first quote is unnecessary.
        - Solution: Remove the first quote, resulting in: `It was Adell's turn to be contrary. "Maybe we can build things up again someday," he said.`
     
     2. **Missing Opening Quotation Marks:**
        - Raw: `The stars are the power-units, dear. Once they're gone, there are no more power-units."`
        - Issue: The closing quotation is present, but the opening quotation is missing.
        - Solution: Add the opening quote at the beginning: `"The stars are the power-units, dear. Once they're gone, there are no more power-units."`
     
     3. **Quotation at the End:**
        - Raw: `"The stars and Galaxies died and snuffed out, and space grew black after ten trillion years of running down.`
        - Issue: The quote is missing its pair.
        - Solution: Add the missing quote at the start of the line: `"The stars and Galaxies died and snuffed out, and space grew black after ten trillion years of running down."`


In [57]:
import spacy

# Load spaCy's English language model
nlp = spacy.load("en_core_web_sm")

def segment_sentences_spacy_2(line: str) -> list[str]:
    """
    Segment the story text into sentences using spaCy.
    :param story: Preprocessed story text
    :return: List of segmented sentences
    """

    doc = nlp(line)
    return [sent.text.strip() for sent in doc.sents]


def fix_odd_quotation_case_1(line: str) -> str:
    """
    Fix a line with odd quotation marks by identifying and removing the invalid one.
    :param line: The input line with odd quotations.
    :return: The corrected line.
    """
    # Identify positions of all quotation marks
    quote_positions = [i for i, char in enumerate(line) if char == '"']

    # If there are not 3 quotation marks, return the line unchanged
    if len(quote_positions) != 3:
        return line

    # Extract text between quotation marks and validate pairs
    valid_pair = None
    for i in range(len(quote_positions) - 1):
        start = quote_positions[i]
        end = quote_positions[i + 1]

        # Check for spaces OUTSIDE the quotes
        valid_start = (start == 0 or line[start - 1] == " ")
        valid_end = (end == len(line) - 1 or line[end + 1] in [" ", "\n"])

        if valid_start and valid_end:
            valid_pair = (start, end)
            break

    # If no valid pair is found, default to the outermost pair
    if valid_pair is None:
        valid_pair = (quote_positions[0], quote_positions[-1])

    # Identify the invalid quotation
    invalid_quote = [pos for pos in quote_positions if pos not in valid_pair][0]

    # Remove the invalid quotation
    line = line[:invalid_quote] + line[invalid_quote + 1:]

    return line




def fix_odd_quotation_case_2_3(line: str) -> str:
    """
    Handle lines with one quotation mark by completing it based on context.
    :param line: The input line with one quotation mark.
    :return: The corrected line.
    """
    # Identify the position of the lone quotation mark
    quote_pos = line.find('"')

    if quote_pos == -1:  # If no quotation mark, return the line unchanged
        return line

    # Split the line into sentences using spaCy
    sentences = segment_sentences_spacy_2(line)

    # Case 0: Check if there is text on both sides of the quotation mark
    text_before = line[:quote_pos].strip()
    text_after = line[quote_pos + 1 :].strip()

    if text_before and text_after:
        # Both sides have text; assume the quotation mark is a typo and remove it
        return line[:quote_pos] + line[quote_pos + 1 :]

    # Case 2: Quotation mark at the beginning of the line
    if quote_pos == 0:
        if len(sentences) == 1:  # If there's only one sentence, add the quotation at the end
            return line + '"'
        else:  # If there are multiple sentences, add the quotation at the end of the first sentence
            first_sentence = sentences[0]
            rest_of_line = line[len(first_sentence):].strip()
            return f'{first_sentence}" {rest_of_line}'

    # Case 3: Quotation mark in the middle of the line
    elif 0 < quote_pos < len(line):  # Lone quotation in the middle or end of the line

        # Check for space after the lone quotation ([quote][space])
        if line[quote_pos + 1 : quote_pos + 2] == " ":
            # Add the missing quotation at the start of the sentence containing the quote
            for sentence in sentences:
                if '"' in sentence:  # Locate the sentence with the lone quotation
                    return f'"{sentence.strip()}" {line[len(sentence):].strip()}'
            # Default: Add the quotation at the start of the line
            return f'"{line}'

        # Check for space before the lone quotation ([space][quote])
        elif line[quote_pos - 1 : quote_pos] == " ":
            # Add the missing quotation at the end of the sentence containing the quote
            for sentence in sentences:
                if '"' in sentence:  # Locate the sentence with the lone quotation
                    return f'{sentence.strip()}" {line[len(sentence):].strip()}'
            # Default: Add the quotation at the end of the line
            return f'{line}"'

        # Case 4: Quotation mark at the end of the line
        else: # Default: Add the quotation at the start of the line
            return f'"{line}'

    return line  # If no specific case applies, return the line unchanged


def fix_odd_quotation_lines(text: str) -> str:
    """
    Fix lines with an odd number of quotation marks in the text, handling misplaced quotes appropriately.
    :param text: The raw input text, line-separated.
    :return: The corrected text with balanced quotation marks.
    """
    lines = text.split("\n")
    fixed_lines = []

    for line in lines:
        stripped_line = line.strip()
        quote_count = stripped_line.count('"')

        # Case 1: Handle lines with 3 quotation marks
        if quote_count % 2 == 1 and quote_count == 3:
            stripped_line = fix_odd_quotation_case_1(stripped_line)

        # Cases 2 & 3: Handle lone quotes or misplaced quotes
        if quote_count == 1:
            stripped_line = fix_odd_quotation_case_2_3(stripped_line)

        # Append corrected lines (if valid and not empty)
        if stripped_line:
            fixed_lines.append(stripped_line)


    return "\n".join(fixed_lines)



# Preprocessing function to clean the text
def preprocess_text(story: str) -> str:
    """
    Preprocess the story text by removing unnecessary characters and normalizing spaces.
    :param story: Raw story text
    :return: Cleaned story text
    """
    # Remove page boundaries or separators
    story = story.replace('------------------------------------------------', '').strip()
    #story = story.replace(""""It was Adell's turn to be contrary.""", "It was Adell's turn to be contrary.").strip()
    #story = story.replace("The stars are the power-units, dear", """"The stars are the power-units, dear""").strip()
    #story = story.replace(""""The stars and Galaxies died and snuffed out, """, """The stars and Galaxies died and snuffed out, """).strip()


    # Standardize quotation marks
    story = story.replace('“', '"').replace('”', '"')
    story = story.replace('‘', "'").replace('’', "'")

    # Fix odd quotation marks
    story = fix_odd_quotation_lines(story)

    #story = ' '.join(story.split())

    return story



# Sentence segmentation using spaCy
def segment_sentences_spacy(story: str) -> list[str]:
    """
    Segment the story text into sentences using spaCy.
    :param story: Preprocessed story text
    :return: List of segmented sentences
    """

    # Remove extra spaces or newlines
    story = ' '.join(story.split())

    # Load spaCy's English language model
    nlp = spacy.load("en_core_web_sm")

    # Process the story with spaCy
    doc = nlp(story)

    # Extract sentences
    sentences = [sent.text.strip() for sent in doc.sents]

    return sentences




In [58]:
# Preprocess and segment
cleaned_story = preprocess_text(story)
segmented_sentences = segment_sentences_spacy(cleaned_story)

# Output segmented sentences
for i, sentence in enumerate(segmented_sentences, 1):
    print(f"{i}: {sentence}")

1: The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
2: The question came about as a result of a five dollar bet over highballs, and it happened this way: Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac.
3: As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer.
4: They had at least a vague notion of the general plan of relays and circuits that had long since grown past the point where any single human could possibly have a firm grasp of the whole.
5: Multivac was self-adjusting and self-correcting.
6: It had to be, for nothing human could adjust and correct it quickly enough or even adequately enough -- so Adell and Lupov attended the monstrous giant only lightly and superficially, yet as well as any men could.
7: They fed it data, adjusted questions to its needs and translated 

2. **Sentence Stitching**
   - **Why Stitch Sentences?** Dialogue is often split across multiple lines in a text (e.g., `"Hello," he said. "How are you?"`). In such cases, each segment may be incorrectly treated as a separate sentence, disrupting the sorting process.
   - **Examples:**
     1. **Unstitched Dialogue:**
        - `22: "All the energy we can possibly ever use for free.`
        - `23: Enough energy, if we wanted to draw on it, to melt all Earth into a big drop of impure liquid iron, and still never miss the energy so used.`
        - `24: All the energy we could ever use, forever and forever and forever."`  
          
        - Issue: The three lines are treated as independent sentences.
     2. **Stitched Dialogue:**
        - `22: "All the energy we can possibly ever use for free. Enough energy, if we wanted to draw on it, to melt all Earth into a big drop of impure liquid iron, and still never miss the energy so used. All the energy we could ever use, forever and forever and forever."`
        - Solution: Combine the lines to preserve the integrity of the dialogue.

In [59]:
def stitch_sentences_with_nested_quotes(sentences: list[str]) -> list[str]:
    """
    Stitch sentences with nested quotes to ensure properly closed quoted sentences.

    :param sentences: List of segmented sentences
    :return: List of stitched sentences
    """
    stitched_sentences = []
    buffer = ""  # Temporary buffer for stitching
    inside_quote = False  # Tracks whether we're inside a quoted section

    for sentence in sentences:
        # Count the number of quotation marks
        quote_count = sentence.count('"')

        if inside_quote:
            # Add the sentence to the buffer
            buffer += f" {sentence}"
            if quote_count % 2 == 1:  # Check if this closes the open quote
                stitched_sentences.append(buffer.strip())
                buffer = ""  # Reset the buffer
                inside_quote = False
        else:
            # Check if this starts a quoted section
            if quote_count % 2 == 1:
                buffer = sentence
                inside_quote = True
            else:
                # If it's not a quoted section, just append as is
                stitched_sentences.append(sentence.strip())

    # If there's anything left in the buffer (edge case), append it
    if buffer:
        stitched_sentences.append(buffer.strip())

    return stitched_sentences


# Stitch sentences with nested quotes
stitched_sentences = stitch_sentences_with_nested_quotes(segmented_sentences)

# Output the stitched sentences with enumeration
for idx, sentence in enumerate(stitched_sentences, 1):
    print(f"{idx}: {sentence}")





1: The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
2: The question came about as a result of a five dollar bet over highballs, and it happened this way: Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac.
3: As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer.
4: They had at least a vague notion of the general plan of relays and circuits that had long since grown past the point where any single human could possibly have a firm grasp of the whole.
5: Multivac was self-adjusting and self-correcting.
6: It had to be, for nothing human could adjust and correct it quickly enough or even adequately enough -- so Adell and Lupov attended the monstrous giant only lightly and superficially, yet as well as any men could.
7: They fed it data, adjusted questions to its needs and translated 

3. **Line Break Function**
   - **Why Handle Line Breaks?** In narrative texts, line breaks often separate character voices or distinct ideas. These need to be respected during segmentation.
   - **Examples:**
     1. **Narrative with Line Breaks:**
        - Raw: `"A hundred billion is not infinite and it's getting less infinite all the time"  
          VJ-23X interrupted.`
        - Issue: Line break information can help when sentences are incorrectly conjoined.
        - Solution: Ensure segmentation respects line breaks, resulting in:
          - `"A hundred billion is not infinite and it's getting less infinite all the time."`
          - `VJ-23X interrupted.`

In [60]:
def split_sentences_with_line_breaks(sentences: list[str], story: str) -> list[str]:
    """
    Split sentences with multiple quotations into separate segments based on line breaks in the raw text.

    :param sentences: List of stitched sentences
    :param story: Raw story text with line breaks
    :return: List of refined sentences split at line breaks
    """
    # Split the raw story into lines for reference
    raw_lines = story.splitlines()

    # Create a new list to store final sentences
    refined_sentences = []

    for sentence in sentences:
        # If the sentence contains two or more quotes, check for line breaks
        if sentence.count('"') >= 2:
            # Initialize a list to store split parts of the sentence
            split_parts = []

            # Track the current working sentence
            current_sentence = sentence

            # Iterate over each line in the raw story
            for line in raw_lines:
                line = line.strip()  # Clean the line
                if line and line in current_sentence:
                    # Split the sentence at the matching line
                    parts = current_sentence.split(line, 1)
                    split_parts.append(line)  # Add the matching line
                    if len(parts) > 1:
                        current_sentence = parts[1]  # Keep the remaining part for further checks

            # Add all split parts and any remaining portion to the refined sentences
            refined_sentences.extend(split_parts)
            if current_sentence.strip():
                refined_sentences.append(current_sentence.strip())
        else:
            # If the sentence doesn't meet the criteria, keep it as is
            refined_sentences.append(sentence.strip())

    return refined_sentences

# Step 3: Split stitched sentences based on embedded line breaks
final_sentences = split_sentences_with_line_breaks(stitched_sentences, cleaned_story)

# Output the final sentences
for idx, sentence in enumerate(final_sentences, 1):
    print(f"{idx}: {sentence}")




1: The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
2: The question came about as a result of a five dollar bet over highballs, and it happened this way: Alexander Adell and Bertram Lupov were two of the faithful attendants of Multivac.
3: As well as any human beings could, they knew what lay behind the cold, clicking, flashing face -- miles and miles of face -- of that giant computer.
4: They had at least a vague notion of the general plan of relays and circuits that had long since grown past the point where any single human could possibly have a firm grasp of the whole.
5: Multivac was self-adjusting and self-correcting.
6: It had to be, for nothing human could adjust and correct it quickly enough or even adequately enough -- so Adell and Lupov attended the monstrous giant only lightly and superficially, yet as well as any men could.
7: They fed it data, adjusted questions to its needs and translated 


---

## **Pipeline Overview**

1. **Pre-Processing:**
   - Clean raw text by removing unnecessary separators (e.g., `---`) and standardizing quotation marks.
   - Fix common typos like extra or missing quotation marks.

2. **Processing:**
   - Use spaCy for sentence segmentation.
   - Extend segmentation logic to stitch multi-line quotations and handle interruptions.
   - Account for edge cases like abbreviations, ellipses, and company names.

3. **Post-Processing:**
   - Alphabetically sort sentences and remove duplicates.
   - Ensure clean and coherent output with narrative and dialogue preserved.

In [61]:
# Sort the final sentences alphabetically
sorted_sentences = sorted(final_sentences, key=str.lower)

# Print the sorted sentences with enumeration
for idx, sentence in enumerate(sorted_sentences, 1):
    print(f"{idx}: {sentence}")


1: "A hundred billion is not infinite and it's getting less infinite all the time. Consider! Twenty thousand years ago, mankind first solved the problem of utilizing stellar energy, and a few centuries later, interstellar travel became possible. It took mankind a million years to fill one small world and then only fifteen thousand years to fill the rest of the Galaxy. Now the population doubles every ten years --"
2: "A very good point. Already, mankind consumes two sunpower units per year."
3: "All right, but now we can hook up each individual spaceship to the Solar Station, and it can go to Pluto and back a million times without ever worrying about fuel. You can't do THAT on coal and uranium. Ask Multivac, if you don't believe me."
4: "All right, then. Billions and billions of years. Twenty billion, maybe. Are you satisfied?"
5: "All right. Who says they won't?"
6: "All the energy we can possibly ever use for free. Enough energy, if we wanted to draw on it, to melt all Earth into a b

In [62]:
# Remove duplicates and sort sentences alphabetically, ignoring leading quotes and parentheses
unique_sorted_sentences = sorted(
    set(final_sentences),  # Remove duplicates using set
    key=lambda s: s.lstrip('"( ').rstrip(')" ').lower()  # Sort while ignoring quotes and parentheses
)

# Print the sorted unique sentences with enumeration
for idx, sentence in enumerate(unique_sorted_sentences, 1):
    print(f"{idx}: {sentence}")


1: "A hundred billion is not infinite and it's getting less infinite all the time. Consider! Twenty thousand years ago, mankind first solved the problem of utilizing stellar energy, and a few centuries later, interstellar travel became possible. It took mankind a million years to fill one small world and then only fifteen thousand years to fill the rest of the Galaxy. Now the population doubles every ten years --"
2: A thought came, infinitely distant, but infinitely clear.
3: A timeless interval was spent in doing that.
4: "A very good point. Already, mankind consumes two sunpower units per year."
5: Adell put his glass to his lips only occasionally, and Lupov's eyes slowly closed.
6: Adell was just drunk enough to try, just sober enough to be able to phrase the necessary symbols and operations into a question which, in words, might have corresponded to this: Will mankind one day without the net expenditure of energy be able to restore the sun to its full youthfulness even after it ha



---

## **Conclusion**
By iteratively improving the solution, I’ve built a robust pipeline that leverages spaCy for sentence segmentation, supplemented by custom logic to handle edge cases. The final solution processes narrative and dialogue accurately, producing clean, alphabetically sorted sentences while addressing challenges like quotation handling, typos, and tricky sentence patterns.

This approach balances the strengths of pre-trained AI models with task-specific enhancements, resulting in a solution that is both efficient and effective for processing complex text.