# About This Assignment

Design and implement a complete **Natural Language Processing (NLP)** pipeline for
advanced sequence-to-sequence tasks using the Sherlock Holmes dataset, including:
-  text summarisation
- semantic search
- thematic analysis 

The focus is on understanding the process, implementing modular steps, and critically evaluating outcomes.

**Objective** 

To write a comprehensive report detailing the development, findings, and
results of your (NLP) pipeline, focusing on:
- How design choices influenced performance.
- Challenges encountered at each stage.
- Insights gained from the dataset and NLP methods used.
- Suggest improvements for each component of the pipeline.


# About this Data

- This collection features all the stories and novels of Sherlock Holmes by Arthur Conan Doyle. 
- Within the Sherlock folder, you'll find multiple .txt files, each containing a unique story.

# Importing neccesary libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk 

# Importing the dataset

As the dataset is presented as a folder containing each story individually in a txt file, we have to save each story in a dictionary to be able to handle them easily. 

In [2]:
path = 'sherlock'

files = os.listdir(path)
stories = {}
files.sort(key=lambda x: int(os.path.splitext(x)[0]))  # Sort based on the numeric part of the filename  bc \os.listdir(path),sorts alphabetically (not numerically).

# Iterate over each file in the folder
for idx, file in enumerate(files):
    with open(os.path.join(path, file), 'r') as data:
        contents = data.read()
        stories[idx] = contents  

# Access stories using numeric indices
print(stories[1])  
print(len(stories[1]))





                      THE ADVENTURE OF THE THREE GARRIDEBS

                               Arthur Conan Doyle



     It may have been a comedy, or it may have been a tragedy. It cost one
     man his reason, it cost me a blood-letting, and it cost yet another
     man the penalties of the law. Yet there was certainly an element of
     comedy. Well, you shall judge for yourselves.

     I remember the date very well, for it was in the same month that
     Holmes refused a knighthood for services which may perhaps some day
     be described. I only refer to the matter in passing, for in my
     position of partner and confidant I am obliged to be particularly
     careful to avoid any indiscretion. I repeat, however, that this
     enables me to fix the date, which was the latter end of June, 1902,
     shortly after the conclusion of the South African War. Holmes had
     spent several days in bed, as was his habit from time to time, but he
     emerged that morning with a long fo

# Task 1

Clean the Sherlock Holmes dataset to handle common text preprocessing challenges, provide a short report detailing preprocessing challenges and how they were addressed. 

Challenges presented to non processed text:

- Includes punctuation
- Includes Numbers which are dates and times (the : punctuation in time is removed so it's just a sequence of numbers)
- Includes uppercase letters 
- Copyright text at the end of the story after ---------- 
- Some stories include table of contexts 
- includes the author and the title at the beginning

## Remove Copyright text & Special Characters, Convert to LowerCase & Tokenize 

### Version 1

In this version we removed non-neccecary text & special characters, turned the text into lowercaase and tokenized it. This version includes stopwords.

In [None]:
nltk.download('punkt')


In [None]:
stories_wstopwords = {}

for i in stories:
    # Split the story at the dashes and keep only the story part (not the information about copyrights)
    stories_wstopwords[i] = stories[i].split("----------")[0]
 
   # Check if 'CHAPTER I' exists before trying to split by it
    if "CHAPTER I" in stories_wstopwords[i]:
        content_after_chapter = stories_wstopwords[i].split("CHAPTER I")[1] # remove text before CHAPTER 1 to remove table of contexts and titles authors, etc
      # remove the first line immediately after "CHAPTER I"
        content_lines = content_after_chapter.split("\n")
        stories_wstopwords[i] = "\n".join(content_lines[2:]) 
    elif "Arthur Conan Doyle" in stories_wstopwords[i]:
        # Remove text before "Arthur Conan Doyle" if "CHAPTER I" does not exist
        parts = stories_wstopwords[i].split("Arthur Conan Doyle")
        # Keep everything after the last occurrence of the author
        stories_wstopwords[i] = parts[-1] if len(parts) > 1 else parts[0]  
        
    # Remove special characters
    stories_wstopwords[i] = re.sub(r"[^\w ]", "", stories_wstopwords[i], flags=re.I) #using regex to remove special characters 

    # Convert to lowercase
    stories_wstopwords[i] = stories_wstopwords[i].lower() 

    #Tokenization
    #stories[i]= stories[i].split() # Split into words
    stories_wstopwords[i] = nltk.word_tokenize(stories_wstopwords[i]) # Split into words taking account of punctuation 


# Checking the first story  
print(stories_wstopwords[1])
# Checking the length of the first story
print(len(stories_wstopwords[1]))

Original:

```
 I have never known my friend to be in better form, both mental and physical, than in the year '95.
```
Edited (without tokenization):

```
i have never known my friend to be in better form both mental and physical than in the year 95
```

In [4]:
# Write the cleaned content to a new file called story_test.txt as the print method doesn't display the entire content
with open("story_test.txt", "w") as file:
    file.write(" ".join(stories[13]))

### Version 2

This version is the same as the first one but we are removing stopwords as well.

In [None]:
from nltk.corpus import stopwords


stories_processed = {}

# Define the set of stopwords globally
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from a list of tokens
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

for i in stories:
    # Split the story at the dashes and keep only the story part (not the information about copyrights)
    stories_processed[i] = stories[i].split("----------")[0]
    
   # Check if 'CHAPTER I' exists before trying to split by it
    if "CHAPTER I" in stories_processed[i]:
        content_after_chapter = stories_processed[i].split("CHAPTER I")[1] # remove text before CHAPTER 1 to remove table of contexts and titles authors, etc
      # remove the first line immediately after "CHAPTER I"
        content_lines = content_after_chapter.split("\n")
        stories_processed[i] = "\n".join(content_lines[2:]) 
    elif "Arthur Conan Doyle" in stories_processed[i]:
        # Remove text before "Arthur Conan Doyle" if "CHAPTER I" does not exist
        parts = stories_processed[i].split("Arthur Conan Doyle")
        # Keep everything after the last occurrence of the author
        stories_processed[i] = parts[-1] if len(parts) > 1 else parts[0]  
    

    # Remove special characters
    stories_processed[i] = re.sub(r"[^\w ]", "", stories_processed[i], flags=re.I) #using regex to remove special characters 

    # Convert to lowercase
    stories_processed[i] = stories_processed[i].lower() 

    #Tokenization
    #stories[i]= stories[i].split() # Split into words
    stories_processed[i] = nltk.word_tokenize(stories_processed[i]) # Split into words taking account of punctuation 

    # Remove stopwords
    stories_processed[i] = remove_stopwords(stories_processed[i])


# Checking the first story  
print(stories_processed[1])
# Checking the length of the first story
print(len(stories_processed[1]))

In [4]:
# Write the cleaned content to a new file called story_test.txt as the print method doesn't display the entire content
with open("story_test.txt", "w") as file:
    file.write(" ".join(stories_processed[1]))

In the code above we used word_tokenize to tokenize the text. In this case it doesn't make a difference between using split and word_tokenize as we have removed punctuation already.

### Version 3

Lets sentence segmentation and word tokenization handle punctuation

This version also includes tagging 

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')


In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords



# Define the set of stopwords 
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from a list of tokens
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]


# Iterate through the stories dictionary
for i in stories:
    # Step 1: Remove the copyright section
    story_content = stories[i]
    story_content = story_content.split("----------")[0]

   # Check if 'CHAPTER I' exists before trying to split by it
    if "CHAPTER I" in stories[i]:
        content_after_chapter = stories[i].split("CHAPTER I")[1] # remove text before CHAPTER 1 to remove table of contexts and titles authors, etc
      # remove the first line immediately after "CHAPTER I"
        content_lines = content_after_chapter.split("\n")
        stories[i] = "\n".join(content_lines[2:]) 
    elif "Arthur Conan Doyle" in stories[i]:
        # Remove text before "Arthur Conan Doyle" if "CHAPTER I" does not exist
        parts = stories[i].split("Arthur Conan Doyle")
        # Keep everything after the last occurrence of the author
        stories[i] = parts[-1] if len(parts) > 1 else parts[0]  
    

    # Step 2: Sentence segmentation
    sentences = sent_tokenize(story_content)

    # Step 3: Tokenization and POS tagging
    tokenized_and_tagged = []
    for sentence in sentences:
        # Tokenize the sentence
        tokens = word_tokenize(sentence)

        #stop word removal
        remove_stop = remove_stopwords(tokens)
        # POS tagging for the tokens
        tagged_tokens = pos_tag(remove_stop)
        # Append tagged tokens to the result
        tokenized_and_tagged.extend(tagged_tokens)

    # Step 4: Overwrite the story with the tagged tokens
    stories[i] = tokenized_and_tagged

# Checking the first story
print(stories[1])
print(len(stories[1]))


In [4]:
# Write the cleaned content to a new file called story_test.txt as the print method doesn't display the entire content
with open("story_test.txt", "w") as file:
    file.write(" ".join(stories[9]))

## Dealing with numbers

Problem:
- The way we have to deal with numbers depends on the task. In some tasks the actual numbers can be irrevelant and can be replaced with placeholders. However, removing them can potentially change the context/topic. For example, I have a K9 dog. -> I have a K dog.

Solutions:
1. A dummy token, such as <NUMBER> can be used, so that the fact that there was a number in the original text is preserved,  without disturbing the syntactic context.

--- 
We will evaluate how each solution influences perfomance.

### Solution 1

*A dummy token, such as <NUMBER> can be used, so that the fact that there was a number in the original text is preserved,  without disturbing the syntactic context.*

#### version 1

In [None]:
 
for i in stories:
    # Split the story at the dashes and keep only the story part (not the information about copyrights)
    stories[i] = stories[i].split("----------")[0]

   # Check if 'CHAPTER I' exists before trying to split by it
    if "CHAPTER I" in stories[i]:
        content_after_chapter = stories[i].split("CHAPTER I")[1] # remove text before CHAPTER 1 to remove table of contexts and titles authors, etc
      # remove the first line immediately after "CHAPTER I"
        content_lines = content_after_chapter.split("\n")
        stories[i] = "\n".join(content_lines[2:]) 
    elif "Arthur Conan Doyle" in stories[i]:
        # Remove text before "Arthur Conan Doyle" if "CHAPTER I" does not exist
        parts = stories[i].split("Arthur Conan Doyle")
        # Keep everything after the last occurrence of the author
        stories[i] = parts[-1] if len(parts) > 1 else parts[0]  
    
    #using regex to replace numbers with <NUMBER> tag
    stories[i] = re.sub(r"\d", " <NUMBER> ",stories[i]) 

    # using regex to remove special characters
    stories[i] = re.sub(r"[^\w ]", "", stories[i], flags=re.I)    

    # Convert to lowercase
    stories[i] = stories[i].lower() 

    #Tokenization
    stories[i] = nltk.word_tokenize(stories[i]) # Split into words taking account of punctuation 

print(stories[1])





#### version 2

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords



# Define the set of stopwords globally
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from a list of tokens
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]


# Iterate through the stories dictionary
for i in stories:
    # Step 1: Remove the copyright section
    stories[i] = stories[i].split("----------")[0]
   # Check if 'CHAPTER I' exists before trying to split by it
    if "CHAPTER I" in stories[i]:
        content_after_chapter = stories[i].split("CHAPTER I")[1] # remove text before CHAPTER 1 to remove table of contexts and titles authors, etc
      # remove the first line immediately after "CHAPTER I"
        content_lines = content_after_chapter.split("\n")
        stories[i] = "\n".join(content_lines[2:]) 
    elif "Arthur Conan Doyle" in stories[i]:
        # Remove text before "Arthur Conan Doyle" if "CHAPTER I" does not exist
        parts = stories[i].split("Arthur Conan Doyle")
        # Keep everything after the last occurrence of the author
        stories[i] = parts[-1] if len(parts) > 1 else parts[0]  
        
    stories[i] = re.sub(r"\d", " <NUMBER> ",stories[i]) 

    # Step 2: Sentence segmentation
    sentences = sent_tokenize(stories[i])

    # Step 3: Tokenization and POS tagging
    tokenized_and_tagged = []
    for sentence in sentences:
        # Tokenize the sentence
        tokens = word_tokenize(sentence)

        #stop word removal
        remove_stop = remove_stopwords(tokens)
        # POS tagging for the tokens
        tagged_tokens = pos_tag(remove_stop)
        # Append tagged tokens to the result
        tokenized_and_tagged.extend(tagged_tokens)

    # Step 4: Overwrite the story with the tagged tokens
    stories[i] = tokenized_and_tagged

# Checking the first story
print(stories[1])
print(len(stories[1]))


Original:

```
 I have never known my friend to be in better form, both mental and physical, than in the year '95.
```
Edited (without tokenization):

```
i have never known my friend to be in better form both mental and physical than in the year number number
```

The problem with this solution is that not all numbers have the same context. Some numbers are dates for example 25th December, others are time for example 11:00 AM etc.

### version 3

In [None]:
from nltk.corpus import stopwords



# Define the set of stopwords globally
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from a list of tokens
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

for i in stories:
    # Split the story at the dashes and keep only the story part (not the information about copyrights)
    stories[i] = stories[i].split("----------")[0]
    
   # Check if 'CHAPTER I' exists before trying to split by it
    if "CHAPTER I" in stories[i]:
        content_after_chapter = stories[i].split("CHAPTER I")[1] # remove text before CHAPTER 1 to remove table of contexts and titles authors, etc
      # remove the first line immediately after "CHAPTER I"
        content_lines = content_after_chapter.split("\n")
        stories[i] = "\n".join(content_lines[2:]) 
    elif "Arthur Conan Doyle" in stories[i]:
        # Remove text before "Arthur Conan Doyle" if "CHAPTER I" does not exist
        parts = stories[i].split("Arthur Conan Doyle")
        # Keep everything after the last occurrence of the author
        stories[i] = parts[-1] if len(parts) > 1 else parts[0]  
    
    # replace numbers with NUMBER
    stories[i] = re.sub(r"\d", " <NUMBER> ",stories[i]) 

    # Remove special characters
    stories[i] = re.sub(r"[^\w ]", "", stories[i], flags=re.I) #using regex to remove special characters 

    # Convert to lowercase
    stories[i] = stories[i].lower() 

    #Tokenization
    #stories[i]= stories[i].split() # Split into words
    stories[i] = nltk.word_tokenize(stories[i]) # Split into words taking account of punctuation 

    # Remove stopwords
    stories[i] = remove_stopwords(stories[i])


# Checking the first story  
print(stories[1])
# Checking the length of the first story
print(len(stories[1]))

In [None]:
pip install num2words


In [None]:
from num2words import num2words

# Function to convert numbers in a list of tokens to words
def convert_numbers_to_words(tokens):
    return [num2words(word) if word.isdigit() else word for word in tokens]

# Iterate through stories to replace numbers with words
for i in range(len(stories)):
    # Convert numbers to words for each story
    stories[i] = convert_numbers_to_words(stories[i])

# Example: Check the first story after conversion
print(stories[0])


In [4]:
# Write the cleaned content to a new file called story_test.txt as the print method doesn't display the entire content
with open("story_test.txt", "w") as file:
    file.write(" ".join(stories[60]))

# Task 2

Implement a Seq2Seq model using an Encoder-Decoder LSTM architecture for summarizing entire Sherlock Holmes stories into short summaries. 
- Train the model to produce summaries of around 50-100 words.  
- Evaluate the model on both using suitable score metrics based on your research.

## Story Summarization

In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Define the stopwords globally
stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    """Remove stopwords from a list of tokens."""
    return [word for word in tokens if word.lower() not in stop_words]

def process_story(story):
    """
    Process a single story by performing:
    - Remove content before 'CHAPTER I' or 'Arthur Conan Doyle'.
    - Remove the line immediately following 'CHAPTER I'.
    - Remove special characters.
    - Convert to lowercase.
    - Tokenize text into words.
    - Remove stopwords.

    Returns:
        str: The cleaned story as a single string.
    """
    # Step 1: Remove content before 'CHAPTER I' or 'Arthur Conan Doyle'
    if "CHAPTER I" in story:
        content_after_chapter = story.split("CHAPTER I")[1]
        content_lines = content_after_chapter.split("\n")
        story = "\n".join(content_lines[1:])  # Remove the first line after 'CHAPTER I'
    elif "Arthur Conan Doyle" in story:
        parts = story.split("Arthur Conan Doyle")
        story = parts[-1] if len(parts) > 1 else parts[0]  # Keep text after the last occurrence

    # Step 2: Remove special characters
    story = re.sub(r"[^\w ]", "", story, flags=re.I)

    # Step 3: Convert to lowercase
    story = story.lower()

    # Step 4: Tokenization
    tokens = nltk.word_tokenize(story)

    # Step 5: Remove stopwords
    tokens = remove_stopwords(tokens)

    return " ".join(tokens)  # Return the cleaned story as a string

def create_summary(stories, num_words=10, num_sentences=5):
    """
    Create summaries for each story after cleaning them based on top word frequencies.

    Parameters:
        stories (dict): Dictionary of original stories (not tokenized).
        num_words (int): Number of top words to use for summary generation.
        num_sentences (int): Number of sentences to extract for the summary.

    Returns:
        dict: Dictionary containing summaries for each story.
    """
    # Step 1: Process all stories
    cleaned_stories = {story_id: process_story(story) for story_id, story in stories.items()}

    # Step 2: Compute word frequencies for each cleaned story
    word_frequencies = {}
    for story_id, story in cleaned_stories.items():
        word_freq = {}
        for word in story.split():  # Story is now a cleaned string
            word_freq[word] = word_freq.get(word, 0) + 1
        word_frequencies[story_id] = word_freq

    # Step 3: Generate summaries
    summaries = {}
    for story_id, cleaned_story in cleaned_stories.items():
        # Get the top `num_words` frequent words
        top_words = sorted(word_frequencies[story_id].items(), key=lambda x: x[1], reverse=True)[:num_words]
        top_words = [word for word, freq in top_words]

        # Split the cleaned story into sentences
        sentences = cleaned_story.split('. ')  # Split by periods

        # Find sentences containing frequent words
        selected_sentences = []
        for sentence in sentences:
            # Remove stopwords from each sentence before comparison
            sentence_tokens = remove_stopwords(nltk.word_tokenize(sentence))
            if any(word in sentence_tokens for word in top_words):
                selected_sentences.append(sentence.strip())
            if len(selected_sentences) >= num_sentences:  # Stop after reaching the desired number of sentences
                break

        # Combine selected sentences into a summary
        summaries[story_id] = ' '.join(selected_sentences)

    return summaries

# Example Usage:

# Assuming `stories` is your dictionary of raw story text
summaries = create_summary(stories, num_words=10, num_sentences=5)

# Display summaries for each story
for story_id, summary in summaries.items():
    print(f"Summary of Story {story_id}:")
    print(summary)
    print("\n")

In [None]:
import nltk
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords


# Define the set of stopwords globally
stop_words = set(stopwords.words('english'))

# Function to clean and process a story
def clean_story(story):
    # Remove content before 'CHAPTER I' or 'Arthur Conan Doyle'
    if "CHAPTER I" in story:
        story = story.split("CHAPTER I")[1]
    elif "Arthur Conan Doyle" in story:
        story = story.split("Arthur Conan Doyle")[-1]
    
    # Remove special characters
    story = re.sub(r"[^\w\s.]", "", story)
    
    # Convert to lowercase
    story = story.lower()
    
    return story

# Function to summarize a story
def summarize_story(text):
    # Tokenize words and sentences
    words = word_tokenize(text)
    sentences = sent_tokenize(text)
    
    # Build a frequency table for the words
    freq_table = {}
    for word in words:
        word = word.lower()
        if word not in stop_words:
            freq_table[word] = freq_table.get(word, 0) + 1
    
    # Score each sentence based on word frequencies
    sentence_scores = {}
    for sentence in sentences:
        for word, freq in freq_table.items():
            if word in sentence.lower():
                sentence_scores[sentence] = sentence_scores.get(sentence, 0) + freq
    
    # Calculate the average sentence score
    avg_score = sum(sentence_scores.values()) / len(sentence_scores)
    
    # Generate a summary by selecting sentences with a score above a threshold
    summary = ""
    for sentence in sentences:
        if sentence in sentence_scores and sentence_scores[sentence] > 1.2 * avg_score:
            summary += sentence + " "
    
    return summary.strip()

# Clean and summarize each story
summaries = {}
for story_id, story_text in stories.items():
    # Clean the story
    cleaned_story = clean_story(story_text)
    
    # Summarize the story
    summary = summarize_story(cleaned_story)
    
    # Store the summary
    summaries[story_id] = summary

# Display summaries for each story
for story_id, summary in summaries.items():
    print(f"Summary of Story {story_id}:")
    print(summary)
    print("\n")


https://www.topcoder.com/thrive/articles/text-summarization-in-nlp

In [None]:
import nltk
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords


# Define the set of stopwords globally
stop_words = set(stopwords.words('english'))

# Function to clean and process a story
def clean_story(story):
    # Remove content before 'CHAPTER I' or 'Arthur Conan Doyle'
    if "CHAPTER I" in story:
        story = story.split("CHAPTER I")[1]
    elif "Arthur Conan Doyle" in story:
        story = story.split("Arthur Conan Doyle")[-1]
    
    # Remove special characters
    story = re.sub(r"[^\w\s.]", "", story)
    
    # Convert to lowercase
    story = story.lower()
    
    return story

# Function to summarize a story
def summarize_story(text):
    # Tokenize words and sentences
    words = word_tokenize(text)
    sentences = sent_tokenize(text)
    
    # Build a frequency table for the words
    freq_table = {}
    for word in words:
        word = word.lower()
        if word not in stop_words:
            freq_table[word] = freq_table.get(word, 0) + 1
    
    # Score each sentence based on word frequencies
    sentence_scores = {}
    for sentence in sentences:
        for word, freq in freq_table.items():
            if word in sentence.lower():
                sentence_scores[sentence] = sentence_scores.get(sentence, 0) + freq
    
    # Calculate the average sentence score
    avg_score = sum(sentence_scores.values()) / len(sentence_scores)
    
    # Select sentences to form the summary (between 50-100 words)
    summary_sentences = []
    total_words = 0
    for sentence in sorted(sentence_scores, key=sentence_scores.get, reverse=True):
        word_count = len(word_tokenize(sentence))
        if total_words + word_count <= 100:
            summary_sentences.append(sentence)
            total_words += word_count
        if total_words >= 50:  # Ensure the summary is at least 50 words
            break
    
    # Combine the selected sentences into a summary
    summary = " ".join(summary_sentences)
    
    return summary

# Clean and summarize each story
summaries = {}
for story_id, story_text in stories.items():
    # Clean the story
    cleaned_story = clean_story(story_text)
    
    # Summarize the story
    summary = summarize_story(cleaned_story)
    
    # Store the summary
    summaries[story_id] = summary

# Display summaries for each story
for story_id, summary in summaries.items():
    print(f"Summary of Story {story_id}:")
    print(summary)
    print(f"Word Count: {len(word_tokenize(summary))}")
    print("\n")


In [None]:
pip install sumy

In [None]:
# Load Required Packages
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer



# Function to clean and prepare story text
def clean_story(story):
    # Remove content before 'CHAPTER I' or 'Arthur Conan Doyle'
    if "CHAPTER I" in story:
        story = story.split("CHAPTER I")[1]
    elif "Arthur Conan Doyle" in story:
        story = story.split("Arthur Conan Doyle")[-1]
    
    # Remove special characters
    story = re.sub(r"[^\w\s.]", "", story)
    
    # Convert to lowercase
    story = story.lower()
    
    return story

# Function to summarize a story using Sumy TextRank
def summarize_story_with_sumy(story_text, num_sentences=2):
    # Clean the story text
    cleaned_text = clean_story(story_text)
    
    # Create text parser using tokenization
    parser = PlaintextParser.from_string(cleaned_text, Tokenizer("english"))
    
    # Initialize TextRank summarizer
    summarizer = TextRankSummarizer()
    
    # Generate summary with the specified number of sentences
    summary = summarizer(parser.document, num_sentences)
    
    # Combine summary sentences into a single string
    text_summary = " ".join(str(sentence) for sentence in summary)
    
    return text_summary

# Summarize each story
summaries = {}
for story_id, story_text in stories.items():
    summaries[story_id] = summarize_story_with_sumy(story_text, num_sentences=2)

# Display summaries
for story_id, summary in summaries.items():
    print(f"Summary of Story {story_id}:")
    print(summary)
    print("\n")


In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
import torch
import re

# Function to clean and prepare story text
def clean_story(story):
    # Remove content before 'CHAPTER I' or 'Arthur Conan Doyle'
    if "CHAPTER I" in story:
        story = story.split("CHAPTER I")[1]
    elif "Arthur Conan Doyle" in story:
        story = story.split("Arthur Conan Doyle")[-1]
    
    # Remove special characters
    #story = re.sub(r"[^\w\s.]", "", story)
    story = re.sub(r"[^\w\s.]", "", story)

    # Convert to lowercase
    #story = story.lower()
    
    return story

# Function to summarize a story using BART
def summarize_with_bart(story_text, max_length=100, min_length=50):
    # Load pre-trained BART model and tokenizer
    model_name = "facebook/bart-large-cnn"
    model = BartForConditionalGeneration.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = BartTokenizer.from_pretrained(model_name)
    
    story_text = clean_story(story_text)

    # Tokenize input text (story)
    inputs = tokenizer(story_text, return_tensors="pt", max_length=1024, truncation=True)
    
    # Send inputs to GPU if available (use 'cpu' if not available)
    inputs = {key: value.to(model.device) for key, value in inputs.items()}
    
    # Generate summary
    summary_ids = model.generate(
        inputs["input_ids"], 
        max_length=max_length, 
        min_length=min_length, 
        length_penalty=2.0, 
        num_beams=4, 
        early_stopping=True
    )
    
    # Decode summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    return summary


# Summarize each story
summaries = {}
for story_id, story_text in stories.items():
    # Clean and summarize each story
    summary = summarize_with_bart(story_text)
    summaries[story_id] = summary

# Display summaries
for story_id, summary in summaries.items():
    print(f"Summary of Story {story_id}:")
    print(summary)
    print("\n")


In [None]:
summaries[0]

In [10]:
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize



# Function to clean and prepare story text
def clean_story(story):
    story = story.split("----------")[0]

    # Remove content before 'CHAPTER I' or 'Arthur Conan Doyle'
    if "CHAPTER I" in story:
        story = story.split("CHAPTER I")[1]
    elif "Arthur Conan Doyle" in story:
        story = story.split("Arthur Conan Doyle")[-1]

    # Remove special characters and normalize whitespace
    story = re.sub(r"[^\w\s.,!?]", "", story)
    story = re.sub(r"\s+", " ", story).strip()

    # Remove repetitive phrases
    sentences = story.split(".")
    unique_sentences = list(dict.fromkeys(sentences))  # Remove duplicates
    return ". ".join(unique_sentences)

# Function to rank sentences based on TF-IDF
def rank_sentences(text):
    sentences = sent_tokenize(text)  # Split text into sentences
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf_matrix = vectorizer.fit_transform(sentences)
    
    # Calculate scores for each sentence (sum of the TF-IDF scores of the words)
    sentence_scores = []
    for i, sentence in enumerate(sentences):
        score = tfidf_matrix[i].sum()
        sentence_scores.append((score, sentence))
    
    # Sort sentences based on their scores in descending order
    ranked_sentences = sorted(sentence_scores, key=lambda x: x[0], reverse=True)
    return ranked_sentences

# Function to summarize text using extractive summarization
def summarize_with_tfidf(story_text, num_sentences=5):
    # Clean the story text
    story_text = clean_story(story_text)
    
    # Rank sentences based on their importance
    ranked_sentences = rank_sentences(story_text)
    
    # Select the top 'num_sentences' sentences to form the summary
    summary_sentences = [sentence for _, sentence in ranked_sentences[:num_sentences]]
    
    # Combine the sentences to form the summary
    summary = " ".join(summary_sentences)
    
    return summary

# Example usage with the stories data
bad_summaries = {}
for story_id, story_text in stories.items():
    summary = summarize_with_tfidf(story_text)
    bad_summaries[story_id] = summary

# Display summaries
for story_id, summary in bad_summaries.items():
    print(f"Summary of Story {story_id}:")
    print(summary)
    print("\n")


Summary of Story 0:
This strange, languid creature spent his waking hours in the bow window of a St.  Jamess Street club and was the receivingstation as well as the transmitter for all the gossip of the metropolis. She rose from a settee as we entered tall, queenly, a perfect figure, a lovely masklike face, with two wonderful Spanish eyes which looked murder at us both. It was quite evident that The Three Gables was under very close surveillance, for as we came round the high hedge at the end of the lane there was the negro prizefighter standing in the shadow. He was in a chatty mood that morning, however, and had just settled me into the wellworn low armchair on one side of the fire, while he had curled down with his pipe in his mouth upon the opposite chair, when our visitor arrived. A minute later we were in an Arabian Nights drawingroom, vast and wonderful, in a half gloom, picked out with an occasional pink electric light.


Summary of Story 1:
It was twilight of a lovely spring e

## Seq2Seq Model 

Take bad_summaries and apply seq2seq model using an Encoder-Decoder LSTM architecture ???

**What is a Seq2Seq Model?**
- A seq2seq model aims to map a fixed length input with a fixed length output even when their length differs.
- As an example translating the English phrase "Good Afternoon" to the Greek phrase "Καλησπέρα".
- In our case, the entire story and the summary have different length as the one is significally shorter than the other. 

**What is Encoder-Decoder LSTM?**
- A recurrent neural network designed to address sequence-to-sequence problems, that has proven very effective.
- This architecture is comprised of two models: 
    - one for reading the input sequence and encoding it into a fixed-length vector
    - a second for decoding the fixed-length vector and outputting the predicted sequence. 
- The use of the models in concert gives the architecture its name of Encoder-Decoder LSTM designed specifically for seq2seq problems.

--- 

https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346

https://www.geeksforgeeks.org/seq2seq-model-in-machine-learning/

