# About This Assignment

Design and implement a complete **Natural Language Processing (NLP)** pipeline for
advanced sequence-to-sequence tasks using the Sherlock Holmes dataset, including:
-  text summarisation
- semantic search
- thematic analysis 

The focus is on understanding the process, implementing modular steps, and critically evaluating outcomes.

**Objective** 

To write a comprehensive report detailing the development, findings, and
results of your (NLP) pipeline, focusing on:
- How design choices influenced performance.
- Challenges encountered at each stage.
- Insights gained from the dataset and NLP methods used.
- Suggest improvements for each component of the pipeline.


# About this Data

- This collection features all the stories and novels of Sherlock Holmes by Arthur Conan Doyle. 
- Within the Sherlock folder, you'll find multiple .txt files, each containing a unique story.

# Importing neccesary libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk 

# Importing the dataset

As the dataset is presented as a folder containing each story individually in a txt file, we have to save each story in a dictionary to be able to handle them easily. 

In [2]:
path = 'sherlock'

files = os.listdir(path)
stories = {}

# Iterate over each file in the folder
for idx, file in enumerate(files):
    with open(os.path.join(path, file), 'r') as data:
        contents = data.read()
        stories[idx] = contents  

# Access stories using numeric indices
print(stories[1])  
print(len(stories[1]))





                      THE ADVENTURE OF THE THREE GARRIDEBS

                               Arthur Conan Doyle



     It may have been a comedy, or it may have been a tragedy. It cost one
     man his reason, it cost me a blood-letting, and it cost yet another
     man the penalties of the law. Yet there was certainly an element of
     comedy. Well, you shall judge for yourselves.

     I remember the date very well, for it was in the same month that
     Holmes refused a knighthood for services which may perhaps some day
     be described. I only refer to the matter in passing, for in my
     position of partner and confidant I am obliged to be particularly
     careful to avoid any indiscretion. I repeat, however, that this
     enables me to fix the date, which was the latter end of June, 1902,
     shortly after the conclusion of the South African War. Holmes had
     spent several days in bed, as was his habit from time to time, but he
     emerged that morning with a long fo

# Task 1

Clean the Sherlock Holmes dataset to handle common text preprocessing challenges, provide a short report detailing preprocessing challenges and how they were addressed. 

Challenges presented to non processed text:

- Includes punctuation
- Includes Numbers which are dates and times (the : punctuation in time is removed so it's just a sequence of numbers)
- Includes uppercase letters 
- Copyright text at the end of the story after ---------- 

## Remove Copyright text & Special Characters, Convert to LowerCase & Tokenize 

### Version 1

In [5]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nitea\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
for i in stories:
    # Split the story at the dashes and keep only the story part (not the information about copyrights)
    stories[i] = stories[i].split("----------")[0]

    # Remove special characters
    stories[i] = re.sub(r"[^\w ]", "", stories[i], flags=re.I) #using regex to remove special characters 

    # Convert to lowercase
    stories[i] = stories[i].lower() 

    #Tokenization
    #stories[i]= stories[i].split() # Split into words
    stories[i] = nltk.word_tokenize(stories[i]) # Split into words taking account of punctuation 


# Checking the first story  
print(stories[1])
# Checking the length of the first story
print(len(stories[1]))

['the', 'adventure', 'of', 'the', 'three', 'garridebs', 'arthur', 'conan', 'doyle', 'it', 'may', 'have', 'been', 'a', 'comedy', 'or', 'it', 'may', 'have', 'been', 'a', 'tragedy', 'it', 'cost', 'one', 'man', 'his', 'reason', 'it', 'cost', 'me', 'a', 'bloodletting', 'and', 'it', 'cost', 'yet', 'another', 'man', 'the', 'penalties', 'of', 'the', 'law', 'yet', 'there', 'was', 'certainly', 'an', 'element', 'of', 'comedy', 'well', 'you', 'shall', 'judge', 'for', 'yourselves', 'i', 'remember', 'the', 'date', 'very', 'well', 'for', 'it', 'was', 'in', 'the', 'same', 'month', 'that', 'holmes', 'refused', 'a', 'knighthood', 'for', 'services', 'which', 'may', 'perhaps', 'some', 'day', 'be', 'described', 'i', 'only', 'refer', 'to', 'the', 'matter', 'in', 'passing', 'for', 'in', 'my', 'position', 'of', 'partner', 'and', 'confidant', 'i', 'am', 'obliged', 'to', 'be', 'particularly', 'careful', 'to', 'avoid', 'any', 'indiscretion', 'i', 'repeat', 'however', 'that', 'this', 'enables', 'me', 'to', 'fix',

In [5]:
# Write the cleaned content to a new file called story_test.txt as the print method doesn't display the entire content
with open("story_test.txt", "w") as file:
    file.write(" ".join(stories[9]))

In the code above we used word_tokenize to tokenize the text. In this case it doesn't make a difference between using split and word_tokenize as we have removed punctuation already.

Original:

```
 I have never known my friend to be in better form, both mental and physical, than in the year '95.
```
Edited (without tokenization):

```
i have never known my friend to be in better form both mental and physical than in the year 95
```

### Version 2

Let sentence segmentation and word tokenization handle punctuation

#### Using NLTK

This version also includes tagging and removing stopwords

In [3]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nitea\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\nitea\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nitea\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords



# Define the set of stopwords globally
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from a list of tokens
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]


# Iterate through the stories dictionary
for i in stories:
    # Step 1: Remove the copyright section
    story_content = stories[i]
    story_content = story_content.split("----------")[0]

    # Step 2: Sentence segmentation
    sentences = sent_tokenize(story_content)

    # Step 3: Tokenization and POS tagging
    tokenized_and_tagged = []
    for sentence in sentences:
        # Tokenize the sentence
        tokens = word_tokenize(sentence)

        #stop word removal
        remove_stop = remove_stopwords(tokens)
        # POS tagging for the tokens
        tagged_tokens = pos_tag(remove_stop)
        # Append tagged tokens to the result
        tokenized_and_tagged.extend(tagged_tokens)

    # Step 4: Overwrite the story with the tagged tokens
    stories[i] = tokenized_and_tagged

# Checking the first story
print(stories[1])
print(len(stories[1]))


[('ADVENTURE', 'NNP'), ('THREE', 'NNP'), ('GARRIDEBS', 'NNP'), ('Arthur', 'NNP'), ('Conan', 'NNP'), ('Doyle', 'NNP'), ('may', 'MD'), ('comedy', 'VB'), (',', ','), ('may', 'MD'), ('tragedy', 'VB'), ('.', '.'), ('cost', 'NN'), ('one', 'CD'), ('man', 'NN'), ('reason', 'NN'), (',', ','), ('cost', 'NN'), ('blood-letting', 'NN'), (',', ','), ('cost', 'NN'), ('yet', 'RB'), ('another', 'DT'), ('man', 'NN'), ('penalties', 'VBZ'), ('law', 'NN'), ('.', '.'), ('Yet', 'RB'), ('certainly', 'RB'), ('element', 'JJ'), ('comedy', 'NN'), ('.', '.'), ('Well', 'RB'), (',', ','), ('shall', 'MD'), ('judge', 'NN'), ('.', '.'), ('remember', 'VB'), ('date', 'NN'), ('well', 'RB'), (',', ','), ('month', 'NN'), ('Holmes', 'NNP'), ('refused', 'VBD'), ('knighthood', 'NN'), ('services', 'NNS'), ('may', 'MD'), ('perhaps', 'RB'), ('day', 'NN'), ('described', 'VBD'), ('.', '.'), ('refer', 'NN'), ('matter', 'NN'), ('passing', 'NN'), (',', ','), ('position', 'NN'), ('partner', 'NN'), ('confidant', 'JJ'), ('obliged', 'VBD'

In [4]:
# Write the cleaned content to a new file called story_test.txt as the print method doesn't display the entire content
with open("story_test.txt", "w") as file:
    file.write(" ".join(stories[9]))

## Dealing with numbers

Problem:
- The way we have to deal with numbers depends on the task. In some tasks the actual numbers can be irrevelant and can be replaced with placeholders. However, removing them can potentially change the context/topic. For example, I have a K9 dog. -> I have a K dog.

Solutions:
1. A dummy token, such as <NUMBER> can be used, so that the fact that there was a number in the original text is preserved,  without disturbing the syntactic context.

--- 
We will evaluate how each solution influences perfomance.

### Solution 1

*A dummy token, such as <NUMBER> can be used, so that the fact that there was a number in the original text is preserved,  without disturbing the syntactic context.*

In [None]:
 
for i in stories:
    # Split the story at the dashes and keep only the story part (not the information about copyrights)
    stories[i] = stories[i].split("----------")[0]

    #using regex to replace numbers with <NUMBER> tag
    stories[i] = re.sub(r"\d", " <NUMBER> ",stories[i]) 

    # using regex to remove special characters
    stories[i] = re.sub(r"[^\w ]", "", stories[i], flags=re.I)    

    # Convert to lowercase
    stories[i] = stories[i].lower() 

    #Tokenization
    stories[i] = nltk.word_tokenize(stories[i]) # Split into words taking account of punctuation 

print(stories[1])


In [9]:
# Define a function to split text into chunks of words
def split_into_lines(words, max_words_per_line=10):
    lines = []
    for i in range(0, len(words), max_words_per_line):
        lines.append(" ".join(words[i:i + max_words_per_line]))
    return lines

# Process and write the content to a file
with open("story_test.txt", "w") as file:
    # Replace `9` with the index of the story you want to write
    lines = split_into_lines(stories[6], max_words_per_line=10)
    for line in lines:
        file.write(line + "\n")


Original:

```
 I have never known my friend to be in better form, both mental and physical, than in the year '95.
```
Edited (without tokenization):

```
i have never known my friend to be in better form both mental and physical than in the year number number
```

The problem with this solution is that not all numbers have the same context. Some numbers are dates for example 25th December, others are time for example 11:00 AM etc.