# About This Assignment

Design and implement a complete **Natural Language Processing (NLP)** pipeline for
advanced sequence-to-sequence tasks using the Sherlock Holmes dataset, including:
-  text summarisation
- semantic search
- thematic analysis 

The focus is on understanding the process, implementing modular steps, and critically evaluating outcomes.

**Objective** 

To write a comprehensive report detailing the development, findings, and
results of your (NLP) pipeline, focusing on:
- How design choices influenced performance.
- Challenges encountered at each stage.
- Insights gained from the dataset and NLP methods used.
- Suggest improvements for each component of the pipeline.


# About this Data

- This collection features all the stories and novels of Sherlock Holmes by Arthur Conan Doyle. 
- Within the Sherlock folder, you'll find multiple .txt files, each containing a unique story.

# Importing neccesary libraries

In [8]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Importing the dataset

As the dataset is presented as a folder containing each story individually in a txt file, we have to save each story in a dictionary to be able to handle them easily. 

In [None]:
path = 'sherlock'

files = os.listdir(path)
stories = {}

# Iterate over each file in the folder
for idx, file in enumerate(files):
    with open(os.path.join(path, file), 'r') as data:
        contents = data.read()
        stories[idx] = contents  

# Access stories using numeric indices
print(stories[1])  
print(stories[2])






                      THE ADVENTURE OF THE THREE GARRIDEBS

                               Arthur Conan Doyle



     It may have been a comedy, or it may have been a tragedy. It cost one
     man his reason, it cost me a blood-letting, and it cost yet another
     man the penalties of the law. Yet there was certainly an element of
     comedy. Well, you shall judge for yourselves.

     I remember the date very well, for it was in the same month that
     Holmes refused a knighthood for services which may perhaps some day
     be described. I only refer to the matter in passing, for in my
     position of partner and confidant I am obliged to be particularly
     careful to avoid any indiscretion. I repeat, however, that this
     enables me to fix the date, which was the latter end of June, 1902,
     shortly after the conclusion of the South African War. Holmes had
     spent several days in bed, as was his habit from time to time, but he
     emerged that morning with a long fo

# Task 1

Clean the Sherlock Holmes dataset to handle common text preprocessing challenges, provide a short report detailing preprocessing challenges and how they were addressed. 

## Remove Special Characters & Convert to LowerCase

In [None]:
for i in stories:
    stories[i] = re.sub(r"[^\w ]", "", stories[i], flags=re.I)  # Remove special characters
    stories[i] = stories[i].lower() # Convert to lowercase

print(stories[1])

                      the adventure of the three garridebs                               arthur conan doyle     it may have been a comedy or it may have been a tragedy it cost one     man his reason it cost me a bloodletting and it cost yet another     man the penalties of the law yet there was certainly an element of     comedy well you shall judge for yourselves     i remember the date very well for it was in the same month that     holmes refused a knighthood for services which may perhaps some day     be described i only refer to the matter in passing for in my     position of partner and confidant i am obliged to be particularly     careful to avoid any indiscretion i repeat however that this     enables me to fix the date which was the latter end of june 1902     shortly after the conclusion of the south african war holmes had     spent several days in bed as was his habit from time to time but he     emerged that morning with a long foolscap document in his hand and a     twinkl