# Preprocessor for Text 2
Preprocessing includes removing special characters, lowercasing all the letters and reducing the text to the main content.

### Setting Up

In [1]:
import regex as re                  # Used for matching words in the text
import unidecode                    # To remove Greek accents

### Reading the Book

In [2]:
book_name = "The Return of Sherlock Holmes, by Sir Arthur Conan Doyle.txt"
with open(book_name, encoding="utf-8") as book:
    lines = book.readlines()

print(lines[:10])

['\ufeff\n', "Project Gutenberg's The Return of Sherlock Holmes, by Arthur Conan Doyle\n", '\n', 'This eBook is for the use of anyone anywhere at no cost and with\n', 'almost no restrictions whatsoever.  You may copy it, give it away or\n', 're-use it under the terms of the Project Gutenberg License included\n', 'with this eBook or online at www.gutenberg.org\n', '\n', 'Title: The Return of Sherlock Holmes\n', '\n']


### Preprocessing

In [3]:
# Extracting the main content from the text

begin_index = lines.index("THE ADVENTURE OF THE EMPTY HOUSE\n")
end_index = len(lines) - 1 - lines[::-1].index("      THE END\n")
print("The main content is from line numbers {} to {}".format(begin_index, end_index))

lines = lines[begin_index:end_index]        # Reducing lines to main content

The main content is from line numbers 55 to 13427


In [4]:
# Removing chapter headings, part headings and empty lines

chapter_pattern = r"THE ADVENTURE OF [A-Z]+"

temp = []
for line in lines:
    is_valid = ((line == '\n') or re.match(chapter_pattern, line))
    if(not is_valid):               # If the line is neither a chapter name nor an empty line
        temp.append(line)           # include it in the final list

lines = temp
print(lines[:10])


['      It was in the spring of the year 1894 that all London was\n', '      interested, and the fashionable world dismayed, by the murder of\n', '      the Honourable Ronald Adair under most unusual and inexplicable\n', '      circumstances. The public has already learned those particulars\n', '      of the crime which came out in the police investigation, but a\n', '      good deal was suppressed upon that occasion, since the case for\n', '      the prosecution was so overwhelmingly strong that it was not\n', '      necessary to bring forward all the facts. Only now, at the end of\n', '      nearly ten years, am I allowed to supply those missing links\n', '      which make up the whole of that remarkable chain. The crime was\n']


In [5]:
# Combining all the lines into one string

joined_book = ''.join(lines)                            # Combining all the lines to a single string
joined_book = unidecode.unidecode(joined_book)          # Removing Greek accents
joined_book = joined_book.lower()                       # Turing all the characters to lower case
joined_book = re.sub('_', '', joined_book)              # Removing all the '_'
joined_book = re.sub('[\s]+', '_', joined_book)         # Replacing spaces with '_'
joined_book = re.sub(r'\W+', '_', joined_book)           # Removing non-alphanumeric characters
joined_book = re.sub('[_]+', ' ', joined_book)             # Replacing '_' back to ' '
joined_book = joined_book.strip()

print(joined_book[:1000])

with open("T2.txt", "w") as T2:
    T2.write(joined_book)


it was in the spring of the year 1894 that all london was interested and the fashionable world dismayed by the murder of the honourable ronald adair under most unusual and inexplicable circumstances the public has already learned those particulars of the crime which came out in the police investigation but a good deal was suppressed upon that occasion since the case for the prosecution was so overwhelmingly strong that it was not necessary to bring forward all the facts only now at the end of nearly ten years am i allowed to supply those missing links which make up the whole of that remarkable chain the crime was of interest in itself but that interest was as nothing to me compared to the inconceivable sequel which afforded me the greatest shock and surprise of any event in my adventurous life even now after this long interval i find myself thrilling as i think of it and feeling once more that sudden flood of joy amazement and incredulity which utterly submerged my mind let me say to t