# The Sign of the Four (a Sherlock Holmes novel)
## by Sir Arthur Conan Doyle

In [29]:
# EN source: https://sherlock-holm.es/ascii/
# DE source: https://www.projekt-gutenberg.org/doyle/zeichen/index.html

working_folder = f"../books/holmes_four"
en_txt_path = f"{working_folder}/EN_holmes_four.txt"
de_txt_path = f"{working_folder}/DE_holmes_four.txt"

## Scrape the German book from Projekt Gutenberg as .txt file

In [34]:
import requests
from bs4 import BeautifulSoup

with open(de_txt_path, "w") as book:
    for i in range(12):
        chapter_no = str(i+1).zfill(2)
        url = f'https://www.projekt-gutenberg.org/doyle/zeichen/chap{chapter_no}.html'
        website = requests.get(url)
        soup = BeautifulSoup(website.content, 'html.parser')
        
        target = soup.findAll('h4')[1]
        book.write(f"{target.text}\n")
        for sib in target.find_next_siblings():
            if sib.name=="hr":
                break
            else:
                book.write(f"{sib.text}\n")               

## Clean the English text and run sentence tokenization

In [47]:
import pysbd
import re

en_text = open(en_txt_path).read()
en_text = re.sub('(?<![\r\n])(\r?\n|\r)(?![\r\n])', '', en_text)
en_text = re.sub(' +', ' ', en_text)
seg = pysbd.Segmenter(language="en", clean=True)
en_out_text = "\n".join(seg.segment(en_text))
en_out_text = re.sub('\n\'\n', '\'\n', en_out_text)

print(f"len chars: {len(en_out_text)}")
print(f"lines: {len(en_out_text.splitlines())}")
print(en_out_text)

len chars: 230674
lines: 1798
CHAPTER I The Science of Deduction
Sherlock Holmes took his bottle from the corner of the mantelpiece and his hypodermic syringe from its neat morocco case.
With his long, white, nervous fingers he adjusted the delicate needle, and rolled back his left shirt-cuff.
For some little time his eyes rested thoughtfully upon the sinewy forearm and wrist all dotted and scarred with innumerable puncture-marks.
Finally he thrust the sharp point home, pressed down the tiny piston, and sank back into the velvet-lined arm-chair with a long sigh of satisfaction.
Three times a day for many months I had witnessed this performance, but custom had not reconciled my mind to it.
On the contrary, from day to day I had become more irritable at the sight, and my conscience swelled nightly within me at the thought that I had lacked the courage to protest.
Again and again I had registered a vow that I should deliver my soul upon the subject, but there was that in the cool, nonchal

## Clean the German text and run sentence tokenization

In [54]:
import pysbd
import re

de_text = open(de_txt_path).read()
seg = pysbd.Segmenter(language="de", clean=True)
de_out_text = "\n".join(seg.segment(de_text))
de_out_text = re.sub('\n«', '«', de_out_text)

print(f"len chars: {len(de_out_text)}")
print(f"lines: {len(de_out_text.splitlines())}")
print(de_out_text)

len chars: 245838
lines: 2881
Die Wissenschaft der Deduktion
Sherlock Holmes nahm seine Flasche von der Ecke des Kaminsimses und eine Spritze aus seinem geschmackvollen marokkanischen Etui.
Mit seinen langen, weißen, nervösen Fingern befestigte er die dünne Nadel, und rollte seinen linken Hemdärmel hoch.
Für kurze Zeit ruhten seine Augen nachdenklich auf dem sehnigen Unterarm und dem Handgelenk mit all den sichtbaren und vernarbten Einstichpunkten.
Schließlich stieß er die Nadelspitze ein, drückte den winzigen Kolben und sank mit einem langen, zufriedenen Seufzer in den samtgefütterten Lehnstuhl zurück.
Dreimal täglich und über viele Monate hatte ich diesem Vorgang zugesehen, aber ich hatte mich nie daran gewöhnt.
Im Gegenteil, von Tag zu Tag war ich beim Anblick dieser Prozedur reizbarer geworden, und mein Gewissen bedrückte mich jede Nacht schwerer, obgleich mir der Mut zum Protestieren fehlte.
Wieder und wieder hatte ich mir vorgenommen, daß ich dieses Thema ansprechen sollte, aber 

## Save files

In [57]:
open(f"{working_folder}/holmes_four_st.en", "w").write(en_out_text)
open(f"{working_folder}/holmes_four_st.de", "w").write(de_out_text)

245838