# sentence-boundary-detection
\There are several libraries and tools that provide sentence boundary detection capabilities in natural language processing (NLP). Here are some of the popular libraries and tools:

spaCy:

Website: https://spacy.io/
spaCy is a popular NLP library that includes a pretrained model for sentence boundary detection. You can use it to tokenize and segment text into sentences.
NLTK (Natural Language Toolkit):

Website: https://www.nltk.org/
NLTK is a comprehensive NLP library for Python. It provides tools for tokenization and sentence splitting, making it easy to perform sentence boundary detection.
CoreNLP:

Website: https://stanfordnlp.github.io/CoreNLP/
Stanford CoreNLP is a suite of natural language processing tools developed by Stanford University. It includes a sentence splitter module that can identify sentence boundaries.
OpenNLP:

Website: https://opennlp.apache.org/
Apache OpenNLP is an open-source NLP library that provides various NLP tools, including sentence detection models for multiple languages.
Gensim:

Website: https://radimrehurek.com/gensim/
Gensim is a library for topic modeling and document similarity analysis, but it also includes tools for sentence segmentation.
TextBlob:

Website: https://textblob.readthedocs.io/
TextBlob is a simple NLP library for Python. It includes a sentence tokenizer to segment text into sentences.
Stanza (formerly known as StanfordNLP):

Website: https://stanfordnlp.github.io/stanza/
Stanza is an NLP library that provides pretrained models for various NLP tasks, including sentence splitting.
Apache Tika:

Website: https://tika.apache.org/
Apache Tika is a content analysis toolkit that can be used for extracting text from various document formats. It includes a sentence boundary detector.
CLTK (Classical Language Toolkit):

Website: https://cltk.org/
CLTK is focused on classical languages and includes tools for sentence segmentation in ancient texts.
SentencePiece:

Website: https://github.com/google/sentencepiece
SentencePiece is a library developed by Google for text tokenization, including sentence segmentation. It's commonly used in natural language processing tasks for various languages.
These libraries offer a range of options for sentence boundary detection, with some of them providing pretrained models for multiple languages. You can choose the one that best fits your specific NLP project's needs and programming language preferences.

SpaCy's built-in sentence boundary detection model (sentencizer) is a good choice for many languages, including German. However, if you're specifically looking for the best models for sentence boundary detection in the German language, you can consider the following options:

SpaCy's Pretrained Models for German:

SpaCy provides pretrained models for various languages, including German. You can use these models for sentence boundary detection. Some popular SpaCy models for German are de_core_news_sm, de_core_news_md, and de_core_news_lg.
BERT-Based Models:

BERT (Bidirectional Encoder Representations from Transformers) models are powerful for sentence boundary detection and many other NLP tasks. You can use multilingual BERT models, like mBERT (multilingual BERT), which can handle multiple languages, including German.
GPT-Based Models:

GPT (Generative Pretrained Transformer) models can be fine-tuned for specific tasks, including sentence boundary detection. You can use a German GPT-3 model or fine-tune a general GPT-3 model on German data for this purpose.
Customized Rule-Based Models:

Depending on your specific needs, you can create custom rule-based models for German sentence boundary detection. These rules can be based on language-specific punctuation and grammatical patterns.
Language-Specific NLP Libraries:

You can explore other NLP libraries or tools that specialize in German language processing. Libraries like Stanza (formerly known as StanfordNLP) provide models and tools for German NLP tasks, including sentence boundary detection.
The choice of the best model depends on your specific use case and the trade-off between model accuracy and computational resources. If you're already using SpaCy for other NLP tasks in German, it might be convenient to stick with SpaCy's built-in models. However, if you require the highest accuracy or need to fine-tune a model for a specific domain, you might consider more advanced models like BERT or GPT-based models

In [None]:
import warnings
from sklearn.exceptions import InconsistentVersionWarning

warnings.simplefilter("ignore", InconsistentVersionWarning)


# Importing the pdf 

In [None]:
import fitz
fname = r'C:\A\00Master\Demo\data acc\DIN EN 206.pdf'
doc = fitz.open(fname)
rawtext = ''
for page in doc:
    rawtext += page.get_text()

______________________________________________________________________________________________________________________________________________________________________________________

# Clearing text from noise before tokenization

In [None]:
import re


# Sentences to delete
sentences_to_delete = [
    "Technische Universität Darmstadt",
    "Printed copies are uncontrolled",
    "— Leerseite —",

]

# Create a regular expression pattern to match the sentences
pattern = "|".join(map(re.escape, sentences_to_delete))

# Use re.sub to remove the matched sentences
new_text = re.sub(pattern, "", rawtext)

# print(new_text)


In [None]:
print("The length of the Text without the cleaning : " + str(len(rawtext)) + "\n" +
      "The length of the Text with the cleaning of Tu Darmstadt & white space: " + str(len(new_text)) + "\n" +
      "The difference between the Text is: " + str(len(rawtext) - len(new_text)))


In [None]:
new_text2 = re.sub(r'\s+', ' ', new_text)

In [None]:
print("The length of the Text without the cleaning : " + str(len(new_text)) + "\n" +
      "The length of the Text with the cleaning of Tu Darmstadt & white space: " + str(len(new_text2)) + "\n" +
      "The difference between the Text is: " + str(len(new_text) - len(new_text2)))

In [None]:
def replace_unknown_characters(text, replacement=" "):
    return text.replace('�', replacement)

# Example usage

new_text3 = replace_unknown_characters(new_text2)
print(new_text3)

In [None]:
print("The length of the Text without the cleaning : " + str(len(new_text2)) + "\n" +
      "The length of the Text with the cleaning of Tu Darmstadt & white space: " + str(len(new_text3)) + "\n" +
      "The difference between the Text is: " + str(len(new_text2) - len(new_text3)))

In [None]:
# Define a regular expression pattern to match the date and time format
pattern = r'\d{2}/\d{2}/\d{4} \d{2}:\d{2}:\d{2}'

# Use re.sub to replace all occurrences of the pattern with an empty string
new_text4 = re.sub(pattern, '', new_text3)

# Print the cleaned text
print(new_text4)

In [None]:
print("The length of the Text without the cleaning : " + str(len(new_text3)) + "\n" +
      "The length of the Text with the cleaning of Tu Darmstadt & white space: " + str(len(new_text4)) + "\n" +
      "The difference between the Text is: " + str(len(new_text3) - len(new_text4)))

In [None]:
print("The length of the Text without the cleaning : " + str(len(rawtext)) + "\n" +
      "The length of the Text with the cleaning of Tu Darmstadt & white space: " + str(len(new_text4)) + "\n" +
      "The totall clean character after deleting the noise is : " + str(len(rawtext) - len(new_text4)))

In [None]:
text= new_text4
print("The length of the Text: "+ str(len(text)))


# <center><b>Sentence Boundary Detection </b></center>

# SpaCy

-   de_dep_news_trf: A large Transformer-based model suitable for various NLP tasks, resource-intensive. for one pdf take aroud 1m 35.6s for DIN 206
-   de_core_news_lg: A large, versatile model for general-purpose NLP with good performance.
-   de_core_news_md: A medium-sized model, balanced for various NLP tasks with moderate resource requirements.
-   de_core_news_sm: A small, lightweight model designed for basic NLP tasks with minimal resource usage.

<!-- ## without deleting the Puntutions and apply thte Sentence Tokenzation -->

In [None]:
text

In [None]:
import string
string.punctuation
def remove_punctuation(text):
    txt = "".join([x for x in text if x not in string.punctuation])
    return txt

In [22]:
import string
import spacy

# Load the spaCy language model
nlp = spacy.load("de_dep_news_trf")

def tokenize(text):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    return sentences
   

# Remove punctuation and tokenize using spaCy
text_SpaCy = tokenize(text)

# Now you can access individual sentences without punctuation
print(text_SpaCy[0])

KeyboardInterrupt: 

In [None]:
print(type(text_SpaCy))

In [None]:
import pandas as pd
df = pd.DataFrame({'Index': range(1, len(text_SpaCy) + 1), 'Sentence': text_SpaCy})

In [None]:
df_noPun.head()

In [None]:
df_noPun.tail()

## StopWords 

In [None]:
import spacy
import pandas as pd 

# Load the spaCy language model for German
nlp = spacy.load("de_core_news_sm")

# Get the list of stop words in German
stop_words = nlp.Defaults.stop_words

# Print the list of stop words
print("List of stop words in German:")
print(stop_words)
df_stopword = pd.DataFrame(stop_words, columns=['stop_words_SpaCy'])


# NLTK

In [None]:
!pip install SnowballStemmer

In [None]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

stemmer = SnowballStemmer("german")
stop_words = set(stopwords.words("german"))



def clean_text(text, for_embedding=False):
    """
        - remove any html tags (< /br> often found)
        - Keep only ASCII + European Chars and whitespace, no digits
        - remove single letter chars
        - convert all whitespaces (tabs etc.) to single wspace
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and stemm
    """
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
    RE_TAGS = re.compile(r"<[^>]+>")
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž ]", re.IGNORECASE)
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE)
    if for_embedding:
        # Keep punctuation
        RE_ASCII = re.compile(r"[^A-Za-zÀ-ž,.!? ]", re.IGNORECASE)
        RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž,.!?]\b", re.IGNORECASE)

    text = re.sub(RE_TAGS, " ", text)
    text = re.sub(RE_ASCII, " ", text)
    text = re.sub(RE_SINGLECHAR, " ", text)
    text = re.sub(RE_WSPACE, " ", text)

    word_tokens = word_tokenize(text)
    words_tokens_lower = [word.lower() for word in word_tokens]

    if for_embedding:
        # no stemming, lowering and punctuation / stop words removal
        words_filtered = word_tokens
    else:
        words_filtered = [
            stemmer.stem(word) for word in words_tokens_lower if word not in stop_words
        ]

    text_clean = " ".join(words_filtered)
    return text_clean

In [None]:
new_text = clean_text(text, False)

In [None]:
len(new_text)
len(text)

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

# Download the Punkt tokenizer for German if you haven't already
nltk.download("punkt")



# Tokenize the text into sentences
sentences_NLTK = sent_tokenize(text, language="german")

# Print the detected sentences
for sentence_NLTK in sentences_NLTK:
    print(sentence_NLTK)


In [None]:
len(sentences_NLTK)

# TextBlob

In [None]:
# !pip install textblob


In [None]:
from textblob import TextBlob


# Create a TextBlob object
blob = TextBlob(text)

# Get the list of sentences
sentences_Blob = blob.sentences

# Print the detected sentences
for sentence in sentences_Blob:
    print(sentence)


In [None]:
len(sentences_Blob)

# wtpsplit
https://github.com/bminixhofer/wtpsplit

In [None]:
!pip install wtpsplit




In [None]:
from wtpsplit import WtP
wtp = WtP("wtp-bert-mini")
sentence_wtp=wtp.split(text, lang_code="de")

In [None]:
len(sentence_wtp)