<a href="https://colab.research.google.com/github/iHakawaTi/nlp-sentence-tokenization-comparison/blob/main/sentence_tokenizer_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Notebook Goals

This notebook aims to compare the performance of different sentence tokenization techniques in Python. We will analyze how three distinct methods—RegexpTokenizer, sent_tokenize (from NLTK), and PunktSentenceTokenizer (also from NLTK)—handle various text types, including standard prose, unconventional punctuation, and abbreviations.

The comparison will focus on:

* Accuracy: Which tokenizer correctly identifies sentence boundaries?

* Efficiency: How do their runtimes compare on different texts?

* Robustness: How well do they handle surprising or non-standard text patterns?

The ultimate goal is to determine which tokenizer is the most suitable for a variety of general-purpose Natural Language Processing (NLP) tasks.

In [9]:
# Libraries
import pandas as pd
import re
import nltk
from nltk.tokenize import sent_tokenize

import warnings
warnings.filterwarnings('ignore')

In [10]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [11]:
# Input df (story text)
df = pd.read_csv('/content/care_opinion.csv')

# Output df (original text + technique 1 + technique 2 + technique 3)
outputs_df = pd.DataFrame()
outputs_df['Original Text'] = df['Story Text']

In [12]:
# RegexpTokenizer
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=[.?!])\s',gaps=True)
sentences1 = df['Story Text'].apply(lambda x: tokenizer.tokenize(x))
for i, sentence in enumerate(sentences1):
    print(f"Sentence {i+1}: {sentence}")


#sample_text = "Dr. Lala went to the store. What did he buy? I don't know!"
#sample_token = sent_tokenize(sample_text)
#sample_token

#Save onto a outputs_df
outputs_df['RegexpTokenizer'] = sentences1

Sentence 1: ['Very pleasant doctor who listened to all my concerns.', ' Went through treatment plan and to return in 6 weeks to see how I am coping.']
Sentence 2: ['This is by far the best practice I have been registered to since locating to Stoke on Trent nearly 20 years ago.', 'They are efficient yet thorough, friendly and considerate.', 'And most importantly I can ring up and get an appointment!', 'This was a pretty new experience in itself.', 'I can not praise this practice enough.']
Sentence 3: ['I was taken into hospital after a serious attempt on my life.', 'I have had multiple attempts but thankfully never been successful.', 'I felt helpless and as though there was no other solution than to end my life.I feel listened to and understood by staff who have been attentive, friendly, helpful and caring.', 'The cleanliness of the word and the food have also been good.I felt there could have been more explanation of procedures, and more available to do to pass the time.']
Sentence 4: 

In [13]:
# PunktSentenceTokenizer
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sentences2 = df['Story Text'].apply(lambda x: punkt_st.tokenize(x))
for i, sentence in enumerate(sentences2):
    print(f"Sentence {i+1}: {sentence}")

#sample_text = "Dr. Lala went to the store. What did he buy? I don't know!"
#sample_token = sent_tokenize(sample_text)
#sample_token

#Save onto a outputs_df
outputs_df['PunktSentenceTokenizer'] = sentences2

Sentence 1: ['Very pleasant doctor who listened to all my concerns.', 'Went through treatment plan and to return in 6 weeks to see how I am coping.']
Sentence 2: ['This is by far the best practice I have been registered to since locating to Stoke on Trent nearly 20 years ago.', 'They are efficient yet thorough, friendly and considerate.', 'And most importantly I can ring up and get an appointment!', 'This was a pretty new experience in itself.', 'I can not praise this practice enough.']
Sentence 3: ['I was taken into hospital after a serious attempt on my life.', 'I have had multiple attempts but thankfully never been successful.', 'I felt helpless and as though there was no other solution than to end my life.I feel listened to and understood by staff who have been attentive, friendly, helpful and caring.', 'The cleanliness of the word and the food have also been good.I felt there could have been more explanation of procedures, and more available to do to pass the time.']
Sentence 4: [

In [14]:
# sent_tokenize
from nltk import sent_tokenize

sentences3 = df['Story Text'].apply(lambda x: sent_tokenize(x))
for i, sentence in enumerate(sentences3):
    print(f"Sentence {i+1}: {sentence}")

#sample_text = "Dr. Lala went to the store. What did he buy? I don't know!"
#sample_token = sent_tokenize(sample_text)
#sample_token

#Save onto a outputs_df
outputs_df['sent_tokenize'] = sentences3

Sentence 1: ['Very pleasant doctor who listened to all my concerns.', 'Went through treatment plan and to return in 6 weeks to see how I am coping.']
Sentence 2: ['This is by far the best practice I have been registered to since locating to Stoke on Trent nearly 20 years ago.', 'They are efficient yet thorough, friendly and considerate.', 'And most importantly I can ring up and get an appointment!', 'This was a pretty new experience in itself.', 'I can not praise this practice enough.']
Sentence 3: ['I was taken into hospital after a serious attempt on my life.', 'I have had multiple attempts but thankfully never been successful.', 'I felt helpless and as though there was no other solution than to end my life.I feel listened to and understood by staff who have been attentive, friendly, helpful and caring.', 'The cleanliness of the word and the food have also been good.I felt there could have been more explanation of procedures, and more available to do to pass the time.']
Sentence 4: [

In [15]:
# outputs_df to csv file
outputs_df.to_csv('outputs.csv', index=False)

In [16]:
# Comparing number of sentences for each method to determine mismatch
comparison_df = outputs_df.copy()

comparison_df['sent_tokenize_count'] = comparison_df['sent_tokenize'].apply(len)
comparison_df['PunktSentenceTokenizer_count'] = comparison_df['PunktSentenceTokenizer'].apply(len)
comparison_df['RegexpTokenizer_count'] = comparison_df['RegexpTokenizer'].apply(len)

# Create a boolean column that is True if there is a difference in sentence count
comparison_df['has_difference'] = (
    (comparison_df['sent_tokenize_count'] != comparison_df['PunktSentenceTokenizer_count']) |
    (comparison_df['sent_tokenize_count'] != comparison_df['RegexpTokenizer_count']) |
    (comparison_df['PunktSentenceTokenizer_count'] != comparison_df['RegexpTokenizer_count'])
)

# Filter the DataFrame to show only the rows where a difference was found
mismatched_rows = comparison_df[comparison_df['has_difference']]

# Display the rows where the sentence counts do not match
print("Rows with differences in sentence count across the three tokenizers:")
print(mismatched_rows[['sent_tokenize_count', 'PunktSentenceTokenizer_count', 'RegexpTokenizer_count']])

# the index of the rows with differences
mismatched_indices = mismatched_rows.index.tolist()
print("\nRow indices with a difference:", mismatched_indices)

# Printing results in thiw way to easily detect where each method went wrong
for i in mismatched_indices:
  print(f"\n--- Row {i} ---")
  row = outputs_df.loc[i]
  for col in outputs_df.columns:
    print(f"{col}: {row[col]}")

Rows with differences in sentence count across the three tokenizers:
     sent_tokenize_count  PunktSentenceTokenizer_count  RegexpTokenizer_count
4                      8                             8                      9
28                     3                             4                      3
36                     5                             5                      4
37                     9                             9                      8
40                     3                             2                      3
47                     5                             5                      6
59                    10                            10                      9
63                    20                            20                     18
71                    10                            10                      9
76                     8                             8                      7
78                     2                             2                   

#Conclusion
The comparative analysis of sentence tokenization techniques revealed that while all three methods—RegexpTokenizer, sent_tokenize, and PunktSentenceTokenizer—performed effectively on standard text, sent_tokenize proved to be the most robust and reliable option for more complex scenarios.

Although PunktSentenceTokenizer and RegexpTokenizer were successful in handling straightforward text, they struggled with certain unconventional patterns. In contrast, sent_tokenize consistently produced accurate outputs in these complex cases, demonstrating its superior ability to handle a wider variety of text with greater accuracy.

Therefore, for future natural language processing tasks, sent_tokenize is the recommended choice due to its robustness and reliable performance across diverse textual data.