<a href="https://colab.research.google.com/github/guilhermelaviola/NaturalLanguageProcessing/blob/main/Class04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Token & Sentence Segmentation**
Natural language processing (NLP) focuses on enabling computers to understand and analyze human language, with text preprocessing playing a vital role in preparing raw data for analysis. Key preprocessing techniques include sentence segmentation, which breaks text into meaningful sentences, and tokenization, which further divides sentences into smaller units such as words or symbols. Additional steps like stopword removal can improve efficiency depending on the task. Tools such as the NLTK library in Python provide flexible and effective methods for performing these processes across multiple languages. These techniques are especially important in tasks like sentiment analysis, where analyzing smaller text units allows for more accurate and detailed insights. Overall, selecting appropriate preprocessing methods is essential for building effective NLP systems and has wide-ranging applications in areas such as market research, customer service, and social media analytics.

In [7]:
# Importing all the necessary libraries:
import nltk
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## **Example: Sentence Segmentation with NLTK**
Using the NLTK library for sentence segmentation, it is possible to automate and improve the process of separating text into sentences, essential for various NLP tasks. In this example, we work with the DHBB dataset, provided by the Getúlio Vargas Foundation, containing entries on Brazilian political figures.

In [9]:
text = '''
NLTK, or the Natural Language Toolkit, is an open-source Python library designed for working with
human language data. It provides a comprehensive suite of tools for performing text analysis,
including tokenization, stemming, tagging, parsing, and machine learning for linguistic tasks.
'''

# Assuming that text is the string containing the DHBB text:
sentences = sent_tokenize(text, language='english')

# Assuming that 'sentences' is the list containing the previously segmented sentences:
tokenized_sentences = [word_tokenize(sentence, language='english') for sentence in sentences]

stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

tokens = word_tokenize(text, language='english')
tokens_without_stopwords = [token for token in tokens if token.lower() not in stop_words and token not in punctuation]

In [None]:
with open('dhbb.txt') as f:
  text = f.read()

print(text)

In [None]:
# Transforming text into sentences manually:
text = text.replace('\n', ' ')
text = text.replace('  ', ' ')
text = text.replace('', '')

text

sentences = text.split('.')

sentences

sentences = [sentence.strip() + '.' for sentence in sentences if sentence]

sentences

In [None]:
# Transforming text into sentences with NLTK:
nltk.download('punkt')

text

nltk_sentences = nltk.tokenize.sent_tokenize(text, language='english')

nltk_sentences

In [None]:
# Tokenizing sentences manually:
tokenized_sentences = []

for sentence in nltk_sentences:
  sentence = sentence.replace('.', ' . ')
  sentence = sentence.replace(',', ' , ')
  sentence = sentence.replace(':', ' : ')
  sentence = sentence.replace('(', ' ( ')
  sentence = sentence.replace(')', ' ) ')
  sentence = sentence.replace('-', ' - ')
  sentence = sentence.replace('«', ' « ')
  sentence = sentence.replace('»', ' » ')
  sentence = sentence.replace('  ', ' ')
  sentence = sentence.strip()
  tokenized_sentences.append(sentence.split(' '))

tokenized_sentences

tokenized_sentences[8]

In [None]:
# Tokenizing sentences with NLTK:
tokenized_sentences_nltk = []

for sentence in nltk_sentences:
    sentence = nltk.tokenize.word_tokenize(sentence)
    tokenized_sentences_nltk.append(sentence)

tokenized_sentences_nltk

tokenized_sentences_nltk[8]

In [None]:
# Removing stopwords manually:
! wget -O stopwords.txt https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt

with open('stopwords.txt') as f:
    stopwords = f.read()

print(stopwords)

stopwords = stopwords.split('\n')

stopwords

tokenized_sentences_nltk

sentences_without_stopwords = []

for sentence in tokenized_sentences_nltk:
  sentences_without_stopwords = []
  for token in sentence:
    if not token.lower() in stopwords:
      sentences_without_stopwords.append(token)
      sentences_without_stopwords.append(sentences_without_stopwords)

sentences_without_stopwords

In [None]:
# Removing stopwords with NLTK:
string.punctuation

nltk.download('stopwords')

nltk.corpus.stopwords.words('english')

sentences_without_stopwords_nltk = []

for sentence in sentences_without_stopwords_nltk:
  sentences_without_stopwords = []
  for token in sentence:
    if not token.lower() in nltk.corpus.stopwords.words('english') and not token in string.punctuation:
      sentences_without_stopwords.append(token)
      sentences_without_stopwords_nltk.append(sentences_without_stopwords)

sentences_without_stopwords_nltk