<a href="https://colab.research.google.com/github/gustavolio/AI-studies/blob/main/text_mining/Text_Pre_Processing_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('stopwords')

def preprocess_text(text):
    """
    Preprocesses text by:
    1. Lowercasing
    2. Removing stop words
    3. Removing punctuation
    4. Stemming
    5. Removing words with less than 2 occurrences
    6. Removing words with length less than 2

    Args:
        text (str): The input text.

    Returns:
        str: The preprocessed text.
    """

    # Lowercase the text
    text = text.lower()

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = [word for word in text.split() if word not in stop_words]

    # Remove punctuation
    words = [word.translate(str.maketrans('', '', string.punctuation)) for word in words]

    # Stem the words
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]

    # Remove words with less than 2 occurrences
    word_counts = {word: words.count(word) for word in words}
    words = [word for word in words if word_counts[word] >= 2]

    # Remove words with length less than 2
    words = [word for word in words if len(word) >= 2]

    return ' '.join(words)

# Example usage
text = "This is a sample text with some punctuation and stop words. We will preprocess it."
preprocessed_text = preprocess_text(text)

- Lower case: this is a sample text with some punctuation and stop words. we will preprocess it.
- Remove stopwords: ['sample', 'text', 'punctuation', 'stop', 'words.', 'preprocess', 'it.']
- Remove pontuaction: ['sample', 'text', 'punctuation', 'stop', 'words', 'preprocess', 'it']
- Stemmer: ['sampl', 'text', 'punctuat', 'stop', 'word', 'preprocess', 'it']
- Remove less than two ocurrences: []
- Remove with length less than two: []


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
