# Text Preprocessing, Stemming & Lemmatization

## Text Preprocessing
- Text preprocessing includes steps such as stemming and lemmatization that reduce words to their lexical roots, helping minimize word variation in the corpus.

### Common Text Cleaning Steps
- **Remove HTML**: Use libraries like Beautiful Soup to strip out HTML tags from raw text.
- **Convert to Lowercase**: Reducing text to lowercase can help minimize word variations. However, be cautious as this may cause confusion with certain words (e.g., 'May' the month vs 'may' the verb). Context-based techniques can help address this.
- **Remove Punctuation**: Use regular expressions (re) to remove punctuation. Use replace functions for specific symbols like & or @.
- **Replace Numbers**: Replace numbers with standardized outputs to reduce data complexity.

## Stemming & Lemmatization
- Both Stemming and Lemmatization are techniques used to reduce a word to its lexical root form. 

### Stemming
- Stemming is an algorithmic process where the ends of words are cut off to arrive at a common root.
- It may not always yield an actual word. Different stemmers may produce different results, with the Porter Stemmer being one of the most common.
- The Porter Stemmer uses a large number of logical rules to stem a word. However, it does have limitations due to the complex rules of English language word formation.

### Lemmatization
- Lemmatization reduces words to their true roots (lemmas) based on actual vocabulary and structural analysis of the word.
- It uses precomputed lemma dictionaries and considers the context of the word within the sentence.
- The WordNet Lemmatizer from the NLTK library is a popular choice.
- The downside is that it may not be able to handle new, made-up, or informal words.

### Differences between Stemming and Lemmatization
- Stemming is faster but may yield lower quality results, while lemmatization is more computationally expensive but usually provides higher quality results.
- The choice between the two depends on the context of the analysis and the trade-off between speed and detail.

## Why Preprocess Text with Stemming and Lemmatization?
- Preprocessing using stemming and lemmatization, along with the removal of stop words, can help distill sentences down to their core meanings.
- This not only reduces corpus size but also improves data quality and speeds up processing, which is beneficial when analyzing sentences within deep learning frameworks.


# Code segment

In [5]:
from bs4 import BeautifulSoup
import re

In [6]:
input_text = "<b> This text is in bold</br>, <i> This text is in italics </i>"
output_text =  BeautifulSoup(input_text, "html.parser").get_text()
print('Input: ' + input_text)
print('Output: ' + output_text)

input_text = ['Cat','cat','CAT']
output_text =  [x.lower() for x in input_text]
print('Input: ' + str(input_text))
print('Output: ' + str(output_text))

input_text = "This ,sentence.'' contains-£ no:: punctuation?"
output_text = re.sub(r'[^\w\s]', '', input_text)
print('Input: ' + input_text)
print('Output: ' + output_text)

input_text = "Cats & dogs"
output_text = input_text.replace("&", "and")
print('Input: ' + input_text)
print('Output: ' + output_text)

Input: <b> This text is in bold</br>, <i> This text is in italics </i>
Output:  This text is in bold,  This text is in italics 
Input: ['Cat', 'cat', 'CAT']
Output: ['cat', 'cat', 'cat']
Input: This ,sentence.'' contains-£ no:: punctuation?
Output: This sentence contains no punctuation
Input: Cats & dogs
Output: Cats and dogs


In [1]:
from nltk.stem import PorterStemmer
import nltk.corpus
from nltk.corpus import wordnet
from nltk import word_tokenize
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /home/kprasath/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
porter = PorterStemmer()


In [None]:
word_list = ["see","saw","cat", "cats", "stem", "stemming","lemma","lemmatization","known","knowing","time", "timing","football", "footballers"]
for word in word_list:
    print(word + ' -> ' + porter.stem(word))

In [None]:
def SentenceStemmer(sentence):
    tokens=word_tokenize(sentence)
    stems=[porter.stem(word) for word in tokens]
    return " ".join(stems)

SentenceStemmer('The cats and dogs are running')

In [None]:
wordnet_lemmatizer = WordNetLemmatizer()

print(wordnet_lemmatizer.lemmatize('horses'))
print(wordnet_lemmatizer.lemmatize('wolves'))
print(wordnet_lemmatizer.lemmatize('mice'))
print(wordnet_lemmatizer.lemmatize('cacti'))

In [None]:

print(wordnet_lemmatizer.lemmatize('madeupwords'))
print(porter.stem('madeupwords'))

In [None]:
print(wordnet_lemmatizer.lemmatize('ran'))
print(wordnet_lemmatizer.lemmatize('run'))

In [None]:

sentence = 'The cats and dogs are running'

def return_word_pos_tuples(sentence):
    return nltk.pos_tag(nltk.word_tokenize(sentence))

return_word_pos_tuples(sentence)

In [None]:

def get_pos_wordnet(pos_tag):
    pos_dict = {"N": wordnet.NOUN,
                "V": wordnet.VERB,
                "J": wordnet.ADJ,
                "R": wordnet.ADV}

    return pos_dict.get(pos_tag[0].upper(), wordnet.NOUN)

get_pos_wordnet('VBG')

In [None]:

def lemmatize_with_pos(sentence):
    new_sentence = []
    tuples = return_word_pos_tuples(sentence)
    for tup in tuples:
        pos = get_pos_wordnet(tup[1])
        lemma = wordnet_lemmatizer.lemmatize(tup[0], pos=pos)
        new_sentence.append(lemma)
    return new_sentence

print(lemmatize_with_pos(sentence))