# Text Preprocessing, Stemming & Lemmatization

- Stemming, Lemma. preprocessing steps
- reduces a word to their lexical root
- techniques used to reduce word variations in the corpus

common cleaning steps for text data

- remove HTML (use beautiful soup)
- convert to lower (reduce word variations) (sometimes helpful May month and may action) (but if use contextuation techniques can be combated)
- remove punctuation (use re) (use replace for things like & or @)
- replace numbers (standardise outputs)


#### Stemming & Lemmatization
- uses the suffix of a word, to shrink it down to roots lexical root form

#### Stemming
- Stemming is an algoritmic process, where ends of words r cut off to arrive at common word
- might not always end up with a proper word
- use different stemmers (most common one being porter stemmer)
- Porter Stemmer is an algorithm with large number of logical rules to stem a word
- does have limitations (saw wont get changed to see etc)
- english language rules changin for word are v diff so thats why the issue from P.S


#### Lemmatization
- L uses true covab and structural analysis of the word itself to arrive at true roots (lemma)
- L uses pre computed lemma as well as context of word within sentence
- WordNet Lemmatizer from NLTK
- cons being that it may not be able to generalize new or made up works (text language etc)
- so problem if text data doesnt follow proper english (too informal or media data)

#### Differences
- stemming faster, lemma more computationally exp
- but stem poorer quality, lemma higher quality
- choose based on analysis context
- question of trading off speed versus detail. 

#### WHY?

By performing preprocessing using stemming and lemmatization, coupled with the removal of stop words, we can better reduce our sentences to understand their core meaning. By removing words that do not significantly contribute to the meaning of the sentence and by reducing words to their roots or lemmas, we can efficiently analyze sentences within our deep learning frameworks. 

- reduces corpus size
- improves data quality
- speeds up process

In [5]:
from bs4 import BeautifulSoup
import re

In [6]:
input_text = "<b> This text is in bold</br>, <i> This text is in italics </i>"
output_text =  BeautifulSoup(input_text, "html.parser").get_text()
print('Input: ' + input_text)
print('Output: ' + output_text)

input_text = ['Cat','cat','CAT']
output_text =  [x.lower() for x in input_text]
print('Input: ' + str(input_text))
print('Output: ' + str(output_text))

input_text = "This ,sentence.'' contains-£ no:: punctuation?"
output_text = re.sub(r'[^\w\s]', '', input_text)
print('Input: ' + input_text)
print('Output: ' + output_text)

input_text = "Cats & dogs"
output_text = input_text.replace("&", "and")
print('Input: ' + input_text)
print('Output: ' + output_text)

Input: <b> This text is in bold</br>, <i> This text is in italics </i>
Output:  This text is in bold,  This text is in italics 
Input: ['Cat', 'cat', 'CAT']
Output: ['cat', 'cat', 'cat']
Input: This ,sentence.'' contains-£ no:: punctuation?
Output: This sentence contains no punctuation
Input: Cats & dogs
Output: Cats and dogs


In [1]:
from nltk.stem import PorterStemmer
import nltk.corpus
from nltk.corpus import wordnet
from nltk import word_tokenize
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /home/kprasath/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
porter = PorterStemmer()


In [None]:
word_list = ["see","saw","cat", "cats", "stem", "stemming","lemma","lemmatization","known","knowing","time", "timing","football", "footballers"]
for word in word_list:
    print(word + ' -> ' + porter.stem(word))

In [None]:
def SentenceStemmer(sentence):
    tokens=word_tokenize(sentence)
    stems=[porter.stem(word) for word in tokens]
    return " ".join(stems)

SentenceStemmer('The cats and dogs are running')

In [None]:
wordnet_lemmatizer = WordNetLemmatizer()

print(wordnet_lemmatizer.lemmatize('horses'))
print(wordnet_lemmatizer.lemmatize('wolves'))
print(wordnet_lemmatizer.lemmatize('mice'))
print(wordnet_lemmatizer.lemmatize('cacti'))

In [None]:

print(wordnet_lemmatizer.lemmatize('madeupwords'))
print(porter.stem('madeupwords'))

In [None]:
print(wordnet_lemmatizer.lemmatize('ran'))
print(wordnet_lemmatizer.lemmatize('run'))

In [None]:

sentence = 'The cats and dogs are running'

def return_word_pos_tuples(sentence):
    return nltk.pos_tag(nltk.word_tokenize(sentence))

return_word_pos_tuples(sentence)

In [None]:

def get_pos_wordnet(pos_tag):
    pos_dict = {"N": wordnet.NOUN,
                "V": wordnet.VERB,
                "J": wordnet.ADJ,
                "R": wordnet.ADV}

    return pos_dict.get(pos_tag[0].upper(), wordnet.NOUN)

get_pos_wordnet('VBG')

In [None]:

def lemmatize_with_pos(sentence):
    new_sentence = []
    tuples = return_word_pos_tuples(sentence)
    for tup in tuples:
        pos = get_pos_wordnet(tup[1])
        lemma = wordnet_lemmatizer.lemmatize(tup[0], pos=pos)
        new_sentence.append(lemma)
    return new_sentence

print(lemmatize_with_pos(sentence))