<h1 style="text-align:center">Chapter 4</h1>

##  Normalization

> Normalization is the task of putting words/tokens in a standard format.Normalization is benefecial despite the spelling information that is lost.

#### Case folding
---

> Mapping everything to the same case is called case folding.

Case folding is helpful for tasks like speech recognition, information retrieval.

For sentiment analysis and other text classification tasks, information
extraction, and machine translation, by contrast, case can be quite helpful and case
folding is generally not done.


Example,

'US' the country and 'us' the pronoun can outweigh the advantage in
generalization that case folding would have provided for other words.

---

#### Case folding using python

In [1]:
sentence = 'THIS string Has a MIX of lowercase AND UPPERCASE'

# Case folding to lowercase

print("Lower case ->", sentence.lower())

# Case folding to UPPERCASE

print("Upper case ->", sentence.upper())

# Case folding to first letter of first word in uppercase

print("Capitalized case ->", sentence.capitalize())

# Case folding to title case

print("Title case ->", sentence.title())

# More aggressive lower()

print("Casefold -> ", sentence.casefold())

Lower case -> this string has a mix of lowercase and uppercase
Upper case -> THIS STRING HAS A MIX OF LOWERCASE AND UPPERCASE
Capitalized case -> This string has a mix of lowercase and uppercase
Title case -> This String Has A Mix Of Lowercase And Uppercase
Casefold ->  this string has a mix of lowercase and uppercase


#### Lemmatization
---

> Lemmatization is the task of determinig that the two words have the same root despite their surface differences.


Example,

Dinner & Dinners have the same <strong>lemma</strong> - Dinner

<strong>Why is lemmatization done?</strong>

There can be many reasons. One of the reasons is to reduce the vocabulary so that it does not have multiple words with exact same meaning.

<strong>How is lemmatization done?</strong>

Lemmatization method involves morphological parsing of the words.

---

#### Morphology

> Morphology is the study of how the words are built from smaller bearing units called <strong>morphemes</strong>.

There are two main classes of morphemes,

- <strong>Stem</strong> - The central morpheme of a word acting as the main.
- <strong>Affixes</strong> - The additional part that gives the word a variation.

For example,

Suppose the word is <strong>Dinners</strong>, here,

<strong>Stem</strong> = Dinner

<strong>Affix</strong> = -s

---

> Lemmatization algorithms can be complex in nature. Hence, sometimes stemming is used for naive morphological analysis.

---

#### Stemming

> Stemming is a naive version of morphological analysis in which consists of removing word affixes to normalize the word.

The most commonly used stemming algorithm is <strong>The Porter Stemmer</strong> in which the text is run through a series of steps as a <strong>cascade</strong> in which the output of a pass is passed as input to the next.

Stemming is essentially a way of normalizing text through a series of rules, hence there are some errors of under and over generalization.

---

#### Lemmatization and Stemming using python

In [2]:
from nltk.stem import WordNetLemmatizer 
import nltk
nltk.download('all')
lemmatizer = WordNetLemmatizer() 
  
print(lemmatizer.lemmatize('I am going to the market to get some groceries'))

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to-date!
[nltk_data]    | Downloading packag

BadZipFile: File is not a zip file

In [None]:
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer() 
  
print(ps.stem("I am going to the market to get some groceries"))

---

### Sentence segmentation

> Sentence segmentation is the task of dividing text into sentences. This can be done by taking help from punctuations like ., , !,?

But this method can be confusing when the text has abbreviations like Mr., Miss., Inc. etc.
There are other better methods for segmenting text into sentences.

We will be discussing smarter methods in later notebooks. Let's look at a few naive approaches for sentence segmentation.

---

##### Sentence segmentation using Python

In [3]:
sentence_1 = 'How was your day? Were you able to get stuff done? I\'ll be taking a leave tommorow. \
Hope it\'s okay.'

In [4]:
import re
segments_1 = re.split('[?,.]',sentence_1)
print(len(segments_1))
print(segments_1)

5
['How was your day', ' Were you able to get stuff done', " I'll be taking a leave tommorow", " Hope it's okay", '']


In [6]:
sentence_2 = 'Hello Mr. Brown. How was your day today?'
segments_2 = re.split('[?,.]',sentence_2)
print(len(segments_2))
print(segments_2)
# Notice how this breaks Mr. and Brown

4
['Hello Mr', ' Brown', ' How was your day today', '']


> Another heuristic approach for sentence segmentation is to use a dictionary having common abbreviations, and then perform dictionary matching.