## What is Text Normalization?

Text normalization is that the method of transforming text into one canonical form that it'd not have had before. Normalizing text before storing or processing it allows for separation of concerns since the input is sure to be consistent before operations are performed thereon. Text normalization requires being conscious of what sort of text is to be normalized and the way it's to be processed afterwards; there's no all-purpose normalization procedure.

#### Steps Required:

1. **Input Text String**  
   Start with the raw input text that needs to be normalized.

2. **Convert Case**  
   Convert all letters of the string to one case (either **lowercase** or **uppercase**) to maintain consistency.  
   _Example_: "This Is A Sample Text" → "this is a sample text"

3. **Handle Numbers**  
   - If numbers are essential, **convert them to words**.  
   - If not required, **remove all numeric characters**.

4. **Remove Punctuation and Grammar Formalities**  
   Strip out punctuation marks and unnecessary grammatical symbols like `.,!?;:'"` etc.

5. **Remove Extra White Spaces**  
   Normalize white spaces by removing extra spaces, tabs, or newline characters.

6. **Remove Stop Words**  
   Filter out common stop words such as `the`, `is`, `and`, `in`, `to`, etc., as they may not contribute meaningful information in many NLP tasks.

7. **Other Computations (Optional)**  
   Additional processing may include:
   - **Stemming or Lemmatization**: Reduce words to their base or root form.
   - **Spelling Correction**
   - **Tokenization**
   - **Handling Emojis or Special Characters** depending on the context of use.

---

These steps are crucial to prepare raw text for further natural language processing (NLP) tasks such as classification, clustering, or sentiment analysis.

#### Case Conversion (Lower Case)

In [16]:
# import required libraries
import re

In [11]:
# input string
string = "      Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

In [12]:
# convert to lower case
lower_string = string.lower()
print(lower_string)

      python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much python 2 code does not run unmodified on python 3. with python 2's end-of-life, only python 3.6.x[30] and later are supported, with older versions still supporting e.g. windows 7 (and old installers not restricted to 64-bit windows).


#### Removing Numbers

In [17]:
# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)

      python ., released in , was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python . with python 's end-of-life, only python ..x[] and later are supported, with older versions still supporting e.g. windows  (and old installers not restricted to -bit windows).


#### Removing punctuation

In [18]:
# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string) 
print(no_punc_string)

      python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows


#### Removing White space

In [19]:
# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string) 

# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows


#### Removing Stop Words

In [23]:
import spacy

# convert to lower case
lower_string = string.lower().strip()

# load spaCy english model
nlp = spacy.load("en_core_web_sm")

# process the text using spaCy
doc = nlp(lower_string)

# remove stopwords
filtered_words = [token.text for token in doc if not token.is_stop]

# join the filtered words to form a clean text
clean_text = ' '.join(filtered_words)

print(clean_text)

python 3.0 , released 2008 , major revision language completely backward compatible python 2 code run unmodified python 3 . python 2 end - - life , python 3.6.x[30 ] later supported , older versions supporting e.g. windows 7 ( old installers restricted 64 - bit windows ) .
