### Text Cleaning 
Text pre-processing is one of the major tasks in NLP. To get useful insights from the data we may want to remove punctuations, numbers, special characters etc. and to make this task simpler we use REGEX.

#### REGEX also known as Regular Expression is used to search for a pattern in the data. REGEX contain two components.
<b>1)</b> Literals - they have no syntax meaning like (a,b,1,2)<br>
<b>2)</b> Metacharacters - They are made up of symbol and alphabets and  have good syntax meaning (?,*). Ex: . - for character, d- digits

<b> To use the metacharacter as a literal we will use escaping "\\"</b>

----
The most common usage of REGEX are:<br>

<b>1)</b> Search - Search inside the string and stop when the pattern is found.<br>
<b>2)</b> Match - Search the pattern at the beginning of the string.<br>
<b>3)</b> findall - Find all the occurence.<br>
<b>4)</b> sub - Replace part of the string

In [107]:
import re

In [116]:
#Match (string of any length or character)
text = "India won the world cup in 2012" 
result = re.match(r".*",text)# regex expression start with alphabet r and search for any character except newline
result

<re.Match object; span=(0, 31), match='India won the world cup in 2012'>

In [115]:
#Match string of any length or character
text = "India won the world cup \n in 2012" 
result = re.match(r".*",text)# regex expression start with alphabet r and search for any character except newline
result

<re.Match object; span=(0, 24), match='India won the world cup '>

In [117]:
# Search
pattern = 'a'
search_ = re.search(pattern,'Machine Learning')
print("Found match at",search_.start(),"and the string is",search_.group())

Found match at 1 and the string is a


In [119]:
#findall
pat = 'a'
search_ = re.findall(pattern,'aeroplane')
search_

['a', 'a']

In [120]:
#sub 
pat = 'a'
re.sub(pat,'*','aeroplane')

'*eropl*ne'

### Metacharacters 
* . for any character
* \d for digits

In [121]:
# character class
s1 = 'defense'
s2 = 'defence'

In [122]:
re.search(pattern = r'defen[sc]e',string= s1)

<re.Match object; span=(0, 7), match='defense'>

In [123]:
re.search(pattern = r'defen[sc]e',string= s2)

<re.Match object; span=(0, 7), match='defence'>

Hence character class can be used as OR statement.

### Different Regex
* <b>Alphabets [a-z] and [A-Z] 
* Numbers [0-9]
* ^ Negation [^0-9] -- Search everything except Numbers
* [0-9a-zA-z] -- Alphanumeric  
* [^0-9a-zA-z] -- only symbols
* \s -- for white space [\n, \t, \v, \h]
* <i> to use s as s- alphabet we use "s"</i>
* \w = [a-zA-Z0-9]
* \W = ^[a-zA-Z0-9]
* \d = Digits[0-9]


In [129]:
# Searching Alphabets
text = "Covid-19 is increasing at an alarming rate"
result = re.match(r"[a-zA-Z]+",text)
# this regular expression has searched for alphabet and stopped when a number found.

In [130]:
print(result)

<re.Match object; span=(0, 5), match='Covid'>


In [140]:
# Searching Numbers & speccial characters
text = "Covid-19 is increasing at an alarming rate"
result = re.search(r"[^a-zA-Z]+",text)

In [141]:
print(result)

<re.Match object; span=(5, 9), match='-19 '>


In [142]:
# flag ignore to ignore case
print("All Alphabets",re.findall(pattern= r'[a-z]',string = "JagABCDhfhio3hifhi821",flags=re.I))

All Alphabets ['J', 'a', 'g', 'A', 'B', 'C', 'D', 'h', 'f', 'h', 'i', 'o', 'h', 'i', 'f', 'h', 'i']


In [143]:
# space, tab, newline
pat = '\s'
string = 'heiiurfihfi....7@#$$\nhduiw523'
print(re.findall(pat,string))

['\n']


In [144]:
Password = input("Enter Password")
obj = re.search(r'\s',Password)
if obj == None:
    print("Correct Password")
else:
    print("Don't use space")

Enter Passwordhjj 
Don't use space


### Quantifiers
#### - No occurence(?)
#### - Exactly one(*)
#### - Atleast one(+)
#### - Exactly n times(n)
#### - min m times max n times(m,n)

In [145]:
# find numbers beginning with 6,7,8,9 or 9 followed by 9 digits
pat = "[6-9]\d{9}"
result = re.match(pat,"785666666666")
result

<re.Match object; span=(0, 10), match='7856666666'>

In [146]:
s1 = "9xm"
s2 = "MTV"
pat = r'[0-9]?'
searchObj = re.search( r'[0-9]?', s1) 

In [147]:
searchObj

<re.Match object; span=(0, 1), match='9'>

In [148]:
searchObj = re.search( r'[0-9]?', s2) 

In [149]:
searchObj

<re.Match object; span=(0, 0), match=''>

In [150]:
# Anchors for start and end
pat = "^[6-9]\d{5}$" # start with either 6,7,8,9 and at the end i need 5 more nubers
re.findall(pat,string = "987685")

['987685']

In [152]:
text = "1998 was the year when the film titanic was released"
if re.search(r"1998$", text):
    print("Match found")
else:
    print("Match not found")

Match found


### Text Cleaning 
Machines can’t understand the unstructured raw data, and we can’t fit our model on the raw text. We have to clean the text and convert it from the unstructured format to a structured format to get any meaning from the data. Text cleaning is a crucial part of any NLP project.

-----------
<b>Now that we are dealing with text data we need to understand some basic terms:</b>
* Words:- Collection of alphabets without any symbols or numbers.
* Term:- Important word in any language except stopwords(General words).
* Document:- The atomic part of the text data. Ex: Tweet in Twitter data.
* Corpus:- The most important and required document for the analysis. A corpus is a systematic computerized collection of authentic language that is used for linguistic analysis as well as corpus analysis. 
----------
<b>Steps in Data Cleaning:</b>
* Remove HTML characters from corpus.- <i>Using Regex</i>
* Convert the text to lower case ( Case Standardization).
* Remove Punctuations.
* Remove Stopwords.
* Tokenization.
* Stemming vs Lemmatization.
* Get Parts of Speech.
----------

In [88]:
import pandas as pd
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet # is for word meanings
import nltk.classify.util
from nltk.stem import WordNetLemmatizer,PorterStemmer
import re

In [89]:
sentence =  """Thank you all so very much. Thank you to the Academy. 
    Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               and crew. First off, to my brother in this endeavor, Mr. Tom 
               Hardy. Tom, your talent on screen can only be surpassed by 
               your friendship off screen … thank you for creating a t
               ranscendent cinematic experience. Thank you to everybody at 
               Fox and New Regency … my entire team. I have to thank 
               everyone from the very onset of my career … To my parents; 
               none of this would be possible without you. And to my 
               friends, I love you dearly; you know who you are. And lastly,
               I just want to say this: Making The Revenant was about
               man's relationship to the natural world. A world that we
               collectively felt in 2015 as the hottest year in recorded
               history. Our production needed to move to the southern
               tip of this planet just to be able to find snow. Climate
               change is real, it is happening right now. It is the most
               urgent threat facing our entire species, and we need to work
               collectively together and stop procrastinating. We need to
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""

In [90]:
# there aren't any html characters and hence we will convert it into lower case
sentence = sentence.lower()

In [91]:
# Remove Punctuations
sentence_P = re.sub(r"[^a-zA-Z ]","",sentence)
#substitute nothing if encounter anything except a-zA-Z (Alphabets)

In [92]:
print(stopwords.words('English'))
#these are the stopwords in english language and we will remove all of them as they have no meaning

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [93]:
# Tokenization and Removing stopwords
sentence_split = sentence_P.split(" ")
new_words = []
for i in sentence_split:
    if i not in stopwords.words("English") and i != "":
        new_words.append(i)

In [101]:
print(new_words)

['thank', 'much', 'thank', 'academy', 'thank', 'room', 'congratulate', 'incredible', 'nominees', 'year', 'revenant', 'product', 'tireless', 'efforts', 'unbelievable', 'cast', 'crew', 'first', 'brother', 'endeavor', 'mr', 'tom', 'hardy', 'tom', 'talent', 'screen', 'surpassed', 'friendship', 'screen', 'thank', 'creating', 'ranscendent', 'cinematic', 'experience', 'thank', 'everybody', 'fox', 'new', 'regency', 'entire', 'team', 'thank', 'everyone', 'onset', 'career', 'parents', 'none', 'would', 'possible', 'without', 'friends', 'love', 'dearly', 'know', 'lastly', 'want', 'say', 'making', 'revenant', 'mans', 'relationship', 'natural', 'world', 'world', 'collectively', 'felt', 'hottest', 'year', 'recorded', 'history', 'production', 'needed', 'move', 'southern', 'tip', 'planet', 'able', 'find', 'snow', 'climate', 'change', 'real', 'happening', 'right', 'urgent', 'threat', 'facing', 'entire', 'species', 'need', 'work', 'collectively', 'together', 'stop', 'procrastinating', 'need', 'support', 

In [102]:
# Word Correction
from textblob import TextBlob

In [103]:
word = "Englash"
tb = TextBlob(word)
r = tb.correct().raw
r

'English'

In [104]:
clean_tokens_no_typpos = []
for word in new_words:
    tb = TextBlob(word)
    clean_tokens_no_typpos.append(tb.correct().raw)
    tb=""
clean_tokens_no_typpos

['thank',
 'much',
 'thank',
 'academy',
 'thank',
 'room',
 'congratulate',
 'incredible',
 'nominee',
 'year',
 'covenant',
 'product',
 'tireless',
 'efforts',
 'unbelievable',
 'cast',
 'crew',
 'first',
 'brother',
 'endeavor',
 'mr',
 'tom',
 'hardy',
 'tom',
 'talent',
 'screen',
 'surpassed',
 'friendship',
 'screen',
 'thank',
 'creating',
 'ranscendent',
 'cinematic',
 'experience',
 'thank',
 'everybody',
 'fox',
 'new',
 'regency',
 'entire',
 'team',
 'thank',
 'everyone',
 'onset',
 'career',
 'parents',
 'none',
 'would',
 'possible',
 'without',
 'friends',
 'love',
 'dearly',
 'know',
 'lastly',
 'want',
 'say',
 'making',
 'covenant',
 'man',
 'relationship',
 'natural',
 'world',
 'world',
 'collectively',
 'felt',
 'hottest',
 'year',
 'recorded',
 'history',
 'production',
 'needed',
 'move',
 'southern',
 'tip',
 'planet',
 'able',
 'find',
 'snow',
 'climate',
 'change',
 'real',
 'happening',
 'right',
 'urgent',
 'threat',
 'facing',
 'entire',
 'species',
 'ne

#### Else we can use Text Blob for word Correction. 

### Stemming v/s Lemmitization
* Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. Example: [finally,final,finalized] = fina
* Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma.Example: [finally,final,finalized] = final


<b>Making a stemmer is easier and for Lemmitizer we need deep knowledge for fast algorithm go for stemming else Lemmitizer</b>

In [106]:
nltk.pos_tag(clean_tokens_no_typpos)
# Now we can get noun or Adjectives.

[('thank', 'RB'),
 ('much', 'JJ'),
 ('thank', 'NN'),
 ('academy', 'NN'),
 ('thank', 'NN'),
 ('room', 'NN'),
 ('congratulate', 'NN'),
 ('incredible', 'JJ'),
 ('nominee', 'JJ'),
 ('year', 'NN'),
 ('covenant', 'NN'),
 ('product', 'NN'),
 ('tireless', 'NN'),
 ('efforts', 'NNS'),
 ('unbelievable', 'JJ'),
 ('cast', 'NN'),
 ('crew', 'NN'),
 ('first', 'RB'),
 ('brother', 'RB'),
 ('endeavor', 'NN'),
 ('mr', 'NN'),
 ('tom', 'NN'),
 ('hardy', 'JJ'),
 ('tom', 'JJ'),
 ('talent', 'NN'),
 ('screen', 'NN'),
 ('surpassed', 'VBD'),
 ('friendship', 'NN'),
 ('screen', 'NN'),
 ('thank', 'NN'),
 ('creating', 'VBG'),
 ('ranscendent', 'JJ'),
 ('cinematic', 'JJ'),
 ('experience', 'NN'),
 ('thank', 'NN'),
 ('everybody', 'NN'),
 ('fox', 'VBZ'),
 ('new', 'JJ'),
 ('regency', 'NN'),
 ('entire', 'JJ'),
 ('team', 'NN'),
 ('thank', 'NN'),
 ('everyone', 'NN'),
 ('onset', 'VBN'),
 ('career', 'NN'),
 ('parents', 'NNS'),
 ('none', 'NN'),
 ('would', 'MD'),
 ('possible', 'JJ'),
 ('without', 'IN'),
 ('friends', 'NNS'),
 ('lo