## REGEX 

* RegEx uses metacharacters in conjunction with a search engine to retrieve specific patterns. 

* Metacharacters are the building blocks of regular expressions. 


<table>
  <tr>
    <th>Meta character</th>
    <th>Meaning</th>
    <th>Example/</th>
  </tr>
  <tr> 
    <td> \d </td> 
    <td> Whole Number 0-9 </td> 
    <td> \d\d = 81, \d=4 </td>
  </tr>
  <tr>
    <td> [a-z], [0-9] </td> 
    <td> Character set, at least one of which must be a match, but no more than one unless otherwise specified. 
    The order of the characters does not matter. </td>
    <td> [a-z] = a, b, c,... / [0-9] = 0, 1, 2... </td>
  </tr>
  <tr>
    <td> * </td> 
    <td>  Asterisk matches when the character preceding \* matches 0 or more times </td> 
    <td> tre*= tr / tree / treeeeee </td>
  </tr>
  <tr>
    <td> + </td>
    <td> Matches when the character preceding + matches 1 or more times </td>
    <td> tre*= tre / tree / treeeeee </td> 
  </tr>
  <tr>
    <td> {n} </td> 
    <td> Fixes number of occurences </td>
    <td> \d{3} = 123/ 234/ 987. 
  </tr>
  <tr>
    <td> . (period) </td> 
    <td> Matches any single alphanumeric character or symbol </td> 
    <td> ton. = tone / ton3 / tone# </td> 
  </tr>
</table>


<br>

* What will be the regex for your phone number?
* What will ".*" match?





You can use regular expressions to search for URLs, email addresses, dates, and other strings that follow a specific pattern. You need to understand the pattern well!!!



In [None]:
import re

**Search**

In [None]:
txt = "The rain in Spain"
x = re.search(".*a\.n.*", txt)    #returns true if there is a match anywhere in the string

if x:
  print("YES! We have a match!")
else:
  print("No match")

No match




**Find**

In [None]:
x = re.findall("[^\sin ]", txt)     #returns a list of ocurences contatining the word
print(x)

['T', 'h', 'e', 'r', 'a', 'S', 'p', 'a']


**Split**

In [None]:
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


In [None]:
txt2 = "My name is, megha?? I love trees!!"
x = re.split("[\?\!]", txt2)
print(x)

['My name is, megha', '', ' I love trees', '', '']


**Replace**

In [None]:
x = re.sub("\s", "a+", txt)
print(x)

Thea+raina+ina+Spain


In [None]:
# controlled replace
x = re.sub("\s", "+", txt, 2)   #second terms cant be regex
print(x) 

The+rain+in Spain


# NLTK

In [None]:
import nltk


nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer('\w+')
text = "Hello, she has build a very good site. She hasn't added any irrelevant info!"
filterdText = tokenizer.tokenize(text)
print(filterdText)

['Hello', 'she', 'has', 'build', 'a', 'very', 'good', 'site', 'She', 'hasn', 't', 'added', 'any', 'irrelevant', 'info']


In [None]:
from nltk.tokenize import WordPunctTokenizer

# \w+|[^\w\s]+
tokenizer = WordPunctTokenizer()
text = "Hello, she has build a very good site. She hasn't added any irrelevant info!"
filterdText = tokenizer.tokenize(text)
print(filterdText)

['Hello', ',', 'she', 'has', 'build', 'a', 'very', 'good', 'site', '.', 'She', 'hasn', "'", 't', 'added', 'any', 'irrelevant', 'info', '!']


In [None]:
from nltk.tokenize import word_tokenize

text = "Hello, she has build a very good site. She hasn't added any irrelevant info!"
filterdText = word_tokenize(text)
print(filterdText)

['Hello', ',', 'she', 'has', 'build', 'a', 'very', 'good', 'site', '.', 'She', 'has', "n't", 'added', 'any', 'irrelevant', 'info', '!']


### Sentence Tokenization

In [None]:
from nltk.tokenize import sent_tokenize

text = "Dr. Hello NLTK, You have build a very good site. I love visiting your site!"
filterdText = sent_tokenize(text)
print(filterdText)

['Dr. Hello NLTK, You have build a very good site.', 'I love visiting your site!']


## Removing StopWords

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(stop_words)
print(len(stop_words))

{"should've", 'yourself', 'why', 'hers', 'until', 's', 'our', 'how', 'mightn', 'all', 'my', "wouldn't", 'on', 'any', 'while', 'after', 'herself', 'nor', 'by', "mustn't", 'to', 'at', 'i', 'this', 'be', 'does', 'them', 'where', 'between', 'doesn', 'not', 'the', "shan't", 'did', 'in', 'ourselves', 'their', 'than', 'him', 'further', 'ain', 'haven', "isn't", 'from', 'whom', 'which', 'is', 'having', 'with', 'hadn', 'these', 'a', 'but', 'won', "hadn't", 'just', 'been', 'when', 'wouldn', 'before', 'against', 'more', 'don', "that'll", 'had', 'ma', "couldn't", "didn't", 'o', 't', 'doing', 'through', 'isn', 'was', 'ours', 'we', 'no', 'hasn', 'what', 'will', 'me', 'out', 'has', 'there', 'about', 'under', 'if', 'very', "you'll", 'above', 'now', 'yourselves', 'am', 'you', 'are', 'll', 'were', 'over', 'that', 'because', 'up', 'both', 've', 'he', 'or', 'she', 'into', 'off', 'd', 'weren', 'each', 'too', 'shouldn', 'wasn', 'have', 'here', 'few', 'it', 'and', 'couldn', 'for', 'itself', 'again', 'of', 'it

In [None]:
from nltk.tokenize import word_tokenize

input_str = "NLTK is a leading platform for building Python programs to work with human language data."
tokens = word_tokenize(input_str)
result = [i for i in tokens if i not in stop_words]
print (result)

['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


## Stemming
Stemming is a process of linguistic normalization, which reduces words to their word root word or chops off the derivational affixes. For example, connection, connected, connecting word reduce to a common word "connect".

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
input_str ="been had done languages cities mice fairly"
input_str = word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

been
had
done
languag
citi
mice
fairli


In [None]:
from nltk.stem import LancasterStemmer

lancaster = LancasterStemmer()
input_str ="been had done languages cities mice fairly"
input_str = word_tokenize(input_str)
for word in input_str:
    print(lancaster.stem(word))

been
had
don
langu
city
mic
fair


In [None]:
from nltk.stem.snowball import SnowballStemmer # This is "Porter 2" and is considered the optimal stemmer.

snowball = SnowballStemmer("english")
input_str ="been had done languages cities mice fairly"
input_str = word_tokenize(input_str)
for word in input_str:
    print(snowball.stem(word))

been
had
done
languag
citi
mice
fair


## Lemmatization
Lemmatization reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis. 

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
input_str = "been had done languages cities mice fairly"
input_str = word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse
fairly


## Stemming VS Lemmatization
Lemmatization is usually more sophisticated than stemming. Stemmer works on an individual word without knowledge of the context. For example, The word "better" has "good" as its lemma. This thing will miss by stemming because it requires a dictionary look-up.

In [None]:
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

word = "better"

print("Lemmatized Word:", lem.lemmatize(word, wordnet.ADJ)) 
print("Stemmed Word:", stem.stem(word))

print()

word = "flying"
print("Lemmatized Word:", lem.lemmatize(word, wordnet.VERB)) 
print("Stemmed Word:", stem.stem(word))

Lemmatized Word: good
Stemmed Word: better

Lemmatized Word: fly
Stemmed Word: fli


## Part-Of-Speech Tagging

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
text = "Hello NLTK, You have build a very good site. I love visiting your site!"
sentence = nltk.sent_tokenize(text)
for sent in sentence:
    print(nltk.pos_tag(nltk.word_tokenize(sent)))

[('Hello', 'NNP'), ('NLTK', 'NNP'), (',', ','), ('You', 'PRP'), ('have', 'VBP'), ('build', 'VBN'), ('a', 'DT'), ('very', 'RB'), ('good', 'JJ'), ('site', 'NN'), ('.', '.')]
[('I', 'PRP'), ('love', 'VBP'), ('visiting', 'VBG'), ('your', 'PRP$'), ('site', 'NN'), ('!', '.')]


## Lemmatization with POS Tagging

In [None]:
from nltk.corpus import wordnet

In [None]:
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None

In [None]:
text = "been had done languages cities mice fairly"
print("{0:20}{1:20}".format("Without POS","With POS"))
sentence = nltk.sent_tokenize(text)
for sent in sentence:
    tok = nltk.pos_tag(nltk.word_tokenize(sent))
    for word in tok:
      pos = pos_tagger(word[1])
      if pos:
        print("{0:20}{1:20}".format(lemmatizer.lemmatize(word[0]), lemmatizer.lemmatize(word[0], pos)))
      else:
        print("{0:20}{1:20}".format(lemmatizer.lemmatize(word[0]), lemmatizer.lemmatize(word[0])))

Without POS         With POS            
been                be                  
had                 have                
done                do                  
language            language            
city                city                
mouse               mice                
fairly              fairly              


## Synsets in Wordnet
Synset are groupings of synonyms words that express the same concept. When you use Wordnet to look up words, you will get a list of Synset instances. 

In [None]:
from nltk.corpus import wordnet as wn

syn = wn.synsets('dog')[0]

## References. 

1. [Text Processing Tutorial](https://colab.research.google.com/drive/1C8K5yqdjI0LJ1CMvNc-uxwTuMCHFjYMY#scrollTo=7xbKjI6fM_cX)
2. [Hands-On nltk tutorial](https://github.com/hb20007/hands-on-nltk-tutorial)
3. [Natural Language Toolkit - Tokenizing Text](https://www.tutorialspoint.com/natural_language_toolkit/natural_language_toolkit_tokenizing_text.htm)
4. [Text Analytics - nltk](https://www.datacamp.com/tutorial/text-analytics-beginners-nltk)
5. [Stemming VS Lemmatization](https://towardsdatascience.com/stemming-vs-lemmatization-in-nlp-dea008600a0)
6. [NLTK Documentation](https://www.nltk.org/howto.html)
