<a href="https://colab.research.google.com/github/marsani/MachineLearning-2021/blob/main/textProcessing_regularExpression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What are we going to cover

<ul>
    <li>Text Processing with standard Python libraries</li>
    <li>Regular Expressions</li>
    <li>Basics of NLP - Text Processing with Spacy library</li>
    <li>Exploratory Data Analysis</li>
    <li>Sentence Similarity via Vectorization</li>
    <li>Text Generation</li>
</ul>

In [None]:
# If you are using google colab, upload the text file using the left panel -> Files tab and then execute this cell
with open('cv000_29590.txt') as f:
    text = f.read()

In [None]:
text

'films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there\'s never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid \'80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don\'t dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell\'s directors , albert and allen hughes . \ngetting the hughes brothers to direct this seem

In [None]:
lines = text.split('\n')

In [None]:
sentence = lines[0]

In [None]:
sentence.split(" ")

['films',
 'adapted',
 'from',
 'comic',
 'books',
 'have',
 'had',
 'plenty',
 'of',
 'success',
 ',',
 'whether',
 "they're",
 'about',
 'superheroes',
 '(',
 'batman',
 ',',
 'superman',
 ',',
 'spawn',
 ')',
 ',',
 'or',
 'geared',
 'toward',
 'kids',
 '(',
 'casper',
 ')',
 'or',
 'the',
 'arthouse',
 'crowd',
 '(',
 'ghost',
 'world',
 ')',
 ',',
 'but',
 "there's",
 'never',
 'really',
 'been',
 'a',
 'comic',
 'book',
 'like',
 'from',
 'hell',
 'before',
 '.',
 '']

In [None]:
"Kapil"[0].isupper()

True

In [None]:
"K@pil".isalnum()

False

In [None]:
"98765".isnumeric()

True

In [None]:
"proper noun".capitalize()

'Proper noun'

In [None]:
number = ["one","two","three"]
"-".join(number)

'one-two-three'

In [None]:
words = []
for s in lines:
  words.extend(s.split())

len(words)

802

In [None]:
count = {}
for w in words:
  if w in count:
    count[w] += 1
  else:
    count[w] = 1

count

{'"': 6,
 "'80s": 1,
 '(': 18,
 ')': 18,
 ',': 43,
 '-': 2,
 '.': 23,
 '00': 1,
 '102': 1,
 '12-part': 1,
 '1888': 1,
 '2': 2,
 '30': 1,
 '500': 1,
 ':': 3,
 '?': 3,
 'a': 15,
 'abberline': 2,
 'ably': 1,
 'about': 4,
 'absinthe': 1,
 'accent': 2,
 'acting': 1,
 'acts': 1,
 'actually': 1,
 'adapted': 1,
 'after': 1,
 'alan': 1,
 'albert': 1,
 'all': 3,
 'allen': 1,
 'almost': 1,
 'amounts': 1,
 'an': 3,
 'and': 20,
 'another': 1,
 'anyone': 1,
 'anything': 1,
 'apes': 1,
 'appearance': 1,
 'are': 1,
 'arriving': 1,
 'arthouse': 1,
 'as': 2,
 'at': 3,
 'attempt': 1,
 'back': 1,
 'bad': 1,
 'batman': 1,
 'be': 3,
 'because': 2,
 'been': 3,
 'before': 1,
 'befriends': 1,
 'behind': 1,
 'better': 1,
 'big': 1,
 'black-and-white': 1,
 'blame': 1,
 'bleak': 1,
 'blindly': 1,
 'block': 1,
 'blow': 1,
 'book': 3,
 'books': 1,
 'both': 2,
 'bother': 1,
 'briefed': 1,
 'british': 1,
 'brothers': 1,
 'brought': 1,
 'burton': 1,
 'but': 7,
 'by': 1,
 'called': 2,
 'calls': 1,
 'campbell': 3,
 'can

In [None]:
sent = "This is the first sentence. This is another sentence? This is the third sentence. This is the last sentence"
sent.split(". ")

['This is the first sentence',
 'This is another sentence? This is the third sentence',
 'This is the last sentence']

## Can we do better? Regular Expressions to the Rescue

In [None]:
import re

### This module provides regular expression matching operations.

Below is a list of expressions and what they match to. 

| Expression | Matches With                   |
| ---------- | -----------------------------  |
| `abc...`   | lowercase letter               |
| `123…`     | Digits                         |
| `\d`       | Any Digit                      |
| `'\D'`     | Any Non-digit character        |
| `.`        | Any Character                  |
| `\.`       | Period                         |
| `[abc]`    | Only a, b, or c                |
| `\.`       | Period                         |
| `[abc]`    | Only a, b, or c                |
| `[^abc]`   | Not a, b, nor c                |
| `[a-z]`    | Characters a to z              |
| `[0-9]`    | Numbers 0 to 9                 |
| `\w`       | Any Alphanumeric character     |
| `\W`       | Any Non-alphanumeric character |
| `{m}`      | m Repetitions                  |
| `{m,n}`    | m to n Repetitions             |
| `\*`       | Zero or more repetitions       |
| `\+`       | One or more repetitions        |
| `?`        | Optional character             |
| `\s`       | Any Whitespace                 |
| `\S`       | Any Non-whitespace character   |
| `^…$`      | Starts and ends                |
| `(…)`      | Capture Group                  |


In [None]:
re.findall("\w+",sentence)

['films',
 'adapted',
 'from',
 'comic',
 'books',
 'have',
 'had',
 'plenty',
 'of',
 'success',
 'whether',
 'they',
 're',
 'about',
 'superheroes',
 'batman',
 'superman',
 'spawn',
 'or',
 'geared',
 'toward',
 'kids',
 'casper',
 'or',
 'the',
 'arthouse',
 'crowd',
 'ghost',
 'world',
 'but',
 'there',
 's',
 'never',
 'really',
 'been',
 'a',
 'comic',
 'book',
 'like',
 'from',
 'hell',
 'before']

In [None]:
re.split("[?.!]",sent)

['This is the first sentence',
 ' This is another sentence',
 ' This is the third sentence',
 ' This is the last sentence']

In [None]:
sent = "My phone number is +1-972-1234567. Indian number is +91-987654321"
phone = re.findall("\+[0-9\-]*",sent)
phone

['+1-972-1234567', '+91-987654321']

In [None]:
text = "Your otp to login to xyz app is 567846. Go to the following link, https://xyz.co/34567"
otp = re.findall("[0-9]{6}",text)
otp

['567846']

### Groups

Groups of text show up everywhere.
<ul>
    <li>Names</li>
    <li>Phone Numbers</li>
    <li>Noun Phrases - "The" `< adjective>+` `< noun >` - For example - The funny man</li>
</ul>

In [None]:
p = phone[0]
p
re.match("(?P<country_code>[\+0-9]+)-(?P<area_code>[0-9]*)-(?P<number>[0-9]*)", p).groupdict()

{'area_code': '972', 'country_code': '+1', 'number': '1234567'}

## More complicated patterns - Email IDs, URLs, etc

## Fun Exercise

Building a regular expression to test the validity of a password

A valid password is one which
<ul>
    <li> must contains one digit</li>
    <li>must contains one special symbols [#@!?]</li>
    <li>must contains one upper characters</li>
    <li>must contains one lowercase characters</li>
    <li>length at least 6 characters and maximum of 20</li>	
</ul>

In [None]:
def is_valid(p):
    #  (?=.*[#@!?])(?=.*[a-z](?=.*[A-Z])(?=.*\d)[A-Za-z\d#@!?]{6-20})
    pattern = "(?=.*[#@!?])(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[A-Za-z\d#@!?]{6,20}"
    regex = re.compile(pattern)
    if re.match(regex, p):
        return True
    return False

passwords = ["Regex123", "Regex@123", "Rr@12"]
for p in passwords:
    print(is_valid(p))

False
True
False
