<a href="https://colab.research.google.com/github/mohd-faizy/CAREER-TRACK-Machine-Learning-Scientist-with-Python/blob/main/01_Regular_expressions_and_word_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

--- 
<strong> 
    <h1 align='center'>Regular expressions & word tokenization</h1> 
</strong>

---


> This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.



In [None]:
!git clone https://github.com/mohd-faizy/CAREER-TRACK-Machine-Learning-Scientist-with-Python.git

Cloning into 'CAREER-TRACK-Machine-Learning-Scientist-with-Python'...
remote: Enumerating objects: 744, done.[K
remote: Counting objects: 100% (415/415), done.[K
remote: Compressing objects: 100% (368/368), done.[K
remote: Total 744 (delta 110), reused 325 (delta 45), pack-reused 329[K
Receiving objects: 100% (744/744), 214.73 MiB | 31.38 MiB/s, done.
Resolving deltas: 100% (236/236), done.
Checking out files: 100% (340/340), done.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

plt.style.use('fivethirtyeight')
#plt.style.use('ggplot')
#sns.set_theme()

%matplotlib inline

In [None]:
os.chdir('/content/CAREER-TRACK-Machine-Learning-Scientist-with-Python/11_Introduction_to_Natural_Language_Processing_in_Python/_dataset')
cwd = os.getcwd()
print('Curent working directory is ', cwd)

Curent working directory is  /content/CAREER-TRACK-Machine-Learning-Scientist-with-Python/11_Introduction_to_Natural_Language_Processing_in_Python/_dataset


In [None]:
ls

articles.csv           fake_or_real_news.csv  [0m[01;34mnews_articles[0m/
english_stopwords.txt  grail.txt              [01;34mwikipedia_articles[0m/


In [None]:
import re
from pprint import pprint

## **Introduction to regular expressions**

A **RegEx**, or **Regular Expression**, is a sequence of characters that forms a search pattern.

>RegEx can be used to check **if a string contains the specified search pattern**.

```python
import re
```
The module defines several functions and constants to work with **RegEx**. The `re` module is composed of **five function**s known as:

- __`findall`__: *It **findall** search for matches and print resultant in the form of a list.*

- __`search`__: *It works the same as a findall, but the resultant is a matched object, if any found.*

- __`split`__: *The **split** function splits the string from every matched into two new strings.*

- __`sub`__: *The sub-function works exactly like a replace function in notepad or MS Word, it replaces the original word, with a word of our choice.*

- __`finditer`__: *The finditer yields an iterator as a resultant with all the objects that match the one we sent it) finditer supports more attributes than any other function defined above. It also provides more details related to the matched object. So, most of the examples we are going to see next will contain a finditer function in them.*



### **Metacharacters**

<p align='center'>
	<a href='#'><img src='https://github.com/mohd-faizy/CAREER-TRACK-Machine-Learning-Scientist-with-Python/blob/main/11_Introduction_to_Natural_Language_Processing_in_Python/img/Metacharacters.png?raw=true'>
    </a>
</p>

### **Special Sequences**

A special sequence is a __`\`__ followed by one of the characters in the list below, and has a special meaning:

<p align='center'>
	<a href='#'><img src='https://github.com/mohd-faizy/CAREER-TRACK-Machine-Learning-Scientist-with-Python/blob/main/11_Introduction_to_Natural_Language_Processing_in_Python/img/Special%20Sequences.png?raw=true'>
    </a>
</p>

### **Sets**

<p align='center'>
	<a href='#'><img src='https://github.com/mohd-faizy/CAREER-TRACK-Machine-Learning-Scientist-with-Python/blob/main/11_Introduction_to_Natural_Language_Processing_in_Python/img/Set.png?raw=true'>
    </a>
</p>

#### __`findall()` Function__

The `findall()` function returns a list containing all matches.

In [None]:
import re

txt = "The rain in Spain"

x = re.findall("ai", txt)
print(x)

y = re.findall("Portugal", txt) # If no matches are found, an empty list[] is returned
print(y)

['ai', 'ai']
[]


#### __`search()` Function__

- The `search()` function searches the string for a match, and returns a Match object if there is a match.

- If there is **more than one match**, *only the first occurrence of the match will be returned*

In [None]:
import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

y = re.search("Portugal", txt) # If no matches are found, the value None is returned
print(y)

The first white-space character is located in position: 3
None


#### __`split()` Function__

The `split()` function returns a list where the string has been split at each match.

In [None]:
import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

y = re.split("\s", txt, 1) # specifying the `maxsplit` parameter
print(y)

['The', 'rain', 'in', 'Spain']
['The', 'rain in Spain']


#### __`sub() Function`__

The `sub()` function replaces the matches with the text of your choice:

In [None]:
import re

txt = "The rain in Spain"
x = re.sub("\s", "_", txt) #Replace every white-space character with the number 9
print(x)

y = re.sub("\s", "_", txt, 2) # specifying the `count` parameter
print(y)

The_rain_in_Spain
The_rain_in Spain


In [None]:
import re

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) # this will print an object

<re.Match object; span=(5, 7), match='ai'>


The Match object has properties and methods used to retrieve information about the search, and the result:

- `.span()`: returns a tuple containing the start-, and end positions of the match.
- `.string`: returns the string passed into the function
- `.group()`: returns the part of the string where there was a match

In [None]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

y = re.search(r"\bS\w+", txt)
print(y.string)

z = re.search(r"\bS\w+", txt)
print(z.group())

(12, 17)
The rain in Spain
Spain


> **Note:** If there is no match, the value ***None*** will be returned, instead of the Match Object.

In [None]:
import re

mystr = '''Tata Limited
Dr. David Landsman, executive director
18, Grosvenor Place
London SW1X 7HSc
Phone: +44 (20) 7235 8281
Fax: +44 (20) 7235 8727
Email: tata@tata.co.uk
Website: www.europe.tata.com
Directions: View map

Tata Sons, North America
1700 North Moore St, Suite 1520
Arlington, VA 22209-1911
USA
Phone: +1 (703) 243 9787
Fax: +1 (703) 243 9791
66-66
455-4545
Email: northamerica@tata.com 
Website: www.northamerica.tata.com
Directions: View map fass
harry potter lekin
bahut hi badia aadmi haiaiinaiiiiiiiiiiii'''

# findall, search, split, sub, finditer
# patt = re.compile(r'fass')
# patt = re.compile(r'.adm')
# patt = re.compile(r'^Tata')
# patt = re.compile(r'iin$')
# patt = re.compile(r'ai{2}')
# patt = re.compile(r'(ai){1}')
# patt = re.compile(r'ai{1}|Fax')


# Special Sequences
# patt = re.compile(r'Fax\b')
# patt = re.compile(r'27\b')
patt = re.compile(r'\d{5}-\d{4}')

# Task
# Given a string with a lot of indian phone numbers starting from +91

matches = patt.finditer(mystr)

for match in matches:
    print(match)

<re.Match object; span=(284, 294), match='22209-1911'>


### What is Natural Language Processing?

- **Field of study focused on making sense of language**
    - Using statistics and computers

- **Basics of NLP**
    - *Topic identification*
    - *Text classification*

- [__NLP applications include__](https://www.analyticsvidhya.com/blog/2020/07/top-10-applications-of-natural-language-processing-nlp/)
    - ***Search Autocorrect and Autocomplete***
    - ***Language Translator***
    - ***Social Media Monitoring***
    - ***Chatbots***
    - ***Survey Analysis***
    - ***Sentiment analysis***
    - ***Targeted Advertising***
    - ***Hiring and Recruitment***
    - ***Voice Assistants***
    - ***Grammar Checkers***
    - ***Email Filtering***


- **Regular expressions**

    - Strings with a special syntax
    - Allow us to match patterns in other strings
    - Applications of regular expressions
        - Find all web links in a document
        - Parse email addresses, remove/replace unwanted characters

```python 
import re
re.match('abc', 'abcdef')

# <_sre.SRE_Match object; span=(0, 3), match='abc'>
```
```python
import re
word_regex = '\w+'
re.match(word_regex,'hi there!')

# <_sre.SRE_Match object; span=(0, 2), match='hi'>
```

- **Common Regex patterns**

$$
\begin{array}{|l|l|l|}
\hline \text { pattern } & \text { matches } & \text { example } \\
\hline \text { \w+ } & \text { word } & \text { 'Magic' } \\
\hline \text { \d } & \text { digit } & 9 \\
\hline \text { \s } & \text { space } & \text { ' ' } \\
\hline \text {.*} & \text { wildcard } & \text { 'username74' } \\
\hline \text { + or }^{*} & \text { greedy match } & \text { 'aaaaaa' } \\
\hline \text { \S } & \text { not space } & \text { 'no_spaces' } \\
\hline \text { [a-z] } & \text { lowercase group } & \text { 'abcdefg' } \\
\hline
\end{array}
$$


- Python's re Module
    - `split`: split a string on regex
    - `findall`: find all patterns in a string
    - `search`: search for a pattern
    - `match`: match an entire string or substring based on a pattern
    - Pattern first, and the string second
    - May return an iterator, string, or match object

In [None]:
my_string = "Let's write RegEx!"
PATTERN = r"\w+"
re.findall(PATTERN, my_string)

['Let', 's', 'write', 'RegEx']

### Practicing regular expressions - re.split() and re.findall()
Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at `my_string` first by printing it in the IPython Shell, to determine how you might best match the different steps.

Note: It's important to prefix your regex patterns with `r` to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, `"\n"` in Python is used to indicate a new line, but if you use the `r` prefix, it will be interpreted as the raw string `"\n"` - that is, the character `"\"` followed by the character `"n"` - and not as a new line.

Remember from the video that the syntax for the regex library is to always to pass the **pattern first**, and then the **string second**.

In [None]:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

In [None]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capicalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']


## Introduction to tokenization
- Tokenization
    - Turning a string or document into **tokens** (smaller chunks)
    - One step in preparing a text for NLP
    - Many different theories and rules
    - You can create your own rules using regular expressions
    - Some examples:
        - Breaking out words or sentences
        - Separating punctuation
        - Separating all hashtags in a tweet
- Why tokenize?
    - Easier to map part of speech
    - Matching common words
    - Removing unwanted tokens
- Other `nltk` tokenizers
    - `sent_tokenize`: tokenize a document into sentences
    - `regexp_tokenize`: tokenize a string or document based on a regular expression pattern
    - `TweetTokenizer`: special class just for tweet tokenization, allowing you to separate hashtags, mentions and lots of exclamation points

### Word tokenization with NLTK
Here, you'll be using the first scene of Monty Python's Holy Grail, which has been pre-loaded as `scene_one`.

Your job in this exercise is to utilize `word_tokenize` and `sent_tokenize` from `nltk.tokenize` to tokenize both words and sentences from Python strings - in this case, the first scene of Monty Python's Holy Grail.

> Note: Before using NLTK, you must install `punkt` package for tokenizer

In [None]:
with open('grail.txt', 'r') as file:
    holy_grail = file.read()
    scene_one = re.split('SCENE 2:', holy_grail)[0]

In [None]:
scene_one

"SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a t

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

### More regex with re.search()
In this exercise, you'll utilize `re.search()` and `re.match()` to find specific tokens. Both search and match expect regex patterns, similar to those you defined in an earlier exercise. You'll apply these regex library methods to the same Monty Python text from the `nltk` corpora.



In [None]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

In [None]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

In [None]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))

## Advanced tokenization with NLTK and regex
- Regex groups using or `|`
    - OR is represented using `|`
    - You can define a group using `()`
    - You can define explicit character ranges using `[]`
- Regex ranges and groups

| pattern | matches | example |
| ------- | ------- | ------- |
| [A-Za-z]+ | upper and lowercase English alphabet | 'ABCDEFghijk' |
| [0-9] | numbers from 0 to 9 | 9 |
| [A-Za-z\-\.]+ | upper and lowercase English alphabet, - and . | 'My-Website.com' |
| (a-z) | a, - and z | 'a-z' |
| (\s+|,) | spaces or a comma | ', ' |

### Choosing a tokenizer
Given the following string, which of the below patterns is the best tokenizer? If possible, you want to retain sentence punctuation as separate tokens, but have `'#1'` remain a single token.
```python
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"
```

Additionally, `regexp_tokenize` has been imported from `nltk.tokenize`. You can use `regexp_tokenize(string, pattern)` with `my_string` and one of the patterns as arguments to experiment for yourself and see which is the best tokenizer.

In [None]:
from nltk.tokenize import regexp_tokenize

my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

pattern1 = r'(\\w+|\\?|!)'
pattern2 = r"(\w+|#\d|\?|!)"
pattern3 = r'(#\\d\\w+\\?!)'
pattern4 = r'\\s+'

In [None]:
pprint(regexp_tokenize(my_string, pattern2))

### Regex with NLTK tokenization
Twitter is a frequently used source for NLP text and tasks. In this exercise, you'll build a more complex tokenizer for tweets with hashtags and mentions using nltk and regex. The `nltk.tokenize.TweetTokenizer` class gives you some extra methods and attributes for parsing tweets.

Here, you're given some example tweets to parse using both `TweetTokenizer` and `regexp_tokenize` from the `nltk.tokenize` module. 

Unlike the syntax for the regex library, with `nltk_tokenize()` you pass the pattern as the second argument.

In [None]:
tweets = ['This is the best #nlp exercise ive found online! #python',
 '#NLP is super fun! <3 #learning',
 'Thanks @datacamp :) #nlp #python']

In [None]:
from nltk.tokenize import regexp_tokenize, TweetTokenizer

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

In [None]:
# write a pattern that matches both mentions (@) and hashtags
pattern2 = r"[@|#]\w+"

# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)

In [None]:
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

### Non-ascii tokenization
In this exercise, you'll practice advanced tokenization by tokenizing some non-ascii based text. You'll be using German with emoji!

Here, you have access to a string called `german_text`, which has been printed for you in the Shell. Notice the emoji and the German characters!

Unicode ranges for emoji are:

`('\U0001F300'-'\U0001F5FF')`, `('\U0001F600-\U0001F64F')`, `('\U0001F680-\U0001F6FF')`, and `('\u2600'-\u26FF-\u2700-\u27BF')`.

In [None]:
german_text = 'Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕'

In [None]:
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

## Charting word length with NLTK


### Charting practice
Try using your new skills to find and chart the number of words per line in the script using matplotlib. The Holy Grail script is loaded for you, and you need to use regex to find the words per line.

Using list comprehensions here will speed up your computations. For example: `my_lines = [tokenize(l) for l in lines]` will call a function tokenize on each line in the list lines. The new transformed list will be saved in the `my_lines` variable.

In [None]:
import matplotlib.pyplot as plt

# Split the script into lines: lines
lines = holy_grail.split('\n')

# Replace all script lines for speaker
pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
lines = [re.sub(pattern, '', l) for l in lines]

# Tokenize each line: tokenized_lines
tokenized_lines = [regexp_tokenize(s, '\w+') for s in lines]

# Make a frequency list of lengths: line_num_words
line_num_words = [len(t_line) for t_line in tokenized_lines]

# Plot a histogram of the line lengths
plt.figure(figsize=(8,8))
plt.hist(line_num_words)
plt.title('# of words per line in holy_grail')
plt.show()