# Chunking and chinking with RegEx - Geeks for Geeks

## Converting chunk to RegEx Pattern

In [1]:
from nltk.chunk.regexp import tag_pattern2re_pattern

In [2]:
print('Chunk Pattern: ', tag_pattern2re_pattern('<DT>?<NN.*>+'))

Chunk Pattern:  (<(DT)>)?(<(NN[^\{\}<>]*)>)+


## Parsing the sentence with RegExParser

In [3]:

from nltk.chunk import RegexpParser
chunker = RegexpParser(r'''
NP:
{<DT>?<NN.*>+}
}<VB.*>{
''')

In [4]:
chunker.parse([('the', 'DT'), ('book', 'NN'), (
    'has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])

The Ghostscript executable isn't found.
See http://web.mit.edu/ghostscript/www/Install.htm
If you're using a Mac, you can try installing
https://docs.brew.sh/Installation then `brew install ghostscript`


LookupError: 

Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), ('many', 'JJ'), Tree('NP', [('chapters', 'NNS')])])

# Regular Expressions - Corey Schafer
A regular expression allows us to search for and match specific patterns of text.

A raw string in python is a string beginning with an *r* and this tells python not to handle backslashes in any special way.

In [5]:
import re

In [6]:
print('\tTab')
print(r'\tTab')

	Tab
\tTab


In [27]:
text_to_search = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

. ^ $ * + ? { } [ ]  \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-4321
900-555-4321

cat 
mat 
bat 
pat
'''

In [8]:
pattern = re.compile(r'abc')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(1, 4), match='abc'>


The span is the beginning and end index of the match, and so we could use string slicing function.

This search is case sensetive so did not match to 'ABC'.

In [9]:
print(text_to_search[1:4])

abc


Special characters need to be escaped as they have special meanings in regex, for example a period means any character except new line.

In [22]:
#pattern = re.compile(r'.') 
pattern = re.compile(r'\.') 
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(76, 77), match='.'>
<re.Match object; span=(113, 114), match='.'>
<re.Match object; span=(135, 136), match='.'>
<re.Match object; span=(139, 140), match='.'>


In [11]:
pattern = re.compile(r'coreyms\.com') 
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(106, 117), match='coreyms.com'>


To find any digit we need \d, so we can make the pattern for a phone number. We can match any special character as a seperator (.), or use a character set ([]) to only match specific characters, without having to escape them.

In [23]:
#pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d') 
#pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d') 
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d') 
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(158, 170), match='800-555-4321'>
<re.Match object; span=(171, 183), match='900-555-4321'>


Character sets can also be used for ranges.

Using ^ outside of character set matches the beginning of a string, but inside the character set it negates it: 'anything but'.

In [28]:
#pattern = re.compile(r'[a-z]') 
#pattern = re.compile(r'[1-5]') 
#pattern = re.compile(r'[^a-zA-Z]') 
pattern = re.compile(r'[^b]at')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(185, 188), match='cat'>
<re.Match object; span=(190, 193), match='mat'>
<re.Match object; span=(200, 203), match='pat'>
