## Compiling and Matching

re.compile() method takes a regular expression pattern as an argument and compiles the pattern into a regular expression object, which you can later use to find matching text. The regular expression object below will exactly match 4 upper or lower case characters.

regular_expression_object = re.compile("[A-Za-z]{4}")

Regular expression objects have a .match() method that takes a string of text as an argument and looks for a single match to the regular expression that starts at the beginning of the string. 

result = regular_expression_object.match("Toto")

If .match() finds a match that starts at the beginning of the string, it will return a match object. The match object lets you know what piece of text the regular expression matched, and at what index the match begins and ends. If there is no match, .match() will return None.

With the match object stored in result, you can access the matched text by calling result.group(0). If you use a regex containing capture groups, you can access these groups by calling .group() with the appropriately numbered capture group as an argument.

Instead of compiling the regular expression first and then looking for a match in separate lines of code, we can simplify your match to one line:

result = re.match("[A-Za-z]{4}","Toto")

With this syntax, re‘s .match() method takes a regular expression pattern as the first argument and a string as the second argument.

![i](https://i.imgur.com/Y37Fia4.jpg)

![i1](https://i.imgur.com/r122ela.jpg)

## Searching and Finding

You can make your regular expression matches even more dynamic with the help of the .search() method. Unlike .match() which will only find matches at the start of a string, .search() will look left to right through an entire piece of text and return a match object for the first match to the regular expression given. If no match is found, .search() will return None. For example, to search for a sequence of 8 word characters in the string Are you a Munchkin?:

result = re.search("\w{8}","Are you a Munchkin?")

Using .search() on the string above will find a match of "Munchkin", while using .match() on the same string would return None!

So far you have used methods that only return one piece of matching text. What if you want to find all the occurrences of a word or keyword in a piece of text to determine a frequency count? Step in the .findall() method!

Given a regular expression as its first argument and a string as its second argument, .findall() will return a list of all non-overlapping matches of the regular expression in the string. Consider the below piece of text:

text = "Everything is green here, while in the country of the Munchkins blue was the favorite color. But the people do not seem to be as friendly as the Munchkins, and I'm afraid we shall be unable to find a place to pass the night."

To find all non-overlapping sequences of 8 word characters in the sentence you can do the following:

list_of_matches = re.findall("\w{8}",text)

.findall() will thus return the list ['Everythi', 'Munchkin', 'favorite', 'friendly', 'Munchkin'].

It’s important to note that the number of words in an entire text can impact the importance of a given word’s frequency!

![i](https://i.imgur.com/iY2MZpp.jpg)

## Part-of-Speech Tagging

While it is useful to match and search for patterns of individual characters in a text, you can often find more meaning by analyzing text on a word-by-word basis, focusing on the part of speech of each word in a sentence. This process of identifying and labeling the part of speech of words is known as part-of-speech tagging!

Wow! Ramona and her class are happily studying the new textbook she has on NLP.

Noun: the name of a person (Ramona,class), place, thing (textbook), or idea (NLP)

Pronoun: a word used in place of a noun (her,she)

Determiner: a word that introduces, or “determines”, a noun (the)

Verb: expresses action (studying) or being (are,has)

Adjective: modifies or describes a noun or pronoun (new)

Adverb: modifies or describes a verb, an adjective, or another adverb (happily)

Preposition: a word placed before a noun or pronoun to form a phrase modifying another word in the sentence (on)

Conjunction: a word that joins words, phrases, or clauses (and)

Interjection: a word used to express emotion (Wow)

We can automate the part-of-speech tagging process with nltk‘s pos_tag() function! The function takes one argument, a list of words in the order they appear in a sentence, and returns a list of tuples, where the first entry in the tuple is a word and the second is the part-of-speech tag.

![i](https://i.imgur.com/kQ6iQ2u.jpg)


## Chunking

Given part-of-speech tagged text, we can now use regular expressions to find patterns in sentence structure that give insight into the meaning of a text. This technique of grouping words by their part-of-speech tag is called ***chunking***

With chunking in nltk, you can define a pattern of parts-of-speech tags using a modified notation of regular expressions. You can then find non-overlapping matches, or chunks of words, in the part-of-speech tagged sentences of a text.

The regular expression you build to find chunks is called chunk grammar. A piece of chunk grammar can be written as follows:

chunk_grammar = "AN: {<JJ><NN>}"

AN is a user-defined name for the kind of chunk you are searching for. You can use whatever name makes sense given your chunk grammar. In this case AN stands for adjective-noun

A pair of curly braces {} surround the actual chunk grammar

<JJ> operates similarly to a regex character class, matching any adjective

<NN> matches any noun, singular or plural

The chunk grammar above will thus match any adjective that is followed by a noun.

To use the chunk grammar defined, you must create a nltk RegexpParser object and give it a piece of chunk grammar as an argument.

chunk_parser = RegexpParser(chunk_grammar)

You can then use the RegexpParser object’s .parse() method, which takes a list of part-of-speech tagged words as an argument, and identifies where such chunks occur in the sentence!

Consider the part-of-speech tagged sentence below:

pos_tagged_sentence = [('where', 'WRB'), ('is', 'VBZ'), ('the', 'DT'), ('emerald', 'JJ'), ('city', 'NN'), ('?', '.')]
You can chunk the sentence to find any adjectives followed by a noun with the following:

chunked = chunk_parser.parse(pos_tagged_sentence)

![i1](https://i.imgur.com/pQw0KcC.jpg)

![i](https://i.imgur.com/jTkMSNX.jpg)
