# Del 04: Parsanje, analiza podatkov in generiranje poročil

# Regular Expression

Regular expressions are used to identify whether a pattern exists in a given sequence of characters (string) or not. They help in manipulating textual data, which is often a pre-requisite for data science projects that involve text mining. You must have come across some application of regular expressions: they are used at the server side to validate the format of email addresses or password during registration, used for parsing text data files to find, replace or delete certain string, etc.

> String containing a combination of normal characters and special meta characters that describes patterns to find text or positions within a text

As powerful as regular expressions are, they can be difficult to learn at first and the syntax can look visually intimidating. As a result, a lot of students end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.

That said, learning (and loving!) regular expressions is something that is a worthwhile investment
- Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.
- Regular expressions are often faster to execute than their manual equivalents.
- Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.
We could probably fill a whole Dataquest course with the intricacies of regular expressions, but instead we're going to give you a two-mission tour of the main components.

One thing to keep in mind before we start: don't expect to remember all of the regular expression syntax. The most important thing is to understand the core principles, what is possible, and where to look up the details. This will mean you can quickly jog your memory whenever you need regular expressions.

With that in mind, don't be put off if some things in these missions don't stick in your memory. As long as you are able to write and understand regular expressions with the help of documentation and/or other reference guides, you have all the skills you need to excel.

[Online tester](https://regexr.com/)

### The Regular Expression Module

When working with regular expressions, we use the term pattern to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has matched.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string "and" within another string, the regex pattern for that is simply and:

In Python, regular expressions are supported by the re module. That means that if you want to start using them in your Python scripts, you have to import this module with the help of import:

In [2]:
import re

[re — Regular expression operations](https://docs.python.org/3/library/re.html)

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. 

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

In Python, a backslash followed by certain characters represents an escape sequence — like the \n sequence — which we previously learned represents a new line. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring \b:

In [3]:
print('hello\b world')

hello world


The escape sequence \b represents a backspace, so the final letter from our string is removed. The character sequence \b has a special meaning in regular expressions (which we'll learn about later), so we need a way to write these characters without triggering the escape sequence.

One way is to add an extra backslash before the "b":

In [4]:
print('hello\\b world')

hello\b world


This can make regular expressions even more difficult to read and interpret, so instead we use raw strings, which we denote by prefixing our string with the r character. Let's take a look at the code from above with a raw string:

In [5]:
print(r'hello\b world')

hello\b world


We strongly recommend using raw strings for every regex you write, rather than remember which sequences are escape sequences and using raw strings selectively. That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.

Once you have an object representing a compiled regular expression, what do you do with it? Pattern objects have several methods and attributes. Only the most significant ones will be covered here; consult the re docs for a complete listing.

<table class="docutils align-default">
<colgroup>
<col style="width: 28%">
<col style="width: 72%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Method/Attribute</p></th>
<th class="head"><p>Purpose</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">match()</span></code></p></td>
<td><p>Determine if the RE matches at the beginning
of the string.</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">search()</span></code></p></td>
<td><p>Scan through a string, looking for any
location where this RE matches.</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">findall()</span></code></p></td>
<td><p>Find all substrings where the RE matches, and
returns them as a list.</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">finditer()</span></code></p></td>
<td><p>Find all substrings where the RE matches, and
returns them as an <a class="reference internal" href="../glossary.html#term-iterator"><span class="xref std std-term">iterator</span></a>.</p></td>
</tr>
</tbody>
</table>

match() and search() return None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

- `re.match(pattern, string, flags=0)`

If zero or more characters at the beginning of string match this regular expression, return a corresponding match object.

Return None if the string does not match the pattern; note that this is different from a zero-length match. Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
If you want to locate a match anywhere in string, use search() instead

In [6]:
print(re.match(r'dog', 'today is dog day'))

None


In [7]:
match = re.match(r'dog', 'dog day is today')
print(match)

<re.Match object; span=(0, 3), match='dog'>


Now you can query the match object for information about the matching string. Match object instances also have several methods and attributes; the most important ones are:

<table class="docutils align-default">
<colgroup>
<col style="width: 29%">
<col style="width: 71%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Method/Attribute</p></th>
<th class="head"><p>Purpose</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">group()</span></code></p></td>
<td><p>Return the string matched by the RE</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">start()</span></code></p></td>
<td><p>Return the starting position of the match</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">end()</span></code></p></td>
<td><p>Return the ending position of the match</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">span()</span></code></p></td>
<td><p>Return a tuple containing the (start, end)
positions  of the match</p></td>
</tr>
</tbody>
</table>

In [8]:
match.group()

'dog'

In [9]:
match.start()

0

In [10]:
match.end()

3

In [11]:
match.span()

(0, 3)

group() returns the substring that was matched by the RE. start() and end() return the starting and ending index of the match. span() returns both start and end indexes in a single tuple. Since the match() method only checks if the RE matches at the start of a string, start() will always be zero. However, the search() method of patterns scans through the string, so the match may not start at zero in that case.

- `re.search(pattern, string, flags=0)`

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

In [12]:
re.search(r"and", "hand")

<re.Match object; span=(1, 4), match='and'>

In actual programs, the most common style is to store the match object in a variable, and then check if it was None. This usually looks like:

In [13]:
m = re.search(r"and", "hand")
if m:
    print('Match found: ', m.group())
else:
    print('No match')

Match found:  and


> Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string.

- `re.findall(r'regex', string)`

Find all matches of a pattern. 

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

In [14]:
re.findall(r"movies", "Love movies! I had fun yesterday going to the movies")

['movies', 'movies']

- `re.finditer(pattern, string, flags=0)`

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

findall() has to create the entire list before it can be returned as the result. The finditer() method returns a sequence of match object instances as an iterator:

In [15]:
iterator= re.finditer(r"movies", "Love movies! I had fun yesterday going to the movies")

In [16]:
iterator

<callable_iterator at 0x7f8895fb6f10>

In [17]:
for match in iterator:
    print(match.span())

(5, 11)
(46, 52)


- `re.split(r'regex', string)`

Split string at each match: 

In [18]:
re.split(r"!", "Nice Place to eat! I'll come back! Excellent meat!")

['Nice Place to eat', " I'll come back", ' Excellent meat', '']

- `re.sub(r'regex', new, string)`

Replace one or many matches with a string: 

In [19]:
re.sub(r"yellow", "nice", "I have a yellow car and a yellow house in a yellow neighborhood")

'I have a nice car and a nice house in a nice neighborhood'

- `re.fullmatch(pattern, string, flags=0)`

If the whole string matches the regular expression pattern, return a corresponding match object.: 

If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

### Character Sets

The first of these we'll learn is called a set. A set allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets.

In [20]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

If you look closely, you'll notice the first string contains the substring Blue with a capital letter, where the third string contains the substring blue in all lowercase. We can use the set [Bb] for the first character so that we can match both variations, and then use that to count how many times Blue or blue occur in the list:

In [21]:
blue_mentions = 0
pattern = r"[Bb]lue"

for s in string_list:
    if re.search(pattern, s):
        blue_mentions += 1

print(blue_mentions)

2


We're going to use this technique to find out how many times Python is mentioned in the title of stories in our Hacker News dataset. We'll use a set to check for both Python with a capital 'P' and python with a lowercase 'p'.

In [22]:
import pandas as pd 

hn = pd.read_csv('data/hacker_news.csv')
titles = hn["title"].tolist()
python_mentions = 0
pattern = r"[Pp]ython"

for t in titles:
    if re.search(pattern, t):
        python_mentions += 1

In [23]:
python_mentions

160

### Using Regular Expressions to Select Data

On the previous two screens, we used regular expressions to count how many titles contain Python or python. What if we wanted to view those titles?

In [24]:
python_titles = []
for t in titles:
    if re.search(pattern, t):
        python_titles.append(t)

In [25]:
python_titles[:10]

['From Python to Lua: Why We Switched',
 'Ubuntu 16.04 LTS to Ship Without Python 2',
 'Create a GUI Application Using Qt and Python in Minutes',
 "How I Solved GCHQ's Xmas Card with Python and Pycosat. (Explanation and Source)",
 'Unikernel Power Comes to Java, Node.js, Go, and Python Apps',
 'Developing a computational pipeline using the asyncio module in Python 3',
 'Show HN: Minimal, modern embedded V8 for Python',
 'Python integration for the Duktape Javascript interpreter',
 'Python 3 on Google App Engine flexible environment now in beta',
 'IronPython 3 (python for .net) development restarted']

### Character Classes

So far, we've learned how to perform simple matches with sets, and how to use quantifiers to specify when a character should repeat a certain number of times. Let's continue by looking at a more complex example.

Some stories submitted to Hacker News include a topic tag in brackets, like [pdf]. Here are a few examples of story titles with these tags:

    [video] Google Self-Driving SUV Sideswipes Bus
    New Directions in Cryptography by Diffie and Hellman (1976) [pdf]
    Wallace and Gromit  The Great Train Chase (1993) [video]

In this screen, our task is going to be to find how many titles in our dataset have tags.

Our first inclination may be to create the regex [pdf]. Unfortunately, the brackets would be interpreted as a set, so our pattern would match the single characters p, d, or f.

To match the substring "[pdf]", we can use backslashes to escape both the open and closing brackets: \[pdf\].

The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like pdf and video) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use character classes. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:
- The set notation using brackets to match any of a number of characters.
- The range notation, which we used to match ranges of digits (like [0-9]).

Let's look at a summary of syntax for some of the regex character classes:

<img alt="character classes one" src="https://s3.amazonaws.com/dq-content/354/character_classes_v2_1.svg">

There are two new things we can observe from this table:
- Ranges can be used for letters as well as numbers.
- Sets and ranges can be combined.

Just like with quantifiers, there are some other common character classes which we'll use a lot.

<img alt="character classes two" src="https://s3.amazonaws.com/dq-content/354/character_classes_v2_2.svg">

Negative character classes are character classes that match every character except a character class. Let's look at a table of the common negative character classes:

<img alt="negative character classes" src="https://s3.amazonaws.com/dq-content/354/negative_character_classes.svg">

#### Vaja1: Are they bots?

The company that you are working for asked you to perform a sentiment analysis using a dataset with tweets. First of all, you need to do some cleaning and extract some information.
While printing out some text, you realize that some tweets contain user mentions. Some of these mentions follow a very strange pattern. A few examples that you notice: @robot3!, @robot5& and @robot7#

To analyze if those users are bots, you will do a proof of concept with one tweet and extract them using the .findall() method.

In [26]:
with open('data/short_tweets.csv') as f:
    sentiment_analysis = f.read()

In [27]:
# Write the regex
regex = r"@robot\d\W"

# Find all matches of regex
print(re.findall(regex, sentiment_analysis))

['@robot9!', '@robot4&', '@robot9$', '@robot7%']


> The advantage of regular expressions is that you can use both normal characters and metacharacters. You can easily extract complex patterns, which would be more complicated otherwise.

#### Vaja2: Find the numbers

While examining the tweet text in your dataset, you detect that some tweets carry extra information. The text contains the number of retweets, user mentions, and likes of that tweet. So, you decide to extract this important information.

The information is given as in this example:

Agh...snow! User_mentions:9, likes: 5, number of retweets: 4

You also bring your list of metacharacters: \d digit, \w word character, \s whitespace.

In [28]:
# Write a regex to obtain user mentions
print(re.findall(r"User_mentions:\d", sentiment_analysis))

['User_mentions:2']


In [29]:
# Write a regex to obtain number of likes
print(re.findall(r"likes:\s\d", sentiment_analysis))

['likes: 9']


In [30]:
# Write a regex to obtain number of retweets
print(re.findall(r"number\sof\sretweets:\s\d", sentiment_analysis))

['number of retweets: 7']


> Using metacharacters in regular expressions will allow you to match types of characters such as digits. Remember to always specify whitespaces as \s.

#### Vaja3: Match and split

Some of the tweets in your dataset were downloaded incorrectly. Instead of having spaces to separate words, they have strange characters. You decide to use regular expressions to handle this situation. You print some of these tweets to understand which pattern you need to match.

You notice that the sentences are always separated by a special character, followed by a number, the word break, and after that, another special character, e.g &4break!. The words are always separated by a special character, the word new, and a normal random character, e.g #newH.

In [31]:
# Write a regex to match pattern separating sentences
regex_sentence = r"\W\dbreak\W"

#print(re.findall(regex_sentence, sentiment_analysis))

# Replace the regex_sentence with a space
sentiment_sub = re.sub(regex_sentence, ' ', sentiment_analysis)
#print(re.findall(regex_sentence, sentiment_sub))

In [32]:
# Write a regex to match pattern separating words
regex_words = r"\Wnew\w"

# Replace the regex_words and print the result
sentiment_final = re.sub(regex_words, ' ', sentiment_sub)

### Repetitions / Quantifiers

Problem: Validate the following string: password1234

In [33]:
import re
password = "password1234"

In [34]:
re.search(r"\w\w\w\w\w\w\w\w\d\d\d\d", password) # vidimo da to ni najlepše

<re.Match object; span=(0, 12), match='password1234'>

> Quantiers:
A metacharacter that tells the regex
engine how many times to match a
character immediately to its left.

The name for this type of regular expression syntax is called a quantifier. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths.

The specific type of quantifier we saw above is called a numeric quantifier. Here are the different types of numeric quantifiers we can use:

<img alt="quantifiers" src="https://s3.amazonaws.com/dq-content/354/quantifiers_numeric.svg">

You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.

<img alt="quantifiers" src="https://s3.amazonaws.com/dq-content/354/quantifiers_other.svg">

- `Once or more: +`

In [35]:
text = "Date of start: 4-3. Date of registration: 10-04."

In [36]:
re.findall(r"\d+-\d+", text)

['4-3', '10-04']

- `Zero times or more: *`

In [37]:
my_string = "The concert was amazing! @ameli!a @joh&&n @mary90"
re.findall(r"@\w+\W*\w+", my_string)

['@ameli!a', '@joh&&n', '@mary90']

- `Zero times or once: ?`

In [38]:
text = "The color of this image is amazing. However, the colour blue could be brighter."
re.findall(r"colou?r", text)

['color', 'colour']

- `n times at least, m times at most : {n, m}`

In [39]:
phone_number = "John: 1-966-847-3131 Michelle: 54-908-42-42424"

In [40]:
re.findall(r"\d{1,2}-\d{3}-\d{2,3}-\d{4,}", phone_number)

['1-966-847-3131', '54-908-42-42424']

> Immediately to the left `r"apple+` : + applies to e and not to apple

#### Vaja 4: Everything clean

Back to your Twitter sentiment analysis project! There are several types of strings that increase your sentiment analysis complexity. But these strings do not provide any useful sentiment. Among them, we can have links and user mentions.

In order to clean the tweets, you want to extract some examples first. You know that most of the times links start with http and do not contain any whitespace, e.g. https://www.datacamp.com. User mentions start with @ and can have letters and numbers only, e.g. @johnsmith3.

You write down some helpful quantifiers to help you: * zero or more times, + once or more, ? zero or once.

In [41]:
sentiment_analysis = ['0,1467962897,Mon Apr 06 23:01:04 PDT 2009,NO_QUERY,aleskywalker,@nick_carter Come to the chat  just 15 minutes  please? http://fanclub.backstreetboys.com/chat.php',
'0,1467962938,Mon Apr 06 23:01:04 PDT 2009,NO_QUERY,jess___x,Boredd. Colddd @blueKnight39 Internet keeps stuffing up. Save me! https://www.tellyourstory.com',
'0,1467963418,Mon Apr 06 23:01:14 PDT 2009,NO_QUERY,Zimily,"I had a horrible nightmare last night @anitaLopez98 @MyredHat31 which affected my sleep, now I\'m really tired"',
'0,1467963477,Mon Apr 06 23:01:15 PDT 2009,NO_QUERY,Augustina22,"im lonely  keep me company @YourBestCompany! @foxRadio https://radio.foxnews.com 22 female, new york"',
'0,1467963715,Mon Apr 06 23:01:18 PDT 2009,NO_QUERY,missmadison,@Born_4_Broadway Lost  and it was St. Ignacius Prepatory School. Haha.']

In [42]:
# Import re module
import re

for tweet in sentiment_analysis:
    # Write regex to match http links and print out result
    print(re.findall(r"http\S+", tweet))

    # Write regex to match user mentions and print out result
    print(re.findall(r"@\w+", tweet))

['http://fanclub.backstreetboys.com/chat.php']
['@nick_carter']
['https://www.tellyourstory.com']
['@blueKnight39']
[]
['@anitaLopez98', '@MyredHat31']
['https://radio.foxnews.com']
['@YourBestCompany', '@foxRadio']
[]
['@Born_4_Broadway']


Regular expressions provide very useful metacharacters that you should not forget to use. \S is an example. It is very useful to use when you know a pattern doesn't contain spaces and you have reached the end when you do find one.

#### Vaja 5: Some time ago

In [43]:
sentiment_analysis = ['I would like to apologize for the repeated Video Games Live related tweets. 32 minutes ago', '@zaydia but i cant figure out how to get there / back / pay for a hotel 1st May 2019', 'FML: So much for seniority, bc of technological ineptness 23rd June 2018 17:54']

You are interested in knowing when the tweets were posted. After reading a little bit more, you learn that dates are provided in different ways. You decide to extract the dates using .findall() so you can normalize them afterwards to make them all look the same.

You realize that the dates are always presented in one of the following ways:
- 27 minutes ago
- 4 hours ago
- 23rd june 2018
- 1st september 2019 17:25

In [44]:
# Complete the for loop with a regex to find dates, 27 minutes ago or 4 hours ago
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\s\w+\s\w+", date))

['32 minutes ago']
[]
[]


In [45]:
# Complete the for loop with a regex to find dates, 23rd june 2018
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}", date))

[]
['1st May 2019']
['23rd June 2018']


In [46]:
# Complete the for loop with a regex to find dates, 1st september 2019 17:25
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}\s\d{1,2}:\d{2}", date))

[]
[]
['23rd June 2018 17:54']


Handling dates can become a really difficult task. Fortunately, regular expressions can simplify this task!

#### Vaja 6: Getting tokens

Your next step is to tokenize the text of your tweets. Tokenization is the process of breaking a string into lexical units or, in simpler terms, words. But first, you need to remove hashtags so they do not cloud your process. You realize that hashtags start with a # symbol and contain letters and numbers but never whitespace. After that, you plan to split the text at whitespace matches to get the tokens.

You bring your list of quantifiers to help you: * zero o more times, + once or more, ? zero or once, {n, m} minimum n, maximum m.

In [47]:
sentiment_analysis = 'ITS NOT ENOUGH TO SAY THAT IMISS U #MissYou #SoMuch #Friendship #Forever'

In [48]:
# Write a regex matching the hashtag pattern
regex = r"#\w+"

In [49]:
# Replace the regex by an empty string
no_hashtag = re.sub(regex, "", sentiment_analysis)

In [50]:
no_hashtag

'ITS NOT ENOUGH TO SAY THAT IMISS U    '

In [51]:
# Get tokens by splitting text
print(re.split(r"\s+", no_hashtag))

['ITS', 'NOT', 'ENOUGH', 'TO', 'SAY', 'THAT', 'IMISS', 'U', '']


Regular expressions can be very useful when replacing and splitting string using complex patterns.

### Regex metacharacters

- `Match any character (except newline): .`

In [52]:
my_links = "Just check out this link: www.amazingpics.com. It has amazing photos!"

In [53]:
re.findall(r"www com", my_links)

[]

There are often scenarios where we want to specifically match a pattern at the start and end of strings.

Other than the word boundary anchor, the other two most common anchors are the beginning anchor and the end anchor, which represent the start and the end of the string, respectfully.

<img alt="positional anchors" src="https://s3.amazonaws.com/dq-content/354/positional_anchors.svg">

Note that the ^ character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a [ or not.

In [54]:
re.findall(r"www.+com", my_links)

['www.amazingpics.com']

- `Start ofthe string: ^`

In [55]:
my_string = "the 80s music was much better that the 90s"

In [56]:
re.findall(r"the\s\d+s", my_string)

['the 80s', 'the 90s']

In [57]:
re.findall(r"^the\s\d+s", my_string)

['the 80s']

- `End ofthe string: $`

In [58]:
my_string = "the 80s music hits were much better that the 90s"

In [59]:
re.findall(r"the\s\d+s$", my_string)

['the 90s']

- `Escape special characters: \`

In [60]:
my_string = "I love the music of Mr.Go. However, the sound was too loud."

In [61]:
print(re.split(r".\s", my_string))

['', 'lov', 'th', 'musi', 'o', 'Mr.Go', 'However', 'th', 'soun', 'wa', 'to', 'loud.']


In [58]:
print(re.split(r"\.\s", my_string))

['I love the music of Mr.Go', 'However, the sound was too loud.']


- `OR operator: Set of characters: [ ]`
- `OR operator: Character: |`


In [62]:
my_string = "Elephants are the world's largest land animal! I would love to see an elephant one day"

In [63]:
re.findall(r"Elephant|elephant", my_string)

['Elephant', 'elephant']

In [64]:
my_string = "Yesterday I spent my afternoon with my friends: MaryJohn2 Clary3"

In [65]:
re.findall(r"[a-zA-Z]+\d", my_string)

['MaryJohn2', 'Clary3']

In [66]:
my_string = "My&name&is#John Smith. I%live$in#London."

In [67]:
re.sub(r"[#$%&]", " ", my_string)

'My name is John Smith. I live in London.'

- `Set of characters: [ ], ^ transforms the expression to negative`

In [68]:
my_links = "Bad website: www.99.com. Favorite site: www.hola.com"
re.findall(r"www[^0-9]+com", my_links)

['www.hola.com']

#### Vaja 7: Finding files

You are not satisfied with your tweets dataset cleaning. There are still extra strings that do not provide any sentiment. Among them are strings refer to text file names.

You also find a way to detect them:
- They appear at the start of the string.
- They always start with a sequence of 2 or 3 upper or lowercase vowels (a e i o u).
- They always finish with the txt ending.

You are not sure if you should remove them directly. So you write a script to find and store them in a separate dataset.

You write down some metacharacters to help you: ^ anchor to beginning, . any character.

In [69]:
sentiment_analysis = ['AIshadowhunters.txt aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company',
 "ouMYTAXES.txt I am worried that I won't get my $900 even though I paid tax last year"]

In [70]:
# Write a regex to match text file name
regex = r"^[aeiouAEIOU]{2,3}.+txt"

for text in sentiment_analysis:
    # Find all matches of the regex
    print(re.findall(regex, text))

    # Replace all matches with empty string
    print(re.sub(regex, '', text))

['AIshadowhunters.txt']
 aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company
['ouMYTAXES.txt']
 I am worried that I won't get my $900 even though I paid tax last year


The dot . metacharacter is very useful when we want to match all repetitions of any character. However, we need to very careful how we use it.

#### Vaja 8: Give me your email

A colleague has asked for your help! When a user signs up on the company website, they must provide a valid email address.
The company puts some rules in place to verify that the given email address is valid:
- The first part can contain:
    - Upper A-Z and lowercase letters a-z
    - Numbers
    - Characters: !, #, %, &, *, $, .
- Must have @
- Domain:
    - Can contain any word characters
    - But only .com ending is allowed
    
The project consist of writing a script that checks if the email address follow the correct pattern. Your colleague gave you a list of email addresses as examples to test.

In [72]:
emails = ['n.john.smith@gmail.com', '87victory@hotmail.com', '!#mary-=@msca.net']

In [73]:
# Write a regex to match a valid email address
regex = r"[A-Za-z-0-9!#%&*$~.]+@\w+\.com"

for example in emails:
    # Match the regex to the string
    if re.match(regex, example):
        # Complete the format method to print out the result
        print(f"The email {example} is a valid email")
    else:
        print(f"The email {example} is invalid")   

The email n.john.smith@gmail.com is a valid email
The email 87victory@hotmail.com is a valid email
The email !#mary-=@msca.net is invalid


Validating strings is a task that becomes simpler when we use regular expressions. Square brackets are very useful for optional characters. Notice that we used the .match() method. The reason is that we want to match the pattern from the beginning of the string.

#### Vaja 9: Invalid password

The second part of the website project is to write a script that validates the password entered by the user. The company also puts some rules in order to verify valid passwords:
- It can contain lowercase a-z and uppercase letters A-Z
- It can contain numbers
- It can contain the symbols: *, #, $, %, !, &, .
- It must be at least 8 characters long but not more than 20

Your colleague also gave you a list of passwords as examples to test.

In [74]:
passwords = ['Apple34!rose', 'My87hou#4$', 'abc123']

In [75]:
# Write a regex to match a valid password
regex = r"[a-zA-Z0-9!#%&*$~\.]{8,20}"

for example in passwords:
    # Scan the strings to find a match
    if re.search(regex, example):
        # Complete the format method to print out the result
        print(f"The password {example} is a valid password")
    else:
        print(f"The password {example} is invalid") 

The password Apple34!rose is a valid password
The password My87hou#4$ is a valid password
The password abc123 is invalid


Notice that we used the .search() method. The reason is that we want to scan the string to match the pattern. We are not interested in where the regex finds the match.

### Using Flags to Modify Regex Patterns

Up until now, we've been using sets like [Pp] to match different capitalizations in our regular expressions. This strategy works well when there is only one character that has capitalization, but becomes cumbersome when we need to cater for multiple instances.

Within the titles, there are many different formatting styles used to represent the word "email." Here is a list of the variations:

In [77]:
import re
email_tests = ['email', 'Email', 'e Mail', 'e mail', 'E-mail', 'e-mail', 'eMail', 'E-Mail', 'EMAIL']

To write a regular expression for this, we would need to use a set for all five letters in email, which would make our regular expression very hard to read.

Instead, we can use flags to specify that our regular expression should ignore case.

Both re.search() and the pandas regular expression methods accept an optional flags argument. This argument accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.

<p>A <a target="_blank" href="https://docs.python.org/3/library/re.html#re.A">list of all available flags</a> is in the documentation, but by far the most common and the most useful is the <a target="_blank" href="https://docs.python.org/3/library/re.html#re.I"><code>re.IGNORECASE</code> flag</a>, which is also available using the alias <code>re.I</code> for convenience.</p>

In [78]:
pattern = r"e[\-\s]?mail"

In [79]:
# without flag
for email in email_tests:
    print(re.match(pattern, email))

<re.Match object; span=(0, 5), match='email'>
None
None
<re.Match object; span=(0, 6), match='e mail'>
None
<re.Match object; span=(0, 6), match='e-mail'>
None
None
None


In [80]:
for email in email_tests:
    print(re.match(pattern, email, flags=re.I))

<re.Match object; span=(0, 5), match='email'>
<re.Match object; span=(0, 5), match='Email'>
<re.Match object; span=(0, 6), match='e Mail'>
<re.Match object; span=(0, 6), match='e mail'>
<re.Match object; span=(0, 6), match='E-mail'>
<re.Match object; span=(0, 6), match='e-mail'>
<re.Match object; span=(0, 5), match='eMail'>
<re.Match object; span=(0, 6), match='E-Mail'>
<re.Match object; span=(0, 5), match='EMAIL'>


### Greedy versus Non-Greedy

Two types of matching methods:
- Greedy
- Non-greedy or lazy

Standard quantiers are greedy by default: * , + , ? , {num, num}

- `Greedy: match as many characters as possible`

Return the longest match

In [81]:
re.match(r"\d+", "12345bcada")

<re.Match object; span=(0, 5), match='12345'>

Backtracks when too many character matched. Gives up characters one at a time

In [82]:
re.match(r".*hello", "xhelloxxxxxx")

<re.Match object; span=(0, 6), match='xhello'>

- `Lazy: match as few characters as needed`

Returns the shortest match. Append ? to greedy quantiers

In [83]:
re.match(r"\d+?", "12345bcada")

<re.Match object; span=(0, 1), match='1'>

Backtracks when too few characters matched. Expands characters one a time

In [84]:
re.match(r".*?hello", "xhelloxxxxxx")

<re.Match object; span=(0, 6), match='xhello'>

When repeating a regular expression, as in a\*, the resulting action is to consume as much of the pattern as possible. This fact often bites you when you’re trying to match a pair of balanced delimiters, such as the angle brackets surrounding an HTML tag. The naive pattern for matching a single HTML tag doesn’t work because of the greedy nature of .\*.

In [85]:
s = '<html><head><title>Title</title>'

In [86]:
len(s)

32

In [87]:
print(re.match('<.*>', s).span())

(0, 32)


In [88]:
print(re.match('<.*>', s).group())

<html><head><title>Title</title>


In this case, the solution is to use the non-greedy qualifiers *?, +?, ??, or {m,n}?, which match as little text as possible. In the above example, the '>' is tried immediately after the first '<' matches, and when it fails, the engine advances a character at a time, retrying the '>' at every step. This produces just the right result:

In [89]:
print(re.match('<.*?>', s).group())

<html>


> Note that parsing HTML or XML with regular expressions is painful. Quick-and-dirty patterns will handle common cases, but HTML and XML have special cases that will break the obvious regular expression; by the time you’ve written a regular expression that handles all of the possible cases, the patterns will be very complicated. Use an HTML or XML parser module for such tasks.

#### Vaja 10: Understanding the difference

You need to keep working and cleaning your tweets dataset. You realize that there are some HTML tags present. You need to remove them but keep the inside content as they are useful for analysis.

Let's take a look at this sentence containing an HTML tag:

    I want to see that <strong>amazing show</strong> again!.

You know that for getting HTML tag you need to match anything that sits inside angle brackets < >. But the biggest problem is that the closing tag has the same structure. If you match too much, you will end up removing key information. So you need to decide whether to use a greedy or a lazy quantifier.

In [90]:
string = 'I want to see that <strong>amazing show</strong> again!'

In [91]:
# Import re
import re

# Write a regex to eliminate tags
string_notags = re.sub(r"<.+?>", "", string)
#string_notags = re.sub(r"<.+>", "", string) # greedy

# Print out the result
print(string_notags)

I want to see that amazing show again!


Remember that a greedy quantifier will try to match as much as possible while a non-greedy quantifier will do it as few times as needed, expanding one character at a time and giving us the match we are looking for. 

#### Vaja 11: Greedy matching

Next, you see that numbers still appear in the text of the tweets. So, you decide to find all of them.

Let's imagine that you want to extract the number contained in the sentence I was born on April 24th. A lazy quantifier will make the regex return 2 and 4, because they will match as few characters as needed. However, a greedy quantifier will return the entire 24 due to its need to match as much as possible.

In [92]:
sentiment_analysis = 'Was intending to finish editing my 536-page novel manuscript tonight, but that will probably not happen. And only 12 pages are left'

In [93]:
# Write a lazy regex expression 
numbers_found_lazy = re.findall(r"\d+?", sentiment_analysis)

# Print out the result
print(numbers_found_lazy)

['5', '3', '6', '1', '2']


In [94]:
# Write a greedy regex expression 
numbers_found_greedy = re.findall(r"\d+", sentiment_analysis)

# Print out the result
print(numbers_found_greedy)

['536', '12']


Even though greedy quantifiers lead to longer matches, they are sometimes the best option. Because lazy quantifiers match as few as possible, they return a shorter match than we expected. It is always good to know when to use greedy and lazy quantifiers!

#### Vaja 12: Lazy approach

You have done some cleaning in your dataset but you are worried that there are sentences encased in parentheses that may cloud your analysis.

Again, a greedy or a lazy quantifier may lead to different results.

For example, if you want to extract a word starting with a and ending with e in the string I like apple pie, you may think that applying the greedy regex a.+e will return apple. However, your match will be apple pie. A way to overcome this is to make it lazy by using ? which will return apple.

In [95]:
sentiment_analysis = "Put vacation photos online (They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying)."

In [96]:
# Write a greedy regex expression to match 
sentences_found_greedy = re.findall(r"\(.+\)", sentiment_analysis)

# Print out the result
print(sentences_found_greedy)

["(They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying)"]


In [97]:
# Write a lazy regex expression
sentences_found_lazy = re.findall(r"\(.+?\)", sentiment_analysis)

# Print out the results
print(sentences_found_lazy)

['(They were so cute)', "(I'm crying)"]


Notice that using greedy quantifiers always leads to longer matches that sometimes are not desired. Making quantifiers lazy by adding ? to match a shorter pattern is a very important consideration to keep in mind when handling data for text mining.

### Capturing groups


We define a capture group by wrapping the part of our pattern we want to capture in parentheses. If we want to capture the whole pattern, we just wrap the whole pattern in a pair of parentheses.

In [98]:
text = "Clary has 2 friends who she spends a lot time with. Susan has 3 brothers while John has 4 sisters."

In [99]:
re.findall('[A-Za-z]+\s\w+\s\d+\s\w+', text)

['Clary has 2 friends', 'Susan has 3 brothers', 'John has 4 sisters']

Use parentheses to group and capture characters together

In [100]:
re.findall('([A-Za-z]+)\s\w+\s\d+\s\w+', text)

['Clary', 'Susan', 'John']

In [101]:
re.findall('([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', text)

[('Clary', '2', 'friends'),
 ('Susan', '3', 'brothers'),
 ('John', '4', 'sisters')]

- Match a specic subpattern in a pattern
- Use it for further processing

Organize the data

In [102]:
pets = re.findall('([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', "Clary has 2 dogs but John has 3 cats")
pets

[('Clary', '2', 'dogs'), ('John', '3', 'cats')]

In [103]:
pets[0][0]

'Clary'

- Apply a quantier to the entire group

In [104]:
re.search(r"(\d[A-Za-z])+", "My user name is 3e4r5fg")

<re.Match object; span=(16, 22), match='3e4r5f'>

- Capture a repeated group (\d+) vs. repeat a capturing group (\d)+

In [105]:
my_string = "My lucky numbers are 8755 and 33"
re.findall(r"(\d)+", my_string)

['5', '3']

In [106]:
re.findall(r"(\d+)", my_string)

['8755', '33']

#### Vaja 13: Try another name

You are still working on your Twitter sentiment analysis. You analyze now some things that caught your attention. You noticed that there are email addresses inserted in some tweets. Now, you are curious to find out which is the most common name.

You want to extract the first part of the email. E.g. if you have the email marysmith90@gmail.com, you are only interested in marysmith90.
You need to match the entire expression. So you make sure to extract only names present in emails. Also, you are only interested in names containing upper (e.g. A,B, Z) or lowercase letters (e.g. a, d, z) and numbers.

In [107]:
sentiment_analysis = ['Just got ur newsletter, those fares really are unbelievable. Write to statravelAU@gmail.com or statravelpo@hotmail.com. They have amazing prices', 'I should have paid more attention when we covered photoshop in my webpage design class in undergrad. Contact me Hollywoodheat34@msn.net.', 'hey missed ya at the meeting. Read your email! msdrama098@hotmail.com']

In [108]:
# Write a regex that matches email
regex_email = r"([A-Za-z0-9]+)@\S+"

for tweet in sentiment_analysis:
    # Find all matches of regex in each tweet
    email_matched = re.findall(regex_email, tweet)

    # Complete the format method to print the results
    print(f"Lists of users found in this tweet: {email_matched}")

Lists of users found in this tweet: ['statravelAU', 'statravelpo']
Lists of users found in this tweet: ['Hollywoodheat34']
Lists of users found in this tweet: ['msdrama098']


 Remember that placing a subpattern inside parenthesis will capture that content and stores it temporarily in memory. This can be later reused.

#### Vaja 14: Flying home

Your boss assigned you to a small project. They are performing an analysis of the travels people made to attend business meetings. You are given a dataset with only the email subjects for each of the people traveling.

You learn that the text followed a pattern. Here is an example:

Here you have your boarding pass LA4214 AER-CDB 06NOV.

You need to extract the information about the flight:

- The two letters indicate the airline (e.g LA),
- The 4 numbers are the flight number (e.g. 4214).
- The three letters correspond to the departure (e.g AER),
- The destination (CDB),
- The date (06NOV) of the flight.

All letters are always uppercase.

In [109]:
text = "Subject: You are now ready to fly. Here you have your boarding pass IB3723 AMS-MAD 06OCT"

In [110]:
# Import re
import re

# Write regex to capture information of the flight
regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

# Find all matches of the flight information
flight_matches = re.findall(regex, text)

print(flight_matches)

#Print the matches
print(f"Airline: {flight_matches[0][0]} Flight number: {flight_matches[0][1]}")
print(f"Departure: {flight_matches[0][2]} Destination: {flight_matches[0][3]}")
print(f"Date: {flight_matches[0][4]}")

[('IB', '3723', 'AMS', 'MAD', '06OCT')]
Airline: IB Flight number: 3723
Departure: AMS Destination: MAD
Date: 06OCT


findall() returns a list of tuples. The nth element of each tuple is the element corresponding to group n. This provides us with an easy way to access and organize our data.

#### Vaja 15: Extracting URL Parts Using Multiple Capture Groups

The task we will be performing first is extracting the different components of the URLs in order to analyze them.

In [111]:
test_urls = [
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param'
]

- Protocol
- Domain
- Page path

In order to do this, we'll create a regular expression with multiple capture groups. Multiple capture groups in regular expressions are defined the same way as single capture groups — using pairs of parentheses.

In [112]:
pattern = r"(.+)://([\w\.]+)/?(.*)"
results = []


for url in test_urls:
    comp = re.findall(pattern, url)
    results.append(comp)
    print(comp)

[('https', 'www.amazon.com', 'Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429')]
[('http', 'www.interactivedynamicvideo.com', '')]
[('http', 'www.nytimes.com', '2007/11/07/movies/07stein.html?_r=0')]
[('http', 'evonomics.com', 'advertising-cannot-maintain-internet-heres-solution/')]
[('HTTPS', 'github.com', 'keppel/pinn')]
[('Http', 'phys.org', 'news/2015-09-scale-solar-youve.html')]
[('https', 'iot.seeed.cc', '')]
[('http', 'www.bfilipek.com', '2016/04/custom-deleters-for-c-smart-pointers.html')]
[('http', 'beta.crowdfireapp.com', '?beta=agnipath')]
[('https', 'www.valid.ly', '?param')]


In [113]:
results

[[('https',
   'www.amazon.com',
   'Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429')],
 [('http', 'www.interactivedynamicvideo.com', '')],
 [('http', 'www.nytimes.com', '2007/11/07/movies/07stein.html?_r=0')],
 [('http',
   'evonomics.com',
   'advertising-cannot-maintain-internet-heres-solution/')],
 [('HTTPS', 'github.com', 'keppel/pinn')],
 [('Http', 'phys.org', 'news/2015-09-scale-solar-youve.html')],
 [('https', 'iot.seeed.cc', '')],
 [('http',
   'www.bfilipek.com',
   '2016/04/custom-deleters-for-c-smart-pointers.html')],
 [('http', 'beta.crowdfireapp.com', '?beta=agnipath')],
 [('https', 'www.valid.ly', '?param')]]

In [114]:
from collections import Counter

In [115]:
# count protocol
Counter([result[0][0].lower() for result in results]).most_common()

[('http', 6), ('https', 4)]

### Numbered and named groups 

- `Numbered groups`

In [116]:
text = "Python 3.0 was released on 12-03-2008."

In [117]:
information = re.search('(\d{1,2})-(\d{2})-(\d{4})', text)

In [118]:
information

<re.Match object; span=(27, 37), match='12-03-2008'>

In [119]:
information.group(3)

'2008'

In [120]:
information.group(1)

'12'

In [121]:
information.group(0)

'12-03-2008'

- `Named groups`

In order to name a capture group we use the syntax ?P, where name is the name of our capture group. This syntax goes after the open parentheses, but before the regex syntax that defines the capture group:

    (?P<name>regex)

In [122]:
text = "Austin, 78701"
cities = re.search(r"(?P<city>[A-Za-z]+).*?(?P<zipcode>\d{5})", text)
cities.group("city")

'Austin'

In [123]:
cities.group("zipcode")

'78701'

#### Vaja 16: Parsing PDF files

You now need to work on another small project you have been delaying. Your company gave you some PDF files of signed contracts. The goal of the project is to create a database with the information you parse from them. Three of these columns should correspond to the day, month, and year when the contract was signed.
The dates appear as Signed on 05/24/2016 (05 indicating the month, 24 the day). You decide to use capturing groups to extract this information. Also, you would like to retrieve that information so you can store it separately in different variables.

You decide to do a proof of concept.

In [128]:
contract = 'Provider will invoice Client for Services performed within 30 days of performance.  Client will pay Provider as set forth in each Statement of Work within 30 days of receipt and acceptance of such invoice. It is understood that payments to Provider for services rendered shall be made in full as agreed, without any deductions for taxes of any kind whatsoever, in conformity with Provider’s status as an independent contractor. Signed on 03/25/2001.'

In [129]:
# Write regex and scan contract to capture the dates described
regex_dates = r"Signed\son\s(\d{2})/(\d{2})/(\d{4})"
dates = re.search(regex_dates, contract)

In [130]:
# Assign to each key the corresponding match
signature = {
    "day": dates.group(2),
    "month": dates.group(1),
    "year": dates.group(3)
}

In [131]:
# Complete the format method to print-out
print(f"Our first contract is dated back to {signature['year']}. Particularly, the day {signature['day']} of the month {signature['month']}.")

Our first contract is dated back to 2001. Particularly, the day 25 of the month 03.


Remember that each capturing group is assigned a number according to its position in the regex. Only if you use .search() and .match(), you can use .group() to retrieve the groups.



#### Vaja 17: Extracting URL Parts Using Named Groups

In [132]:
pattern = r"(?P<protocol>.+)://(?P<domain>[\w\.]+)/?(?P<path>.*)"

for url in test_urls:
    comp = re.search(pattern, url)
    print(comp.group('protocol'))
    print(comp.group('domain'))
    print(comp.group('path'))
    print('------------------')

https
www.amazon.com
Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429
------------------
http
www.interactivedynamicvideo.com

------------------
http
www.nytimes.com
2007/11/07/movies/07stein.html?_r=0
------------------
http
evonomics.com
advertising-cannot-maintain-internet-heres-solution/
------------------
HTTPS
github.com
keppel/pinn
------------------
Http
phys.org
news/2015-09-scale-solar-youve.html
------------------
https
iot.seeed.cc

------------------
http
www.bfilipek.com
2016/04/custom-deleters-for-c-smart-pointers.html
------------------
http
beta.crowdfireapp.com
?beta=agnipath
------------------
https
www.valid.ly
?param
------------------
