Credits: This notebook contains an excerpt from the [Python Data Science Handbook]
by Jake VanderPlas;

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). <br/>
If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

<a id="home"></a>
# Working with Text Data

<u>[Part 1: Basic python Strings](#1)</u><br/>

| Section | Section-name | Section | Section-name | Section | Section-name | 
| :- | :- | :- | :- | :- | :- | 
| 1.a. | [Basic strings](#1a) |  1.b. | [String indexing](#1b) |  1.c. | [Basic string operations](#1c) | 
| 1.d. | [Finding a substring](#1d) | 1.e. | [String transformations](#1e) |  1.f. | [split and join](#1f) | 
| 1.g. | [String ops via list comprehensions](#1g) | 1-Ex | [Exercise 1 - Python Strings](#ex1) |

<u>[Part 2: Python regular expressions - Basics](#2)</u><br/>

| Section | Section-name | Section | Section-name | Section | Section-name | 
| :- | :- | :- | :- | :- | :- | 
| 2.a. | [the re Module](#2a) |  2.b. | [Simple (syntax) Examples](#2b) |  2.c. | [the 'search' function](#2c) | 
| 2.d. | [the 'findall' function](#2d) | 2.e. | [the 'split' function](#2e) | 2.f. | [Character sets and ranges](#2f) | 
| 2.g. | [Escape codes](#2g) |  2.h. | [Or expression](#2h) | Ex2 | [Exercise 2 - Regular Expressions](#ex2) | 

<u> [Part 3: Text in Pandas Series](#3) </u><br/>

| Section | Section-name | Section | Section-name | Section | Section-name | 
| :- | :- | :- | :- | :- | :- |  
| 3.a. | [Pandas objects containing strings](#3a) | 3.b. | [Series string methods](#3b) | 3.c. | [Using pandas string methods](#3c) |
| 3.d. | [Series and regular expressions](#3d) | 3.e. | [Series  misc. string methods](#3e) | 3.f. | [Series item access and slicing](#3e) | 
| 3.g. | [Indicator variables](#3g) | 

## Import python modules

In [1]:
# --------------------------------------
import os
import warnings
import re

import numpy as np
import pandas as pd
# --------------------------------------
import seaborn as sns
import matplotlib.pyplot as plt
# --------------------------------------
# show several outputs in one cell. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# --------------------------------------
warnings.simplefilter("ignore")
%matplotlib inline
# --------------------------------------

<a id="1"></a>
## 1. Basic Python Strings

[Go to the beginning of the notebook](#home)
<a id="1a"></a>
#### 1.a. Basic strings

In [2]:
print('A string is contained within 2 quotes:')
"John Smith"

print('You can also use single  quotes:')
'John Smith'

print('A string can be spaces and digits:') 
'1 2 3 4 5 6 '

print('A string can also be special characters:') 
'@#2_#]&*^%$'

A string is contained within 2 quotes:


'John Smith'

You can also use single  quotes:


'John Smith'

A string can be spaces and digits:


'1 2 3 4 5 6 '

A string can also be special characters:


'@#2_#]&*^%$'

In [3]:
# multiline string
hi = """Hi there 
Hi again
Bye"""
'''
explanation 
of 
something
'''
print(hi)  

'\nexplanation \nof \nsomething\n'

Hi there 
Hi again
Bye


[Go to the beginning of the notebook](#home)
<a id="1b"></a>
#### 1.b. String indexing

In [4]:
Name= "Jack Smith"
len_min1=len(Name)-1
print('Name       : '+Name)
print('len(Name)  : %d' %(len(Name)))
print('Name[5]    : '+'-'*5+Name[5])
print('Name[-1]   : '+'-'*len_min1+Name[-1])
print('Name[0:4]  : '+Name[0:4])
print('Name[::2]  : '+Name[::2])
print('Name[::-1] : '+Name[::-1])
print('Name[::-2] : '+Name[::-2])
print('Name[1:7:2]: '+Name[1:7:2])
print('hi '*3)

Name       : Jack Smith
len(Name)  : 10
Name[5]    : -----S
Name[-1]   : ---------h
Name[0:4]  : Jack
Name[::2]  : Jc mt
Name[::-1] : htimS kcaJ
Name[::-2] : hiSka
Name[1:7:2]: akS
hi hi hi 


[Go to the beginning of the notebook](#home)
<a id="1c"></a>
#### 1.c. Basic string operations
You can find a list of all string methods in the [documentation](https://docs.python.org/2/library/stdtypes.html#string-methods).

In [5]:
s = "he'llo"
print(s)
print (s.capitalize()  )# Capitalize a string; prints "Hello"
print (s.upper()      ) # Convert a string to uppercase; prints "HELLO"
print (s.rjust(7)    )  # Right-justify a string, padding with spaces; prints "  hello"
print (s.center(7)  )   # Center a string, padding with spaces; prints " hello "
print (s.replace('l', '(ell)'))  # Replace all instances of one substring with another;
#                                  prints "he(ell)(ell)o"
print (s.replace('l', '(ell)',1)) 
print (s.replace("'",""))
print ('  world '.strip())  # Strip leading and trailing whitespace; prints "world"
print (', hi my name is John.'.strip(',. '))
print ('שלום לכולם,'.rstrip(','))
print ('שלום לכולם,'.lstrip(','))

he'llo
He'llo
HE'LLO
 he'llo
 he'llo
he'(ell)(ell)o
he'(ell)lo
hello
world
hi my name is John
שלום לכולם
שלום לכולם,


[Go to the beginning of the notebook](#home)
<a id="1d"></a>
#### 1.d. Finding a substring

In [6]:
Name
Name.find('ck')

'Jack Smith'

2

In [7]:
Name
Name.find('lm')

'Jack Smith'

-1

[Go to the beginning of the notebook](#home)
<a id="1e"></a>
#### 1.e. String transformations

In [11]:
ord('A')
some_str='string with ABCE'
# Map between one character to another, such that a character can be replaced by another when calling str.translate(translation)
translation =some_str.maketrans('BACD', 'abcd')
translation
some_str
some_str.translate(translation)

65

{66: 97, 65: 98, 67: 99, 68: 100}

'string with ABCE'

'string with bacE'

In [12]:
ord('א')
some_str='Hebrew Letters: אבג'
translation = s.maketrans('אבג', 'abc')
translation
some_str
some_str.translate(translation)

1488

{1488: 97, 1489: 98, 1490: 99}

'Hebrew Letters: אבג'

'Hebrew Letters: abc'

[Go to the beginning of the notebook](#home)
<a id="1f"></a>
#### 1.f. split and join

In [13]:
str_sentence  = 'This is a sentence'
str_sentence2 = 'This is, a sentence'
str_sentence
str_sentence.split(' ')
str_sentence2
str_sentence2.split(', ')

'This is a sentence'

['This', 'is', 'a', 'sentence']

'This is, a sentence'

['This is', 'a sentence']

In [14]:
normalized_tokens = ['This', 'is', 'a', 'sentence','.']
normalized_tokens
norm_sentence = ' '.join(normalized_tokens)
norm_sentence

['This', 'is', 'a', 'sentence', '.']

'This is a sentence .'

[Go to the beginning of the notebook](#home)
<a id="1g"></a>
#### 1.g. String operations via list comprehensions
Use list comprehensions on simple python lists

In [15]:
lst_names1 = ['peter', 'Paul', 'MARY', 'gUIDO']
lst_names1
[s.capitalize() for s in lst_names1]
lst2=[name.capitalize().replace('Pe','Me') for name in lst_names1]
lst2
' '.join(lst2)

['peter', 'Paul', 'MARY', 'gUIDO']

['Peter', 'Paul', 'Mary', 'Guido']

['Meter', 'Paul', 'Mary', 'Guido']

'Meter Paul Mary Guido'

This is perhaps sufficient to work with some data, <br/>
but **it will break if there are any missing values**.<br/>
For example:

In [16]:
lst_names2 = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in lst_names2]

['Peter', 'Paul', 'Mary', 'Guido']

[Go to the beginning of the notebook](#home)
<a id="ex1"></a>
### Exercise 1
This Exercise involves a text `in_text`, which you need to manipulate, based on the above material. 
1. Replace every double quotation(") with a single quotation (')
2. Remove the spaces and punctuation (,;.': ) from the beginning and end
3. Remove all punctuation  (,;.:), besides single quotation (').
4. transfer Hebrew letters to the corresponding in English (א --> a; ב --> b, etc.)
   * assume there are NO "suffix" Hebrew letters (ךםןףץ)
   * map to the first 22 English letters
5. split the text to words. Assume that space separates between words
6. Remove empty words
7. Capitalize every second word and the rest of the words turn to lower case
7. Reverse the order of the words
8. Join and print the sentence (sentence with the reverse order)

In [56]:
in_text = """  Ammm :this is a text.  nothing "really   important"; פשוט   טקסט, that's that. ...  """
print(in_text)

  Ammm :this is a text.  nothing "really   important"; פשוט   טקסט, that's that. ...  


In [58]:
# your solution:
print(in_text.ljust(100), '(original)')
print(in_text.replace('"', "'").ljust(100), '(\' instead of ")')
print(in_text.strip(' ,:.;').ljust(100), '(No punctuation at begin/end)')

import re
print(re.sub(',|:|\.|;', '', in_text).ljust(100), '(No punctuation everywhere)')
print(in_text.replace(',', '').replace(':', '').replace('.', '').replace(';', '').ljust(100), '(No punctuation everywhere)')

translation = in_text.maketrans('אבגדהוזחטיכלמנסעפצקרשת', 'abcdefghijklmnopqrstuv')
print(in_text.translate(translation).ljust(100), '(Hebrew Translated)')

words = in_text.split(' ')
print(words, '(Split by space)')
print([s for s in words if len(s) > 0], '(Split by space, skip empty)')
print([words[i].capitalize() if i % 2 == 1 else words[i].lower() for i in range(len(words))], '(#7)')
print(words[::-1], '(Words in reverse order)')
print(' '.join(words[::-1]), '(Words in reverse order as sentence)')

  Ammm :this is a text.  nothing "really   important"; פשוט   טקסט, that's that. ...                 (original)
  Ammm :this is a text.  nothing 'really   important'; פשוט   טקסט, that's that. ...                 (' instead of ")
Ammm :this is a text.  nothing "really   important"; פשוט   טקסט, that's that                        (No punctuation at begin/end)
  Ammm this is a text  nothing "really   important" פשוט   טקסט that's that                          (No punctuation everywhere)
  Ammm this is a text  nothing "really   important" פשוט   טקסט that's that                          (No punctuation everywhere)
  Ammm :this is a text.  nothing "really   important"; qufi   isoi, that's that. ...                 (Hebrew Translated)
['', '', 'Ammm', ':this', 'is', 'a', 'text.', '', 'nothing', '"really', '', '', 'important";', 'פשוט', '', '', 'טקסט,', "that's", 'that.', '...', '', ''] (Split by space)
['Ammm', ':this', 'is', 'a', 'text.', 'nothing', '"really', 'important";', 'פשוט', 'טקסט,

[Go to the beginning of the notebook](#home)
<a id="2"></a>
## 2.Python regular expressions - Basics

**This Section covers the following**:
+ Get familiar with Regular Expressions
- Basic use of main `re` module functions: `search`,`findall`,`split`,`sub` and `match` object.
- Practice basic regular expression syntax
- Get familiar with character sets and character ranges ([] operator)
- Get familiar with special escape codes (such as `\w`,`\d`, etc)
- Or expression
- Useful online regexp debugger

In this part of our exercise we'll learn about regular expressions. <br/>
Regular expressions are text matching patterns described with a formal syntax. <br/>
You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. <br/>
They are very useful to find (and replace) text, to extract structured information such as <br/>
e-mails, phone numbers, etc., or for cleaning up text that was entered by humans, and many other applications. 

In Python, regular expressions are available as part of the [`re`](https://docs.python.org/3/library/re.html#module-re) module. <br/>
There are various [good](https://docs.python.org/3/howto/regex.html) [tutorials](https://developers.google.com/edu/python/regular-expressions) [available](https://github.com/tesla809/intro-to-python-jupyter-notebooks/blob/master/47-Regular%20Expressions.ipynb) on which this document is partially based. 

[Go to the beginning of the notebook](#home)
<a id="2a"></a>
#### 2.a. the `re` Module

In order to use the `re` module in python, one need first to import it. <br/>
As mentioned in the lecture, there are 3 main use cases:
1. Find - mainly using `search` and `findall` functions.
2. Replace - mainly using the `sub` function.
3. Split -  using the `split` function.
4. Match object - this is not a use case, but an object returned mainly by "find" functions and can be used for further text manipulation.


The basic syntax to search for a match in a string is this: 

```python
match = re.search(pattern, text)
```

Here, `pattern` is the regular expression, `text` is the text that the regular expression is applied to. <br/>
Match holds the search result that matches the string in an object.

[`search()`](https://docs.python.org/3/library/re.html#re.search) returns only the first occurrence of a match, in contrast, [`findall()`](https://docs.python.org/3/library/re.html#re.findall) returns all matches.

Another useful function is [`split()`](https://docs.python.org/3/library/re.html#re.split), which splits a string based on a regex pattern – we'll use all of these functions and others where appropriate. 

Mostly, we'll use search to learn about the syntax, but sometimes we'll use split instead of search to explain a pattern. <br/>
There are other functions which we'll use later.

[Go to the beginning of the notebook](#home)
<a id="2b"></a>
#### 2.b. Simple (syntax) Examples

We'll use a regular expression to demonstrate the syntax and use case: 
```python
'name: \w\w\w\w'
```

To extract the name of people that submitted inquiries to the forum. <br/>
The way this pattern works, it matches the substring **'name:'** followed by a four letter word, encoded by **'\w\w\w\w'**. <br/>
Let's start with the import..

In [None]:
import re

Below you can find a snippet from the comments we got into the course's forum. <br/>
We'll save it into a string variable

In [59]:
txt="""

name: Dina Ivry
email: dimai@gmail.com
time: 2020-11-02 11:32:11
phone: +972-3-52-3434233
city: Tel-aviv
title: knn  
content: can you explain what does the k hyper-parameter mean???

==============
name: Joseph Katzir
email: joek@myemail.ac.il
time: 2020-12-20 13:34:02
phone: (054) 5444443
city: Tel aviv
title: what a great lecture   
content: avinoam this was one of your best

=============

"""
print(txt)



name: Dina Ivry
email: dimai@gmail.com
time: 2020-11-02 11:32:11
phone: +972-3-52-3434233
city: Tel-aviv
title: knn  
content: can you explain what does the k hyper-parameter mean???

name: Joseph Katzir
email: joek@myemail.ac.il
time: 2020-12-20 13:34:02
phone: (054) 5444443
city: Tel aviv
title: what a great lecture   
content: avinoam this was one of your best





[Go to the beginning of the notebook](#home)
<a id="2c"></a>
#### 2.c.  the `search` function

One of the most common uses for the re module is for finding patterns in text. <br/>
Let's do a quick example of using the search method in the re module to find some text. <br/>
In this case, by finding the first names of the people that wrote in the forum, based on the pattern we mentioned before.

In [60]:
good_pattern="name: \w\w\w\w"
no_match_pattern="first name: \w\w\w\w"

#Check for match on first pattern
if re.search(good_pattern,  txt):
    print ('Match was found for pattern:',good_pattern)
else:
    print ('No Match was found for pattern:',good_pattern)

#Check for match on second pattern
if re.search(no_match_pattern,  txt):
    print ('Match was found for pattern:',no_match_pattern)
else:
    print ('No Match was found for pattern:',no_match_pattern)


Match was found for pattern: name: \w\w\w\w
No Match was found for pattern: first name: \w\w\w\w


This is nice.. we've seen that `re.search()` will take the pattern, scan the text, and return if it finds a match or not. <br/>
But how can we get the actual text it matched?

In order to understand this, we will introduce the `Match` object. When the function `search` is called, <br/>
it returns a `Match` object. If no pattern is found,  `None` is returned. <br/>
To give a clearer picture of this match object, check out the cell below:

In [61]:
match = re.search(good_pattern,  txt)

type(match)

re.Match

This Match object returned by the search() method is more than just a Boolean or None, <br/>
it contains information about the match, including the original input string, <br/>
the regular expression that was used, and the location of the match. <br/>

Let's see the methods we can use on the match object:

In [62]:
# Show start of match
print('match.start():', match.start())
# Show end
print('match.end():', match.end())
# show the text that was found
print('match.group(0):', match.group(0))

match.start(): 2
match.end(): 12
match.group(0): name: Dina


[Go to the beginning of the notebook](#home)
<a id="2d"></a>
#### 2.d.  Finding all instances of a pattern - the `findall` function

You can use `re.findall()` to find all the instances of a pattern in a string. <br/>

For example, if we want to apply the previous pattern (`good_pattern`) on all the posts in the forum:

In [63]:
print(txt)



name: Dina Ivry
email: dimai@gmail.com
time: 2020-11-02 11:32:11
phone: +972-3-52-3434233
city: Tel-aviv
title: knn  
content: can you explain what does the k hyper-parameter mean???

name: Joseph Katzir
email: joek@myemail.ac.il
time: 2020-12-20 13:34:02
phone: (054) 5444443
city: Tel aviv
title: what a great lecture   
content: avinoam this was one of your best





In [64]:
# Returns a list of all matches
re.findall(good_pattern,txt)

['name: Dina', 'name: Jose']

As you can see, it extracted both names from the forum posts. <br/>

In addition, as we mentioned in the lecture, while the first name is extarcted properly,<br/>
in the second name only the first 4 characters were extracted. We will see later how to fix it.

[Go to the beginning of the notebook](#home)
<a id="2e"></a>
#### 2.e.  the  `split` function
Split is another useful function in the `re` module. Let's see how we can split with the re syntax. <br/>

This should look similar to how you used the split() method with strings, <br/>
however you can see that instead of simple patterns, you can use the unique regule-expression syntax for more powerfull split. <br/>

We will start with a simple example:

In [65]:
email="myaddress@domain.com"

# Term to split on
split_term = '@'

# Split the phrase
re.split(split_term,email)

['myaddress', 'domain.com']

This splits the email exactly to the alias and its domain. <br/>
Let's take a look on a more sophisticated example. Consider email aliases. <br/>
They can be in the form of "first.last" or "first-last" or "first_last". <br/>

Your task is to split them into first and last. 

For that we'll make use of `character sets` and split function (more on `character ranges` in next section)

In [66]:
names=["first last","first_last","first.last","first-last"]
char_range="[ ._-]"

for name in names:
    print('splitting "{}" into:'.format(name),re.split(char_range,name))

splitting "first last" into: ['first', 'last']
splitting "first_last" into: ['first', 'last']
splitting "first.last" into: ['first', 'last']
splitting "first-last" into: ['first', 'last']


[Go to the beginning of the notebook](#home)
<a id="2f"></a>
#### 2.f. Character sets and ranges

Character sets are used when you wish to match any one of a group of characters at a point in the input. <br/>
Brackets are used to construct character set inputs. <br/>

For example: the input **[ab]** searches for occurrences of either a or b.<br/>
As character sets grow larger, typing every character that should (or should not) match could become very tedious. <br/>

A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. 
**The format used is [start-end]**.

Common use cases are to search for a specific range of letters in the alphabet, <br/>
such [a-f] would return matches with any instance of letters between a and f.

Let's walk through some examples:

In [67]:
# find all 4 digit words that start with a capital letter
cap_pattern="[A-Z]\w\w\w"

# Returns a list of all matches
re.findall(cap_pattern,txt)

['Dina', 'Ivry', 'Jose', 'Katz']

In [68]:
# find all dates in format yyyy-mm-dd
date_pattern="[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]"

# Returns a list of all matches
re.findall(date_pattern,txt)

['2020-11-02', '2020-12-20']

As you can see this is very powerful! However it is a bit tedious to write, for that we introduce the escape codes:

[Go to the beginning of the notebook](#home)
<a id="2g"></a>
#### 2.g Escape codes
You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more.

For example:

<table class="docutils" border="1">

<thead valign="bottom">
<tr class="row-odd">
<th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even">
<td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd">
<td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even">
<td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd">
<td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even">
<td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd">
<td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash (\\). <br/>
Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. <br/>

Using raw strings, created by prefixing the literal value with **r**, for creating regular expressions eliminates this problem and maintains readability. 

Let's take a fresh look on the previous pattern of finding date expressions:

In [69]:
# find all dates in format yyyy-mm-dd
date_pattern=r'\d\d\d\d-\d\d-\d\d'

# Returns a list of all matches
re.findall(date_pattern,txt)

['2020-11-02', '2020-12-20']

Sometimes the use of **r** to escape a backslash is probably one of the things <br/>
that block someone who is not familiar with regex in Python from being able to read regex code at first. <br/>

Hopefully after seeing these examples this syntax will become clear. 

In [72]:
txt2=r"I will eat 1\2\3 oranges"

# Returns the number of oranges I will each
eat_pattern=r"\d\\\d\\\d"


re.findall(eat_pattern,txt2)

['1\\2\\3']

[Go to the beginning of the notebook](#home)
<a id="2h"></a>
#### 2.h Or expression

We can use the pipe `|` to define an or between any regular expression:

In [73]:
weekdays = "We could meet Monday or Wednesday"
pattern = "Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday"
re.findall(pattern , weekdays)

['Monday', 'Wednesday']

[Go to the beginning of the notebook](#home)
<a id="ex2"></a>
### Exercise 2 - Regular Expressions
1. Write a regular expression to extract all time expressions in the text string `txt`
2. Write a regular expression to extract urls from text (including the http or https prefix)

In [74]:
# for part 2, will be printed later
url1='My favorite site is http://www.md.hit.ac.il/machinelearning I go there every day (yeah right).'
url2='Please post your answer to the form https://www.feedback.com/review_forms/form1.aspx'
url3="I visited https://edition.cnn-news.co.il/sports/nba/games-standing.html"  

In [75]:
# for part 1
print(txt.strip())

name: Dina Ivry
email: dimai@gmail.com
time: 2020-11-02 11:32:11
phone: +972-3-52-3434233
city: Tel-aviv
title: knn  
content: can you explain what does the k hyper-parameter mean???

name: Joseph Katzir
email: joek@myemail.ac.il
time: 2020-12-20 13:34:02
phone: (054) 5444443
city: Tel aviv
title: what a great lecture   
content: avinoam this was one of your best



In [101]:
# your solution 1:
regex = r'\d{4}-\d{2}-\d{2}'
re.findall(regex, txt)
regex = r'\d{2}:\d{2}'
re.findall(regex, txt)

['2020-11-02', '2020-12-20']

['11:32', '13:34']

In [80]:
# for part 2:
print(url1)
print(url2)
print(url3)

My favorite site is http://www.md.hit.ac.il/machinelearning I go there every day (yeah right).
Please post your answer to the form https://www.feedback.com/review_forms/form1.aspx
I visited https://edition.cnn-news.co.il/sports/nba/games-standing.html


In [105]:
# your solution 2:
# '?' after + or after * means that the RE before the + should not be greedy. 
# (?:____) - this way we can create a group of characters to treat as one, then use '?' or '*' at the end.
regex = r'https?://(?:www\d{0-3}[.])?[a-zA-Z.\-_]+?(?:/\w+?)*(?:[.]com|[.]il|[.]net|[.]org)(?:/[a-zA-Z.\-_]+[.]?\w*)*'
all_urls = url1 + '\n' + url2 + '\n' + url3
re.findall(regex, all_urls)

['http://www.md.hit.ac.il/machinelearning',
 'https://www.feedback.com/review_forms/form1',
 'https://edition.cnn-news.co.il/sports/nba/games-standing.html']

In [106]:
### solution:
urls=[url1,url2,url3]

pattern="https?://[\w-]+\.([\w-]+\.)+(/?[\w\.\-\_]+)*"

re.findall(pattern, all_urls)

for url in urls:
    m=re.search(pattern,url)
    if(m):
        print('url found:',m.group(0))

[('ac.', '/machinelearning'),
 ('feedback.', '/form1.aspx'),
 ('co.', '/games-standing.html')]

url found: http://www.md.hit.ac.il/machinelearning
url found: https://www.feedback.com/review_forms/form1.aspx
url found: https://edition.cnn-news.co.il/sports/nba/games-standing.html


<a id="3"></a>
## Part 3 - Text in Pandas Series

[Go to the beginning of the notebook](#home)
<a id="3a"></a>
### 3.a.  Pandas Series and Index objects containing strings
Using series or index string operations is possible via via the ``str`` attribute <br/>
So, for example, suppose we create a Pandas Series with this data:

In [107]:
import pandas as pd
sr_names = pd.Series(lst_names2)
sr_names

0    peter
1     Paul
2     MARY
3    gUIDO
dtype: object

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

In [108]:
sr_names.str.upper()

0    PETER
1     PAUL
2     MARY
3    GUIDO
dtype: object

Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.

[Go to the beginning of the notebook](#home)
<a id="3b"></a>
#### 3.b.  Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. <br/>
Here is a list of Pandas ``str`` methods that mirror Python string methods:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

Notice that these have various return values. Some, like ``lower()``, return a series of strings:

[Go to the beginning of the notebook](#home)
<a id="3c"></a>
#### 3.c.  Using Pandas String Methods
Pandas string syntax is similar to basic python string operations
The examples in this section use the following series of names:

In [109]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

In [110]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

But some others return numbers:

In [111]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

Or Boolean values:

In [112]:
monte.str.startswith('Terry')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

Still others return lists or other compound values for each element:

In [113]:
s2 = monte.str.lower().str.split()
s2

0    [graham, chapman]
1       [john, cleese]
2     [terry, gilliam]
3         [eric, idle]
4       [terry, jones]
5     [michael, palin]
dtype: object

In [114]:
monte.str.split("e")

0    [Graham Chapman]
1    [John Cl, , s, ]
2    [T, rry Gilliam]
3        [Eric Idl, ]
4     [T, rry Jon, s]
5    [Micha, l Palin]
dtype: object

We'll see further manipulations of this kind of series-of-lists object as we continue our discussion.

[Go to the beginning of the notebook](#home)
<a id="3d"></a>
#### 3.d.  Series string methods with regular expressions

In addition, there are several methods that accept regular expressions to examine the content of each string element, <br/>
and follow some of the API conventions of Python's built-in ``re`` module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

With these, you can do a wide range of interesting operations.<br/>
For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

In [115]:
monte
monte.str.extract('([A-Za-z]+)', expand=False)

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [116]:
monte.str.extract('([a-z]+)', expand=False)

0     raham
1       ohn
2      erry
3       ric
4      erry
5    ichael
dtype: object

Or we can do something more complicated, like finding all names that start and end with a consonant, <br/>
making use of the start-of-string (``^``) and end-of-string (``$``) regular expression characters:

In [117]:
monte

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

In [118]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

In [125]:
monte.str.findall(r'[^AEIOU](?:\w+ ?)+[^aeiou]')
monte.str.findall(r'[^AEIOU]\w+[^aeiou]')

0    [Graham Chapman]
1        [John Clees]
2     [Terry Gilliam]
3           [ric Idl]
4       [Terry Jones]
5     [Michael Palin]
dtype: object

0    [Graham , Chapman]
1        [John , Clees]
2     [Terry , Gilliam]
3                [ric ]
4       [Terry , Jones]
5     [Michael , Palin]
dtype: object

The ability to concisely apply regular expressions across ``Series`` or ``Dataframe`` entries <br/>
opens up many possibilities for analysis and cleaning of data.

[Go to the beginning of the notebook](#home)
<a id="3e"></a>
#### 3.e. Series  miscellaneous string methods
Finally, there are some miscellaneous methods that enable other convenient operations:

| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

[Go to the beginning of the notebook](#home)
<a id="3f"></a>
#### 3.f. Series item access and slicing

The ``get()`` and ``slice()`` operations, in particular, enable a pandas element access from each array.<br/>
For example, we can get a slice of the first three characters of each array using ``str.slice(0, 3)``.<br/>
Note that this behavior is also available through Python's normal indexing syntax–for example, <br/>
``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:

In [129]:
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

Indexing via ``df.str.get(i)`` and ``df.str[i]`` is likewise similar.

These ``get()`` and ``slice()`` methods also let you access elements of arrays returned by ``split()``.<br/>
For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:

In [130]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

[Go to the beginning of the notebook](#home)
<a id="3g"></a>
#### 3.g. Indicator variables

Another method that requires a bit of extra explanation is the ``get_dummies()`` method.<br/>
This is useful when your data has a column containing some sort of coded indicator.<br/>
For example, we might have a dataset that contains information in the form of codes, such as <br/>
    A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":

In [131]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


The ``get_dummies()`` routine lets you quickly split-out these indicator variables into a ``DataFrame``:

In [132]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.

We won't dive further into these methods here, but I encourage you to read through ["Working with Text Data"](http://pandas.pydata.org/pandas-docs/stable/text.html) in the Pandas online documentation.