
<br>
<p style="text-align: left;"><img src='https://s3.amazonaws.com/weclouddata/images/logos/sunlife_logo.png' width='35%'></p>
<p style="text-align:left;"><font size='15'><b> Python - Strings and Regex </b></font> <br><font color='#559E54' size='6'>(Intructor Copy)</color></p>
<h2 align='left' > Sunlife Data Science Training </h2>

<h4 align='left'>  Prepared by: <img src='https://s3.amazonaws.com/weclouddata/images/logos/wcd_logo.png' width='15%'>

---

# <font color='#347B98'> 1. Python Strings


## $\Delta$ 1.1 Introduction to Python Unicode and Character Encoding

### History of ASCII

In computer system, information is represented with `bit`. Each bit has two states: 0 or 1. So 8 bits can represent 256 different states. One byte has 8 bits and therefore can represent 256 different states, from `00000000` to `11111111`. 

In 1968, the **`American Standard Code for Information Interchange`**, better known by its acronym **`ASCII`**, was standardized. ASCII defined numeric codes for various characters, with the numeric values running from 0 to 127.

Check out Wikipedia for more details on ASCII [Wikipedia: ASCII](https://zh.wikipedia.org/wiki/ASCII)

<img src="https://s3.amazonaws.com/weclouddata/images/python/ASCII.png" width="50%">

Note that for ASCII codes, only 7 bits are used. The leading bit is set to 0. 

### Problem with ASCII 
ASCII can represent 128 characters which worked fine for English. But not other languages. So the leading bit (not used in ASCII) in ASCII was used in other languages. For example, in French, é is encoded as 130 (binary **`10000010`**) and therefore some European languages can use that extra bit to represent 256 characters. 

However, 255 characters aren’t very many. For example, you can’t fit both the accented characters used in Western Europe and the Cyrillic alphabet used for Russian into the 128–255 range because there are more than 127 such characters.

The Chinese language has more than 100k characters and therefore cannot be encoded using the 8 bits. 

### Unicode
Prior to Unicode, there were many different encoding methods. The problem with that is one binary character can mean different character in different language (encoding method). Therefore, to open and display a text file corrently, one needs to know the exact encoding methods. That's very painful. 

That led to the birth of Unicoding, which is an encoding method that **tries to assign a unique code to each available character across different languages**. 

Unicode started out using 16-bit characters instead of 8-bit characters. 16 bits means you have 2^16 = 65,536 distinct values available, making it possible to represent many different characters from many different alphabets; an initial goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn’t enough to meet that goal, and the modern Unicode specification uses a wider range of codes, `0` through `1,114,111` ( **`0x10FFFF`** in base 16).

### Problem of Unicodes
Note that Unicode is just a character set that assigns a unique binary format for each character. However, it doesn't specify how the character should be stored. 

For example, the Chinese character **`"谢" (thanks)`** is encoded as **`&#35874`** in unicode and **`1000110000100010`** in binary format. That's at least 2 bytes to encode this character. Now the problem is that how does a computer tell whether **`1000110000100010`** is a unicode or ASCII? It will mean different characters in encode and ASCII. 

Because of that, many different unicode storage methods were developed and **`UTF-8`** [wikipedia](https://en.wikipedia.org/wiki/UTF-8) is one of the most popular character encoding method based on unicode in the Internet era. Other encodings include UTF-16, UTF-32 etc. 

### Encodings
A Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. **The rules for translating a Unicode string into a sequence of bytes are called an encoding**.


<img src='https://s3.amazonaws.com/weclouddata/images/python/encode_decode_string_bytes.png' width='30%'>

---

## $\Delta$ 1.2 String Encoding

### Encode unicode string to bytes

> - Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented by a code point. So, each string is just a sequence of Unicode code points.

> - For efficient storage of these strings, the sequence of code points are converted into set of bytes. The process is known as encoding.

> - There are various encodings present which treats a string differently. The popular encodings being utf-8, ascii, etc.

### Syntax
> `string.encode(encoding='UTF-8',errors='strict')`

> **Parameters**:
> - **encoding** - the encoding type a string has to be encoded to  
> - **errors** - response when encoding fails. There are six types of error response
>  - `strict` - default response which raises a UnicodeDecodeError exception on failure
>  - `ignore` - ignores the unencodable unicode from the result
>  - `replace` - replaces the unencodable unicode to a question mark ?
>  - `xmlcharrefreplace` - inserts XML character reference instead of unencodable unicode
>  - `backslashreplace` - inserts a \uNNNN espace sequence instead of unencodable unicode
>  - `namereplace` - inserts a \N{...} escape sequence instead of unencodable unicode


#### <font color='#FC7307'> Encode to default utf-8 encoding

In [4]:
u'加拿大永明金融集团'.encode('utf-8')

b'\xe5\x8a\xa0\xe6\x8b\xbf\xe5\xa4\xa7\xe6\xb0\xb8\xe6\x98\x8e\xe9\x87\x91\xe8\x9e\x8d\xe9\x9b\x86\xe5\x9b\xa2'

In [5]:
len(u'加拿大永明金融集团'.encode('utf-8'))

27

In [6]:
len(u'加拿大永明金融集团')

9

#### <font color='#FC7307'> Encodng with Error Parameter

In [7]:
# unicode string
string = 'pythön!'

# print string
print('The string is:', string)

# ignore error
print('The encoded version (with ignore) is:', string.encode("ascii", "ignore"))

# replace error
print('The encoded version (with replace) is:', string.encode("ascii", "replace"))

The string is: pythön!
The encoded version (with ignore) is: b'pythn!'
The encoded version (with replace) is: b'pyth?n!'


### Return the code point for unicode strings

In [13]:
[ord(c) for c in u'sunlife'.encode('utf-8').decode('ascii')]

[115, 117, 110, 108, 105, 102, 101]

## $\Delta$ 1.3 Decoding Bytes

### Decode a unicode string to ascii string

In [14]:
unicode_str = u'sunlife financial'
unicode_str

'sunlife financial'

In [15]:
ascii_str = unicode_str.encode('utf-8').decode('ascii')
ascii_str

'sunlife financial'

#### <font color='#FC7307'>The examples below will fail because the code point goes beyond 128 after decoding to ascii

In [16]:
u'加拿大永明金融集团'.encode('utf-8').decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

In [17]:
unicode_str = u'a\xac\u1234\u20ac\U00008000'
ascii_str = unicode_str.encode('utf-8').decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)

#### <font color='#FC7307'> Use `ord` to return the Unicode Code Point of one character in a string

In [18]:
unicode_str = u'加拿大永明金融集团'
[ord(c) for c in unicode_str] ### in this case, all ordinal code points are larger than 128

[21152, 25343, 22823, 27704, 26126, 37329, 34701, 38598, 22242]

### <font color='#559E54'> Decoing bytes

In [19]:
s = '加拿大永明金融集团'   # str  
u = u'加拿大永明金融集团'  # unicode
  
# Convert unicode to bytes
print(u.encode('utf-8'))

# Conveft bytes to unicode 
print(u.encode('utf-8').decode('utf-8'))

b'\xe5\x8a\xa0\xe6\x8b\xbf\xe5\xa4\xa7\xe6\xb0\xb8\xe6\x98\x8e\xe9\x87\x91\xe8\x9e\x8d\xe9\x9b\x86\xe5\x9b\xa2'
加拿大永明金融集团


### Get system default encoding method
> starting python 3.x, the default encoding is 'utf-8'

In [20]:
import sys
sys.getdefaultencoding()

'utf-8'

## $\Delta$ 1.4 Raw strings
> raw strings don't escape characters

In [21]:
# raw strings
print (r'This is a raw string...newlines \r\n are ignored.')

This is a raw string...newlines \r\n are ignored.


In [22]:
print ('This is a raw string...newlines \r\n are ignored.')

This is a raw string...newlines 
 are ignored.


#### <font color='#FC7307'> `chr` to return a Unicode string of one character with ordinal i

In [23]:
help(chr)

Help on built-in function chr in module builtins:

chr(i, /)
    Return a Unicode string of one character with ordinal i; 0 <= i <= 0x10ffff.



In [25]:
[ord(c) for c in u'life'.encode('utf-8').decode('ascii')]

[108, 105, 102, 101]

In [26]:
u = u'sun' + chr(108) + chr(105) + chr(102) + chr(101)
print(u)
print (u.encode('utf-8'))
print(list(ord(c) for c in u))

sunlife
b'sunlife'
[115, 117, 110, 108, 105, 102, 101]


---
## $\Delta$ 1.5 Python String Representation


<img src='https://s3.amazonaws.com/weclouddata/images/python/python_string_index_representation_1.png' width='25%'>
<img src='https://s3.amazonaws.com/weclouddata/images/python/python_string_index_representation_2.png' width='26%'>

In [31]:
# String Object is Immutable

s = "Sunlife Data Science Training"
print(s)
print(s[21:])

Sunlife Data Science Training
Training


## $\Delta$ 1.6 - String Operators and Methods


### String <font color='#559E54'> Concatenation </font> - example 1

In [32]:
vendor = 'WeCloudData'
client = 'Sunlife'
courses = ['Python', 'Data Science', 'Machine Learning','Big Data']

print(vendor + ' offers data science training to ' + client)

WeCloudData offers data science training to Sunlife


### String <font color='#559E54'> Concatenation </font> - example 2

In [33]:
# expect an error
vendor + 'will teach ' + len(courses) + ' courses!'

TypeError: Can't convert 'int' object to str implicitly

In [34]:
# casting int to str
vendor + ' will teach ' + str(len(courses)) + ' courses!'

'WeCloudData will teach 4 courses!'

### String <font color='#559E54'> Slicing/Indexing </font> 
> `string` is a sequence in python

In [36]:
# string is a sequence in Python
a = 'sunlife'
b = a[0:1] + a[3] + a[5:6]
b

'slf'

### String <font color='#FC7307'> Replication </font> 


In [37]:
s = "spam"
e = "egg"

print (s*3 + e)

spamspamspamegg


In [38]:
print ("+"*50)

++++++++++++++++++++++++++++++++++++++++++++++++++


## $\Delta$ 1.7 - String Formatting

In [39]:
course = 'Data Science with Python'
vendor = 'WeCloudData'
print ('Welcome to day 2 of the {0} course delivered by {1}'.format(course, vendor))


Welcome to day 2 of the Data Science with Python course delivered by WeCloudData


In [40]:
print ("The training starts on (MM/DD/YY) {0:02d}/{1:02d}/{2:04d}".format(5, 28, 2018) )

The training starts on (MM/DD/YY) 05/28/2018


In [41]:
print ('Total with tax: ${0:.2f}'.format(2500 * 1.13) )

Total with tax: $2825.00


## $\Delta$ 1.8 - String Methods

>
`s.lower()`    
`s.upper()`   
`s.capitalize()`        
`s.strip()`   
`s.isalpha()`  
`s.isdigit()`  
`s.isspace()`  
`s.startwith()`  
`s.endswith()`  
`s.replace()`  
`s.split()`    
`s.join()`  
`s.index()`  
`s.count()`  


# <font color='#559E54'> $\Omega$ Python String Labs </font>

## String Lab 1 - Find all 'triangle rewards' keywords in a string

### Retrieve data from a webpage using `request`

In [43]:
!pip install beautifulsoup4 --user
!pip install urllib --user

Collecting beautifulsoup4
  Using cached https://files.pythonhosted.org/packages/1d/5d/3260694a59df0ec52f8b4883f5d23b130bc237602a1411fa670eae12351e/beautifulsoup4-4.7.1-py3-none-any.whl
Collecting soupsieve>=1.2 (from beautifulsoup4)
  Using cached https://files.pythonhosted.org/packages/b9/a5/7ea40d0f8676bde6e464a6435a48bc5db09b1a8f4f06d41dd997b8f3c616/soupsieve-1.9.1-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.7.1 soupsieve-1.9.1
Collecting urllib
[31m  Could not find a version that satisfies the requirement urllib (from versions: )[0m
[31mNo matching distribution found for urllib[0m


In [55]:
import urllib.request
from bs4 import BeautifulSoup
import re

link = 'https://www.sunlife.ca/ca/Insurance/Life+insurance?vgnLocale=en_CA'
link_html = urllib.request.urlopen(urllib.request.Request(link)).read()
soup_obj = BeautifulSoup(link_html, 'html.parser')

In [65]:
#link_html

### Parse an HTML using `bs4`

In [57]:
texts = soup_obj.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', 'document']:
        return False
    elif re.match('<!--.*-->', element):
        return False
    elif re.match('\n', element): 
        return False
    return True

#visible_texts = filter(visible, texts)
visible_texts = [text for text in texts if visible(text)]

In [64]:
#visible_texts

### Question: concatenate all string elements in the above list `visible_texts` into a string

In [59]:
#####################
# Your code below
#####################

sunlife_page = ' '.join(visible_texts)

In [66]:
#print(sunlife_page)

### Question: how many times has the phrase 'insurance' occurred on this page?

In [63]:
sunlife_page.lower().count('insurance')

114

----

<br>

# <font color='#347B98'> 2. Python Regular Expression Tutorial 

### `re.search()`
The `re` package has a number of top level methods, and to test whether a regular expression matches a specific string in Python, you can use `re.search()`. This method either returns `None` if the pattern doesn't match, or a re.MatchObject with additional information about which part of the string the match was found.

> Note that this method stops after the first match, so this is best suited for testing a regular expression more than extracting data.

### `re.findall()`
Unlike the `re.search()` method above, we can use `re.findall()` to perform a global search over the whole input string. If there are capture groups in the pattern, then it will return a list of all the captured data, but otherwise, it will just return a list of the matches themselves, or an empty list if no matches are found.


----------

**Reference**
> [Python Regex Documentation](https://docs.python.org/3/howto/regex.html)  
> [Python Regex Tester](https://pythex.org/) -- very useful tool

### <font color='#FC7307'> Matching Characters
**`a, X, 9`** Ordinary characters just match themselves exactly.   

**`.`** (a period) 
Matches any single character except newline '\n'  

**`\b`** Boundary between word and non-word  

**`\t, \n, \r`** tab, newline, return  

**`^`** Matches the start of the string  

**`$`** Matches the end of the string  

**`\`** inhibit the "specialness" of a character. So, for example, use `\.` to match a period or `\\` to match a slash.  

**`\d`**
Matches any decimal digit; this is equivalent to the class [0-9].  

**`\D`**
Matches any non-digit character; this is equivalent to the class [^0-9].  

**`\s`**
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].  

**`\S`**
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].  

**`\w`**
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].  

**`\W`**
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].  

**`Meta-Characters`**
The meta-characters do not match themselves because they have special meanings: 
   `. ^ '$' * + ? { [ ] \ | ( )` 

## $\Delta$ 2.1 Basic Regex Search - `re.search`
* `search()` scans through a string, looking for any location where this RE matches.

### Find `dates`

In [68]:
import re

regex = r"([a-zA-Z]+) (\d+)"

if re.search(regex, "June 24"):
    match = re.search(regex, "June 24")
    print("Match at index {},{}".format(match.start(), match.end()))
    
    # So this will print "June 24"
    print("Full match: {}".format(match.group(0)))
    # So this will print "June"
    print("Month: %s" % (match.group(1)))
    # So this will print "24"
    print("Day: %s" % (match.group(2)))
else:
    # If re.search() does not match, then None is returned
    print("The regex pattern does not match. :(")

Match at index 0,7
Full match: June 24
Month: June
Day: 24


### Find digits

In [69]:
import re

s = 'There\'re 20 participants in the data science training.' 
match = re.search(r'\d+', s)

if match:                      
    print ('Found number: {}'.format(match.group()))
else:
    print ('did not find')
        

Found number: 20


### <font color='#FC7307'> Group Extraction
* The "group" feature of a regular expression allows you to pick out parts of the matching text. 
* Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: `r'([\w.-]+)@([\w.-]+)'`. 
* In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. 

In [71]:
s = 'Shaohua\'s weclouddata email address is shaohua@weclouddata.com.'
match = re.search('([\w.-]+)@([\w.-]+)', s)
if match:
    print ('Email: {}'.format(match.group()))   ## 'alice-b@google.com' (the whole match)
    print ('Username: {}'.format(match.group(1)))  ## 'alice-b' (the username, group 1)
    print ('Host: {}'.format(match.group(2)))  ## 'google.com' (the host, group 2)
else:
    print ('find no match!')

Email: shaohua@weclouddata.com.
Username: shaohua
Host: weclouddata.com.


## $\Delta$ 2.2 `re.findall()`
- Unlike the `re.search()` method above, we can use `re.findall()` to perform a global search over the whole input string. If there are capture groups in the pattern, then it will return a list of all the captured data, but otherwise, it will just return a list of the matches themselves, or an empty list if no matches are found.

- If you need additional context for each match, you can use `re.finditer()` which instead returns an iterator of re.MatchObjects to walk through. Both methods take the same parameters.

### Find all `dates`

In [72]:
import re
# Lets use a regular expression to match a few date strings.
regex = r"[a-zA-Z]+ \d+"
matches = re.findall(regex, "June 24, August 9, Dec 12")
for match in matches:
    print("Full match: %s" % (match))

Full match: June 24
Full match: August 9
Full match: Dec 12


In [73]:
# To capture the specific months of each date we can use the following pattern
regex = r"([a-zA-Z]+) \d+"
matches = re.findall(regex, "June 24, August 9, Dec 12")
for match in matches:
    print("Match month: %s" % (match))

Match month: June
Match month: August
Match month: Dec


In [74]:
# If we need the exact positions of each match
regex = r"([a-zA-Z]+) \d+"
matches = re.finditer(regex, "June 24, August 9, Dec 12")
for match in matches:
    print("Match at index: %s, %s" % (match.start(), match.end()))

Match at index: 0, 7
Match at index: 9, 17
Match at index: 19, 25


### Group Extraction
* The "group" feature of a regular expression allows you to pick out parts of the matching text. 
* Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. 
* In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. 

# <font color='#559E54'> $\Omega$ Python Regex Labs </font>

## $\Delta$ Regex Lab 1 - Find Valid Emails

<img src='https://s3.amazonaws.com/weclouddata/images/python/emailregex-explained-gui.jpeg' width='50%'>

In [75]:
messy_emails = ['A keen nature photographer and a nature conservationist, he is a resource person with the NParks Biodiversity Centre for research in the field of butterflies. Sin Khoon was one of the pioneers in the creation of a free-ranging butterfly trail in Singapore with the Butterfly Trail at Alexandra Hospital, Butterfly Hill @ Pulau Ubin, Butterfly Garden @ Hort Park and various other trails at Park Connectors and urban gardens. @Kallang River-Bishan Park – is a partnership between the', 
'zengyiwen@u.nus.edu',
'dbstanlw@nus.edu.sg',
'dbssong@nus.edu.sg',
'dbslbhr@nus.edu.sg',
'rebecca.loh.ker@u.nus.edu',
'Email: hello_world@comp.nus.edu.sg where hello_world=wenzy', 
'Email: lingtw@comp.nus.edu.sg',
'Copyright @ ljklonepiece', 
'Teacher @ MOE', 
'@comp.nus.edu.sg','Email: loolf(@)comp(dot)nus(dot)edu(dot)sg',
'[firstname][lastname]@comp.nus.edu.sg'
]

In [76]:
###########################
## Your Code Below
###########################

import re

matched_emails = [re.search(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", email).group() 
                  for email in messy_emails 
                  if re.search(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", email) ]

In [77]:
matched_emails

['zengyiwen@u.nus.edu',
 'dbstanlw@nus.edu.sg',
 'dbssong@nus.edu.sg',
 'dbslbhr@nus.edu.sg',
 'rebecca.loh.ker@u.nus.edu']