# Elements Of Data Processing (2020S1) - Week 4


## Regular expressions 
Regular expressions allow you to match patterns in strings, rather than matching exact characters.  
For example, 
if I wished to find all phone numbers of the form (03) xxxx xxxx, where x is some arbitrary digit, 
I could use a regular expression like this: 
    
\(03\) \d\d\d\d \d\d\d\d

*or*

\(03\) \d{4} \d4}    

The **re** library in python allows you to use regular expressions.  It provides a number of useful functions, 
including:
    
***search*** - Searches for a particular pattern in a string

***findall*** - Finds all substrings that match a particular pattern

***sub*** - Replaces substrings that match a particular pattern with a new substring


### This example looks for phone numbers that match the format above

In [2]:
#This examples looks for phone numbers that match the format above
import re

string = r'Name: Chris, ph: (03) 9923 1123, comments: this is not my real number'
pattern = r'\(03\) \d{4} \d{4,4}'

if re.search(pattern, string) :
    print("Phone number found")
else :
    print("Not found")

Phone number found


### <span style="color:blue"> Exercise 1 </span>

Modify the example above so that it will also find phone numbers starting with 03 that:
    
- are missing brackets and/or
- instead of a space, use hyphens,  backslashes and/or spaces.

Your program should match all elements in ***strings*** in the code segment below 

In [None]:
#This examples looks for phone numbers that match the format above
import re
strings = [
    r'Name: Chris, ph: (03) 9923 1123, comments: this is not my real number',
    r'Name: John, ph: 03-9923-1123, comments: this might be an old number',
    r'Name: Sara, phone: (03)-9923-1123, comments: there is data quality issues, so far, three people sharig the same number',
    r'Name: Christopher, ph: (03)\-9923 -1123, comments, is this the same Chris in the first record?'
]

#change this line
#pattern = r'\(03\) \d{4} \d{4,4}'


for s in strings:
    if re.search(pattern, s) :
        print("Phone number found")
    else :
        print("Not found")

### <span style="color:blue"> Exercise 2 </span>

Write a program that will remove all leading zeros from an IP address
    
For example, 0216.08.094.102 should become 216.8.94.196

Your program should match all elements in ***strings*** in the code segment below 

In [None]:
#Exercise 2: Write a program that will remove all leading zeros from an IP address
#For example, 0216.08.094.102 should become 216.8.94.196
import re

ip_addr = '0216.08.094.102'

#change this line
revised_addr = ip_addr


print(revised_addr)

## Web scraping ##
The BeautifulSoup library can be used to scrape data from a web page for processing and analysis.  You can find out more about BeautifulSoup at https://www.crummy.com/software/BeautifulSoup/

### This example extracts tennis scores from the 2019 ATP Tour

In [None]:
#This example extracts tennis scores from the 2019 ATP Tour

import requests
from bs4 import BeautifulSoup
import pandas as pd
import unicodedata
import re
import matplotlib.pyplot as plt
%matplotlib inline 
#Specify the page to download
u = 'https://en.wikipedia.org/wiki/2019_ATP_Tour'
page = requests.get(u)
soup = BeautifulSoup(page.text, 'html.parser')

#Locate the section we're interested in.  Here we are after the second table after the 'ATP_ranking id'
section = soup.find(id='ATP_ranking')
results = section.findNext('table').findNext('table')

#Iterate through all rows in the resultant table
rows = results.find_all('tr')

i = 0
records = []

#for row in rows[1:2]:
#    cells = row.find_all('th')
#    print("{0}, {1}, {2} ,{3}".format(cells[0].text.strip(), cells[1].text.strip(), cells[2].text.strip(), cells[3].text.strip()))
#    # column headers are #, Player, Points, Tours
    
for row in rows[2:]:
    cells = row.find_all('td')
    record = []
    #print("{0}::{1}::{2}::{3}".format(cells[0].text.strip(), cells[1].text.strip(), cells[2].text.strip(), cells[3].text.strip()))
    # column value: 1::Rafael Nadal (ESP)::9,585::12
    
    #Removes junk characters from string and stores the result
    ranking = int(unicodedata.normalize("NFKD", cells[0].text.strip()))
    record.append(int(ranking))
    
    player = unicodedata.normalize("NFKD", cells[1].text.strip())
    #Removes the country from the player name, removing surrounding whitespaces.
    player_name = re.sub('\(.*\)', '', player).strip()
    #print(player_name)
    record.append(player_name)

    #Remove the thousands separator from the points value and store as an integer
    points = unicodedata.normalize("NFKD", cells[2].text.strip())
    record.append(int(re.sub(',', '', points)))
    
    # number of tours: integer type
    tours = unicodedata.normalize("NFKD", cells[3].text.strip())
    record.append(int(tours))
    
    #Store the country code separately
    country_code = re.search('\((.*)\)', player).group(1)
    record.append(country_code)
    #print(record)
    #[1, 'Rafael Nadal', 9585, 12, 'ESP']
    records.append(record)
    i = i+1

column_names = ["ranking", "player", "points", "tours", "country"]
tennis_data = pd.DataFrame(records, columns = column_names)

plt.xticks(rotation='vertical')
plt.bar(tennis_data['player'], tennis_data['points'])
plt.ylabel('points')
plt.title("ATP Tour - player points")
plt.show()

### Side note on *unicodedata.normalize()*

Web pages commonlu uses uncode encoding.

Most ASCII characters are printable characters of the english alphabet such as abc, ABC, 123, ?&!, etc., represented as a number between 32 and 127.
    
Unicode represents most written languages and still has room for even more; this
includes typical left-to-right scripts like English and even right-to-left scripts like Arabic. Chinese, Japanese, and the many other variants are also represented within Unicode
ASCII has its equivalent within Unicode.



In Unicode, several characters can be expressed in various way. 
For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) 
can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

The Unicode standard defines various normalization forms of a Unicode string, 
based on the definition of canonical equivalence and compatibility equivalence. 

The function ***unicodedata.normalize("NFKD", unistr)*** 
will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents.

#### Example:

In [None]:
unistr = u'\u2460'
print("{0} is the equivalent character of {1}".format(unicodedata.normalize('NFKD', unistr), unistr))


### <span style="color:blue"> Exercise 3 </span>

Produce a graph similar to the example above for the **2019 ATP Doubles Scores**.

*First locate the section we're interested in.*
    

In [None]:
# Solution of exercise 3

import requests
from bs4 import BeautifulSoup
import pandas as pd
import unicodedata
import re
import matplotlib.pyplot as plt

#Specify the page to download
u = 'https://en.wikipedia.org/wiki/2019_ATP_Tour'
page = requests.get(u)
soup = BeautifulSoup(page.text, 'html.parser')

## complete the program

## Web crawling ##

This example implements a simplified Web crawler that traverses the site 
**[http://books.toscrape.com/](http://books.toscrape.com/)**


In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
import unicodedata
import re
import matplotlib.pyplot as plt

page_limit = 20

#Specify the initial page to crawl
base_url = 'http://books.toscrape.com/'
seed_item = 'index.html'

seed_url = base_url + seed_item
page = requests.get(seed_url)
soup = BeautifulSoup(page.text, 'html.parser')

visited = {}; 
visited[seed_url] = True
pages_visited = 1
print(seed_url)

#Remove index.html
links = soup.findAll('a')
seed_link = soup.findAll('a', href=re.compile("^index.html"))
#to_visit_relative = list(set(links) - set(seed_link))
to_visit_relative = [l for l in links if l not in seed_link]


# Resolve to absolute urls
to_visit = []
for link in to_visit_relative:
    to_visit.append(urljoin(seed_url, link['href']))

    
#Find all outbound links on succsesor pages and explore each one 
while (to_visit):
    # Impose a limit to avoid breaking the site 
    if pages_visited == page_limit :
        break
        
    # consume the list of urls
    link = to_visit.pop(0)
    print(link)

    # need to concat with base_url, an example item <a href="catalogue/sharp-objects_997/index.html">
    page = requests.get(link)
    
    # scarping code goes here
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # mark the item as visited, i.e., add to visited list, remove from to_visit
    visited[link] = True
    to_visit
    new_links = soup.findAll('a')
    for new_link in new_links :
        new_item = new_link['href']
        new_url = urljoin(link, new_item)
        if new_url not in visited and new_url not in to_visit:
            to_visit.append(new_url)
        
    pages_visited = pages_visited + 1

print('\nvisited {0:5d} pages; {1:5d} pages in to_visit'.format(len(visited), len(to_visit)))
#print('{0:1d}'.format(pages_visited))


### <span style="color:blue"> Exercise 4 </span>

The code above can easily be end up stuck in a crawler trap.  
Explain three ways this could occur and suggest possible solutions

### <span style="color:blue"> Exercise 5 </span>

Modify the code above to print the titles of as many books as can be found within the page_limit


In [None]:
# Solution to exercise 5
# Modify the code above to print the titles of as many books as can be found within the page_limit

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
import unicodedata
import re
import matplotlib.pyplot as plt

# complete the program


## Natural Language Processing ##
The ***nltk*** library provides you with tools for natural language processing, including tokenizing, stemming and lemmatization

In [2]:
import nltk
from nltk.stem.porter import *

# if running the first time with errors:
nltk.download('punkt')
nltk.download('stopwords')
#
porterStemmer = PorterStemmer()

speech = 'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.'
wordList = nltk.word_tokenize(speech)

# run the line to download it the first time:
#nltk.download('stopwords')
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

filteredList = [w for w in wordList if not w in stopWords]

wordDict = {}
for word in filteredList:
    stemWord = porterStemmer.stem(word)
    if stemWord in wordDict : 
        wordDict[stemWord] = wordDict[stemWord] +1
    else :
        wordDict[stemWord] = 1

wordDict = {k: v for k, v in sorted(wordDict.items(), key=lambda item: item[1], reverse=True)}
for key in wordDict : print(key, wordDict[key])


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jaypr\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


, 22
. 10
-- 7
dedic 6
nation 5
live 4
great 3
It 3
dead 3
us 3
shall 3
peopl 3
new 2
conceiv 2
men 2
war 2
long 2
We 2
gave 2
consecr 2
the 2
far 2
rather 2
devot 2
four 1
score 1
seven 1
year 1
ago 1
father 1
brought 1
forth 1
contin 1
liberti 1
proposit 1
creat 1
equal 1
now 1
engag 1
civil 1
test 1
whether 1
endur 1
met 1
battle-field 1
come 1
portion 1
field 1
final 1
rest 1
place 1
might 1
altogeth 1
fit 1
proper 1
but 1
larger 1
sens 1
hallow 1
ground 1
brave 1
struggl 1
poor 1
power 1
add 1
detract 1
world 1
littl 1
note 1
rememb 1
say 1
never 1
forget 1
unfinish 1
work 1
fought 1
thu 1
nobli 1
advanc 1
task 1
remain 1
honor 1
take 1
increas 1
caus 1
last 1
full 1
measur 1
highli 1
resolv 1
die 1
vain 1
god 1
birth 1
freedom 1
govern 1
perish 1
earth 1


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jaypr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### <span style="color:blue"> Exercise 6 </span>

Modify the example above to use a WordNet Lemmatizer instead of a porter stemmer.

Comment on the differences

In [6]:
#Solution to Exercise 6: 
import nltk
from nltk.stem.porter import *
from nltk.stem import WordNetLemmatizer 

import nltk
from nltk.stem.porter import *

# if running the first time with errors:
#nltk.download('punkt')
#nltk.download('stopwords')
#

wnl = WordNetLemmatizer()
#porterStemmer = PorterStemmer()

print(wnl.lemmatize('dogs'))

speech = 'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.'
wordList = nltk.word_tokenize(speech)

# run the line to download it the first time:
#nltk.download('stopwords')
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

filteredList = [w for w in wordList if not w in stopWords]

wordDict = {}
for word in filteredList:
    stemWord = wordnet.lemmatize(word)
    if stemWord in wordDict : 
        wordDict[stemWord] = wordDict[stemWord] +1
    else :
        wordDict[stemWord] = 1

wordDict = {k: v for k, v in sorted(wordDict.items(), key=lambda item: item[1], reverse=True)}
for key in wordDict : print(key, wordDict[key])




LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\jaypr/nltk_data'
    - 'C:\\Users\\jaypr\\Anaconda3\\nltk_data'
    - 'C:\\Users\\jaypr\\Anaconda3\\share\\nltk_data'
    - 'C:\\Users\\jaypr\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\jaypr\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************
