### Chapter 11 - Regular Expression
Regular expressions are a very specialized language that allow us to succinctly search strings and extract data from strings. Regular expressions are a language unto themselves. It is not essential to know how to use regular expressions, but they can be quite useful and powerful.

- really cleaver 'wild card' expression 
- very porwerful and very cryptic
- fun once  you  understand them

- `^` Matches the beginning of a line
-  `$` Matched the end of the line
- `.` Matches any character
- `\s` Matches whitespaces
-  `\S` Matches any non-whitespace character
- `*` Repeats a character zero or more  times
- `*?` Repeats a character zero  or more times (non greedy)
- `+` Repeats a character one or more times
- `+?` Repeats a character one or  more times (non greedy)
- `[aeiou]` Matches single character in the listed set
- `[^XYZ]` Matches a single character not in the listed  set
- `[a-z0-9]` The set of characters can include a range
- `(` Indicates where string extraction is to start
- `)` Indicates where string extraction is to end

`import re`  - imports the regular expression in python

#### 11.2 Extracting Data

`re.search()` - returns a True/False depending on whether the string matches the regular expression

`re.findall()` - extracts the matching string.

In [1]:
import re
x = 'My 2 favarite numbers are 19 and 42'
y = re.findall('[0-9]+', x)
print(y)

['2', '19', '42']


In [2]:
# continuation
# find if there is any uppercase vowel
y = re.findall('[AEIOU]+', x)
print(y)

[]


In [4]:
# Greedy Matching
xx = 'From: Using the: character'
y = re.findall('^F.+:', xx)
print(y)

['From: Using the:']


In [5]:
## Non Greedy Match
y = re.findall('^F.+?:', xx)
print(y)

['From:']


#### Fine-Tuning String Extraction
You can refine the match for `re.findlall()` and seperately determine which portion of the mactch is to be extracted using parentheses

In [6]:
x = 'From random@mail.com.us Sat June 5 09:94:44 20114'
y = re.findall('\S+@\S+', x)

# This is greedy extraction
print(y)

['random@mail.com.us']


In [7]:
# Fine tuning the above example
y = re.findall('^From (\S+@\S+)', x)

# Matchihg and demanding exact email address
print(y)

['random@mail.com.us']


Example 1 - `@([^ ]*)`
- `[^ ]` = Match non-blank character = everything but space
- `*` = Match many of them
-  `@( )` = Match everything after `@` but don't extract `@`

Example 2 - `^From .*@([^ ]*)`
- `[^ ]` = Match non-blank character = everything but space
- `*` = Match many of them
-  ` .*@( )` = Match everything after 'From' upto `@`, and extract after that
- `^` = Start at the beginning of the line
- `^From` = Look for the string 'From'

Example 3 - `re.findall('^X-DSPAM-Confidence: ([0-9.]+)' ,line)`
- `^X-DSPAM-Confidence:` = Find everything that starts with this, followed by a space. Then extract whats in the parentheses `()`
- `([0-9.]+)` = Look for 0-9 numbers with period =  floating number, `+` = one or more time.

In [41]:
import re
name = input("Enter file:")
if len(name) < 1 : name = "regex_sum_806196.txt"
handle = open(name)

sum_list = list()
another_list = list()

for line in handle:
    line = line.rstrip()
    x = re.findall('[0-9*]+' , line)
    #print(x)
    for num in x:
        if len(num) > 0:
            sum_list.append(int(num))

print(sum(sum_list))


        

Enter file:
432049


#### Regular expression example codes

HTML links = `re.findall('href="(http[s]?://.*)"',line)`

### Chapter 12 - Network Technology

#### Step One - Make Connection - [Transport Protocol]
Python has built-in support for TCP Sockets. jsut pass host name and port number.

`import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
`

#### Step Two  - Application Protocol
HTTP (Hypertext Transfer Protocol) - has request-responce  cycle


In [43]:
import socket

# Step One  - Make a connection
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))

# Step Two - Send a GET request
# send a HTTP command
cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

# Step Three - Get the data back
while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode())
# close the connection
mysock.close()



HTTP/1.1 200 OK
Date: Tue, 21 Jul 2020 21:03:49 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "1d3-54f6609240717"
Accept-Ranges: bytes
Content-Length: 467
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

Why should you learn to write programs?

Writing programs (or programming) is a very creative 
and rewarding activity.  You can write programs
 for 
many reasons, ranging from making your living to solving
a difficult data analysis problem to having fun to helping
someone else solve a problem.  This book assumes that 
everyone needs to know how to program, and that once 
you know how to program you will figure out what you want 
to do with your newfound skills.  



`.encode()` and `.decode()` - are two strings commands that converts and decodes unocode to utf8 and vice versa

#### 12.3 Unicode Characters and Strings
- each character is represented by a number between 0 and 256 stored in 8 bits of memory.
- We refer to "8 bits" of memory as a *byte* of memory
- The `ord()` function tells us the numberic value of a simple ASCII character

In [44]:
print(ord('H'))

72


To represent the wide range of characters coputers must handle we represent characters with more than one byte.
- UTF-8 is recommended practice for endcoding data to be exhanged between systems.
- In  python3 all the strings internally   are UNICODE
- Working with strings variables in Python programs are reading data from files usually "Just works". 
- When python talks to a network resource using sockets or talks to a database we have to encode and decode (usualy to UTF-8)

#### Python stringd to Bytes

When we talk to an external resources like a netwerk socket we send bytes, so we need to encode python3 strings to a given character encoding. Similarly, when we read data from an extrenal resources, we must decode it based on character se so it is properly represented in python3 as a string.

In [None]:
while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    mystring = data.decode()
    print(mystring)


#### 12.4 Retrieving Web Pages
Since HTTP is so common, python3 has a library that does all the socket work for it and makes web page/s look like a  file. `urllib`

In [46]:
import urllib.request, urllib.parse, urllib.error
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

for line in fhand:
    print(line.decode().strip())

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [47]:
import urllib.request, urllib.parse, urllib.error
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1

print(counts)

{'But': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'It': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'Juliet': 1, 'sun': 2, 'Arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'Who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


#### 12.5 Parsing Web Pages
#### Web Scraping
When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information, and then looks at more web pages.

In [58]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = input('Enter - ')
if len(url) < 1 : url = "http://www.google.com/"
    
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))
    

Enter - 
http://www.google.com/imghp?hl=en&tab=wi
http://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=US&tab=w1
http://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/
/advanced_search?hl=en&authuser=0
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/


In [160]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

url = input('Enter - ')
if len(url) < 1 : url = "http://www.google.com/"
    
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
    
html = urllib.request.urlopen(url, context = ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    # print(tag)
    print(tag.get('href', None))
 
print('********')
# Print specific Tag
print(tags[3].get('href', None))

Enter - 
http://www.google.com/imghp?hl=en&tab=wi
http://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=US&tab=w1
http://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/
/advanced_search?hl=en&authuser=0
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/
********
http://www.youtube.com/?gl=US&tab=w1


You can also use BeautifulSoup to pull out various parts of each tag.

In [179]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
if len(url) < 1 : url = "http://www.google.com"
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    # Look at the parts of a tag
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)

Enter - 
TAG: <a class="gb1" href="http://www.google.com/imghp?hl=en&amp;tab=wi">Images</a>
URL: http://www.google.com/imghp?hl=en&tab=wi
Contents: Images
Attrs: {'class': ['gb1'], 'href': 'http://www.google.com/imghp?hl=en&tab=wi'}
TAG: <a class="gb1" href="http://maps.google.com/maps?hl=en&amp;tab=wl">Maps</a>
URL: http://maps.google.com/maps?hl=en&tab=wl
Contents: Maps
Attrs: {'class': ['gb1'], 'href': 'http://maps.google.com/maps?hl=en&tab=wl'}
TAG: <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">Play</a>
URL: https://play.google.com/?hl=en&tab=w8
Contents: Play
Attrs: {'class': ['gb1'], 'href': 'https://play.google.com/?hl=en&tab=w8'}
TAG: <a class="gb1" href="http://www.youtube.com/?gl=US&amp;tab=w1">YouTube</a>
URL: http://www.youtube.com/?gl=US&tab=w1
Contents: YouTube
Attrs: {'class': ['gb1'], 'href': 'http://www.youtube.com/?gl=US&tab=w1'}
TAG: <a class="gb1" href="http://news.google.com/nwshp?hl=en&amp;tab=wn">News</a>
URL: http://news.google.com/nwshp?hl=en&

#### Assignement: Scraping Numbers from HTML using BeautifulSoup 
We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment. You are to find all the <span> tags in the file and pull out the numbers from the tag and sum the numbers.

In [181]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl


url = input('Enter - ')
if len(url) < 1 : url = "http://py4e-data.dr-chuck.net/comments_806198.html"
html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")

collection = list()

# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # print(tag.contents[0])
    collection.append(int(tag.contents[0]))

print(sum(collection))

Enter - 
2688


#### Assignment: Following Links in HTML Using BeautifulSoup
In this assignment you will write a Python program that expands on https://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= values from the anchor tags, scan for a tag that is in a particular position from the top and follow that link, repeat the process a number of times, and report the last name you find.

In [1]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
if len(url) < 1 : 
    # Creating default value for assignement
    url = "http://py4e-data.dr-chuck.net/known_by_Pietro.html"
    
position = int(input("Enter position: "))-1
count = int(input("Enter count: "))


html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Creating a List of names
Sequence = list()

# Retrieve all of the anchor tags
tags = soup('a')

for name in range(count):
    # Step 1 - get the specific positioned link
    link = tags[position].get('href', None)
    # Step 2 - print the link
    print("Retrieving: ",link)
    # Step 3 - append this particular name on the sequence
    Sequence.append(tags[position].contents[0])
    # Step 4 - open page --> to the positioned link and redo
    html = urllib.request.urlopen(link, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')

# print the Last name of the sequence
print('*****************')
print('Last name in sequence: ', Sequence[-1])

Enter - 
Enter position: 3
Enter count: 3
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Ridley.html
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Edwin.html
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Morran.html
*****************
Last name in sequence:  Morran


### Chapter 13  - Data on Web
In this section, we learn how to retrieve and parse XML (eXtensible Markup Language) data.

With the HTTP Request/Response well understood and well supported, there was a natural move toward exchanging data between programs using these protocols. There are two commonly used data formats, that can be used between applications and accross networks --> XML and JSON

#### Serialization
Act of taking a data from one program or programming language to a more widely used 'wired-data' format.
#### Deserialization
Act of bringing back the 'wired-data' format into program specific or programming language speficic format

#### XML data format - eXtensible Markup Language
- primary purpose - share structured data
- Component -
- - Start Tag, End Tag
- - Attrribtutes (always in start tag)
- - Content
- - Self Closing Tag

#### 13.3 XML Schema
- Describing  a  'contract' as  to  what  is acceptable XML
- Way to sort of establish outside of program/programming language 
- This helps to  validate the data that  needs to be unerstood before sending/receiving data.
- XML Schema from W3C - (XSD) - popular one

#### 13.4 Parsing XML

In [183]:
import xml.etree.ElementTree as ET
data =  '''
<person> 
<name> Prashant </name>
<phone> +1 347 204 00044</phone>
<email hide =  "yes"/>
</person>
'''

tree = ET.fromstring(data)
print('Name: ', tree.find('name').text)
print('Attr:  ', tree.find('email').get('hide'))

Name:   Prashant 
Attr:   yes


In [197]:
import xml.etree.ElementTree as ET
input =  ''' <stuff>
<users> 

<user x = "2">
<name> Prashant </name>
<phone> +1 347 204 00044</phone>
<email hide =  "yes"/>
</user>

<user x = "7">
<name> Another Prashant </name>
<phone> +1 347 204 3344</phone>
<email hide =  "yes"/>
</user>

</users>
</stuff>  '''

stuff = ET.fromstring(input)
something = stuff.findall('users/user')
print('User Count: ', len(something) )

for item in something:
    print ('Name:  ', item.find('name').text)
    print ('Email  Attr: ',  item.find('email').get('hide'))
    print ('    - ')

User Count:  2
Name:    Prashant 
Email  Attr:  yes
    - 
Name:    Another Prashant 
Email  Attr:  yes
    - 


#### Assignment-  Extracting Data from XML
In this assignment you will write a Python program somewhat similar to https://py4e.com/code3/geoxml.py. The program will prompt for a URL, read the XML data from that URL using urllib and then parse and extract the comment counts from the XML data, compute the sum of the numbers in the file and enter the sum.

In [17]:
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Enter the URL
url2 = input('Enter - ')
if len(url2) < 1 : 
    # Creating default value for assignement
    url2 = "http://py4e-data.dr-chuck.net/comments_806200.xml"

html2 = urllib.request.urlopen(url2, context=ctx).read()
print('Retrieved  ',len(html2), ' charatcters')

tree = ET.fromstring(html2)
something = tree.findall('comments/comment')
print('Count: ', len(something))

generic_list = list()

for item  in something:
    generic_list.append(int(item.find('count').text))
    
print('Sum: ', sum(generic_list))



Enter - 
Retrieved   4206  charatcters
Count:  50
Sum:  2468


... same  thing  -  -  less code

In [21]:
import urllib
import xml.etree.ElementTree as ET

url = 'http://py4e-data.dr-chuck.net/comments_806200.xml'


html2 = urllib.request.urlopen(url2, context=ctx).read()
print('Retrieved',len(html2),'characters')

tree = ET.fromstring(html2)


results = tree.findall('.//count')
nums = [int(node.text) for node in results]
counts = sum(nums)
print('Sum:  ', counts)

Retrieved 4206 characters
Sum:   2468


#### 13.5 JavaScript Object Notation  (JSON)
- JSON represents data as  nested 'lists' and  'dictionaries'
- in python3, the json.load  gives  you either  list or  dictionary object

In [23]:
import  json
data = '''
{
"name"  : "Chuck",
"phone" : {
"type" : "intl",
"number": "+1 234 567 8900"
},
"email" : {
"hide" : "yes"
}
}
'''

info  =  json.loads(data)

#  the  'info' here  is  a python dictionary
print('Name: ', info["name"])
print('Hide: ', info["email"]["hide"])

Name:  Chuck
Hide:  yes


### 13.6 Service Oriented Approach (SOA)
By introducing service layers between systems, Service  Oriented  Approach helps multiple systems  connect  each other and  provide  data. SOA defines  and documents "contracts" between different applications.


### 13.7 Application Programming Interface (API)
The general name for application to  application contracts is Application Program  Interface (API). When we use an  API, generally one  program makes a set of  services available  for use  by other applicaitons and publishes  the APIs that must be  followed  to access the  services provided by the  program.

### 13.8  Secure API Requests


#### Assignment - Extracting Data from JSON
The program will prompt for a URL, read the JSON data from that URL using urllib and then parse and extract the comment counts from the JSON data, compute the sum of the numbers in the file.

In [31]:
import json
import urllib


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# URL = http://py4e-data.dr-chuck.net/comments_806201.json

while True:
    url3 = input('Enter location: ')
    if len(url3) < 1 : break

    # Print URL
    print ('Retrieving ', url3)
    
    # Print  Characters
    html3 = urllib.request.urlopen(url3, context=ctx).read()
    print('Retrieved  ',len(html3), ' charatcters')
    
    ## Print Count
    info = json.loads(html3)
    comments = info["comments"]
    print ('Count:', len(comments))
    
    # Print Sum
    sum = 0
    for item in comments:
        sum = sum + item['count']
    print ('Sum:', sum)

Enter location: http://py4e-data.dr-chuck.net/comments_806201.json
Retrieving  http://py4e-data.dr-chuck.net/comments_806201.json
Retrieved   2716  charatcters
Count: 50
Sum: 2285
Enter location: 


#### Assignment - Using the GeoJSON API
In this program you will use a GeoLocation lookup API modelled after the Google API to look up some universities and parse the returned data.

In [50]:
import urllib.request, urllib.parse, urllib.error
import json

api_key = False

if api_key is False:
    api_key = 42
    serviceurl = 'http://py4e-data.dr-chuck.net/json?'
else :
    serviceurl = 'https://maps.googleapis.com/maps/api/geocode/json?'


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Address - University of Oxford

while True:
    address = input('Enter location: ')
    if len(address) < 1 : break
    
    parms = dict()
    parms['address'] = address
    if api_key is not False: parms['key'] = api_key
    url = serviceurl + urllib.parse.urlencode(parms)
    print ('Retrieving', url)
    
    
    # Print the Retrieve lenght of data. 
    # This data is line by line loading of received data
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')
    

    # Now, we load data into list/dictionary format via json.load()
    # This step changes the data to python redeable list/dictionary
    try:
        js = json.loads(data)
    except:
        js = None
    
    if not js or 'status' not in js or js['status'] != 'OK':
        print('==== Failure To Retrieve ====')
        print(data)
        continue
    
    # This will pretty print json for more clarity
    # print(json.dumps(js, indent=4))
    
    place_id = js['results'][0]['place_id']
    print('Place id ', place_id)
    
    break

Enter location: University of Oxford
Retrieving http://py4e-data.dr-chuck.net/json?address=University+of+Oxford&key=42
Retrieved 1773 characters
Place id  ChIJW0iM76nGdkgR7a8BoIMY_9I
