# Advanced Python Topics - Retrieve Data from the Web

In this notebook, we will learn some advanced topics in Python language. When programs get big, we probably want to divide the code into multiple blocks and smaller files for better management. So, the functions of each block can be reused. It sounds more like the functions we learned before. Actually, each module of python code is a collection of functions. A module usually contains Python files ending with ".py". Python modules are commonly regarded as libraries, called in other high level programming languages. Multiple modules can be bundled together to create packages. This modular approach improves program's readability and creates building blocks for large projects considering that modules can be built by different people at different time. We now can quickly create a program using these prebuilt or user-defined modules or libraries. When we need the functions inside these libraries, we just need to "import" them into our code. We don't have to read or fully understand the source code of the libraries we are importing. But we do have to know how to interface with the functions. Libraries usually come with documentations describe how to use the functions.

In [0]:
#When python is installed, a set of standard libries are installed with it. These libraries are called standard because
#users don't have to installed manually. But if you do see error messages saying the module's not found, you have to install
#that particular module using "pip install" or "conda install". 

#import the math library into our program. 
import math  

In [2]:
#All the functions defined in math library can be called under math.
#Use Tab key to see all the available functions
math.factorial(5)

120

In [3]:
print(1*2*3*4*5)

120


In [0]:
#We can also use from 'lib' import 'functions' to only import specific functions 
#or all functions using *
#But the functions imported are regarded global functions that are no longer 
#managed by math. package name
from math import factorial

In [9]:
print(m.factorial(5))

120


In [10]:
print(fac(5))

120


In [0]:
#we can use "as" to create shorter aliases for libraries
#we will call math m in our program
import math as m
#we will call factorial as fac
from math import factorial as fac
#we can import all the functions as global functions using *
#this may be dangerous if you created your own function with the same names of the imported functions
from math import *

In [0]:
from math import log2

In [0]:
from math import *

In [13]:
#log2() is available because we imported *
print(log2(1024))

#math is called m now
print(m.factorial(5))

#and factorial is called fac now
print(fac(5))

#why m.fac() doesn't exist? 
print(m.fac(5))

10.0
120
120


AttributeError: ignored

In [14]:
#Now, let's define our own factorial function

def factorial(n):
    fac=1
    for i in range(n):
        fac *= i+1
    #to make the function different from the factorial function defined in math library
    #we meke the return value a string
    return str(fac)

factorial(5)  
#we can see that the function that is called is the function we defined in our code
#this means that if there is a name conflict between our defined function and the imported function
#the user defined function will take priority.

'120'

In [15]:
#It's not just functions that can be imported, global variables can be imported as well
print(m.pi) #import math as m
print(pi)  #from math import *
print('{0:.6}'.format(pi))

3.141592653589793
3.141592653589793
3.14159


# Use Standard and External Python Modules 
After we learned how to import python modules/libraries, let's learn how to use them to quickly solve data retrieval challenges.
We will first look at how to retrieve texts from web pages. <br>
[urllib library](https://docs.python.org/3/library/urllib.html) <br>
[regular expression - re library](https://docs.python.org/3/library/re.html) <br>
[beautifulsoup library](https://www.crummy.com/software/BeautifulSoup/) <br>
[beautifulsoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [16]:
#use urllib library to open a URL and import the webpage contents as a string
import urllib.request

#use urlopen function in the request sub-library
#the page contains the popular names used in world catagorized by countries 
html = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_most_popular_given_names")

#read in the htmltext as reading from an open file
htmltext=html.read()

#htmltext was read in as bytes
print(type(htmltext))

#convert bytes to string so the string can be written to a text file
htmltext = str(htmltext)

#write what we retrieved from the url to an html file and see if we get the right content
f=open('test.html','w')
f.write(htmltext)
f.close

print(htmltext)

<class 'bytes'>
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of most popular given names - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_most_popular_given_names","wgTitle":"List of most popular given names","wgCurRevisionId":894448075,"wgRevisionId":894448075,"wgArticleId":375845,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: dates","CS1 Chinese-language sources (zh)","CS1 errors: external links","Dynamic lists","All articles with unsourced statements","Articles with unsourced statements from December 2008","Articles with unsourced statements from Septemb

In [17]:
#Collect all the words into a list using split()
#by default split() uses whitespaces as separaters including spaces, tabs, new lines ...
words = htmltext.split()
words

["b'<!DOCTYPE",
 'html>\\n<html',
 'class="client-nojs"',
 'lang="en"',
 'dir="ltr">\\n<head>\\n<meta',
 'charset="UTF-8"/>\\n<title>List',
 'of',
 'most',
 'popular',
 'given',
 'names',
 '-',
 'Wikipedia</title>\\n<script>document.documentElement.className',
 '=',
 'document.documentElement.className.replace(',
 '/(^|\\\\s)client-nojs(\\\\s|$)/,',
 '"$1client-js$2"',
 ');</script>\\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_most_popular_given_names","wgTitle":"List',
 'of',
 'most',
 'popular',
 'given',
 'names","wgCurRevisionId":894448075,"wgRevisionId":894448075,"wgArticleId":375845,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1',
 'errors:',
 'dates","CS1',
 'Chinese-language',
 'sources',
 '(zh)","CS1',
 'errors:',
 'external',
 'links","Dynamic',
 'lists","All',
 'articles',
 'wit

In [18]:
#Now we want to find how many names start with letter J and T and the length of the name is between 4 and 7 characters
#we started by trying search() function in regular expression lirbrary

import re

for word in words:
    
    if re.search(r'John',word):
        print(word)

#the result is very messy. We have to find other way or libraries to better process web page contents

href="/wiki/John_(given_name)"
title="John
name)">John</a>,
href="/wiki/John_(first_name)"
title="John
name)">John</a></td>\n<td><a
href="/wiki/John_(given_name)"
title="John
href="/wiki/John_(given_name)"
title="John
href="/wiki/John_(name)"
title="John
(name)">John</a>
href="/wiki/John_(name)"
title="John
href="/wiki/John_(given_name)"
title="John
href="/wiki/John_(given_name)"
title="John
href="/wiki/John_(first_name)"
title="John
href="/wiki/John_(given_name)"
title="John
href="/wiki/John_(given_name)"
title="John
href="/wiki/John_(given_name)"
title="John
href="/wiki/John_(name)"
title="John
(name)">John/Jean/Jonathan/Juan/Gan</a></td>\n<td><a
href="/wiki/John_(name)"
title="John
href="/wiki/John_(given_name)"
title="John
href="/wiki/John_(first_name)"
title="John
href="/wiki/John_(first_name)"
title="John
href="/wiki/John_(given_name)"
title="John


In [0]:
#Let's try BeautifulSoup library which is another solution for processing data from HTML pages. It has extended capacity of
#reganizing HTML tags and extracting data between tags
from bs4 import BeautifulSoup

#BeautifulSoup usually works with request package hand-in-hand. Requests will open a website as a channel as opening a file 
#for processing. 
import requests

#The following page contains the system vulnerabilities reported by Symantec
#Let's take the content of the page out for processing
url = 'https://www.symantec.com/security_response/landing/vulnerabilities.jsp'

In [0]:
# Request content from web page
response = requests.get(url)
content = response.content

In [0]:
soup = BeautifulSoup(content, 'lxml')

In [22]:
print(soup)

<!DOCTYPE html>
<html lang="en">
<head>
<!--[if gt IE 8]><!-->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<!--<![endif]-->
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta name="keywords"/>
<meta name="description"/>
<!--[if ! lte IE 8]><!-->
<!-- <link rel="stylesheet" href="/etc/designs/symantec/clientlib/css/devkit.css"> -->
<!--script src="/etc/designs/symantec/clientlib/js/vendor/modernizr.custom.34906.js"></script-->
<!--<![endif]-->
<!--[if ! lte IE 6]><!-->
<!-- <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css"> -->
<!--<link rel="stylesheet" href="//pro.fontawesome.com/releases/v5.0.10/css/all.css" integrity="sha384-KwxQKNj2D0XKEW5O/Y6haRH39PE/xry8SAoLbpbCMraqlX7kUP6KHOnrlrtvuJLR" crossorigin="anonymous">
	<link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">-->
<!--<![endif]-->
<!--[if (gt IE 6) & (lt IE 9)]>
	<link rel="stylesheet

In [23]:
# Use a browser to open or download the page
# I used firebug, a FireFox/chrome plug in to analyze the page tag structure
# locate the division where the table is located in
# Find all the tables in this division
division_of_interest = soup.find("div",{'class':'bckSolidWht bckPadLarge clearfix'})
tables = division_of_interest.find_all('table')
print(tables[0])

AttributeError: ignored

In [24]:
#retrieve all the rows from the first table 
rows = tables[0].findAll('tr')

NameError: ignored

In [0]:
#create lists to store retrieved data

#vulnerablility names
vul_names =[]

#discovered dates
date_values=[]

#severity level from 1 to 5
severity_values = []

#the urls of the vulnerbility description pages
url_values = []

#in each row
for tr in rows: 
    # if it's not the first row
    # use tr(text=True) to determine if it's the first row
    if "Discovered" not in tr(text=True):
        #severity level is the value of the title attribute in the img tag
        severity = tr.find('img')['title']
        #append to the list
        severity_values.append(severity)
        
        #find all the a tags in the row
        a_tag = tr.find('a')
        #vulnerbility name is the first text of the <a> tag
        vul = a_tag(text=True)[0]
        vul_names.append(vul)
        
        #construct the URLs 
        url_values.append("https://www.symantec.com/"+a_tag['href'])
        
        #date is the text of the last <td> tag
        td = tr.findAll('td')
        #remove the <td></td> tags from the string
        date = str(td[-1]).replace("<td>",'').replace("</td>","")
        #add date to the date_values list
        date_values.append(date)

In [0]:
#test to see the values
print(date_values[30])
print(url_values[30])

12/13/2016
https://www.symantec.com//security_response/vulnerability.jsp?bid=94715


In [0]:
#use input() to get user input
#cast the input to integer
user_severity = int(input("please choose a severity level equal to or above : "))

#find all vulnerbilities associated with "Adobe" and with a severity higher than user_severity
for index in range(len(vul_names)):
    if int(severity_values[index]) >= user_severity:
        if (vul_names[index].count('Adobe') > 0):
            print(vul_names[index])

please choose a severity level equal to or above : 4
Adobe Flash Player APSB16-39 Unspecified Use After Free Remote Code Execution Vu...
Adobe Flash Player CVE-2016-7855 Use After Free Remote Code Execution Vulnerabil...
Adobe Flash Player APSB16-29 Multiple Unspecified Memory Corruption Vulnerabilit...
Adobe Flash Player APSB16-25 Multiple Use After Free Remote Code Execution Vulne...
Adobe Acrobat and Reader APSB16-26 Multiple Unspecified Memory Corruption Vulner...
Adobe Flash Player APSB16-25 Multiple Unspecified Memory Corruption Vulnerabilit...
Adobe Flash Player Multiple Unspecified Security Vulnerabilities
Adobe Flash Player CVE-2015-3113 Unspecified Heap Buffer Overflow Vulnerability
Adobe Flash Player and AIR CVE-2015-8651 Unspecified Integer Overflow Vulnerabil...
Adobe Flash Player and AIR APSB15-32 Multiple Unspecified Memory Corruption Vuln...
Adobe Flash Player ActionScript 3 ByteArray Use After Free Remote Memory Corrupt...
Adobe Flash Player CVE-2015-7645 Remote Code E

In [0]:
#the date_values are in string format
#we need convert them to datetime format so we can compare them with 
#the time boundaries users set
date_values[40]

'12/13/2016'

In [0]:
#after search on the internet, we found the datetime package for this purpose
from datetime import datetime

In [0]:
#define a time based on a specific format
a = datetime.strptime('1/30/2011',"%m/%d/%Y")

In [0]:
#compare the time with 2015/12/2
#the result is correct
print(a > datetime(2015,12,2))

False


In [0]:
#open the vul description page one at a time
for index in range(100):
    response = requests.get(url_values[index])
    content =response.content
    soup = BeautifulSoup(content,"lxml")
    