# Modules for retrieving data from the Internet

## Programming tools, 2016 Winter semester, CEU
_Jeno Pal_, Jan-Feb 2016

## Regular expressions

Regular expressions are powerful tools to extract information from texts. Essentially they form a language for patterns with which you can easily describe complex rules to find in texts what you are looking for.

In [39]:
import re

You should define a pattern, compile it and then use `match` or `search` to look for the pattern in a string.

* `match` looks for the string only at the beginning of the string
* `search` searches in the whole string

In [40]:
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent semper ex sed erat sodales, consectetur molestie augue blandit. Cras ac nisl elementum, mollis ex vitae, accumsan eros. Phasellus nulla purus, bibendum sit amet ultrices et, dignissim vitae erat. Pellentesque in gravida eros, et condimentum nisl. Praesent lobortis fermentum ipsum et lobortis. Proin ultricies dignissim eros ac cursus. Proin blandit libero ac lectus dictum rhoncus. Etiam condimentum, ligula gravida gravida rhoncus, purus erat facilisis tortor, ut elementum purus justo vitae enim. Vestibulum auctor condimentum porttitor. Quisque blandit augue vitae justo facilisis interdum. Sed vel facilisis leo, vel pulvinar enim."

In [41]:
print(text)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent semper ex sed erat sodales, consectetur molestie augue blandit. Cras ac nisl elementum, mollis ex vitae, accumsan eros. Phasellus nulla purus, bibendum sit amet ultrices et, dignissim vitae erat. Pellentesque in gravida eros, et condimentum nisl. Praesent lobortis fermentum ipsum et lobortis. Proin ultricies dignissim eros ac cursus. Proin blandit libero ac lectus dictum rhoncus. Etiam condimentum, ligula gravida gravida rhoncus, purus erat facilisis tortor, ut elementum purus justo vitae enim. Vestibulum auctor condimentum porttitor. Quisque blandit augue vitae justo facilisis interdum. Sed vel facilisis leo, vel pulvinar enim.


In [42]:
regex = "Lorem"
pattern = re.compile(regex)

In [43]:
# if there is a match, it gives back a match object
pattern.match(text)

<_sre.SRE_Match object; span=(0, 5), match='Lorem'>

In [44]:
# if there is no match, it gives back nothing (None)
pattern = re.compile("Foobar")
pattern.match(text)

In [45]:
# see the difference between match and search: searching only at the beginning or
# in the whole string
pattern = re.compile("ipsum")

result_match = pattern.match(text)
result_search = pattern.search(text)

In [46]:
print(bool(result_match))
print(bool(result_search))

False
True


### Elements of patterns

In [47]:
# search at the beginning: ^
# search at the end: $

text = "Python is a lot of fun"

pattern = re.compile("^Python")
pattern.search(text)

<_sre.SRE_Match object; span=(0, 6), match='Python'>

In [48]:
pattern = re.compile("^is")
print(bool(pattern.search(text)))

pattern = re.compile("is")
print(bool(pattern.search(text)))

pattern = re.compile("lot")
print(bool(pattern.search(text)))

pattern = re.compile("lot$")
print(bool(pattern.search(text)))

False
True
True
False


In [49]:
# \d: any numeric digit

phone_num = "(800) 456-567"

In [50]:
# there is a numeric character in the string
pattern = re.compile("\d")
print(bool(pattern.search(phone_num)))

# it does not begin with one
pattern = re.compile("^\d")
print(bool(pattern.search(phone_num)))

# it does end with one
pattern = re.compile("\d$")
print(bool(pattern.search(phone_num)))

True
False
True


In [51]:
# x*: x is present zero or more times
# x+: x is present one or more times

phone_num_1 = "(800) 456-567"
phone_num_2 = "800 456-567"

# the parenthesis has to be escaped with a backslash: parenthesis are special characters in regex
# if we want to tell Python that we are looking for a parenthesis in the text, we have to use
# \( \)
pattern_plus = re.compile("\(+\d\d\d\)+")
pattern_star = re.compile("\(*\d\d\d\)*")

print(bool(pattern_plus.search(phone_num_1)))
print(bool(pattern_plus.search(phone_num_2)))

print(bool(pattern_star.search(phone_num_1)))
print(bool(pattern_star.search(phone_num_2)))

True
False
True
True


In [52]:
# {n}: match exactly n times
# {n,m}: match at least n times, not more than m times

phone_num_1 = "(800) 456-567"
phone_num_2 = "(36) 345-456"

pattern = re.compile("\(\d{3}\) \d{3}-\d{3}")
print(bool(pattern.search(phone_num_1)))
print(bool(pattern.search(phone_num_2)))

# - now allow for 2 or 3 digits in the area code
pattern = re.compile("\(\d{2,3}\) \d{3}-\d{3}")
print(bool(pattern.search(phone_num_2)))

True
False
True


In [53]:
# (a|b|c): matches either a or b or c

phone_number_1 = "36-234-345"
phone_number_2 = "HUN-234-567"

# [A-z] matches any alphabetic character

pattern = re.compile("(\d+|[A-z]+)-\d{3}-\d{3}")
print(bool(pattern.search(phone_number_1)))
print(bool(pattern.search(phone_number_2)))

True
True


In [54]:
# (x) is stored as a group, you can get the matched value with the `.groups()` method when using `re.search`

phone_number_1 = "36-234-345"
phone_number_2 = "HUN-234-567"

# \w means alphanumeric character: either numeric or alphabetic
pattern = re.compile("(\w+)-(\d{3})-(\d{3})")

gr = pattern.search(phone_number_1).groups()
print(type(gr))
print(gr)

<class 'tuple'>
('36', '234', '345')


In [75]:
gr = pattern.search(phone_number_2).groups()
print("Area is {}".format(gr[0]))
print("First part of the number is {}".format(gr[1]))
print("Second part of the number is {}".format(gr[2]))

Area is HUN
First part of the number is 234
Second part of the number is 567


In [80]:
# groups can be retrieved as dictionaries, too.

pattern = re.compile("(?P<area>\w+)-(?P<first_part>\d{3})-(?P<second_part>\d{3})")
gr = pattern.search(phone_number_2).groupdict()

print(type(gr))
print(gr)

<class 'dict'>
{'area': 'HUN', 'first_part': '234', 'second_part': '567'}


There is a lot more into regular expressions. Google is your friend!

## urllib.request

`urllib` is a module used to interact with web pages. We will use the module solely to retrieve a web page as an HTML text, but again, there is a lot more to this module.

In [56]:
from urllib.request import urlopen

In [57]:
url = "http://scores.espn.go.com/nba/boxscore?gameId=400579201"

response = urlopen(url)
html_text = response.read()

## Beautiful Soup

This is a module used to parse HTML files: make sense of its content, extract relevant fields, etc. We are again scratching just the surface here. For more, see the book [Web Scraping with Python](http://shop.oreilly.com/product/0636920034391.do) on which this part is based.

* where is the name coming from? See [here](http://en.wikipedia.org/wiki/Beautiful_Soup)

In [58]:
from bs4 import BeautifulSoup

In [59]:
url = "http://pythonscraping.com/pages/page1.html"

# go and check this url! use Right click -> Inspect element to discover what kind of tags
# enclose information you are looking for

response = urlopen(url)
html_text = response.read()

soup = BeautifulSoup(html_text)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [60]:
# set the default xml reader:

def BS(html):
    """
    Overcome annoying warning message: set the html parser
    """
    return BeautifulSoup(html, "lxml")

soup = BS(html_text)

If you want to parse parts of a webpage, you have to closely inspect its structure. A great way to do that is (when using Chrome): `Right click - Inspect element`. Then you can go deeper and deeper to the tree structure to reach the information you want. This way you can observe how information is embedded and you can use Beautiful Soup to search for that information. Information is always contained within tags and using the tree structure of tags and their attributes to define the paths you are searching for, you can get the information you need.

In [62]:
# we can get the first h1 tag
print(soup.h1)
print(type(soup.h1))

<h1>An Interesting Title</h1>
<class 'bs4.element.Tag'>


In [63]:
# this is identical to

print(soup.body.h1)
print(soup.html.body.h1)

<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>


### Searching for tags by attribute values

In [64]:
url = "http://www.pythonscraping.com/pages/warandpeace.html"
response = urlopen(url)
html_text = response.read()
soup = BS(html_text)

Use `findAll` to find all tags given in the first argument (can be one element or a set), having either one of the attributes represented by a dictionary of key-value pairs.

In [65]:
# green text are character names

# find all <span> tags with attribute class equal to "green"
name_list = soup.findAll("span", {"class": "green"})
for name in name_list:
    print(name.get_text())   # use .get_text() to find text contained within a tag

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


### Searching tags based on their position

Children: tags one level below the current one. Descendants are _somewhere_ below the current one. (That is, all children are descendants but not all descendants are children.)

In [66]:
def url_to_BS(url):
    response = urlopen(url)
    html_text = response.read()
    soup = BS(html_text)
    return soup

url = "http://www.pythonscraping.com/pages/page3.html"
soup = url_to_BS(url)

In [67]:
# "find" finds the first occurence
bs_table = soup.find("table", {"id": "giftList"})

In [68]:
for child in bs_table.children:
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


In [69]:
# there are much more descendants than children: all the nested things appear there separately

print(len(list(bs_table.children)))
print(len(list(bs_table.descendants)))

13
86


Siblings are tags that sit on the same level.

In [70]:
bs_table.tr

<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>

In [71]:
# with next_siblings we can get subsequent tags at the same level of the tree.
# similarly, there is previous_siblings, too.
for sibling in bs_table.tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

### Searching tags based on regular expressions

Using regular expressions our search queries can become more refined.

In [72]:
import re

soup.findAll("img", {"src": re.compile("\.\.\/img\/gifts\/img.*\.jpg")})

[<img src="../img/gifts/img1.jpg"/>,
 <img src="../img/gifts/img2.jpg"/>,
 <img src="../img/gifts/img3.jpg"/>,
 <img src="../img/gifts/img4.jpg"/>,
 <img src="../img/gifts/img6.jpg"/>]

### Access attribute values of tags

Many times we are looking for information within a tag that we found. We have already seen `.get_text()` to get the text enclosed within a tag. Attribute values can be retrieved, too.

In [73]:
image_tags = soup.find("img", {"src": re.compile("\.\.\/img\/gifts\/img.*\.jpg")})

print(type(image_tags))

<class 'bs4.element.Tag'>


In [74]:
# access attribute values like in a dictionary

image_tags["src"]

'../img/gifts/img1.jpg'