# Web Scraping Lotto Results

Here I'll be learning web scraping in python.

In [2]:
from bs4 import BeautifulSoup
import re

The first thing to do is create a `BeautifulSoup` object from the starting url. This will be processed to gather links for further investigation.

In [3]:
# Here is where we will begin scraping
url = "http://lottoresults.co.nz/lotto/archive"

def read_url(url):
    # This reads in a url and returns a BeautifulSoup
    # object. It also catches errors if something goes
    # wrong
    from urllib.request import urlopen
    from urllib.error import HTTPError
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        soup = BeautifulSoup(html.read(), "lxml")
    except AttributeError as e:
        print(e)
        return None
    return soup

soup = read_url(url)

Next, we will take a peek at all of the hyperlinks we have found. We will be looking for the links which point to the data we're looking for.

In [4]:
soup.findAll("a")

[<a class="navbar-brand" href="/">LottoResults.co.nz</a>,
 <a class="game" href="/lotto/" title="Lotto">Lotto</a>,
 <a class="game" href="/keno/" title="Keno">Keno</a>,
 <a class="game" href="/bullseye/" title="Bullseye">Bullseye</a>,
 <a class="game" href="/play3/" title="Play 3">Play 3</a>,
 <a class="world" href="/worldwide-lotteries/" title="Worldwide Lotteries">Worldwide Lotteries</a>,
 <a class="faq" href="/faq/" title="FAQ">FAQ</a>,
 <a class="tools" href="/tools/" title="Tools">Tools</a>,
 <a href="/lotto/faq">Lotto FAQ</a>,
 <a href="/" title="Home">Home</a>,
 <a href="/lotto/" title="Lotto">Lotto</a>,
 <a aria-controls="archive-1" aria-expanded="false" class="toggler collapsed" data-toggle="collapse" href="#archive-1" role="button">2016 Lotto Results</a>,
 <a href="/lotto/december-2016">December 2016</a>,
 <a href="/lotto/november-2016">November 2016</a>,
 <a href="/lotto/october-2016">October 2016</a>,
 <a href="/lotto/september-2016">September 2016</a>,
 <a href="/lotto/aug

There are quite a few tags that show up. We are mainly interested in tags of the form:

` <a href="/lotto/[month]-[year]">[Month] [year]</a>`

We will retrieve these using a regular expression, and compile the results into an object from which we will construct the links for the next level of analysis.

In [5]:
domain = "http://lottoresults.co.nz"
month_tags = soup.findAll("a", href=re.compile("/lotto/[a-z]+-\d+"))

def get_links(domain, tags):
    return [domain + tag.get("href") for tag in tags]

month_links = get_links(domain, month_tags)
month_links[:5]

['http://lottoresults.co.nz/lotto/december-2016',
 'http://lottoresults.co.nz/lotto/november-2016',
 'http://lottoresults.co.nz/lotto/october-2016',
 'http://lottoresults.co.nz/lotto/september-2016',
 'http://lottoresults.co.nz/lotto/august-2016']

Now we start at one of the links, and will take a look at all of the dates for each lotto draw in a given month.

In [6]:
month_link = month_links[-1]
month_link

'http://lottoresults.co.nz/lotto/august-1987'

In [7]:
month_soup = read_url(month_link)
month_soup

<!DOCTYPE html>
<!--[if lt IE 7 ]> <html class="ie6"> <![endif]--><!--[if IE 7 ]> <html class="ie7"> <![endif]--><!--[if IE 8 ]> <html class="ie8"> <![endif]--><!--[if IE 9 ]> <html class="ie9"> <![endif]--><!--[if (gt IE 9)|!(IE)]><!--><html> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="#2D728F" name="theme-color"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1,maximum-scale=1, user-scalable=no" name="viewport"/>
<meta content="This is where you can find all the Lotto, Powerball and Strike! results for August, 1987." name="description"/>
<meta content="LottoResults.co.nz" name="author"/>
<link href="/favicon.ico" rel="icon"/>
<title>Lotto Results for August, 1987</title>
<link href="/css/style.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css"/>
<link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,700,400italic" 

The break down for each month includes the draw numbers, and jackpot results for every month. So we will attempt to fetch these details at this level before continuing on.

We will use these to form a results dictionary for each of the draws, and build on this object as we scrape the extra details for each month. We will index each result by the number of the draw.

In [8]:
result_card_tags = month_soup.findAll("div", {"class":"result-card"})
result_card_tags

[<div class="result-card">
 <div class="result-card__heading">
 <h2 class="result-card__title result-card__title--medium"><a href="/lotto/29-august-1987">Lotto Result for Saturday, 29 August 1987 <span></span></a></h2>
 </div>
 <div class="result-card__details">
 <ul class="result-meta">
 <li class="result-meta__title">Draw Number: <span class="result-meta__detail">5</span></li>
 <li class="result-meta__title">Jackpot:  <span class="result-meta__detail">Rollover</span></li>
 </ul>
 </div>
 <div class="result-card__content">
 <div class="row">
 <div class="col-sm-4">
 <div class="result-card__image-header">
 <figure class="result-card__image result-card__image--long">
 <img alt="Lotto Logo" class="img-responsive game-logo" src="/img/lotto-logo.svg"/>
 </figure>
 </div>
 </div>
 <div class="col-sm-8">
 <ol class="draw-result">
 <li class="draw-result__ball draw-result__ball--blue-border">3</li>
 <li class="draw-result__ball draw-result__ball--blue-border">7</li>
 <li class="draw-result__

`result_card_tags` contains the results for each draw, as well as a link to a page with more detailed results. Since this neatly divides the results for each month, we will use it to grab the basic breakdown for the draw.

We will write a function that can parse a card and return a dictionary.

In [9]:
result_dict = {}

In [10]:
result_dict['link'] = domain + result_card_tags[0].find("a").get("href")
result_dict

{'link': 'http://lottoresults.co.nz/lotto/29-august-1987'}

In [11]:
result_dict['title'] = result_card_tags[0].h2.get_text()
result_dict['title']

'Lotto Result for Saturday, 29 August 1987 '

In [12]:
detail_tags = result_card_tags[0].findAll("li", {"class":"result-meta__title"})
[tag.get_text() for tag in detail_tags]

['Draw Number: 5', 'Jackpot:  Rollover']

In [13]:
for tag in detail_tags:
    if 'draw number' in tag.get_text().lower():
        result_dict['draw number'] = tag.span.get_text()
    elif 'jackpot' in tag.get_text().lower():
        result_dict['jackpot'] = tag.span.get_text()

In [14]:
result_dict

{'draw number': '5',
 'jackpot': 'Rollover',
 'link': 'http://lottoresults.co.nz/lotto/29-august-1987',
 'title': 'Lotto Result for Saturday, 29 August 1987 '}

In [15]:
draw_result_tags = result_card_tags[0].findAll("ol", {"class":"draw-result"})
draw_result_tags

[<ol class="draw-result">
 <li class="draw-result__ball draw-result__ball--blue-border">3</li>
 <li class="draw-result__ball draw-result__ball--blue-border">7</li>
 <li class="draw-result__ball draw-result__ball--blue-border">8</li>
 <li class="draw-result__ball draw-result__ball--gold-border">11</li>
 <li class="draw-result__ball draw-result__ball--gold-border">12</li>
 <li class="draw-result__ball draw-result__ball--green-border">23</li>
 <li class="draw-result__sep"></li><li class="draw-result__ball draw-result__ball--blue-border">15</li>
 </ol>, <ol class="draw-result draw-result--sub">
 <li class="draw-result__logo"><img alt="NZ Strike! Logo" src="/img/strike-logo.svg" width="80px"/></li>
 <li class="draw-result__ball draw-result__ball--green-border">23</li>
 <li class="draw-result__ball draw-result__ball--blue-border">7</li>
 <li class="draw-result__ball draw-result__ball--blue-border">3</li>
 <li class="draw-result__ball draw-result__ball--blue-border">8</li>
 </ol>]

In [16]:
re.findall(r"\d+", draw_result_tags[0].get_text())

['3', '7', '8', '11', '12', '23', '15']

In [17]:
re.findall(r"\d+", draw_result_tags[1].get_text())

['23', '7', '3', '8']

In [18]:
for tag in draw_result_tags:
    nums = re.findall(r"\d+", tag.get_text())
    
    if len(nums) == 7:
        result_dict["bonus ball"]    = nums.pop()
        result_dict["lotto result"]  = nums
    elif len(nums) == 4:
        result_dict["strike result"] = nums
    elif len(nums) == 1:
        result_dict["powerball"]     = nums[0]
    
result_dict

{'bonus ball': '15',
 'draw number': '5',
 'jackpot': 'Rollover',
 'link': 'http://lottoresults.co.nz/lotto/29-august-1987',
 'lotto result': ['3', '7', '8', '11', '12', '23'],
 'strike result': ['23', '7', '3', '8'],
 'title': 'Lotto Result for Saturday, 29 August 1987 '}

Now we use the link to fetch the `soup` for the detailed results. Then we will append these to the dictionary as objects.

In [19]:
result_soup = read_url('http://lottoresults.co.nz/lotto/21-april-2012')

In [20]:
result_soup

<!DOCTYPE html>
<!--[if lt IE 7 ]> <html class="ie6"> <![endif]--><!--[if IE 7 ]> <html class="ie7"> <![endif]--><!--[if IE 8 ]> <html class="ie8"> <![endif]--><!--[if IE 9 ]> <html class="ie9"> <![endif]--><!--[if (gt IE 9)|!(IE)]><!--><html> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="#2D728F" name="theme-color"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1,maximum-scale=1, user-scalable=no" name="viewport"/>
<meta content="These are the Lotto results for draw number 1298 Saturday 21 April 2012." name="description"/>
<meta content="LottoResults.co.nz" name="author"/>
<link href="/favicon.ico" rel="icon"/>
<title>Lotto Results for Saturday, 21 April, 2012</title>
<link href="/css/style.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css"/>
<link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,700,400italic" rel="

In [25]:
result_soup.findAll("ul", {"class":"draw-details-meta"})

[]

In [26]:
print(result_soup.findAll("ul", {"class":"draw-details-meta"}).get_text())

AttributeError: 'ResultSet' object has no attribute 'get_text'

In [23]:
draw_details = result_soup.findAll("ul", {"class":"draw-details-meta"})
draw_details[0]

IndexError: list index out of range

In [None]:
l = [1,2,3]
l.reverse()
l

In [22]:
for tag in draw_details.findAll("li"):
    text = tag.get_text()
    text = text.lower()
    append_text = text.rsplit(" ", 1)[-1]
    if 'total prize pool' in text:
        result_dict['prize pool'] = append_text
    elif 'number of winners' in text:
        result_dict['winner count'] = append_text
    elif 'average prize per winner' in text:
        result_dict['average prize'] = append_text

result_dict

NameError: name 'draw_details' is not defined

In [None]:
table_soup = result_soup.findAll("table")[0]
table_soup

In [None]:
table_soup.findAll("tr", {"data-row":re.compile("\d")})

In [None]:
result_dict

In [None]:
import pandas as pd

In [None]:
result_dict["lotto divisions"]     = pd.DataFrame([], columns = ["Division", "Matches", "Winners", "Prize"])
result_dict["powerball divisions"] = pd.DataFrame([], columns = ["Division", "Matches", "Winners", "Prize"])
result_dict["strike divisions"]    = pd.DataFrame([], columns = ["Division", "Matches", "Winners", "Prize"])

row_text = row_soup.get_text()

row = [text for text in row_text.split("\n") if text != ""]

if 'powerball' in row_text.lower():
    label = 'powerball divisions'
elif 'exact order' in row_text.lower():
    label = 'strike divisions'
else:
    label = 'lotto divisions'

length = len(result_dict[label])
result_dict[label].loc[length] = row

In [None]:
result_dict

There. I've basically scraped everything that I need to at this point. Now all that remains to do is package the code into functions and create the loops that will run this code over the entire site.

In [None]:
import time
time.sleep(5)
print("hello")