# Web scraping

All the prizemoney and numbers of winners are published on [The Lott](https://www.thelott.com/tattslotto/results).

How can we scrape this to get all the data in a format that will make analysis possible?

In [1]:
from urllib.request import urlopen

In [2]:
url = "https://www.thelott.com/tattslotto/results"

In [3]:
page = urlopen(url)

In [4]:
page

<http.client.HTTPResponse at 0x7f9550089490>

In [5]:
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)

<!doctype html>
<html lang="en">
<head state-context-override="vic">
    <title>TattsLotto ✓ Results | Australia&#39;s Official Lotteries |  The Lott</title>
    <meta name="keywords"/>
    <meta name="description" content="Official Site - View the latest Saturday TattsLotto results, check your ticket or search past draws at the Lott today! Safe &amp; Secure Site · Government Regulated · Australian Based · Responsible Play."/>
    <meta name="theme-color" content="#ffffff"/>
    
    <meta name="robots" content="index, follow"/>
    <meta name="viewport" content="width=device-width"/>

    

    

    


<script>
  (function() {
    let cookieStr =  decodeURIComponent(document.cookie).split(';').sort().reverse();
    var localeStr = altFind(cookieStr, function(v) {
       return v === 'locale';
    });

    let locale = (localeStr || "=").split('=')[1];

    let redirects = {"nsw":{"url":"/saturday-lotto/results","ignoreQuery":false,"path":"","search":""},"act":{"url":"/saturday-lotto/

Looking at the web page itself, we are looking for the the prize table. It includes the headings "Division", "Division Prize Pool" and "Winners"—these are the headings we are interested in for each date the lottery was drawn.

I found a reference to the `thelott.com` API on https://github.com/ShawInnes/coding-challenge and others. Pivoting to try POST requests to this API and understanding whether it continues to be available.

The page on [Geeks for Geeks](https://www.geeksforgeeks.org/get-post-requests-using-python/) covers making GET and POST requests using Python, so we will try that. A specific exammple of the API is on [StackOverflow](https://stackoverflow.com/questions/56659895/json-post-import), but this example does not use Python.

In [1]:
import requests

First try with a Google API.

In [2]:
# api endpoint
URL = "http://maps.googleapis.com/maps/api/geocode/json"

In [3]:
# location given here
location = "delhi technological university"
  
# defining a params dict for the parameters to be sent to the API
PARAMS = {'address':location}

Send the request to the API. This is a GET request.

In [4]:
# sending get request and saving the response as response object
r = requests.get(url = URL, params = PARAMS)

In [5]:
r

<Response [200]>

In [6]:
# extracting data in json format
data = r.json()

In [7]:
data

{'error_message': 'You must use an API key to authenticate each request to Google Maps Platform APIs. For additional information, please refer to http://g.co/dev/maps-no-account',
 'results': [],
 'status': 'REQUEST_DENIED'}

So I got to the API but did not have the API key which was needed for authentication.

Now let's try a POST request.

In [12]:
# importing the requests library
import requests
  
# defining the api-endpoint 
API_ENDPOINT = "https://pastebin.com/api/api_post.php"
  
# your API key here
API_KEY = "1MVlLdnUkXgKn-qaBqV9lBACXRtCgLWn"
  
# your source code here
source_code = '''
print("Hello, world!")
a = 1
b = 2
print(a + b)
'''
  
# data to be sent to api
data = {'api_dev_key': API_KEY,
        'api_option': 'paste',
        'api_paste_code': source_code,
        'api_paste_format': 'python'}
  

In [13]:
# sending post request and saving response as response object
r = requests.post(url = API_ENDPOINT, data = data)
  
# extracting response text 
pastebin_url = r.text
print("The pastebin URL is:%s"%pastebin_url)

The pastebin URL is:https://pastebin.com/KNQ5aPju


Now let us try doing the same thing with The Lott.

```php
function main() {

// Make a POST request with a JSON payload.
var data = {
  'CompanyId': 'GoldenCasket',
  'MaxDrawCount': 1,
  'OptionalProductFilter':['OzLotto']};

var options = {
  'method' : 'post',
  'contentType': 'application/json',
  // Convert the JavaScript object to a JSON string.
  'payload' : JSON.stringify(data)};

  var response = UrlFetchApp.fetch('https://data.api.thelott.com/sales/vmax/web/data/lotto/opendraws', options);

  Logger.log('output: '+ response);          

  var json = JSON.parse(response)
  var DIV1 = json["Div1Amount"];

  Logger.log(DIV1) 
}
```

In [29]:
thelott_api = "https://data.api.thelott.com/sales/vmax/web/data/lotto/results/search/daterange/"

thelott_data = {
    "DateStart": "2020-12-31T13:00:00Z",
    "DateEnd": "2021-01-31T12:59:59Z",
    "ProductFilter": ["TattsLotto"],
    "CompanyFilter": ["Tattersalls"]
}

In [40]:
r = requests.post(url = thelott_api, json = thelott_data)

In [41]:
r

<Response [200]>

In [42]:
r.text

'{"Draws":[{"ProductId":"TattsLotto","DrawNumber":4125,"DrawDate":"2021-01-30T00:00:00","PrimaryNumbers":[20,3,31,30,15,6],"SecondaryNumbers":[2,26],"TicketNumbers":null,"Dividends":[{"Division":1,"BlocNumberOfWinners":4,"BlocDividend":1473168.7600,"CompanyId":"Tattersalls","CompanyNumberOfWinners":3,"CompanyDividend":1473168.7600,"PoolTransferType":"NONE","PoolTransferredTo":0},{"Division":2,"BlocNumberOfWinners":79,"BlocDividend":8429.4500,"CompanyId":"Tattersalls","CompanyNumberOfWinners":31,"CompanyDividend":8429.4500,"PoolTransferType":"NONE","PoolTransferredTo":0},{"Division":3,"BlocNumberOfWinners":1442,"BlocDividend":698.9500,"CompanyId":"Tattersalls","CompanyNumberOfWinners":547,"CompanyDividend":698.9500,"PoolTransferType":"NONE","PoolTransferredTo":0},{"Division":4,"BlocNumberOfWinners":66790,"BlocDividend":22.3500,"CompanyId":"Tattersalls","CompanyNumberOfWinners":25342,"CompanyDividend":22.3500,"PoolTransferType":"NONE","PoolTransferredTo":0},{"Division":5,"BlocNumberOfWin

In [33]:
r.headers

{'Pragma': 'no-cache', 'Content-Type': 'application/xml; charset=utf-8', 'Access-Control-Allow-Origin': '*', 'Content-Length': '247', 'Strict-Transport-Security': 'max-age=16070400; includeSubDomains', 'Cache-Control': 'no-cache, no-store', 'Expires': 'Sat, 27 Mar 2021 10:12:38 GMT', 'Date': 'Sat, 27 Mar 2021 10:12:38 GMT', 'Connection': 'close'}

In [39]:
r.request.body

'DateStart=2020-12-31T13%3A00%3A00Z&DateEnd=2021-01-31T12%3A59%3A59Z&ProductFilter=TattsLotto&CompanyFilter=Tattersalls'

GET request

In [22]:
thelott_api_get = "https://api.tatts.com/svc/sales/vmax/web/data/lotto/companies"


In [23]:
r = requests.get(url=thelott_api_get)

In [24]:
r

<Response [405]>

In [25]:
r.text

'\ufeff<?xml version="1.0" encoding="utf-8"?>\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n  <head>\r\n    <title>Service</title>\r\n    <style>BODY { color: #000000; background-color: white; font-family: Verdana; margin-left: 0px; margin-top: 0px; } #content { margin-left: 30px; font-size: .70em; padding-bottom: 2em; } A:link { color: #336699; font-weight: bold; text-decoration: underline; } A:visited { color: #6699cc; font-weight: bold; text-decoration: underline; } A:active { color: #336699; font-weight: bold; text-decoration: underline; } .heading1 { background-color: #003366; border-bottom: #336699 6px solid; color: #ffffff; font-family: Tahoma; font-size: 26px; font-weight: normal;margin: 0em 0em 10px -20px; padding-bottom: 8px; padding-left: 30px;padding-top: 16px;} pre { font-size:small; background-color: #e5e5cc; padding: 5px; font-family: Courier N