# Basic Scraping

The website we want to scrap is https://covid19dev.jatimprov.go.id/xweb/grafikpublik/

It's a normal web page, not js generated or any other fancy stuff. Likely built with php, but who cares.

It accepts a post request with the following parameters:
1. kabko, a select element with string value. It's the city from which data will be shown.
2. tanggal, a select element with date string value with yyyy-mm-dd format. It's the start date of the data shown.
3. sampai, just like tanggal, it's a select element with date string value with yyyy-mm-dd format. It's the end date of the data shown.

They're all at the top of the page, in a form. The submit button has an onclick attached but the function is commented out, so we'll ignore that. 

The main values we want to get are the ones in the cards with various colors; positif, pdp, odp, etc, and their groups. We may want to get the age groups, gender groups, and other details, but they will increase the model complexity so they will have to wait. Aside from that, they're shown as graphs and their data are in the page as plain text javascript variables. That requires extra work and won't be used anytime soon so let's just save it for later.

The parameter tanggal is more or less ignored for our main values. It's only used for some graphs. We'll just use the same value as sampai.

We'll be scraping with requests-html to get the values we need by good ol jquery query. To do that, we'll need to understand the page structure.

There are few things we'll do:
1. Send get request to first get our parameter values.
2. Get all the available kabko
3. Get all the available tanggal/sampai
4. Send post request with specific parameters
5. Get our main values

Init. I'd like to use HTTP instead of HTTPS to avoid handshake, but they just keep redirecting me to HTTPS.

In [23]:
from requests_html import HTMLSession

endpoint = "https://covid19dev.jatimprov.go.id/xweb/grafikpublik/"
session = HTMLSession()
session

<requests_html.HTMLSession at 0x7e457c8>

Get request

In [24]:
r = session.get(endpoint)
r

<Response [200]>

## Getting Form Parameters: Select

Get kabko values

In [25]:
#find the select element
select_kabko = r.html.find('#kabko')[0]
select_kabko

<Element 'select' class=('form-control',) name='kabko' id='kabko'>

In [26]:
#get the childrens
kabko_els = select_kabko.find("option")
kabko_els[:5]

[<Element 'option' value=''>,
 <Element 'option' value='- (STATUS PENDING)'>,
 <Element 'option' value='AWAK BUAH KAPAL'>,
 <Element 'option' value='KAB. BANGKALAN'>,
 <Element 'option' value='KAB. BANYUWANGI'>]

In [27]:
#The value and the text are the same except for the first item. 
#You can just take the value and make exception for the first, or make dictionary of each items
#Python objects are basically dictionaries anyway
kabko_vals = [k.attrs["value"] for k in kabko_els]
kabko_vals[:5]

['',
 '- (STATUS PENDING)',
 'AWAK BUAH KAPAL',
 'KAB. BANGKALAN',
 'KAB. BANYUWANGI']

In [28]:
kabko_dicts = [{'value':k.attrs["value"], 'text':k.text} for k in kabko_els]
kabko_dicts[:5]

[{'value': '', 'text': 'JAWA TIMUR'},
 {'value': '- (STATUS PENDING)', 'text': '- (STATUS PENDING)'},
 {'value': 'AWAK BUAH KAPAL', 'text': 'AWAK BUAH KAPAL'},
 {'value': 'KAB. BANGKALAN', 'text': 'KAB. BANGKALAN'},
 {'value': 'KAB. BANYUWANGI', 'text': 'KAB. BANYUWANGI'}]

Tanggal and sampai are pretty much the same as kabko, so let's just make this into a function.

In [29]:
def get_select_opts(r, query):
    #find the select element
    select = r.html.find(query)[0]
    els = select.find("option")
    return els

get_select_opts(r, "#tanggal")[:5]

[<Element 'option' value=''>,
 <Element 'option' value='2020-03-20'>,
 <Element 'option' value='2020-03-21'>,
 <Element 'option' value='2020-03-22'>,
 <Element 'option' value='2020-03-23'>]

In [48]:
def get_select_vals(r, query):
    els = get_select_opts(r, query)
    vals = [k.attrs["value"] for k in els] 
    #we want to skip empty dates since they're useless, but not the empty kabko because it's the aggregate of all other kabko
    #it's always the first item from a list
    if 'kabko' not in query:
        vals.pop(0)
    return vals

tanggal_vals = get_select_vals(r, "#tanggal")
sampai_vals = get_select_vals(r, "#sampai")
sampai_vals[:5]

['2020-03-20', '2020-03-21', '2020-03-22', '2020-03-23', '2020-03-24']

In [47]:
def get_select_dicts(r, query):
    els = get_select_opts(r, query)
    dicts = [{k.attrs["value"]:k.text} for k in els]
    #we want to skip empty dates since they're useless, but not the empty kabko because it's the aggregate of all other kabko
    #it's always the first item from a list
    if 'kabko' not in query:
        dicts.pop(0)
    return dicts

tanggal_dicts = get_select_dicts(r, "#tanggal")
sampai_dicts = get_select_dicts(r, "#sampai")
sampai_dicts[:5]

[{'2020-03-20': '2020-03-20'},
 {'2020-03-21': '2020-03-21'},
 {'2020-03-22': '2020-03-22'},
 {'2020-03-23': '2020-03-23'},
 {'2020-03-24': '2020-03-24'}]

## Sending Post Request

Now that we have our parameters, let's send post request. By default, the parameters used are the first kabko, the first tanggal, and the last sampai. Let's use the fourths instead.

In [75]:
#array index starts from 0 so the third item is at index 2
payload = {
    'kabko': kabko_vals[3],
    'tanggal': tanggal_vals[3],
    'sampai': sampai_vals[3]
}
payload

{'kabko': 'KAB. BANGKALAN', 'tanggal': '2020-03-23', 'sampai': '2020-03-23'}

In [89]:
r = session.post(endpoint, data=payload)
r

<Response [200]>

## Getting the Values

Now getting the values is a bit tricky and prone to break due to page update. The cards don't have specific class or id and they're not even using rows properly. We'll have to make do with getting n-th childern for each value and hope they stay like that. requests-html seem to not implement jquery selectors fully because :eq and :nth-child are acting weird so we'll just select them the good old way.

In [77]:
row = r.html.find(".container .row")[0]
row.find("#tanggal")
#check if the row has #tanggal, because that's the correct row according to inspect element

[<Element 'select' class=('form-control',) name='tanggal' id='tanggal'>]

The cards are split into two region, left and right. They're the first and second div with col-md-12 class.

In [78]:
card_groups = row.find("div.col-md-6")[:2]
card_groups

[<Element 'div' class=('col-md-6',) style='text-align:center;'>,
 <Element 'div' class=('col-md-6',) style='text-align:center;'>]

Let's extract the cards. They're found with query div.card. The + operator can merge lists.

In [79]:
cards = card_groups[0].find("div.card") + card_groups[1].find("div.card")
cards

[<Element 'div' class=('card', 'text-white', 'bg-danger')>,
 <Element 'div' class=('card', 'text-white', 'bg-success')>,
 <Element 'div' class=('card', 'text-white', 'bg-secondary')>,
 <Element 'div' class=('card', 'text-white', 'bg-secondary')>]

Now that we have the cards, let's extract the values.

In [80]:
#name the cards and extract further
#they're all in card-block except for the grey cards
positif_card = cards[0].find(".card-block")[0]
odp_card = cards[1].find(".card-block")[0]
pdp_card = cards[2].find(".card-block")[0]
otg_card = cards[3].find(".card-body")[0]
odr_card = cards[4].find(".card-body")[0]

pdp_card

<Element 'div' class=('card-block',)>

In [81]:
otg_card

<Element 'div' class=('card-body',)>

In [82]:
#the values are all in h3
#there should be 7 values
positif_els = positif_card.find("h3")
len(positif_els)

7

In [83]:
#now let's wrap it up for positif
#define fields, parse values, then just zip em
positif_fields = ['total', 'dirawat', 'sembuh', 'meninggal', 'rumah', 'gedung', 'rs']
positif_vals = [int(e.text) for e in positif_els]
positif = dict(zip(positif_fields, positif_vals))
positif

{'total': 10886,
 'dirawat': 6452,
 'sembuh': 3619,
 'meninggal': 815,
 'rumah': 2560,
 'gedung': 948,
 'rs': 2944}

Now let's make this a function instead

In [84]:
def parse_card(card, fields):
    els = card.find("h3")
    vals = [int(e.text) for e in els]
    parsed = dict(zip(fields, vals))
    return parsed

odp_fields = ['total', 'belum_dipantau', 'dipantau', 'selesai_dipantau', 'meninggal', 'rumah', 'gedung', 'rs']
odp = parse_card(odp_card, odp_fields)
odp

{'total': 29166,
 'belum_dipantau': 0,
 'dipantau': 5116,
 'selesai_dipantau': 23885,
 'meninggal': 165,
 'rumah': 4863,
 'gedung': 1,
 'rs': 252}

In [85]:
pdp_fields = ['total', 'belum_diawasi', 'dirawat', 'sehat', 'meninggal', 'rumah', 'gedung', 'rs']
pdp = parse_card(pdp_card, pdp_fields)
pdp

{'total': 10137,
 'belum_diawasi': 0,
 'dirawat': 4374,
 'sehat': 4554,
 'meninggal': 1209,
 'rumah': 1723,
 'gedung': 3,
 'rs': 2648}

The rest are just single number, no need dicts.

In [86]:
otg = int(otg_card.find("h3")[0].text)
otg

161636

In [87]:
odr = int(odr_card.find("h3")[0].text)
odr

616435

## Redirection Issue

We got the values, alright. However, the values are incorrect. The values we got are the latest values for whole Jawa Timur. It's as if the post data weren't sent or were ignored.

In [92]:
req = r.request
req

<PreparedRequest [GET]>

In [93]:
req.headers

{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'rand_session=vd7jaemk9hfif70fv7m0ik2i0m3hlkda'}

In [94]:
req.body

In [95]:
req.method

'GET'

In [96]:
req.url

'https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/'

In [97]:
r.history

[<Response [301]>]

In [98]:
r.history[0]

<Response [301]>

In [100]:
r.history[0].text

'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The document has moved <a href="https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/">here</a>.</p>\n</body></html>\n'

So it turns out the url we should be using is https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/

In [101]:
endpoint2 = "https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/"
r = session.post(endpoint2, data=payload)
r

<Response [200]>

In [102]:
r.history

[]

Good. No redirect. Is it working though?

In [104]:
#get cards
row = r.html.find(".container .row")[0]
card_groups = row.find("div.col-md-6")[:2]
cards = card_groups[0].find("div.card") + card_groups[1].find("div.card")

#name the cards
positif_card = cards[0].find(".card-block")[0]
odp_card = cards[1].find(".card-block")[0]
pdp_card = cards[2].find(".card-block")[0]
otg_card = cards[3].find(".card-body")[0]
odr_card = cards[4].find(".card-body")[0]

#get and pack the values
vals = {
    'positif':parse_card(odp_card, odp_fields),
    'odp':parse_card(odp_card, odp_fields),
    'pdp':parse_card(odp_card, odp_fields)
}
vals

{'positif': {'total': 16,
  'belum_dipantau': 0,
  'dipantau': 1,
  'selesai_dipantau': 15,
  'meninggal': 0,
  'rumah': 0,
  'gedung': 0,
  'rs': 0},
 'odp': {'total': 16,
  'belum_dipantau': 0,
  'dipantau': 1,
  'selesai_dipantau': 15,
  'meninggal': 0,
  'rumah': 0,
  'gedung': 0,
  'rs': 0},
 'pdp': {'total': 16,
  'belum_dipantau': 0,
  'dipantau': 1,
  'selesai_dipantau': 15,
  'meninggal': 0,
  'rumah': 0,
  'gedung': 0,
  'rs': 0}}

Nice. The low data count indicates this is data from early dates. 

This concludes the basic scraping