# Putting Things Together and Multiprocessing

Now, let's take it to the next level. We'll need to scrap each day for each kabko.

First, let's wrap these in functions. We already have:
1. get_selected_vals/get_selected_dicts
2. parse_card

Let's also put them here, along with the previous imports and stuff.

In [4]:
from requests_html import HTMLSession

endpoint = "https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/"
session = HTMLSession()

def get_select_opts(r, query):
    #find the select element
    select = r.html.find(query)[0]
    els = select.find("option")
    return els

def get_select_vals(r, query):
    els = get_select_opts(r, query)
    vals = [k.attrs["value"] for k in els] 
    #we want to skip empty dates since they're useless, but not the empty kabko because it's the aggregate of all other kabko
    #it's always the first item from a list
    if 'kabko' not in query:
        vals.pop(0)
    return vals

def get_select_dicts(r, query):
    els = get_select_opts(r, query)
    dicts = [{k.attrs["value"]:k.text} for k in els]
    #we want to skip empty dates since they're useless, but not the empty kabko because it's the aggregate of all other kabko
    #it's always the first item from a list
    if 'kabko' not in query:
        dicts.pop(0)
    return dicts

def parse_card(card, fields):
    els = card.find("h3")
    vals = [int(e.text) for e in els]
    parsed = dict(zip(fields, vals))
    return parsed


In [5]:
#editing parse_card a bit
def parse_int(text):
    if not text:
        return 0
    val = int(text)
    return val

def parse_card(card, fields):
    els = card.find("h3")
    vals = [parse_int(e.text) for e in els]
    parsed = dict(zip(fields, vals))
    return parsed

However, what we need are functions to:
1. get all kabkos and dates
2. get new dates (get dates past a given date from date list)
3. get data of a day & kabko
4. get data of every day & kabko
5. get data of new data (after a given date)

We'll do them one by one

## Getting Form Parameters

First, let's do function to get all parameter values.

In [6]:
def get_all_param_vals(session, endpoint, dicts=False, params=['kabko', 'tanggal', 'sampai']):
    r = session.get(endpoint)
    f = get_select_vals
    if dicts:
        f = get_select_dicts
    vals = dict([(p, f(r, "#"+p)) for p in params])
    return vals

vals = get_all_param_vals(session, endpoint)
vals

{'kabko': ['',
  '- (STATUS PENDING)',
  'AWAK BUAH KAPAL',
  'KAB. BANGKALAN',
  'KAB. BANYUWANGI',
  'KAB. BLITAR',
  'KAB. BOJONEGORO',
  'KAB. BONDOWOSO',
  'KAB. GRESIK',
  'KAB. JEMBER',
  'KAB. JOMBANG',
  'KAB. KEDIRI',
  'KAB. LAMONGAN',
  'KAB. LUMAJANG',
  'KAB. MADIUN',
  'KAB. MAGETAN',
  'KAB. MALANG',
  'KAB. MOJOKERTO',
  'KAB. NGANJUK',
  'KAB. NGAWI',
  'KAB. PACITAN',
  'KAB. PAMEKASAN',
  'KAB. PASURUAN',
  'KAB. PONOROGO',
  'KAB. PROBOLINGGO',
  'KAB. SAMPANG',
  'KAB. SIDOARJO',
  'KAB. SITUBONDO',
  'KAB. SUMENEP',
  'KAB. TRENGGALEK',
  'KAB. TUBAN',
  'KAB. TULUNGAGUNG',
  'KOTA BATU',
  'KOTA BLITAR',
  'KOTA KEDIRI',
  'KOTA MADIUN',
  'KOTA MALANG',
  'KOTA MOJOKERTO',
  'KOTA PASURUAN',
  'KOTA PROBOLINGGO',
  'KOTA SURABAYA',
  'RS LAPANGAN INDRAPURA'],
 'tanggal': ['2020-03-20',
  '2020-03-21',
  '2020-03-22',
  '2020-03-23',
  '2020-03-24',
  '2020-03-25',
  '2020-03-26',
  '2020-03-27',
  '2020-03-28',
  '2020-03-29',
  '2020-03-30',
  '2020-03-31',
  

## Filter New Dates

Next, function to select dates after date list. 

We first need function to parse the date string into date object. We'll use datetime class from datetime package.

In [7]:
from datetime import datetime

def parse_date(date_str):
    return datetime.strptime(date_str, "%Y-%m-%d").date()

parse_date('2020-03-20')

datetime.date(2020, 3, 20)

And vice versa

In [8]:
def format_date(date):
    return datetime.strftime(date, "%Y-%m-%d")

format_date(parse_date('2020-03-20'))

'2020-03-20'

Now we filter dates. Let's say we want dates after 1 April.

In [9]:
def get_new_dates(dates, after):
    after = parse_date(after)
    new_dates = [d for d in dates if d and parse_date(d) > after]
    return new_dates

get_new_dates(vals['tanggal'], '2020-04-01')[:5]

['2020-04-02', '2020-04-03', '2020-04-04', '2020-04-05', '2020-04-06']

## Single Request Scraping

Next we make function to get data from a kabko in a day. But first, we define the fields.

In [10]:
#define the fields
positif_fields = ['total', 'dirawat', 'sembuh', 'meninggal', 'rumah', 'gedung', 'rs']
odp_fields = ['total', 'belum_dipantau', 'dipantau', 'selesai_dipantau', 'meninggal', 'rumah', 'gedung', 'rs']
pdp_fields = ['total', 'belum_diawasi', 'dirawat', 'sehat', 'meninggal', 'rumah', 'gedung', 'rs']

OTG and ODR are just single values thus doesn't need to be a dict and can't use parse_card. Let's make a function for them.

In [11]:
def parse_card_single(card):
    val = parse_int(card.find("h3")[0].text)
    return val

In [12]:
def get_covid_data(session, endpoint, kabko, tanggal, sampai=None):
    #prepare the request
    if not sampai:
        sampai = tanggal
    data = {
        'kabko': kabko,
        'tanggal': tanggal,
        'sampai': sampai
    }
    
    #send request
    r = session.post(endpoint, data=data)
    
    #get cards
    row = r.html.find(".container .row")[0]
    card_groups = row.find("div.col-md-6")[:2]
    cards = card_groups[0].find("div.card") + card_groups[1].find("div.card")
    
    #name the cards
    positif_card = cards[0].find(".card-block")[0]
    odp_card = cards[1].find(".card-block")[0]
    pdp_card = cards[2].find(".card-block")[0]
    otg_card = cards[3].find(".card-body")[0]
    odr_card = cards[4].find(".card-body")[0]
    
    #get and pack the values
    vals = {
        'kabko':kabko,
        'tanggal':tanggal,
        'sampai':sampai,
        'positif':parse_card(positif_card, positif_fields),
        'odp':parse_card(odp_card, odp_fields),
        'pdp':parse_card(pdp_card, pdp_fields),
        'otg':parse_card_single(otg_card),
        'odr':parse_card_single(odr_card)
    }
    
    return vals

get_covid_data(session, endpoint, "", "2020-03-20")

{'kabko': '',
 'tanggal': '2020-03-20',
 'sampai': '2020-03-20',
 'positif': {'total': 15,
  'dirawat': 0,
  'sembuh': 0,
  'meninggal': 1,
  'rumah': 0,
  'gedung': 0,
  'rs': 0},
 'odp': {'total': 542,
  'belum_dipantau': 0,
  'dipantau': 0,
  'selesai_dipantau': 0,
  'meninggal': 0,
  'rumah': 0,
  'gedung': 0,
  'rs': 0},
 'pdp': {'total': 53,
  'belum_diawasi': 0,
  'dirawat': 0,
  'sehat': 0,
  'meninggal': 0,
  'rumah': 0,
  'gedung': 0,
  'rs': 0},
 'otg': 0,
 'odr': 11025}

Now that we can get one, let's do it in bulk.

## Bulk Scraping

In [10]:
def get_covid_data_bulk(session, endpoint, kabko, tanggal):
    data = {k:{t:get_covid_data(session, endpoint, k, t) for t in tanggal} for k in kabko}
    return data

params = get_all_param_vals(session, endpoint)
kabko = params['kabko'][:2]
tanggal = params['tanggal'][:2]
get_covid_data_bulk(session, endpoint, kabko, tanggal)

{'': {'2020-03-20': {'kabko': '',
   'tanggal': '2020-03-20',
   'sampai': '2020-03-20',
   'positif': {'total': 15,
    'dirawat': 0,
    'sembuh': 0,
    'meninggal': 1,
    'rumah': 0,
    'gedung': 0,
    'rs': 0},
   'odp': {'total': 542,
    'belum_dipantau': 0,
    'dipantau': 0,
    'selesai_dipantau': 0,
    'meninggal': 0,
    'rumah': 0,
    'gedung': 0,
    'rs': 0},
   'pdp': {'total': 53,
    'belum_diawasi': 0,
    'dirawat': 0,
    'sehat': 0,
    'meninggal': 0,
    'rumah': 0,
    'gedung': 0,
    'rs': 0},
   'otg': 0,
   'odr': 11025},
  '2020-03-21': {'kabko': '',
   'tanggal': '2020-03-21',
   'sampai': '2020-03-21',
   'positif': {'total': 26,
    'dirawat': 0,
    'sembuh': 0,
    'meninggal': 1,
    'rumah': 0,
    'gedung': 0,
    'rs': 0},
   'odp': {'total': 791,
    'belum_dipantau': 0,
    'dipantau': 0,
    'selesai_dipantau': 0,
    'meninggal': 0,
    'rumah': 0,
    'gedung': 0,
    'rs': 0},
   'pdp': {'total': 78,
    'belum_diawasi': 0,
    'dirawat

In [11]:
data_count = len(params['kabko']) * len(params['tanggal'])
data_count

4510

## Efficiency and Benchmarking

Well, damn. It takes so long because loops are synchronous. We're basically loading 4059 web pages one by one. If we were to open that many web pages manually in a browser, we'd open many tab at once so we'll wait for it to load at the same time instead of opening them one by one. This is multiprocessing. However, we can't use python's standard multiprocessing library here in jupyter, we should use multiprocess library instead. (Ref: https://stackoverflow.com/questions/50937362/multiprocessing-on-python-3-jupyter)


But first, let's time the single function.

In [12]:
import time

def f_timer(f, args=None, kwargs=None):
    args = args or []
    kwargs = kwargs or {}
    start = time.time()
    ret = f(*args, **kwargs)
    end = time.time()
    return end-start

f_timer(get_covid_data, (session, endpoint, "", "2020-03-20"))

7.237887144088745

3 seconds. Pretty long, eh? This will require quite a long time to process 4059 data.

Let's try average of some runs

In [13]:
def f_timer_avg(f, count, args=[], kwargs={}):
    args = args or []
    kwargs = kwargs or {}
    times = [f_timer(f, args, kwargs) for i in range(count)]
    avg = sum(times)/len(times)
    return avg

run_avg_time = f_timer_avg(get_covid_data, 10, (session, endpoint, "", "2020-03-20"))
run_avg_time

6.342255163192749

About the same. What makes it long? If it's the page loading then we can't really do much.

In [14]:
f_timer_avg(session.post, 10, (endpoint,))

5.2201550006866455

It is. So we can't do much. How long will it take to process all data though?

In [15]:
seconds_needed = run_avg_time * data_count
seconds_needed

28603.570785999298

In [16]:
minutes_needed = seconds_needed / 60
minutes_needed

476.726179766655

In [17]:
hours_needed = minutes_needed / 60
hours_needed

7.94543632944425

We need close to 3 hours to process all data. Heroku will have long since terminated our program by then. Thus, we really need to speed this up or split it to parts.

## Multiprocessing

Now we do the multiprocessing. First, let's prepare for it.

In [13]:
#requests-html github readme says that we automatically get connection pool so no need to pool the sessions

#get the params like before
params = get_all_param_vals(session, endpoint)

#let's start small
tanggal = params['tanggal'][:2]
kabko = params['kabko'][:2]

tanggal

['2020-03-20', '2020-03-21']

In [14]:
kabko

['', '- (STATUS PENDING)']

In [15]:
#we still need to prepare the dict, but just make the value empty
data = {k:{t:None for t in tanggal} for k in kabko}

data

{'': {'2020-03-20': None, '2020-03-21': None},
 '- (STATUS PENDING)': {'2020-03-20': None, '2020-03-21': None}}

In [16]:
#prepare the args
#get_covid_data(session, endpoint, k, t)
#we create a new session for each run because we can't make one session load many pages in parallel
args = [(HTMLSession(), endpoint, k, t) for t in tanggal for k in kabko]

args

[(<requests_html.HTMLSession at 0xa8f7448>,
  'https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/',
  '',
  '2020-03-20'),
 (<requests_html.HTMLSession at 0xa8f7c08>,
  'https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/',
  '- (STATUS PENDING)',
  '2020-03-20'),
 (<requests_html.HTMLSession at 0xa906308>,
  'https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/',
  '',
  '2020-03-21'),
 (<requests_html.HTMLSession at 0xa906d88>,
  'https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/',
  '- (STATUS PENDING)',
  '2020-03-21')]

In [17]:
len(args)

4

We have 4 pack of arguments, meaning we should have 4 tasks. We have 10 processor set so we won't have any waiting task, meaning we should have 4 parallel processes running at the same time. The time needed for 1 request is under 3 seconds so this should take at most 12 seconds and that should be the time if it were synchronous. 

Just to be safe, let's run each of them one by one and see if any of them breaks.

In [18]:
get_covid_data(*(args[0]))['odr']

11025

In [19]:
get_covid_data(*(args[1]))['odr']

0

In [20]:
get_covid_data(*(args[2]))['odr']

13392

In [21]:
get_covid_data(*(args[3]))['odr']

0

In [22]:
#from multiprocess import Pool
from multiprocessing.pool import ThreadPool as Pool

max_process_count = 10

#we prepare the pool
#pool in this context is like collection of available tabs
pool = Pool(processes=max_process_count)

#now we execute it
#we use starmap instead of map because there are multiple arguments
output = pool.starmap(get_covid_data, args)

pool.close()
pool.join()

output

[{'kabko': '',
  'tanggal': '2020-03-20',
  'sampai': '2020-03-20',
  'positif': {'total': 15,
   'dirawat': 0,
   'sembuh': 0,
   'meninggal': 1,
   'rumah': 0,
   'gedung': 0,
   'rs': 0},
  'odp': {'total': 542,
   'belum_dipantau': 0,
   'dipantau': 0,
   'selesai_dipantau': 0,
   'meninggal': 0,
   'rumah': 0,
   'gedung': 0,
   'rs': 0},
  'pdp': {'total': 53,
   'belum_diawasi': 0,
   'dirawat': 0,
   'sehat': 0,
   'meninggal': 0,
   'rumah': 0,
   'gedung': 0,
   'rs': 0},
  'otg': 0,
  'odr': 11025},
 {'kabko': '- (STATUS PENDING)',
  'tanggal': '2020-03-20',
  'sampai': '2020-03-20',
  'positif': {'total': 0,
   'dirawat': 0,
   'sembuh': 0,
   'meninggal': 0,
   'rumah': 0,
   'gedung': 0,
   'rs': 0},
  'odp': {'total': 0,
   'belum_dipantau': 0,
   'dipantau': 0,
   'selesai_dipantau': 0,
   'meninggal': 0,
   'rumah': 0,
   'gedung': 0,
   'rs': 0},
  'pdp': {'total': 0,
   'belum_diawasi': 0,
   'dirawat': 0,
   'sehat': 0,
   'meninggal': 0,
   'rumah': 0,
   'gedung':

It works for multithreading. Below are for multiprocessing.

## Multiprocessing: Encapsulation

It doesn't work. It seems to not recognize the function from outside its scope. Multiprocessing always come with extra headache. Yes, multiprocessing can't use variables from global scope, but I just realized that this includes global functions. Maybe wrapping them in something will help?

In [None]:
class Scrapper:
    positif_fields = ['total', 'dirawat', 'sembuh', 'meninggal', 'rumah', 'gedung', 'rs']
    odp_fields = ['total', 'belum_dipantau', 'dipantau', 'selesai_dipantau', 'meninggal', 'rumah', 'gedung', 'rs']
    pdp_fields = ['total', 'belum_diawasi', 'dirawat', 'sehat', 'meninggal', 'rumah', 'gedung', 'rs']
    
    def __init__(self, endpoint="https://covid19dev.jatimprov.go.id/index.php/xweb/grafikpublik/"):
        self.endpoint = endpoint
        
        #multiprocess doesn't even recognize Scrapper type
        self.positif_fields = Scrapper.positif_fields
        self.odp_fields = Scrapper.odp_fields
        self.pdp_fields = Scrapper.pdp_fields

    def get_select_opts(self, r, query):
        #find the select element
        select = r.html.find(query)[0]
        els = select.find("option")
        return els

    def get_select_vals(self, r, query):
        els = self.get_select_opts(r, query)
        vals = [k.attrs["value"] for k in els] 
        #we want to skip empty dates since they're useless, but not the empty kabko because it's the aggregate of all other kabko
        #it's always the first item from a list
        if 'kabko' not in query:
            vals.pop(0)
        return vals

    def get_select_dicts(self, r, query):
        els = self.get_select_opts(r, query)
        dicts = [{k.attrs["value"]:k.text} for k in els]
        #we want to skip empty dates since they're useless, but not the empty kabko because it's the aggregate of all other kabko
        #it's always the first item from a list
        if 'kabko' not in query:
            dicts.pop(0)
        return dicts
    
    def parse_int(self, text):
        if not text:
            return 0
        val = int(text)
        return val

    def parse_card(self, card, fields):
        els = card.find("h3")
        vals = [self.parse_int(e.text) for e in els]
        parsed = dict(zip(fields, vals))
        return parsed
    
    def get_all_param_vals(self, dicts=False, params=['kabko', 'tanggal', 'sampai']):
        session = HTMLSession()
        r = session.get(self.endpoint)
        f = self.get_select_vals
        if dicts:
            f = self.get_select_dicts
        vals = dict([(p, f(r, "#"+p)) for p in params])
        return vals
    
    def parse_date(self, date_str):
        return datetime.strptime(date_str, "%Y-%m-%d").date()
    
    def format_date(self, date):
        return datetime.strftime(date, "%Y-%m-%d")
    
    def get_new_dates(self, dates, after):
        after = self.parse_date(after)
        new_dates = [d for d in dates if d and parse_date(d) > after]
        return new_dates
    
    def parse_card_single(self, card):
        val = self.parse_int(card.find("h3")[0].text)
        return val
    
    def get_covid_data(self, kabko, tanggal, sampai=None):
        from requests_html import HTMLSession
        session = HTMLSession()
        
        #prepare the request
        if not sampai:
            sampai = tanggal
        data = {
            'kabko': kabko,
            'tanggal': tanggal,
            'sampai': sampai
        }

        #send request
        r = session.post(self.endpoint, data=data)

        #get cards
        row = r.html.find(".container .row")[0]
        card_groups = row.find("div.col-md-6")[:2]
        cards = card_groups[0].find("div.card") + card_groups[1].find("div.card")

        #name the cards
        positif_card = cards[0].find(".card-block")[0]
        odp_card = cards[1].find(".card-block")[0]
        pdp_card = cards[2].find(".card-block")[0]
        otg_card = cards[3].find(".card-body")[0]
        odr_card = cards[4].find(".card-body")[0]

        #get and pack the values
        vals = {
            'kabko':kabko,
            'tanggal':tanggal,
            'sampai':sampai,
            'positif':self.parse_card(odp_card, self.positif_fields),
            'odp':self.parse_card(odp_card, self.odp_fields),
            'pdp':self.parse_card(odp_card, self.pdp_fields),
            'otg':self.parse_card_single(otg_card),
            'odr':self.parse_card_single(odr_card)
        }

        return vals
    
    def get_covid_data_bulk(self, kabko, tanggal):
        data = {k:{t:self.get_covid_data(k, t) for t in tanggal} for k in kabko}
        return data
    

In [None]:
scrapper = Scrapper()
scrapper.get_covid_data("", "2020-03-22")['odr']

In [None]:
#prepare the args
#get_covid_data(kabko, tanggal)
args2 = [(k, t) for t in tanggal for k in kabko]

args2

In [None]:
scrapper.get_covid_data(*(args2[0]))['odr']

In [None]:
scrapper.get_covid_data(*(args2[1]))['odr']

In [None]:
scrapper.get_covid_data(*(args2[2]))['odr']

In [None]:
scrapper.get_covid_data(*(args2[3]))['odr']

In [None]:
#we prepare the pool
#pool in this context is like collection of available tabs
pool = Pool(processes=max_process_count)

#now we execute it
#we use starmap instead of map because there are multiple arguments
output = pool.starmap(scrapper.get_covid_data, args2)

pool.close()
pool.join()

output

Hell yeah. However, it has yet to fill the dict.

In [None]:
data

In [None]:
for o in output:
    data[o['kabko']][o['sampai']] = o

data

## Parallel Bulk Scraping

Now we only need to put them in a function.

In [None]:
def get_covid_data_bulk_parallel(self, kabko, tanggal, max_process_count=10):
    
    #we might not need this after all, but whatever
    data = {k:{t:None for t in tanggal} for k in kabko}
    
    #prepare the args
    #get_covid_data(kabko, tanggal)
    args = [(k, t) for t in tanggal for k in kabko]
    
    #we prepare the pool
    #pool in this context is like collection of available tabs
    pool = Pool(processes=max_process_count)

    #now we execute it
    #we use starmap instead of map because there are multiple arguments
    output = pool.starmap(scrapper.get_covid_data, args)
    pool.close()
    pool.join()
    
    return output

Scrapper.get_covid_data_bulk_parallel = get_covid_data_bulk_parallel
scrapper = Scrapper()
scrapper.get_covid_data_bulk_parallel

Now we time it. We're getting 4 data so, let's be generous, it should be 2.5*4=10 seconds if it were synchronous.

In [None]:
f_timer(scrapper.get_covid_data_bulk_parallel, (kabko, tanggal))

Wow, that's almost twice as fast as synchronous. Definitely an improvement worth the headache.

Next is the wrapper to return dictionaries.

In [None]:
def get_covid_data_bulk_parallel_dict_kabko(self, kabko, tanggal, max_process_count=10):
    output = self.get_covid_data_bulk_parallel(kabko, tanggal, max_process_count)
    
    for o in output:
        data[o['kabko']][o['sampai']] = o

    return data

Scrapper.get_covid_data_bulk_parallel_dict_kabko = get_covid_data_bulk_parallel_dict_kabko
scrapper = Scrapper()
scrapper.get_covid_data_bulk_parallel_dict_kabko

In [None]:
def get_covid_data_bulk_parallel_dict_tanggal(self, kabko, tanggal, max_process_count=10):
    output = self.get_covid_data_bulk_parallel(kabko, tanggal, max_process_count)
    
    for o in output:
        data[o['sampai']][o['kabko']] = o

    return data

Scrapper.get_covid_data_bulk_parallel_dict_tanggal = get_covid_data_bulk_parallel_dict_tanggal
scrapper = Scrapper()
scrapper.get_covid_data_bulk_parallel_dict_tanggal

In [None]:
scrapper.get_covid_data_bulk_parallel(kabko, tanggal)

## Bulk Scraping: All

Next, make one for all data. However we will not be testing this here since it will take a very long time.

In [None]:
def get_covid_data_all(self, parallel=True):
    params = self.get_all_param_vals()
    f = self.get_covid_data_bulk_parallel
    if not parallel:
        f = self.get_covid_data_bulk
    data = f(params['kabko'], params['sampai'])
    return data

Scrapper.get_covid_data_all = get_covid_data_all
scrapper = Scrapper()
scrapper.get_covid_data_all

## Bulk Scraping: New

Finally, make one for new data. We'll test this one first before testing getting all data.

In [None]:
def get_covid_data_new(self, after, parallel=True):
    params = self.get_all_param_vals()
    new_dates = get_new_dates(params['sampai'], after)
    f = self.get_covid_data_bulk_parallel
    if not parallel:
        f = self.get_covid_data_bulk
    data = f(params['kabko'], new_dates)
    return data

Scrapper.get_covid_data_new = get_covid_data_new
scrapper = Scrapper()
scrapper.get_covid_data_new

Now let's test it. We will use the second to last date so we'll be getting data from the last date. This will take a few minutes (41 data)

In [None]:
yesterday = params['tanggal'][-2]
start = time.time()
data = scrapper.get_covid_data_new(yesterday)
end = time.time()
data

In [None]:
end-start

Hey, that's really fast. Might want to adjust the max processor count.

This concludes the main data scraping. 