# Web Scrape Tirto.id
Dalam notebook kali ini saya akan membahas bagaimana cara mengorek-orek data yang ada pada website Tirto.id. Perlu diketahui cara ini bisa gagal karena struktur website Tirto.id bisa berubah sewaktu-waktu, tergantung empunya.

Langkah web scrape ini semata-mata dipublikasikan untuk edukasi. Materi saya dapatkan dari pelatihan Big Data - Digital Talent Scholarship 2018.

Mari kita mulai saja orek-oreknya!

## Langkah 1: Impor pustaka-pustaka yang dibutuhkan

In [1]:
# Start importing library to scrape
import requests # for requesting the website
from bs4 import BeautifulSoup # for cleaning the website text (html), so it's more readable for further analysis
import json # for reading & writing a json format variable
import pandas as pd # for managing data frames
import time # for timer

## Langkah 2: Buat fungsi untuk scrape data
Website Tirto.id, memiliki indeks berita di https://tirto.id/indeks/ . Tutorial ini akan mengakses data tersebut. Walau demikian, berdasarkan percobaan terbaru, indeks berita ini diakses dengan metode lazy loading. Langkah scraping akan sedikit berbeda dengan website yang tanpa lazy loading.

Adapun data indeks berita Tirto.id bisa ditemukan di element "script". Sehingga kita bisa ambil data dari situ.

In [2]:
# Function to scrape Tirto.id Articles
def crawl_tirto_article(i, dataArticle):
    print("Start Scraping Tirto.id Page " + str(i))
    
    # Get page content
    page = requests.get("https://tirto.id/indeks/" + str(i))
    htmlPage = page.content    
    soup = BeautifulSoup(htmlPage, "lxml")
    
    # One unique thing is, Tirto.id use lazy loading on its index
    # We found out that these data is found within "script"
    links = soup.find_all("script")
    scriptContent = links[4].text.replace('window.__NUXT__=',"")[:-1]
    
    # The formatted data inside has a similar style with JSON format. Let's convert it into JSON
    article = json.loads(scriptContent)
    
    # Extract data and save it into list
    listArticle = article["data"][0]["listarticle"]
    dataArticle = dataArticle + listArticle
    
    # Sleep to prevent your scraper from being too spammy
    time.sleep(5)
    
    return dataArticle
    
# Function to save crawled data into CSV files
def save_tirto_articles(dataArticle):
    dataFrame = pd.DataFrame(dataArticle)
    dataFrame.to_csv("Data Tirto.ID.csv", sep=";")
    print("Done saving!")
    return dataFrame

## Langkah 3: Mulai jalankan web scrape

In [3]:
# Let's start scraping
firstPage = int(input("Insert first page you want to scrape (minimum is 1): "))
lastPage = int(input("Insert last page you want to scrape (minimum is 1): "))

if firstPage < 1:
    print("Please input valid page!")
    
elif lastPage < 1:
    print("Please input valid page!")

else:
    # Initialize list first to store scrape result
    data = []
    for i in range(firstPage, lastPage+1):
        data = crawl_tirto_article(i, data)

    # Finally, let's save it
    dataFr = save_tirto_articles(data)

Insert first page you want to scrape (minimum is 1): 1
Insert last page you want to scrape (minimum is 1): 5
Start Scraping Tirto.id Page 1
Start Scraping Tirto.id Page 2
Start Scraping Tirto.id Page 3
Start Scraping Tirto.id Page 4
Start Scraping Tirto.id Page 5
Done saving!


## Bonus langkah 4: Tampilkan 5 data pertama dari hasil scrape

In [4]:
# Display it for having fun
dataFr.head()

Unnamed: 0,articleUrl,articleUrlNew,date_news,flag_tvr,id_topic_pialadunia,image,image_infografik,judul,label_kanal,label_navbar,match_id,player_id,ringkasan,team_id
0,/laba-bersih-jasa-marga-capai-rp177-t-pada-tri...,laba-bersih-jasa-marga-capai-rp177-t-pada-triw...,2018-10-26 14:20:39,0,,[{'url': '2018/03/23/antarafoto-rencana-penuru...,False,"Laba Bersih Jasa Marga Capai Rp1,77 T pada Tri...",Hard News,Ekonomi,,,"Dari sisi usaha di luar konstruksi, pendapatan...",
1,/menteri-pariwisata-nilai-transportasi-kita-ma...,menteri-pariwisata-nilai-transportasi-kita-mas...,2018-10-26 14:13:00,0,,[{'url': '2016/06/06/TIRTO-AriefYahya_ratio-16...,False,Menteri Pariwisata: Nilai Transportasi Kita Ma...,Hard News,Ekonomi,,,Sektor pariwisata masih menghadapi kendala pad...,
2,/kemenko-pmk-sebut-pkn-revolusi-mental-sebarka...,kemenko-pmk-sebut-pkn-revolusi-mental-sebarkan...,2018-10-26 14:09:20,0,,[{'url': '2018/10/24/rapat-akhir-persiapan-pkn...,False,Kemenko PMK Sebut PKN Revolusi Mental Sebarkan...,Hard News,Sosial Budaya,,,&quot;Agar masyarakat lebih mengetahui berbaga...,
3,/kasus-suap-meikarta-kpk-periksa-sopir-hingga-...,kasus-suap-meikarta-kpk-periksa-sopir-hingga-p...,2018-10-26 14:06:54,0,,[{'url': '2018/10/26/hl-meikarta--5-tirto.id-h...,False,Kasus Suap Meikarta: KPK Periksa Sopir Hingga ...,Hard News,Hukum,,,"Hari ini, Jumat (26/10/2018) KPK memanggil dua...",
4,/10-tahun-android-besar-karena-ponsel-pintar-c...,10-tahun-android-besar-karena-ponsel-pintar-ci...,2018-10-26 14:01:01,0,,[{'url': '2018/06/17/ilustrasi-android--istock...,2018/10/25/10-tahun-android--mild--quita-01.jpg,10 Tahun Android: Besar karena Ponsel Pintar Cina,Mild Report,Teknologi,,,"Jika bukan karena hasrat menyaingi iPhone,&amp...",


## Pengembangan Kedepannya
1. Modifikasi fungsi penyimpanan agar bisa update file CSV jika sudah dilakukan scraping selanjutnya
2. Buat fungsi unduh gambar