<a href="https://colab.research.google.com/github/ikanx101/G-Colab/blob/main/Scrape_Cookpad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cara _Web Scrape_

1. _Targeting_ `.css` _object_.
    - CLI: _command line interface_.
    - Gak butuh _browser_.
    - Seolah-olah si __R__ atau ___Python___ dia menjadi _browser_-nya.
    - Plus: prosesnya cepat.
    - Minus: Kalau situsnya dijaga sama _bot_, ini susah tembus. 
1. __API__ _based_.
    - _Targeting database_ si penyedia layanan.
    - Biasanya itu berlaku situs-situs pemerintahan.
    - Plus: lebih cepat dan _reliable_.
    - Minus: Tidak semua situs ada __API__-nya.
1. _Mimicking browser_.
    - Seolah-olah ada _browser_.
    - _Killer method_ 100% _works_.
    - Plus: walau dijaga _bot_, dia tetap bisa.
    - Minus: proses lebih lama dan _coding_ lebih sulit.

In [9]:
# membersihkan environment
rm(list=ls())

# libraries
library(dplyr) # data cleaning
library(rvest) # intinya utk web scrape
library(stringr) # manipulasi text

# contoh
# misalkan saya mau scrape resep berikut
url = "https://cookpad.com/id/resep/15229535-bolu-kukus-anak-ayam"

In [13]:
# induksi mendapatkan nama resep
nama_resep = 
  url %>% 
  read_html() %>% 
  html_nodes(".field-group--no-container-xs") %>% 
  html_text()

nama_resep = nama_resep %>% str_squish() %>% str_trim()
nama_resep

In [14]:
# induksi mendapatkan bahan
bahan = 
  url %>% 
  read_html() %>%
  html_nodes(".border-dashed div") %>% 
  html_text()

bahan = bahan %>% str_squish() %>% str_trim()
bahan

In [15]:
# induksi menggabungkan data
data = data.frame(nama_resep,bahan)
data

nama_resep,bahan
<chr>,<chr>
Bolu Kukus Anak Ayam,250 gr tepung terigu
Bolu Kukus Anak Ayam,200 gr gula pasir
Bolu Kukus Anak Ayam,2 butir telur
Bolu Kukus Anak Ayam,150 ml santan (me kara kecil + air)
Bolu Kukus Anak Ayam,1/2 sdt baking powder
Bolu Kukus Anak Ayam,1 sdt SP
Bolu Kukus Anak Ayam,"Pewarna makanan : saya biru, merah rose, kuning"
Bolu Kukus Anak Ayam,mata Choco chips untuk hiasan


In [16]:
# fungsi scraping versi 1
scrape_cookpad = function(url){
  nama_resep = 
    url %>% 
    read_html() %>% 
    html_nodes(".field-group--no-container-xs") %>% 
    html_text()

  nama_resep = nama_resep %>% str_squish() %>% str_trim()
  
  bahan = 
    url %>% 
    read_html() %>%
    html_nodes(".border-dashed div") %>% 
    html_text()

  bahan = bahan %>% str_squish() %>% str_trim()

  data = data.frame(nama_resep,bahan)
  return(data)
}

In [17]:
url_baru = "https://cookpad.com/id/resep/14454088-kentang-goreng-ala-kfc"
scrape_cookpad(url_baru)

nama_resep,bahan
<chr>,<chr>
Kentang goreng ala KFC,1 kg kentang
Kentang goreng ala KFC,5 sdm tepung terigu
Kentang goreng ala KFC,"2,5 tepung maizena"
Kentang goreng ala KFC,1 sdm garam


In [18]:
# fungsi scraping versi 2
# lebih cepat harusnya karena proses membacanya cukup sekali
scrape_cookpad_v2 = function(url){
  data = 
    url %>% 
    read_html() %>% {tibble(
      nama = html_nodes(.,".field-group--no-container-xs") %>% html_text() %>% str_squish() %>% str_trim(),
      bahan = html_nodes(.,".border-dashed div") %>% html_text() %>% str_squish() %>% str_trim()
  )
    } 
  return(data)
}

scrape_cookpad_v2(url_baru)

nama,bahan
<chr>,<chr>
Kentang goreng ala KFC,1 kg kentang
Kentang goreng ala KFC,5 sdm tepung terigu
Kentang goreng ala KFC,"2,5 tepung maizena"
Kentang goreng ala KFC,1 sdm garam


In [24]:
scrape_cookpad_v2("https://cookpad.com/id/resep/13042002-rendang-daging-tanpa-santan")

nama,bahan
<chr>,<chr>
Rendang daging tanpa santan,500 gr daging sapi
Rendang daging tanpa santan,6 butir bawang merah
Rendang daging tanpa santan,4 butir bawang putih
Rendang daging tanpa santan,2 butir kemiri
Rendang daging tanpa santan,1/2 sendok ketumbar
Rendang daging tanpa santan,3 cm kunyit
Rendang daging tanpa santan,3 cm jahe
Rendang daging tanpa santan,2 batang sereh
Rendang daging tanpa santan,2 lembar daun salam
Rendang daging tanpa santan,2 lembar daun jeruk


In [31]:
daftar_link = readLines("resep cookpad.txt")
daftar_link

In [32]:
hasil_scrape = data.frame()

for(i in 1:5){
  temp = scrape_cookpad_v2(daftar_link[i])
  Sys.sleep(10)
  hasil_scrape = rbind(temp,hasil_scrape)
}

hasil_scrape

nama,bahan
<chr>,<chr>
Melted Cheese Chicken,"300 gr Ayam fillet, iris tipis menjadi 4 bagian"
Melted Cheese Chicken,200 gr Melted Cheese (saya: quick melt cheese)
Melted Cheese Chicken,Secukupnya Minyak untuk menggoreng
Melted Cheese Chicken,150 ml Susu cair
Melted Cheese Chicken,"2 siung Bawang Putih, haluskan atau 1/2 sdt garlic powder"
Melted Cheese Chicken,1 sdt Garam
Melted Cheese Chicken,1/4 sdt lada bubuk
Melted Cheese Chicken,1/2 sdt Kaldu bubuk (opsional)
Melted Cheese Chicken,Bahan pelapis:
Melted Cheese Chicken,150 gr Tepung terigu


In [None]:
# jika ingin menambahkan user agent
uastring = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"
x = html_session(url,httr::user_agent(uastring))

read_html(x)