<a href="https://colab.research.google.com/github/ikanx101/G-Colab/blob/main/Shopee_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Shopee Scraper v2.0**

## Petunjuk pemakaian 

1. Silakan _upload file_ `.txt` berisi _links_ produk __Shopee__ yang hendak di-_scrape_ datanya.
1. Pastikan bahwa setiap baris dari file `.txt` tersebut hanya berisi satu _link_ produk saja.
1. Tuliskan nama file `.txt` tersebut dalam kolom yang disediakan secara lengkap. Misalkan nama _file:_ `link produk.txt`.
1. Silakan pilih `Runtime` > `Run All` dan tunggu hingga selesai.
1. Setelah proses selesai, silakan _download file_ hasilnya bernama `scraped_data.xlsx`.

_Created by:_ [Ikang](https://ikanx101.com/)

In [18]:
#@title Nama File `.txt`
rm(list=ls())
nama_file <- "Contoh.txt" #@param {type:"string"}


In [19]:
#@title
library(jsonlite)
library(dplyr)

contoh = c("https://shopee.co.id/Tropicana-Slim-Kecap-Manis-200Ml-i.12656836.95387848",
           "https://shopee.co.id/Tropicana-Slim-Choco-Spread-300-gr-i.12656836.4134919883",
           "https://shopee.co.id/HiLo-Active-Chocolate-500-gr-i.12656836.1309066998",
           "https://shopee.co.id/Tropicana-Slim-Caffe-Latte-10's-i.12656836.95387852",
           "https://shopee.co.id/Tropicana-Slim-Sugar-Free-Cookies-Choco-200G-Tropicana-Slim-Hokkaido-Cheese-Cookies-100gr-i.12656836.6149828589",
           "https://shopee.co.id/HiLo-Thai-Tea-15-gr-10's-i.12656836.1389108883")

links = readLines(nama_file)

link = unique(links)

dummy = data.frame(id = c(1:length(link)),
                   url = link,
                   asli = link) %>% 
  filter(grepl('-i.',url,fixed = T)) %>% 
  filter(!grepl("help",url)) %>% 
  mutate(url = gsub("wpi","",url,ignore.case = T),
         url = gsub("isi","",url,ignore.case = T),
         url = gsub("iso","",url,ignore.case = T),
         url = gsub("imi","",url,ignore.case = T),
         url = gsub("im","",url,ignore.case = T),
         url = gsub("in","",url,ignore.case = T)) %>% 
  tidyr::separate(url,into = c('hapus','pakai'),sep = '-i.') %>% 
  tidyr::separate(pakai, into = c('info1','info2'),sep = '\\.') %>%
  mutate(link_final = paste0('https://shopee.co.id/api/v2/item/get?itemid=',
                             info2,
                             '&shopid=',
                             info1)) %>% 
  filter(!is.na(info2))

url = dummy$link_final


In [20]:
#@title
scrape_shopee = function(url){
  # buka json
  tes = read_json(url)
  #bentuk data frame
  data = data.frame(
    nama = tes$item$name,
    merek = ifelse(is.null(tes$item$brand),NA,tes$item$brand),
    harga_before_disc = tes$item$price_before_discount/100000,
    harga = tes$item$price/100000,
    terjual = tes$item$sold,
    lokasi = tes$item$shop_location,
    status = tes$item$item_status,
    kategori = ifelse(is.null(tes$item$categories[[3]]$display_name),NA,tes$item$categories[[3]]$display_name),
    link = url
  )
  return(data)
}

In [21]:
#@title
i = 1
data = scrape_shopee(url[i])

for(i in 2:length(url)){
  temp = scrape_shopee(url[i])
  data = rbind(data,temp)
}

data$waktu.scrape = Sys.time()
raw = distinct(data)

In [22]:
#@title
# cleaning
data_clean = 
    raw %>% 
    mutate(merek = case_when(
        grepl("tropicana",nama,ignore.case = T) ~ "Tropicana Slim",
        grepl("l-men",nama,ignore.case = T) ~ "L-Men",
        grepl("nutri",nama,ignore.case = T) ~ "NutriSari",
        grepl("teen",nama,ignore.case = T) ~ "HiLo Teen",
        grepl("school",nama,ignore.case = T) ~ "HiLo School",
        grepl("hilo",nama,ignore.case = T) & grepl("rtd",nama,ignore.case = T) ~ "HiLo RTD",
        grepl("hilo",nama,ignore.case = T) & !grepl("teen|school|rtd",nama,ignore.case = T) ~ "HiLo Active/Gold",
        grepl("lokala",nama,ignore.case = T) ~ "Lokalate",
        grepl("wdan|dank|wedan",nama,ignore.case = T) ~ "WDank"
    )
               ) %>% 
    arrange(merek,nama,waktu.scrape) %>% 
    mutate(discount = ifelse(harga_before_disc > 0,
                             harga_before_disc - harga,
                             NA)) %>%
    mutate(waktu.scrape = as.POSIXct(waktu.scrape) + 7*60*60) 

In [23]:
#@title
judul = paste0("Hasil Scrape Shopee.csv")
write.csv(data_clean,judul)
print("-- DONE --")
print("Proses telah selesai")

[1] "-- DONE --"
[1] "Proses telah selesai"
