#Scrape data

Scrape data dari platform freelance [projects.co.id](https://projects.co.id/). Scraping dilakukan pada kategori layanan [web development](https://projects.co.id/public/browse_services/listing/6_website-development).

##Import library

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from IPython.display import clear_output

##Scraping

###Menyiapkan halaman

Mengambil halaman html pada halaman pertama di kategori layanan web development.

In [None]:
url = 'https://projects.co.id/public/browse_services/listing/6_website-development'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

###Mencari jumlah halaman

Mengambil tombol pagination terakhir dan mengakses nilai dari atribut 'paramval' untuk mendapatkan nomor halaman terakhirnya.

In [None]:
pagination = soup.find_all('a', class_='ajax-url')
total_pages = int(pagination.pop().get('paramval'))
print(f"Jumlah halaman: {total_pages} halaman")

Jumlah halaman: 129 halaman


###Mendapatkan semua halaman

Membuat iterasi untuk mendapatkan semua halaman html dan menyimpannya ke dalam array.

In [None]:
pages = []
for i in range(1, total_pages + 1):
  print(f"Mendapatkan halaman ke-{i}")
  url = f"https://projects.co.id/public/browse_services/listing/6_website_development?page={i}&ajax=1"
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')
  pages.append(soup)
  if i != total_pages:
    clear_output()

Mendapatkan halaman ke-129


###Mendapatkan data tiap layanan

Membuat iterasi pada setiap halaman untuk mendapatkan data setiap field pada setiap service. Untuk mengetahui element mana yang menampung informasi tersebut, kita dapat menggunakan inspect element.

In [None]:
# Menyiapkan array yang akan menampung data tiap field
services = ['Nama Layanan']
services_rating = ['Rating Layanan (n/10)']
services_sales = ['Jumlah Penjualan Layanan']
services_price = ['Harga Layanan (Rp)']
services_duration = ['Durasi Pengerjaan (hari)']
sellers = ['Nama Penjual']
sellers_rating = ['Rating Penjual (n/10)']
sellers_sales = ['Jumlah Penjualan Penjual']
descriptions = ['Deskripsi Layanan']

# Mendapatkan data tiap field
for page in pages:

  # Mendapatkan judul layanan
  title_elements = page.find_all('h2')
  for title_element in title_elements:
    title = title_element.get_text()
    services.append(title)

  rating_sales_elements = page.find_all('div', class_='col-md-5 align-left')
  for rating_sales_element in rating_sales_elements:
    rating_sales_text = rating_sales_element.get_text()
    rating_sales_info = rating_sales_text.split()
    # Mendapatkan rating penjualan
    rating = rating_sales_info[2].split('/')[0]
    services_rating.append(rating)
    # Mendapatkan jumlah penjualan
    try:
      sales = rating_sales_info[3].split("(")[1]
    except:
      sales = 0
    services_sales.append(sales)

  price_duration_seller_selrating_elements = page.find_all('div', class_='col-md-7 align-left')
  for price_duration_seller_selrating_element in price_duration_seller_selrating_elements:
    price_duration_seller_selrating_text = price_duration_seller_selrating_element.get_text()
    price_duration_seller_selrating_info = price_duration_seller_selrating_text.split()
    # Mendapatkan harga layanan
    price_raw = price_duration_seller_selrating_info[1]
    price = ''.join(filter(str.isdigit, price_raw))
    services_price.append(price)
    # Mendapatkan durasi layanan
    duration = price_duration_seller_selrating_info[2]
    services_duration.append(duration)
    # Mendapatkan seller
    seller = price_duration_seller_selrating_info[5]
    sellers.append(seller)
    # Mendapatkan rating dari seller
    rating = price_duration_seller_selrating_info[6].split('/')[0]
    sellers_rating.append(rating)
    # Mendapatkan jumlah penjualan seller
    try:
      sales = price_duration_seller_selrating_info[7].split("(")[1]
    except:
      sales = 0
    sellers_sales.append(sales)

  potensial_description_elements = page.find_all('div', class_='col-md-12 align-left')
  for index, potensial_description_element in enumerate(potensial_description_elements):
    if (index % 2) == 1:
      description_element = potensial_description_element.find('p')
      description = description_element.get_text()
      descriptions.append(description)

##Menggabungkan data tiap field ke dalam dataframe

In [None]:
combined = list(zip(services, services_rating, services_sales, services_price, services_duration, sellers, sellers_rating, sellers_sales, descriptions))
df = pd.DataFrame(combined, columns=['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8'])
df.head()

Unnamed: 0,c0,c1,c2,c3,c4,c5,c6,c7,c8
0,Nama Layanan,Rating Layanan (n/10),Jumlah Penjualan Layanan,Harga Layanan (Rp),Durasi Pengerjaan (hari),Nama Penjual,Rating Penjual (n/10),Jumlah Penjualan Penjual,Deskripsi Layanan
1,Jasa Pembuatan Diagram Wireframe Web / App...,0.00,0,300000,7,ilyasrii.ahmadr96,0.00,0,BACA DESKRIPSI SEBELUM ORDER! Kami m...
2,Web Desain,0.00,0,3500000,30,nonamefacecompany,0.00,0,Jasa Pembuatan Website Profesional ...
3,Program Aplikasi & Web,0.00,0,3000000,120,nonamefacecompany,0.00,0,JASA PEMBUATAN PROGRAM APLIKASI & WEB ...
4,Jasa Pembuatan Domain IP (Website Dengan I...,10.00,3,260000,2,siamangaja,9.89,28,Apakah bisa membuat website dengan ip add...


##Mengeksport data ke dalam bentuk csv

In [None]:
df.to_csv('website_development_services_data.csv', index=False)