## Tools Preparation
Scrapping is basically one of the way to retrieve the data and this process is very important to know as a data scientist since sometimes we cannot get data easily as we querying the data from the database or download Kaggle. We're going to scrape Gramedia.com in this lesson using Beautifulsoup. Before we're going further, please install beautifulsoup.

To install beautifulsoup, you may run one of the following commands on Anaconda Prompt (Windows) or Terminal (Linux/Mac/VSCode):

```
pip install bs4 selenium
```

and also you need to install requests to acces a web address by running:

```
pip install requests
```

### Selenium WebDriver

Selenium WebDriver is a powerful tool for automating browser interactions and testing web applications. It provides a programming interface to control browser behavior and perform actions such as clicking buttons, filling forms, and navigating through web pages.

To get started with Selenium WebDriver for different browsers, you'll need to ensure that you have the appropriate browser drivers installed and set up correctly. Each browser requires its specific driver to communicate with Selenium.

1. Google Chrome:
   - You need to download the ChromeDriver executable and place it in a location that is in your system's PATH.
   - Official ChromeDriver download page: https://sites.google.com/chromium.org/driver/

2. Safari:
   - SafariDriver is automatically installed with Safari on macOS.
   - To enable it, go to Safari preferences, then to the 'Advanced' tab, and check the "Show Develop menu in menu bar" option.
   - After that, in the Develop menu, go to "Allow Remote Automation" to enable SafariDriver.

3. Firefox:
   - You need to download the geckodriver executable and place it in a location that is in your system's PATH.
   - Official geckodriver download page: https://github.com/mozilla/geckodriver/releases

4. Microsoft Edge:
   - For Microsoft Edge (Chromium-based version), you need to download the Microsoft Edge Driver (also known as MSEdgeDriver) and place it in a location that is in your system's PATH.
   - Official MSEdgeDriver download page: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

Once you have set up the appropriate drivers, you can use Selenium WebDriver in your preferred programming language (Python, Java, C#, etc.) to automate interactions with the browsers.

Here's the codes of how to define Selenium WebDriver with Python for Chrome, Safari, Firefox, and Microsoft Edge:

```python
from selenium import webdriver


# Create a new instance of the Chrome browser
driver = webdriver.Chrome("/path/to/chromedriver")

# Create a new instance of the Safari browser
driver = webdriver.Safari()

# Create a new instance of the Firefox browser
driver = webdriver.Firefox("/path/to/geckodriver")

# Create a new instance of the Microsoft Edge browser
driver = webdriver.Edge("/path/to/msedgedriver")
```

Similarly, you can use WebDriver with other browsers by using the appropriate driver for each browser and modifying the setup accordingly.

Always ensure you are using the latest versions of Selenium WebDriver and browser drivers to avoid compatibility issues. You can check the Selenium official website (https://www.selenium.dev/) and the respective browser driver download pages for updates and documentation.

## Basic Web Component

The website that you are scraping in this lesson contains several components. Those are:
- HTML — the main content of the page.
- CSS — used to add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG, allow web pages to show pictures.

There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look primarily at the HTML. Hence, you need to know some HTML structure to ease your scraping works. But don't worry, you don't need to dive in deeply into it.

### HTML Structure

HyperText Markup Language (HTML) is the language that web pages are created in. HTML isn’t a programming language, like Python, though. It’s a markup language that tells a browser how to display content.

HTML has many functions that are similar to what you might find in a word processor like Microsoft Word — it can make text bold, create paragraphs, and so on.

Below an example of HTML structure:

```html
<HTML>
    <HEAD>
        <TITLE>My cool title</TITLE>
    </HEAD>
    <BODY>
        <H1>This is a Header</H1>
        <ul id="list" class="coolList">
            <li>item 1</li>
            <li>item 2</li>
            <li>item 3</li>
        </ul>
    </BODY>
</HTML>
```

- The red items are called as tag or element. Usually, tag follows "<".
- HTML, HEAD, and BODY are the main elements and the rests are the content. For your attention, we will focus on the contents.
- The orange items are attribut that give information about the tag.
- The blue texts are the attribute value.


## Accessing the Web

Now, we will access https://www.gramedia.com/categories/buku for this lesson. Before we go further, we need to understand how to access the url in Python. To do it, we use requests library.

In [1]:
pip install -q bs4 selenium

Note: you may need to restart the kernel to use updated packages.


In [1]:
import requests
requests.get("https://www.gramedia.com/categories/buku")

<Response [200]>

If you see the output is <Response [200]>, then you are success to access the url. "200" refers to HTTP status codes. You can read https://id.wikipedia.org/wiki/Daftar_kode_status_HTTP for further explaination.

Now, you can check the HTML content of the page in Python. However, you can also check it on your browser by right click and choose Inspect element to ease your understanding od the web structure.

Above is the HTML structure that Python successfully access. We need to parsing the structure using Beautifulsoup to make it clear and accessible to scrape.

In [2]:
from bs4 import BeautifulSoup
from selenium import webdriver

# Instance webdriver
driver = webdriver.Chrome()

# Nunjukin akan ambil data darimana/tautan/link
url = "https://www.gramedia.com/categories/buku"

# Nyuruh webdriver buat akses tautannya
driver.get(url)

# Kerangka HMTL-nya disimpan dalam variabel `html` tapi masih satu baris
html = driver.page_source

In [3]:
html

'<html lang="id" class=""><head>\n    <!-- Google Tag Manager -->\n    <script type="text/javascript" async="" src="https://app.yellowmessenger.com/widget/main.js"></script><script type="text/javascript" async="" src="https://www.google-analytics.com/plugins/ua/ec.js"></script><script type="text/javascript" async="" src="https://analytics.tiktok.com/i18n/pixel/static/identify_7bf75739.js"></script><script async="" src="//cdnt.netcoresmartech.com/webactivity/ADGMOT35CHFLVDHBJNIG50K96B33CLCBGFVBD608PHJ3ICES10U0.js"></script><script async="" src="//cdnt.netcoresmartech.com/webp/ADGMOT35CHFLVDHBJNIG50K96B33CLCBGFVBD608PHJ3ICES10U0_webp.js"></script><script async="" src="//cdnt.netcoresmartech.com/webactivity/ADGMOT35CHFLVDHBJNIG50K96B33CLCBGFVBD608PHJ3ICES10U0.js"></script><script async="" src="//cdnt.netcoresmartech.com/webactivity/ADGMOT35CHFLVDHBJNIG50K96B33CLCBGFVBD608PHJ3ICES10U0.js"></script><script type="text/javascript" async="" src="https://analytics.tiktok.com/i18n/pixel/static/m

In [5]:
# Variabel `html` tadi di-prettify agar lebih jelas lagi strukturnya
soup = BeautifulSoup(html, "html.parser")
print(soup.prettify()[:1200])

<html class="" lang="id">
 <head>
  <!-- Google Tag Manager -->
  <script async="" src="https://app.yellowmessenger.com/widget/main.js" type="text/javascript">
  </script>
  <script async="" src="https://www.google-analytics.com/plugins/ua/ec.js" type="text/javascript">
  </script>
  <script async="" src="https://analytics.tiktok.com/i18n/pixel/static/identify_7bf75739.js" type="text/javascript">
  </script>
  <script async="" src="//cdnt.netcoresmartech.com/webactivity/ADGMOT35CHFLVDHBJNIG50K96B33CLCBGFVBD608PHJ3ICES10U0.js">
  </script>
  <script async="" src="//cdnt.netcoresmartech.com/webp/ADGMOT35CHFLVDHBJNIG50K96B33CLCBGFVBD608PHJ3ICES10U0_webp.js">
  </script>
  <script async="" src="//cdnt.netcoresmartech.com/webactivity/ADGMOT35CHFLVDHBJNIG50K96B33CLCBGFVBD608PHJ3ICES10U0.js">
  </script>
  <script async="" src="//cdnt.netcoresmartech.com/webactivity/ADGMOT35CHFLVDHBJNIG50K96B33CLCBGFVBD608PHJ3ICES10U0.js">
  </script>
  <script async="" data-id="CL0A31RC77U3K90HMAA0" src="htt

<img src="https://i.ibb.co/vsz2M33/message-Image-1636690176458.jpg"></img>

Let that we want to retrieve the books' title, so let's check the title position on the HTML out using Inspect element!

We know that based on the Inspect element, the books' title lie on this code:

```html
<div _ngcontent-web-gramedia-c53="" class="list-title">Creepy Case Club 4: Kasus Pohon Pemanggil</div>
```

"Creepy Case Club 4: Kasus Pohon Pemanggil" located at **div** tag with attribute **class** and the value of "*list-title*". So we will use the information to inform the soup where the titles exist.

So we need to find all div elements that contain attribute class and value "list-title".

To do that, we use ```soup.find_all("<element>",{"<attribute>":"<attribute value>"})```

In [6]:
# Akses SATU judul pertama dalam web tersebut
soup.find('div',{"class":"list-title"}).get_text()

'Solo Leveling 6'

In [7]:
var1 = soup.find('div',{"class":"list-title"})
var1.get_text()

'Solo Leveling 6'

In [8]:
# Akses SEMUA judul dalam web tersebut
soup.find_all('div',{"class":"list-title"})

[<div _ngcontent-web-gramedia-c26="" class="list-title">Solo Leveling 6</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Frieren : After the End 02</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Muhammad The Messenger: Periode Mekah</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Haikyu!! - Fly High! volleyball! 14</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Youth X Machinegun Aoharu x Kikanju 09</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">KOLONI 5 Menit Sebelum Tayang Vol. 2</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Ketika Ibu Kami Marah</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Happy Coloring : Untuk Anak 3 Tahun</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Moriarty the Patriot 13</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Baby Shark Buku Aktivitas dan Mewarnai Sweet Dessert</div>,
 <div _ngcontent-web-gramedia-c26="" class

We see that the soup found all div elements that contain attribute class and value "list-title" but we need the title text only. To extract it, just add .get_text() method to each list element.

In [14]:
for i in soup.find_all('div',{"class":"list-title"}):
    print(i.get_text())

Solo Leveling 6
Frieren : After the End 02
Muhammad The Messenger: Periode Mekah
Haikyu!! - Fly High! volleyball! 14
Youth X Machinegun Aoharu x Kikanju 09
KOLONI 5 Menit Sebelum Tayang Vol. 2
Ketika Ibu Kami Marah
Happy Coloring : Untuk Anak 3 Tahun
Moriarty the Patriot 13
Baby Shark Buku Aktivitas dan Mewarnai Sweet Dessert
Kuromi Buku Mewarnai City Of Lights
Komik Next G Edisi Spesial: Seharian Bareng Bestie
Tetangga Kanibal
Fairy Tail 100 years Quest 07
Fairy Tail : 100 years Quest Vol. 6
KUHP & KUHAP : Berdasarkan UU RI No. 1 Tahun 2023 dan UU RI No. 8 Tahun 1981
One-Stop English Vocabulary : Terlengkap dan Terupdate
Revolusi Industri 5.0
Undang-Undang Kesehatan : Undang-Undang Nomor 17 Tahun 2023
Bill Gates Perjalanan Sang Visioner


In [16]:
for i in soup.find_all("p", {"class":"formats-price"}):
    print(i.get_text())

Rp 92.000
Rp 36.000
Rp 212.000
Rp 22.400
Rp 45.000
Rp 44.000
Rp 59.500
Rp 53.000
Rp 25.600
Rp 25.500
Rp 25.500
Rp 53.100
Rp 57.500
Rp 32.000
Rp 32.000
Rp 53.500
Rp 55.500
Rp 59.500
Rp 55.500
Rp 45.500


In [30]:
# Inisiasi wadah kosong
list_author = []

# Looping untuk akses nama author
for i in soup.find_all("span", {"class":"list-author"}):
    # Masukin setiap data dalam list kita
    # list_author += i.get_text()
    # list_author = list_author + i.get_text()
    try :
        #Kalo ada author 
        list_author.append(i.get_text().strip())

    except:
        #kalo tidak ada author
        list_author.append("No author")

list_author

['CHUGONG',
 'Kanehito Yamada',
 'Samih Athif Az-Zain',
 'Haruichi Furudate',
 '',
 'Matto Haq',
 'Ockto Barimbing',
 'Qin Zi',
 'Herdhika Puspitasari',
 'Miyoshi Hikaru',
 'The Pinkfong Company',
 'Sanrio Co, Ltd',
 'Syifa Tsabita Wiangga, Dkk',
 'Erby S',
 'HIRO MASHIMA,ATSUO UEDA',
 'HIRO MASHIMA,ATSUO UEDA',
 '',
 'Deviana Maria Ulfa',
 'Jewellius Kistom M.',
 '',
 'Hasna Wijayati']

In [31]:
for i in soup.find_all("span", {"class":"list-author"}):
    print(i.get_text())

 CHUGONG  
 Kanehito Yamada 
 Samih Athif Az-Zain 
 Haruichi Furudate 
 
 Matto Haq 
 Ockto Barimbing 
 Qin Zi 
 Herdhika Puspitasari 
 Miyoshi Hikaru 
 The Pinkfong Company 
 Sanrio Co, Ltd 
 Syifa Tsabita Wiangga, Dkk 
 Erby S 
 HIRO MASHIMA,ATSUO UEDA 
 HIRO MASHIMA,ATSUO UEDA 
 
 Deviana Maria Ulfa 
 Jewellius Kistom M. 
 
 Hasna Wijayati 


It is easy, isn't it?

Next, we will do more. Our task is to get information about Title, Author, Price, Link to the book's page, and link refers to image.

Based on the Inspect element, we know that those information locate on:
- Title: ```<div _ngcontent-web-gramedia-c26="" class="list-title">Frieren : After the End 02</div>```
- Author: ```<span _ngcontent-web-gramedia-c26="" class="list-author ng-star-inserted"> Kanehito Yamada </span>```
- Price: ```<p _ngcontent-web-gramedia-c26="" class="formats-price">Rp 36.000</p>```
- Link: ```<a _ngcontent-web-gramedia-c26 href="/products/frieren-after-the-end-02">```
- Image: ```<img _ngcontent-web-gramedia-c36="" class="product-list-img ng-star-inserted ng-lazyloaded" src="https://cdn.gramedia.com/uploads/items/FRIEREN-2-COV-Ina__w149_hauto.jpg" alt="Frieren : After the End 02">```

Let's we wrap up the code and then input the data into Pandas DataFrame.

In [27]:
list_title = [title.get_text() for title in soup.find_all( 'div', {"class":"list-title"} )]

In [28]:
list_title

['Solo Leveling 6',
 'Frieren : After the End 02',
 'Muhammad The Messenger: Periode Mekah',
 'Haikyu!! - Fly High! volleyball! 14',
 'Youth X Machinegun Aoharu x Kikanju 09',
 'KOLONI 5 Menit Sebelum Tayang Vol. 2',
 'Ketika Ibu Kami Marah',
 'Happy Coloring : Untuk Anak 3 Tahun',
 'Moriarty the Patriot 13',
 'Baby Shark Buku Aktivitas dan Mewarnai Sweet Dessert',
 'Kuromi Buku Mewarnai City Of Lights',
 'Komik Next G Edisi Spesial: Seharian Bareng Bestie',
 'Tetangga Kanibal',
 'Fairy Tail 100 years Quest 07',
 'Fairy Tail : 100 years Quest Vol. 6',
 'KUHP & KUHAP : Berdasarkan UU RI No. 1 Tahun 2023 dan UU RI No. 8 Tahun 1981',
 'One-Stop English Vocabulary : Terlengkap dan Terupdate',
 'Revolusi Industri 5.0',
 'Undang-Undang Kesehatan : Undang-Undang Nomor 17 Tahun 2023',
 'Bill Gates Perjalanan Sang Visioner']

In [32]:
list_author = [author.get_text().strip() for author in soup.find_all( 'span', {"class":"list-author"} )]

list_author

['CHUGONG',
 'Kanehito Yamada',
 'Samih Athif Az-Zain',
 'Haruichi Furudate',
 '',
 'Matto Haq',
 'Ockto Barimbing',
 'Qin Zi',
 'Herdhika Puspitasari',
 'Miyoshi Hikaru',
 'The Pinkfong Company',
 'Sanrio Co, Ltd',
 'Syifa Tsabita Wiangga, Dkk',
 'Erby S',
 'HIRO MASHIMA,ATSUO UEDA',
 'HIRO MASHIMA,ATSUO UEDA',
 '',
 'Deviana Maria Ulfa',
 'Jewellius Kistom M.',
 '',
 'Hasna Wijayati']

In [36]:
for img in soup.find_all( 'img',{"class":"product-list-img"} ):
    #slicing setelah class 'src'
    print(img['src'])

https://cdn.gramedia.com/uploads/picture_meta/2024/6/21/3847kgwiyhubaaragwwgrt__w149_hauto.jpg
https://cdn.gramedia.com/uploads/items/FRIEREN-2-COV-Ina__w149_hauto.jpg
https://cdn.gramedia.com/uploads/picture_meta/2024/3/10/endv8cysnir2ndmzmwbbqw__w149_hauto.jpg
https://cdn.gramedia.com/uploads/items/9786024289003_Haikyu---Fl__w149_hauto.jpg
https://cdn.gramedia.com/uploads/products/3tjnu-2i3h__w149_hauto.jpg
https://cdn.gramedia.com/uploads/items/5MENITVol02_-_cover_-_CU__w149_hauto.jpg
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png
/assets/default-images/product.png


In [38]:
soup.find_all( 'div',{"class":"ng-star-inserted"} )

[<div class="mat-form-field-underline ng-tns-c29-0 ng-star-inserted"><span class="mat-form-field-ripple"></span></div>,
 <div class="mat-form-field-hint-wrapper ng-tns-c29-0 ng-trigger ng-trigger-transitionMessages ng-star-inserted" style="opacity:1;transform:translateY(0%);0:opacity;1:transform;opacity:1;transform:translateY(0%);webkit-opacity:1;webkit-transform:translateY(0%);"><!-- --><div class="mat-form-field-hint-spacer"></div></div>,
 <div class="mat-form-field-underline ng-tns-c29-8 ng-star-inserted"><span class="mat-form-field-ripple"></span></div>,
 <div class="mat-form-field-hint-wrapper ng-tns-c29-8 ng-trigger ng-trigger-transitionMessages ng-star-inserted" style="opacity:1;transform:translateY(0%);0:opacity;1:transform;opacity:1;transform:translateY(0%);webkit-opacity:1;webkit-transform:translateY(0%);"><!-- --><div class="mat-form-field-hint-spacer"></div></div>,
 <div _ngcontent-web-gramedia-c26="" class="ng-star-inserted"><a _ngcontent-web-gramedia-c26="" href="/product

In [20]:
soup.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href']

'/help'

In [21]:
for i in soup.find_all( 'p', {"class":"div-author"} ):
    print(i.get_text())

 Kanehito Yamada 
 Samih Athif Az-Zain 
 Haruichi Furudate 
 
 Matto Haq  Ockto Barimbing 
 Qin Zi 
 Herdhika Puspitasari 
 Miyoshi Hikaru 
 The Pinkfong Company 
 Sanrio Co, Ltd 
 Syifa Tsabita Wiangga, Dkk 
 Erby S 
 HIRO MASHIMA,ATSUO UEDA 
 HIRO MASHIMA,ATSUO UEDA 
 
 Deviana Maria Ulfa 
 Jewellius Kistom M. 
 
 Hasna Wijayati 
 Heru Kurniawan dan Fitri Septianti 


In [44]:
soup.find_all('a',{"_ngcontent-web-gramedia-c26":""})[1]

<a _ngcontent-web-gramedia-c0="" class="header-logo" href="/" title="Gramedia.com"><img _ngcontent-web-gramedia-c0="" alt="Gramedia Logo" src="/assets/gramedia-icon-2.png"/></a>

In [39]:
import pandas as pd

# Inisiasi dataframe kosong untuk menampung hasil scraping
data = pd.DataFrame()

# Bikin kolom baru (Title) yang menangkap lopping scrape judul buku
data['Title'] = [ title.get_text() for title in soup.find_all( 'div', {"class":"list-title"} ) ]
data['Author'] = [ author.get_text() for author in soup.find_all( 'p', {"class":"div-author"} ) ]
data['Price'] = [ price.get_text() for price in soup.find_all( 'p', {"class":"formats-price"} ) ]
data['Image'] = [ img['src'] for img in soup.find_all( 'img',{"class":"product-list-img"} ) ]

links = []
for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
    try:
        links.append("https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href'])
    except:
        pass

data['Link'] = links

data

Unnamed: 0,Title,Author,Price,Image,Link
0,Solo Leveling 6,CHUGONG,Rp 92.000,https://cdn.gramedia.com/uploads/picture_meta/...,https://www.gramedia.com/products/solo-leveling-6
1,Frieren : After the End 02,Kanehito Yamada,Rp 36.000,https://cdn.gramedia.com/uploads/items/FRIEREN...,https://www.gramedia.com/products/frieren-afte...
2,Muhammad The Messenger: Periode Mekah,Samih Athif Az-Zain,Rp 212.000,https://cdn.gramedia.com/uploads/picture_meta/...,https://www.gramedia.com/products/muhammad-the...
3,Haikyu!! - Fly High! volleyball! 14,Haruichi Furudate,Rp 22.400,https://cdn.gramedia.com/uploads/items/9786024...,https://www.gramedia.com/products/haikyu-fly-h...
4,Youth X Machinegun Aoharu x Kikanju 09,,Rp 45.000,https://cdn.gramedia.com/uploads/products/3tjn...,https://www.gramedia.com/products/youth-x-mach...
5,KOLONI 5 Menit Sebelum Tayang Vol. 2,Matto Haq Ockto Barimbing,Rp 44.000,https://cdn.gramedia.com/uploads/items/5MENITV...,https://www.gramedia.com/products/koloni-5-men...
6,Ketika Ibu Kami Marah,Qin Zi,Rp 59.500,/assets/default-images/product.png,https://www.gramedia.com/products/ketika-ibu-k...
7,Happy Coloring : Untuk Anak 3 Tahun,Herdhika Puspitasari,Rp 53.000,/assets/default-images/product.png,https://www.gramedia.com/products/happy-colori...
8,Moriarty the Patriot 13,Miyoshi Hikaru,Rp 25.600,/assets/default-images/product.png,https://www.gramedia.com/products/moriarty-the...
9,Baby Shark Buku Aktivitas dan Mewarnai Sweet D...,The Pinkfong Company,Rp 25.500,/assets/default-images/product.png,https://www.gramedia.com/products/baby-shark-b...


## Multipage

Currently, we are working on a page. However, the rest of the web consist of more pages like below:

<img src="https://i.ibb.co/CQ6JQLv/message-Image-1636716930335.jpg"></img>

If we look at the next page such as page 2, we can see that the url change to https://www.gramedia.com/categories/buku?page=2 and page 3: https://www.gramedia.com/categories/buku?page=3. Then we know each page has a numbering format on url so we can access many pages one time automatically using loop. We exclude the image since image loader is very depended on your connection. Let's check the code below.

In [45]:
for i in range(1,3):
    print(i)

1
2


In [47]:
from time import sleep

# Inisiasi wadah kosong
title = []
author = []
price = []
image = []
Links = []

# Instance webdriver
driver = webdriver.Chrome()

# Looping akses beberapa halaman
for i in range(1,4):
    # Ngasih tau tautan/link
    url="https://www.gramedia.com/categories/buku?page={}".format(i)
    sleep(1)
    # url=f"https://www.gramedia.com/categories/buku?page={i}"
    # Nyusuh driver akses web
    driver.get(url)

    # Menambah agar scroll otomatis
    for scroll in range(6):    # Scroll n-kali
        driver.execute_script("window.scrollBy(0,250)") # Sekali scroll 250 pixel
        sleep(1)

    # Ambil html dan prettify 
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")

    # Ambil data masing-masing elemen
    title += [ title.get_text() for title in soup.find_all( 'div', {"class":"list-title"} ) ]
    author += [ author.get_text() for author in soup.find_all( 'p', {"class":"div-author"} ) ]
    price += [ price.get_text() for price in soup.find_all( 'p', {"class":"formats-price"} ) ]
    image += [ img['src'] for img in soup.find_all( 'img',{"class":"product-list-img"} ) ]

    links = []
    for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
        try:
            links.append("https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href'])
        except:
            pass
    Links += links

# Menutup webdriver
driver.close()

# Bikin dataframe dari masing-masing list hasil scrape
data_multipage = pd.DataFrame()
data_multipage['Title'] = title
data_multipage['Author'] = author
data_multipage['Price'] = price
data_multipage['Image'] = image
data_multipage['Link'] = Links

# Dipanggil dataframenya
data_multipage

Unnamed: 0,Title,Author,Price,Image,Link
0,Solo Leveling 6,CHUGONG,Rp 92.000,https://cdn.gramedia.com/uploads/picture_meta/...,https://www.gramedia.com/products/solo-leveling-6
1,Frieren : After the End 02,Kanehito Yamada,Rp 36.000,https://cdn.gramedia.com/uploads/items/FRIEREN...,https://www.gramedia.com/products/frieren-afte...
2,Muhammad The Messenger: Periode Mekah,Samih Athif Az-Zain,Rp 212.000,https://cdn.gramedia.com/uploads/picture_meta/...,https://www.gramedia.com/products/muhammad-the...
3,Haikyu!! - Fly High! volleyball! 14,Haruichi Furudate,Rp 22.400,https://cdn.gramedia.com/uploads/items/9786024...,https://www.gramedia.com/products/haikyu-fly-h...
4,Youth X Machinegun Aoharu x Kikanju 09,,Rp 45.000,https://cdn.gramedia.com/uploads/products/3tjn...,https://www.gramedia.com/products/youth-x-mach...
5,"Ayo, Belajar Wudhu",Laksmi P Manohara,Rp 89.000,https://cdn.gramedia.com/uploads/products/y5a1...,https://www.gramedia.com/products/ayo-belajar-...
6,KOLONI 5 Menit Sebelum Tayang Vol. 2,Matto Haq Ockto Barimbing,Rp 44.000,https://cdn.gramedia.com/uploads/items/5MENITV...,https://www.gramedia.com/products/koloni-5-men...
7,Ketika Ibu Kami Marah,Qin Zi,Rp 59.500,https://cdn.gramedia.com/uploads/products/7i91...,https://www.gramedia.com/products/ketika-ibu-k...
8,Aku Pintar Membaca & Berkarakter Hebat,Kak Kaysa,Rp 50.000,https://cdn.gramedia.com/uploads/products/-ram...,https://www.gramedia.com/products/aku-pintar-m...
9,Happy Coloring : Untuk Anak 3 Tahun,Herdhika Puspitasari,Rp 53.000,https://cdn.gramedia.com/uploads/products/8r0a...,https://www.gramedia.com/products/happy-colori...


## Accessing Individual Page

<img src="https://i.ibb.co/F8D5bCy/message-Image-1637134633305.jpg"></img>

Suppose that we want to get more detail information about the books, but the information are on the individual page. So, we will access the individual page and scrape some information on it. We will catch title, author, price, description, number of pages, date of issue and publisher.

In [49]:

#Inisiasi wadah kosong
title = []
author = []
price = []
desc = []
num_pages = []
date_issue = []
publisher = []

driver = webdriver.Chrome()

# Looping akses berapa halaman
for i in range(1,2):
    url=f"https://www.gramedia.com/categories/buku?page={i}"
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")
    #setiap kali ganti halaman stop setengah detik
    sleep(0.5)

    # Looping untuk akses link
    for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
        try:
            link = "https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href']
            #nyuruh driver untuk akses masing masing link 
            driver.get(link)

            # Menambah agar scroll otomatis
            for scroll in range(6):    # Scroll n-kali
                driver.execute_script("window.scrollBy(0,250)") # Sekali scroll 250 pixel
                sleep(0.5)

            #menyimpan variabel baru untuk masing-masing html setiap produk
            html_ind = driver.page_source
            soup_ind = BeautifulSoup(html_ind, "html.parser")

            # Judul
            try:
                title.append( soup_ind.find( 'div', {"class":"book-title"} ).get_text() )
            except:
                title.append("No Title")
            # Penulis
            try:
                author.append( soup_ind.find('span', {"class":"title-author"}).get_text() )
            except:
                author.append("No Author")
            # Harga
            try:
                price.append(soup_ind.find('div', {'ins-init-condition':'#LnByaWNlLWZyb20='}).get_text())
            except:
                price.append("Rp. 0")
            # Deskripsi
            try:
                desc.append( soup_ind.find('div', {"class":"product-desc"}).get_text() )
            except:
                desc.append("No Description")
            # Jumlah Halaman
            try:
                num_pages.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[0].get_text() )
            except:
                num_pages.append("0")
            # Tanggal Publish
            try:
                date_issue.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[2].get_text() )
            except:
                date_issue.append("Dunno")
            # Penerbit
            try:
                publisher.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[1].get_text() )
            except:
                publisher.append("No Publisher")

        except:
            pass

driver.close()

In [53]:
pages = pd.DataFrame()
pages['Title'] = title
pages['Author'] = author
pages['Price'] = price
pages['Desc'] = desc
pages['Num Pages'] = num_pages
pages['Date Issue'] = date_issue
pages['Publisher'] = publisher

pages

Unnamed: 0,Title,Author,Price,Desc,Num Pages,Date Issue,Publisher
0,Solo Leveling 6,CHUGONG,Rp 92.000,“Semoga semua yang ingin kau lindungi hangus ...,496,13 Jul 2024,m&c!
1,Frieren: After the End 2,Kanehito Yamada,Rp. 0,Cerita dimulai mengikuti sekelompok orang yan...,0,Dunno,No Publisher
2,Muhammad The Messenger: Periode Mekah,Samih Athif Az-Zain,Rp 212.000,Berita gembira itu telah terbukti! Berita gem...,496,10 Mar 2024,Elex Media Komputindo
3,Haikyu!! - Fly High! volleyball! 14,Haruichi Furudate,Rp. 0,Perempat final pertandingan penentuan perwaki...,0,Dunno,No Publisher
4,Youth X Machinegun Aoharu x Kikanju 09,Naoe,Rp 45.000,“Aku sudah kehilangan segalanya… sudah tidak ...,194,13 Sep 2024,Elex Media Komputindo
5,"Ayo, Belajar Wudhu",Laksmi P Manohara,Rp 89.000,Sinopsis : \n\nHanif asyik bermain kelereng b...,24,16 Sep 2024,Dar! Mizan
6,KOLONI 5 Menit Sebelum Tayang Vol. 2,"Ockto Barimbing,",Rp 25.600,Banyak alas an Budi bertahan menjadi editor s...,168,21 Sep 2022,m&c!
7,Ketika Ibu Kami Marah,Qin Zi,Rp 59.500,Sinopsis :\n\nDidedikasikan untuk ibu kami ya...,40,15 Mei 2024,Penerbit Bestari
8,Aku Pintar Membaca & Berkarakter Hebat,Kak Kaysa,Rp 50.000,Sinopsis :\n\nApakah Anda ingin anak Anda tum...,68,15 Agt 2024,Jendela Penerbit
9,Happy Coloring : Untuk Anak 3 Tahun,Herdhika Puspitasari,Rp 53.000,Sinopsis :\n\nMewarnai adalah kegiatan yang s...,40,15 Sep 2024,C-Klik Media


In [54]:
soup_ind.find('div', {"class":"detail-section"}).find_all('p')

[<p>152</p>,
 <p><a href="/vendor/anak-hebat-indonesia">Anak Hebat Indonesia</a></p>,
 <p>15 Sep 2024</p>,
 <p>0.13 kg</p>,
 <p>9786235150727</p>,
 <p>14 cm</p>,
 <p>Indonesia</p>,
 <p>20cm</p>]

In [55]:

#Inisiasi wadah kosong
title = []
author = []
price = []
desc = []
num_pages = []
date_issue = []
publisher = []
weight = []
isbn=[]
width=[]
language=[]
length=[]

driver = webdriver.Chrome()

# Looping akses berapa halaman
for i in range(1,2):
    url=f"https://www.gramedia.com/categories/buku?page={i}"
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")
    #setiap kali ganti halaman stop setengah detik
    sleep(0.5)

    # Looping untuk akses link
    for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
        try:
            link = "https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href']
            #nyuruh driver untuk akses masing masing link 
            driver.get(link)

            # Menambah agar scroll otomatis
            for scroll in range(6):    # Scroll n-kali
                driver.execute_script("window.scrollBy(0,250)") # Sekali scroll 250 pixel
                sleep(0.5)

            #menyimpan variabel baru untuk masing-masing html setiap produk
            html_ind = driver.page_source
            soup_ind = BeautifulSoup(html_ind, "html.parser")

            # Judul
            try:
                title.append( soup_ind.find( 'div', {"class":"book-title"} ).get_text() )
            except:
                title.append("No Title")
            # Penulis
            try:
                author.append( soup_ind.find('span', {"class":"title-author"}).get_text() )
            except:
                author.append("No Author")
            # Harga
            try:
                price.append(soup_ind.find('div', {'ins-init-condition':'#LnByaWNlLWZyb20='}).get_text())
            except:
                price.append("Rp. 0")
            # Deskripsi
            try:
                desc.append( soup_ind.find('div', {"class":"product-desc"}).get_text() )
            except:
                desc.append("No Description")
            # Jumlah Halaman
            try:
                num_pages.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[0].get_text() )
            except:
                num_pages.append("0")
            # Tanggal Publish
            try:
                date_issue.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[2].get_text() )
            except:
                date_issue.append("Dunno")
            # Penerbit
            try:
                publisher.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[1].get_text() )
            except:
                publisher.append("No Publisher")

            # Bobot buku
            try:
                weight.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[3].get_text() )
            except:
                weight.append("0 kg")
            # ISBN
            try:
                isbn.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[4].get_text() )
            except:
                isbn.append("ILLEGAL")
            
            # Lebar buku
            try:
                width.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[5].get_text() )
            except:
                width.append("No Dimension")
            # Bahasa 
            try:
                language.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[5].get_text() )
            except:
                language.append("No Dimension")
            # Panjang buku
            try:
                length.append( soup_ind.find('div', {"class":"detail-section"}).find_all('p')[7].get_text() )
            except:
                length.append("No Dimension")

        except:
            pass

driver.close()

In [56]:
pages = pd.DataFrame()
pages['Title'] = title
pages['Author'] = author
pages['Price'] = price
pages['Desc'] = desc
pages['Num Pages'] = num_pages
pages['Date Issue'] = date_issue
pages['Publisher'] = publisher
pages['language'] = language
pages['weight'] = weight
pages['isbn'] = isbn
pages['width'] = width
pages['length'] = length

pages

Unnamed: 0,Title,Author,Price,Desc,Num Pages,Date Issue,Publisher,language,weight,isbn,width,length
0,Solo Leveling 6,CHUGONG,Rp 92.000,“Semoga semua yang ingin kau lindungi hangus ...,496,13 Jul 2024,m&c!,13 cm,0.395 kg,9786230314353,13 cm,20cm
1,Frieren: After the End 2,Kanehito Yamada,Rp. 0,Cerita dimulai mengikuti sekelompok orang yan...,200,20 Jul 2022,m&c!,11.4 cm,0.125 kg,9786230307881,11.4 cm,17.2cm
2,Muhammad The Messenger: Periode Mekah,Samih Athif Az-Zain,Rp 212.000,Berita gembira itu telah terbukti! Berita gem...,496,10 Mar 2024,Elex Media Komputindo,17.5 cm,0.5 kg,9786235331294,17.5 cm,25cm
3,Haikyu!! - Fly High! volleyball! 14,Haruichi Furudate,Rp 22.400,Perempat final pertandingan penentuan perwaki...,192,12 Sep 2018,m&c!,11.4 cm,0.11 kg,9786024289003,11.4 cm,17.2cm
4,Youth X Machinegun Aoharu x Kikanju 09,Naoe,Rp 45.000,“Aku sudah kehilangan segalanya… sudah tidak ...,194,13 Sep 2024,Elex Media Komputindo,12 cm,0.135 kg,9786230060373,12 cm,18cm
5,KOLONI 5 Menit Sebelum Tayang Vol. 2,"Ockto Barimbing,",Rp 25.600,Banyak alas an Budi bertahan menjadi editor s...,168,21 Sep 2022,m&c!,13.2 cm,0.165 kg,9786230308536,13.2 cm,20cm
6,Ketika Ibu Kami Marah,Qin Zi,Rp 59.500,Sinopsis :\n\nDidedikasikan untuk ibu kami ya...,40,15 Mei 2024,Penerbit Bestari,21 cm,0.18 kg,9786231361288,21 cm,29cm
7,Happy Coloring : Untuk Anak 3 Tahun,Herdhika Puspitasari,Rp 53.000,Sinopsis :\n\nMewarnai adalah kegiatan yang s...,40,15 Sep 2024,C-Klik Media,18 cm,0.16 kg,9786233571692,18 cm,24cm
8,Moriarty the Patriot 13,Miyoshi Hikaru,Rp 22.400,Disclaimer\r\nCerita dalam komik ini mengandu...,208,15 Mar 2022,Elex Media Komputindo,11.4 cm,0.11 kg,9786230031502,11.4 cm,17.2cm
9,Baby Shark Buku Aktivitas dan Mewarnai Sweet D...,The Pinkfong Company,Rp 25.500,Buku mewarnai ini berisi gambar-gambar menari...,20,24 Agt 2024,Pt. Adinata Melodi Kreasi,20 cm,0.1 kg,9786022437918,20 cm,28cm
