# BeautifulSoup
Deskripsi tugas:
- Tujuan: Membuat program untuk melakukan web scraping pada halaman web (https://quotes.toscrape.com) yang mengandung kutipan, penulis, dan tag-tag yang terkait dengan kutipan tersebut. Data hasil scraping akan disimpan dalam format CSV.
- Sumber Data: Halaman web yang berisi kutipan beserta informasi penulis dan tag-tag terkait.
- Tools yang Digunakan: Python, library BeautifulSoup untuk melakukan web scraping, library Pandas untuk manipulasi dan penyimpanan data dalam format CSV.
- Langkah-langkah:
  1. Menggunakan library requests, mengirim permintaan GET ke URL halaman web yang berisi kutipan.
  2. Menggunakan library BeautifulSoup, mengekstrak data kutipan, penulis, dan tag-tag terkait dari struktur HTML halaman web.
  3. Menyimpan data yang telah diekstrak ke dalam struktur data yang sesuai (misalnya, list atau DataFrame).
  4. Menggunakan library Pandas, menyimpan data ke dalam file CSV dengan menggunakan metode `to_csv()`.
- Output: File CSV yang berisi data kutipan, penulis, dan tag-tag terkait.
- Validasi: Melakukan pengecekan bahwa data yang telah diekstrak sesuai dengan struktur yang diharapkan dan bahwa file CSV telah berhasil dibuat dan berisi data yang benar.
- Kesimpulan: Setelah menjalankan program, data kutipan, penulis, dan tag-tag terkait telah berhasil diambil dari halaman web dan disimpan dalam file CSV, siap digunakan untuk analisis lebih lanjut atau tujuan lainnya.

In [35]:
import requests

url = "https://quotes.toscrape.com"

page = requests.get(url)
page.headers

{'Date': 'Wed, 01 May 2024 15:59:09 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '11054', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload'}

In [36]:
page.text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n        <sp

In [37]:
type(page.text)

str

In [38]:
type(page.content)

bytes

In [39]:
from bs4 import BeautifulSoup
# print(type(page.content))

soup = BeautifulSoup(page.content, 'html.parser')

In [40]:
type(soup)

In [42]:
quotes = soup.find_all("div", class_="quote") # find_all akan mengembalikan list (array)
type(quotes)

In [44]:
len(quotes)

10

In [45]:
quotes[0]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

In [52]:
words = []
authors = []
tags_list = []

In [53]:
import pandas as pd

for quote in quotes:
    author = quote.find("small", class_="author").text.strip() # Mendapatkan nama penulis
    text = quote.find("span", class_="text").text.strip() # Mendapatkan teks kutipan

    # Mendapatkan semua tag
    tags = quote.find('div', class_='tags').find_all('a', class_='tag')
    tag_texts = [tag.text.strip() for tag in tags]

    # Menambahkan ke list
    authors.append(author)
    words.append(text)
    tags_list.append(tag_texts)

# Membuat DataFrame
df = pd.DataFrame({
    "authors": authors,
    "quotes": words,
    "tags": tags_list
})

print(df)

             authors                                             quotes  \
0    Albert Einstein  “The world as we have created it is a process ...   
1       J.K. Rowling  “It is our choices, Harry, that show what we t...   
2    Albert Einstein  “There are only two ways to live your life. On...   
3        Jane Austen  “The person, be it gentleman or lady, who has ...   
4     Marilyn Monroe  “Imperfection is beauty, madness is genius and...   
5    Albert Einstein  “Try not to become a man of success. Rather be...   
6         André Gide  “It is better to be hated for what you are tha...   
7   Thomas A. Edison  “I have not failed. I've just found 10,000 way...   
8  Eleanor Roosevelt  “A woman is like a tea bag; you never know how...   
9       Steve Martin  “A day without sunshine is like, you know, nig...   

                                             tags  
0        [change, deep-thoughts, thinking, world]  
1                            [abilities, choices]  
2  [inspirational,

In [54]:
# Menyimpan DataFrame ke dalam file CSV
df.to_csv('quotes_data.csv', index=False)

print("Data berhasil disimpan dalam file 'quotes_data.csv'.")

Data berhasil disimpan dalam file 'quotes_data.csv'.
