# **Week #6 - Ingestion: Data Scraping & Cleaning Data Scraping**

Intro to Data Engineering Course - Sekolah Engineer - Pacmann Academy

**Outline**

1. Review:
    - Ingestion: Data Scraping
    - Cleaning Data Scraping
2. Case Study 1: Home Depo
3. Case Study 2: CNN News

# <font color='blue'>Review
---

## Ingestion: Data Scraping
---

### What is Data Scraping
---

- Data Scraping adalah sebuah proses mengambil informasi dari sebuah website dan disimpan ke dalam bentuk file (csv, xlsx, json) atau database
- Proses tersebut biasanya dilakukan secara otomatis untuk pengambilan datanya
- Data Scraping juga merupakan bagian dari **Ingestion** pada Data Engineering Lifecycle
- Cara kerja dari Data Scraping adalah mengekstrak data atau informasi yang ada di sebuah web page
- Hasil dari Data Scraping dapat digunakan oleh tim lain untuk diolah dan bisa dijadikan untuk kebutuhan analisis lebih lanjut
- Benefit melakukan Data Scraping:
    - Melakukan Data Collection secara efektif
    - Menghemat Biaya
    - Mempermudah untuk untuk mendapatkan data
    - Otomasi


### Data Scraping Legality
---

- Perihal legalitas dari Data Scraping ini masih menjadi perdebatan hingga sekarang
- Ada yang mengatakan legal, tetapi juga ada yang mengatakan ilegal. Sehingga Data Scraping masih di ranah abu - abu
- Status legalitas Data Scraping tergantung dari tujuannya:
    - Jika menjual hasil data scraping akan menjadi ilegal
    - Jika hanya menjadi bahan belajar akan menjadi legal, dengan catatan mencantumkan sumber website dan memberikan disclaimer
- Tips ketika melakukan Data Scraping:
    - Ambil informasi yang hanya ditampilkan pada website saja
    - Jangan mengambil informasi pribadi user seperti username, password, email, dsb
    - Baca Terms of Use dari website



### Data Scraping Workflow
---

- Untuk workflow dari Data Scraping bisa berbeda - beda, tetapi umumnya:
    1. Identify Data Sources
    2. Understanding Web Structures
        - Pada tahap ini kita harus memahami struktur web page dan url dari website
    3. Developing Scraping Code
    4. Store Data
    5. Data Processing
    6. Analysis

- Untuk melakukan Data Scraping, kita membutuhkan library `beautifulsoup` dan `requests`
- Setelah berhasil melakukan proses Data Scraping, kita bisa menyimpan datanya ke dalam format:
    - File
    - JSON
    - Database
    

### Data Scraping using API
---

- Terdapat website yang memiliki sifat dinamis sehingga tidak bisa dilakukan proses Data Scraping menggunakan struktur HTML
- Sehingga, kita bisa memanfaatkan API yang disediakan untuk mengambil data tersebut yang ada di website

# <font color='blue'>1. Study Case 1: Home Depo
---

- Diberikan sebuah link Home Depo Pacmann https://homedepo.minio-devops-class.pacmann.ai/
- Task yang harus dilakukan adalah harus mengambil detail seluruh informasi dari masing - masing card informasi barang

**Tampilan Home**

<center>
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-intro-to-data-eng/Live_11-1.png" width=80%>
</center>

**Tampilan Detail**

<center>
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-intro-to-data-eng/Live_11-2.png" width=80%>
</center>

- Informasi yang bisa di extract dari tampilan detail itu adalah:
    - Judul
    - Deskripsi
    - Harga
    - Currency
    - Brand
    - Availability
    - Product Id
    - Product Index



## **1**
---

### **1a**
---

Dengan link http://homedepo.pacmann.ai/ sambungkan koneksi ke website tersebut menggunakan library `requests` dan cek status code nya

In [7]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

In [89]:
resp = requests.get("https://scrollmagic.io/examples/advanced/infinite_scrolling.html")

resp.status_code

200

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
resp = requests.get("http://homedepo.pacmann.ai/")

resp.status_code
```

</details>

---

### **1b**
---

- Ternyata setelah diliat, untuk mengakses detail barang ternyata memiliki URL yang berbeda
- URL nya menjadi `https://homedepo.pacmann.ai/show/{product_index}`
- Untuk pengecekan, kita coba ambil salah satu index dari product tersebut. Kita akan coba menggunakan index `0` dan cek status code nya

In [10]:
resp = requests.get("https://a9b1-16-78-86-155.ngrok-free.app/show/0")

resp.text



<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
resp = requests.get("https://homedepo.pacmann.ai/show/0")

resp.status_code
```

</details>

---

### **1c**
---

- Setelah itu, kita buat object `BeautifulSoup` untuk melakukan Web Scraping

In [90]:
soup = BeautifulSoup(resp.text, "html.parser")

soup


<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="width=500" name="viewport">
<meta content="ScrollMagic, example, scrolling, attaching, scrollbar, infinite, dynamic, pages" name="keywords">
<meta content="Jan Paepke (www.janpaepke.de)" name="author"/>
<title>Infinite Scrolling - Examples - ScrollMagic</title>
<link href="http://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,700,400italic|Josefin+Slab:400,700italic" rel="stylesheet" type="text/css"/>
<link href="../../assets/css/normalize.css" rel="stylesheet" type="text/css"/>
<link href="../../assets/css/style.css" rel="stylesheet" type="text/css"/>
<link href="../../assets/css/examples.css" rel="stylesheet" type="text/css"/>
<link href="../../assets/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<script src="../../assets/js/lib/jquery.min.js" type="text/javascript"></script>
<script src="../../assets/js/li

In [6]:
section

<section id="reviews">
<h5>Review Section</h5>
</section>

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
soup = BeautifulSoup(resp.text, "html.parser")

soup
```

</details>

---

## **2**
---

### **2a**
---

- Setelah itu, kita akan mengambil komponen struktur HTML dari detail barang
- Pada case ini, informasi mengenai detail item terdapat pada class `card-body`


In [16]:
raw_data = soup.find_all("div", class_ = "card-body")

raw_data

[<div class="card-body">
 <h5 class="card-title">Men's 3X Large Carbon Heather Cotton/Polyester Rain Defender Paxton Heavyweight Hooded Zip-Front Sweatshirt</h5>
 <p class="card-text description">This heavyweight, water-repellent hooded sweatshirt has a zip front for fast layering. ORIGINAL FIT. 13 oz., 75% cotton/25% polyester blend with Rain Defender durable water repellent. Attached, jersey-lined three-piece hood with drawcord closure. Antique-finish brass front zipper. Two front hand-warmer pockets have a hidden security pocket inside. Stretchable, spandex-reinforced rib-knit cuffs and waistband. Locker loop facilitates hanging.</p>
 <p class="card-text">Price: $<span class="price">64.99</span> <span class="currency">USD</span></p>
 <p class="card-text">Brand: <span class="brand">Carhartt</span></p>
 <p class="card-text">Availibility: <span class="badge bg-success availability">InStock</span></p>
 <p class="card-text"><small class="text-body-secondary">Product Id: <span class="prod

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
raw_data = soup.find_all("div", class_ = "card-body")

raw_data
```

</details>

---

### **2b**
---

- Kita akan extract value dari detail informasi produk dengan menggunakan **Inspect Element** untuk melihat struktur HTML
- Informasi produk yang di extract adalah:
    - `Product Index`
    - `Product Id`
    - `Title`
    - `Description`
    - `Price`
    - `Currency`
    - `Availability`
- Tampung data yang sudah di extract tadi ke dalam dictionary

In [25]:
# your answer here
product = []
for data in raw_data:
    # get all the component
    index = data.find("span", class_ = "index").text
    product_id = data.find("span", class_ = "product_id").text
    title = data.find("h5").text
    description = data.find("p", class_ = "card-text description").text
    price = data.find("span", class_ = "price").text
    currency = data.find("span", class_ = "currency").text
    availability = data.find("span", class_ = "badge bg-success availability").text

    # save it to dictionary
    product_data = {
        "index": index,
        "product_id": product_id,
        "title": title,
        "description": description,
        "price": price,
        "currency": currency,
        "availability": availability
    }
    product.append(product_data)

pd.DataFrame(product)

Unnamed: 0,index,product_id,title,description,price,currency,availability
0,0,310090686,Men's 3X Large Carbon Heather Cotton/Polyester...,"This heavyweight, water-repellent hooded sweat...",64.99,USD,InStock


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
for data in raw_data:
    # get all the component
    index = data.find("span", class_ = "index").text
    product_id = data.find("span", class_ = "product_id").text
    title = data.find("h5").text
    description = data.find("p", class_ = "card-text description").text
    price = data.find("span", class_ = "price").text
    currency = data.find("span", class_ = "currency").text
    availability = data.find("span", class_ = "badge bg-success availability").text

    # save it to dictionary
    product_data = {
        "index": index,
        "product_id": product_id,
        "title": title,
        "description": description,
        "price": price,
        "currency": currency,
        "availability": availability
    }

product_data
```

</details>

---

## **3**
---

- Setelah berhasil mendapatkan sample dari data yang ingin kita scrape, maka kita akan coba untuk mengambil seluruh data yang ada di website tersebut
- Pada case ini, kita cukup pakai sampai index ke `500` karena keterbatasan resouce server website nya

In [26]:
full_data = []

num_of_idx = 500

for idx in tqdm(range(0, num_of_idx + 1)):
    res = requests.get(f"http://127.0.0.1:5000/show/{idx}")

    soup = BeautifulSoup(res.text, "html.parser")

    raw_data = soup.find_all("div", class_ = "card-body")

    for data in raw_data:
        index = data.find("span", class_ = "index").text
        product_id = data.find("span", class_ = "product_id").text
        title = data.find("h5").text
        description = data.find("p", class_ = "card-text description").text
        price = data.find("span", class_ = "price").text
        currency = data.find("span", class_ = "currency").text
        availability = data.find("span", class_ = "badge bg-success availability").text

        product_data = {
            "index": index,
            "product_id": product_id,
            "title": title,
            "description": description,
            "price": price,
            "currency": currency,
            "availability": availability
        }

        full_data.append(product_data)

  0%|          | 0/501 [00:00<?, ?it/s]

100%|██████████| 501/501 [00:39<00:00, 12.54it/s]


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
full_data = []

num_of_idx = 500

for idx in tqdm(range(0, num_of_idx + 1)):
    res = requests.get(f"https://homedepo.pacmann.ai/show/{idx}")

    soup = BeautifulSoup(res.text, "html.parser")

    raw_data = soup.find_all("div", class_ = "card-body")

    for data in raw_data:
        index = data.find("span", class_ = "index").text
        product_id = data.find("span", class_ = "product_id").text
        title = data.find("h5").text
        description = data.find("p", class_ = "card-text description").text
        price = data.find("span", class_ = "price").text
        currency = data.find("span", class_ = "currency").text
        availability = data.find("span", class_ = "badge bg-success availability").text

        product_data = {
            "index": index,
            "product_id": product_id,
            "title": title,
            "description": description,
            "price": price,
            "currency": currency,
            "availability": availability
        }

        full_data.append(product_data)
```

</details>

---

In [27]:
full_data

[{'index': '0',
  'product_id': '310090686',
  'title': "Men's 3X Large Carbon Heather Cotton/Polyester Rain Defender Paxton Heavyweight Hooded Zip-Front Sweatshirt",
  'description': 'This heavyweight, water-repellent hooded sweatshirt has a zip front for fast layering. ORIGINAL FIT. 13 oz., 75% cotton/25% polyester blend with Rain Defender durable water repellent. Attached, jersey-lined three-piece hood with drawcord closure. Antique-finish brass front zipper. Two front hand-warmer pockets have a hidden security pocket inside. Stretchable, spandex-reinforced rib-knit cuffs and waistband. Locker loop facilitates hanging.',
  'price': '64.99',
  'currency': 'USD',
  'availability': 'InStock'},
 {'index': '1',
  'product_id': '206724580',
  'title': 'Turmode 30 ft. RP TNC Female to RP TNC Male Adapter Cable',
  'description': "If you need more length between your existing wireless device and Hi-Gain Antenna, this is the product for you. It's compatible with most Wi-Fi Antennas, so it is

## **4**
---

### **4a**
---

- Setelah berhasil mengambil informasi detail produk, masukkan ke dalam DataFrame

In [28]:
# your answer here
raw_data = pd.DataFrame(full_data)

raw_data

Unnamed: 0,index,product_id,title,description,price,currency,availability
0,0,310090686,Men's 3X Large Carbon Heather Cotton/Polyester...,"This heavyweight, water-repellent hooded sweat...",64.99,USD,InStock
1,1,206724580,Turmode 30 ft. RP TNC Female to RP TNC Male Ad...,If you need more length between your existing ...,71.61,USD,InStock
2,2,310347105,Large Tapestry Bolster Bed,Polyester cover resembling rich Italian tapest...,166.83,USD,InStock
3,3,312338711,16-Gauge-Sinks Vessel Sink in White with Faucet,It features a rectangle shape. This vessel set...,507.63,USD,InStock
4,4,308561619,Men's Crazy Horse 9'' Logger Boot - Steel Toe ...,This 9 in. black full grain leather logger boo...,103.59,USD,InStock
...,...,...,...,...,...,...,...
496,496,314895163,Border Fill 2 ft. x 4 ft. Vintage Metal Glue U...,Fasade Glue-up decorative thermoplastic ceilin...,184.12,USD,InStock
497,497,203868630,Camino Lincroft 48 in. - 86 in. Adjustable 3/4...,The Camino Lincroft Decorative Rod Series comb...,27.99,USD,InStock
498,498,205230260,5 gal. #N290-7 Marrakech Brown Flat Exterior P...,For a classic look on your home's exterior wal...,200,USD,InStock
499,499,307905811,Brake Hydraulic Hose,The Centric Parts brake hydraulic program is t...,41.99,USD,InStock


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
raw_data = pd.DataFrame(full_data)

raw_data
```

</details>

---

### **4b**
---

- Setelah itu simpan raw data yang berhasil kita scrape dengan filename `home_depo_raw.csv`

In [29]:
# your answer here
raw_data.to_csv("home_depo_raw.csv", index = False)

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
raw_data.to_csv("home_depo_raw.csv", index = False)
```

</details>

---

- Kita cek apakah sudah menyimpan data yang baru saja di save

In [30]:
pd.read_csv("home_depo_raw.csv")

Unnamed: 0,index,product_id,title,description,price,currency,availability
0,0,310090686,Men's 3X Large Carbon Heather Cotton/Polyester...,"This heavyweight, water-repellent hooded sweat...",64.99,USD,InStock
1,1,206724580,Turmode 30 ft. RP TNC Female to RP TNC Male Ad...,If you need more length between your existing ...,71.61,USD,InStock
2,2,310347105,Large Tapestry Bolster Bed,Polyester cover resembling rich Italian tapest...,166.83,USD,InStock
3,3,312338711,16-Gauge-Sinks Vessel Sink in White with Faucet,It features a rectangle shape. This vessel set...,507.63,USD,InStock
4,4,308561619,Men's Crazy Horse 9'' Logger Boot - Steel Toe ...,This 9 in. black full grain leather logger boo...,103.59,USD,InStock
...,...,...,...,...,...,...,...
496,496,314895163,Border Fill 2 ft. x 4 ft. Vintage Metal Glue U...,Fasade Glue-up decorative thermoplastic ceilin...,184.12,USD,InStock
497,497,203868630,Camino Lincroft 48 in. - 86 in. Adjustable 3/4...,The Camino Lincroft Decorative Rod Series comb...,27.99,USD,InStock
498,498,205230260,5 gal. #N290-7 Marrakech Brown Flat Exterior P...,For a classic look on your home's exterior wal...,200.00,USD,InStock
499,499,307905811,Brake Hydraulic Hose,The Centric Parts brake hydraulic program is t...,41.99,USD,InStock


## **5**
---

- Sekarang kita masuk ke dalam proses Data Wrangling
- Read data yang baru di save pada proses sebelumnya, yaitu file `home_depo_raw.csv`

In [31]:
depo_data = pd.read_csv("home_depo_raw.csv")

depo_data.head()

Unnamed: 0,index,product_id,title,description,price,currency,availability
0,0,310090686,Men's 3X Large Carbon Heather Cotton/Polyester...,"This heavyweight, water-repellent hooded sweat...",64.99,USD,InStock
1,1,206724580,Turmode 30 ft. RP TNC Female to RP TNC Male Ad...,If you need more length between your existing ...,71.61,USD,InStock
2,2,310347105,Large Tapestry Bolster Bed,Polyester cover resembling rich Italian tapest...,166.83,USD,InStock
3,3,312338711,16-Gauge-Sinks Vessel Sink in White with Faucet,It features a rectangle shape. This vessel set...,507.63,USD,InStock
4,4,308561619,Men's Crazy Horse 9'' Logger Boot - Steel Toe ...,This 9 in. black full grain leather logger boo...,103.59,USD,InStock


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
depo_data = pd.read_csv("home_depo_raw.csv")

depo_data.head()
```

</details>

---

## **6**
---

### **6a**
---

- Pada tahap ini, kita tidak akan menggunakan kolom `index`
- Oleh karena itu kita perlu drop kolom tersebut

In [32]:
depo_data = depo_data.drop(["index"], axis = 1)

depo_data.head()

Unnamed: 0,product_id,title,description,price,currency,availability
0,310090686,Men's 3X Large Carbon Heather Cotton/Polyester...,"This heavyweight, water-repellent hooded sweat...",64.99,USD,InStock
1,206724580,Turmode 30 ft. RP TNC Female to RP TNC Male Ad...,If you need more length between your existing ...,71.61,USD,InStock
2,310347105,Large Tapestry Bolster Bed,Polyester cover resembling rich Italian tapest...,166.83,USD,InStock
3,312338711,16-Gauge-Sinks Vessel Sink in White with Faucet,It features a rectangle shape. This vessel set...,507.63,USD,InStock
4,308561619,Men's Crazy Horse 9'' Logger Boot - Steel Toe ...,This 9 in. black full grain leather logger boo...,103.59,USD,InStock


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
depo_data = depo_data.drop(["index"], axis = 1)

depo_data.head()
```

</details>

---

### **6b**
---

- Kita akan coba melakukan Feature Engineering terhadap pada data yang dimiliki
- Coba buat fitur baru dengan nama `price_idr`, yang dimana itu didapatkan dari

$$
\text{price_idr} = \text{price} \cdot \text{idr_currency}
$$

- Asumsikan `$1` itu adalah `Rp 16.000`

In [33]:
IDR_CURRENCY = 16_000

depo_data["price_idr"] = depo_data["price"] * IDR_CURRENCY

In [34]:
depo_data

Unnamed: 0,product_id,title,description,price,currency,availability,price_idr
0,310090686,Men's 3X Large Carbon Heather Cotton/Polyester...,"This heavyweight, water-repellent hooded sweat...",64.99,USD,InStock,1039840.0
1,206724580,Turmode 30 ft. RP TNC Female to RP TNC Male Ad...,If you need more length between your existing ...,71.61,USD,InStock,1145760.0
2,310347105,Large Tapestry Bolster Bed,Polyester cover resembling rich Italian tapest...,166.83,USD,InStock,2669280.0
3,312338711,16-Gauge-Sinks Vessel Sink in White with Faucet,It features a rectangle shape. This vessel set...,507.63,USD,InStock,8122080.0
4,308561619,Men's Crazy Horse 9'' Logger Boot - Steel Toe ...,This 9 in. black full grain leather logger boo...,103.59,USD,InStock,1657440.0
...,...,...,...,...,...,...,...
496,314895163,Border Fill 2 ft. x 4 ft. Vintage Metal Glue U...,Fasade Glue-up decorative thermoplastic ceilin...,184.12,USD,InStock,2945920.0
497,203868630,Camino Lincroft 48 in. - 86 in. Adjustable 3/4...,The Camino Lincroft Decorative Rod Series comb...,27.99,USD,InStock,447840.0
498,205230260,5 gal. #N290-7 Marrakech Brown Flat Exterior P...,For a classic look on your home's exterior wal...,200.00,USD,InStock,3200000.0
499,307905811,Brake Hydraulic Hose,The Centric Parts brake hydraulic program is t...,41.99,USD,InStock,671840.0


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
IDR_CURRENCY = 15_500

depo_data["price_idr"] = depo_data["price"] * IDR_CURRENCY
```

</details>

---

In [22]:
depo_data.head()

Unnamed: 0,product_id,title,description,price,currency,availability,price_idr
0,310090686,Men's 3X Large Carbon Heather Cotton/Polyester...,"This heavyweight, water-repellent hooded sweat...",64.99,USD,InStock,1007345.0
1,206724580,Turmode 30 ft. RP TNC Female to RP TNC Male Ad...,If you need more length between your existing ...,71.61,USD,InStock,1109955.0
2,310347105,Large Tapestry Bolster Bed,Polyester cover resembling rich Italian tapest...,166.83,USD,InStock,2585865.0
3,312338711,16-Gauge-Sinks Vessel Sink in White with Faucet,It features a rectangle shape. This vessel set...,507.63,USD,InStock,7868265.0
4,308561619,Men's Crazy Horse 9'' Logger Boot - Steel Toe ...,This 9 in. black full grain leather logger boo...,103.59,USD,InStock,1605645.0


### **6c**
---

- Proses selanjutnya adalah kita akan mengubah seluruh value yang ada di `description` menjadi lowercase semua

In [35]:
depo_data["description"] = depo_data["description"].str.lower()

depo_data.head()

Unnamed: 0,product_id,title,description,price,currency,availability,price_idr
0,310090686,Men's 3X Large Carbon Heather Cotton/Polyester...,"this heavyweight, water-repellent hooded sweat...",64.99,USD,InStock,1039840.0
1,206724580,Turmode 30 ft. RP TNC Female to RP TNC Male Ad...,if you need more length between your existing ...,71.61,USD,InStock,1145760.0
2,310347105,Large Tapestry Bolster Bed,polyester cover resembling rich italian tapest...,166.83,USD,InStock,2669280.0
3,312338711,16-Gauge-Sinks Vessel Sink in White with Faucet,it features a rectangle shape. this vessel set...,507.63,USD,InStock,8122080.0
4,308561619,Men's Crazy Horse 9'' Logger Boot - Steel Toe ...,this 9 in. black full grain leather logger boo...,103.59,USD,InStock,1657440.0


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
depo_data["description"] = depo_data["description"].str.lower()

depo_data.head()
```

</details>

---

## **7**
---

- Simpan hasil yang sudah kita lakukan proses Data Wrangling ke dalam bentuk `csv`
- Simpan dengan nama `home_depo_clean.csv`

In [36]:
# your answer here
depo_data.to_csv("home_depo_clean.csv", index = False)

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
depo_data.to_csv("home_depo_clean.csv", index = False)
```

</details>

---

- Cek outputnya apakah sudah tersimpan

In [25]:
pd.read_csv("home_depo_clean.csv").head()

Unnamed: 0,product_id,title,description,price,currency,availability,price_idr
0,310090686,Men's 3X Large Carbon Heather Cotton/Polyester...,"this heavyweight, water-repellent hooded sweat...",64.99,USD,InStock,1007345.0
1,206724580,Turmode 30 ft. RP TNC Female to RP TNC Male Ad...,if you need more length between your existing ...,71.61,USD,InStock,1109955.0
2,310347105,Large Tapestry Bolster Bed,polyester cover resembling rich italian tapest...,166.83,USD,InStock,2585865.0
3,312338711,16-Gauge-Sinks Vessel Sink in White with Faucet,it features a rectangle shape. this vessel set...,507.63,USD,InStock,7868265.0
4,308561619,Men's Crazy Horse 9'' Logger Boot - Steel Toe ...,this 9 in. black full grain leather logger boo...,103.59,USD,InStock,1605645.0


# <font color='blue'>2. Study Case 2: CNN News
---

**[DISCLAIMER]: For learning purposes only!**

- Albudi ingin melakukan Sentiment Analysis terhadap suatu berita
- Tetapi, Albudi tidak memiliki data beritanya untuk melakukan pemodelan
- Akhirnya Albudi meminta tolong ke Data Engineer untuk menyediakan data nya dengan cara melakukan Data Scrape ke Portal Berita
- Portal berita yang akan dipakai disini adalah [CNN Indonesia](https://www.cnnindonesia.com/indeks)

**Tampilan Home**

<center>
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-intro-to-data-eng/Live_11-3.png" width=60%>
</center>

**Tampilan Detail**

<center>
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-intro-to-data-eng/Live_11-4.png" width=60%>
</center>


- **Step yang harus dilakukan adalah:**
    - Memahami cara kerja flow dari website CNN
    - Memahami struktur URL dan HTML website
    - Proses Data Scraping
    - Simpan output ke dalam database


## **1**
---

### **1a**
---

- Diberikan URL https://www.cnnindonesia.com/indeks
- Lakukan koneksi dengan menggunakan library `requests` dan cek status code nya

In [37]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import time

In [44]:
resp = requests.get("https://www.cnnindonesia.com/nasional/20241223210200-12-1180430/bupati-nonaktif-sidoarjo-muhdlor-ali-divonis-45-tahun-penjara")

resp.status_code

200

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
resp = requests.get("https://www.cnnindonesia.com/indeks")

resp.status_code
```

</details>

---

### **1b**
---

- Setelah kita cek, ternyata jika ingin mengambil detail informasi dari masing - masing berita memiliki URL yang berbeda - beda
- Oleh karena itu, kita harus mendapatkan URL dari masing - masing berita sebelum kita mengambil detail informasinya
- Caranya adalah kita mengambil URL dari **Home Portal Berita**
- Untuk tag HTML yang menampung URL dari masing - masing berita ada di tags `<a class="flex group items-center gap-4">`

Buat object BeautifulSoup

In [45]:
# create soup object
soup = BeautifulSoup(resp.text, "html.parser")

In [46]:
soup

<!DOCTYPE html>

<html class="scroll-smooth scroll-pt-[88px]" id="anchor" lang="id-ID">
<head>
<title>Bupati Nonaktif Sidoarjo Muhdlor Ali Divonis 4,5 Tahun Penjara</title>
<link href="https://cdn.cnnindonesia.com" rel="dns-prefetch"/>
<link href="https://cdn.detik.net.id" rel="dns-prefetch"/>
<link href="https://securepubads.g.doubleclick.net" rel="dns-prefetch"/>
<link href="https://cdnstatic.detik.com" rel="dns-prefetch"/>
<link href="https://akcdn.detik.net.id" rel="dns-prefetch"/>
<link href="https://www.gstatic.com" rel="dns-prefetch"/>
<link href="https://www.google-analytics.com" rel="dns-prefetch"/>
<link href="https://partner.googleadservices.com" rel="dns-prefetch"/>
<link href="https://connect.detik.com" rel="dns-prefetch"/>
<link href="https://www.googletagmanager.com" rel="dns-prefetch"/>
<link href="https://pubads.g.doubleclick.net" rel="dns-prefetch"/>
<link href="https://analytic.detik.com" rel="dns-prefetch"/>
<link href="https://newcomment.detik.com" rel="dns-prefetc

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
ORDERED_COLS = ["Order Date", "Product", "Quantity Ordered",
                "Price Each", "Purchase Address"]

sales_data = sales_data[ORDERED_COLS]

sales_data
```

</details>

---

Ambil URL dari masing - masing berita

In [40]:
raw_link = soup.find_all("a", class_ = "flex group items-center gap-4")

raw_link

[<a aria-label="link description" class="flex group items-center gap-4" href="https://www.cnnindonesia.com/nasional/20241223210200-12-1180430/bupati-nonaktif-sidoarjo-muhdlor-ali-divonis-45-tahun-penjara">
 <span class="flex-none overflow-hidden block relative w-[270px]">
 <span class="block aspect-w-16 aspect-h-9">
 <img alt="Bupati Nonaktif Sidoarjo Muhdlor Ali Divonis 4,5 Tahun Penjara" class="object-cover w-full group-hover:scale-110" src="https://akcdn.detik.net.id/visual/2024/12/17/eks-bupati-sidoarjo-gus-muhdlor-terisak-bacakan-pledoi-kasus-korupsi_169.jpeg?w=280&amp;q=90"/>
 </span>
 </span>
 <span class="flex-grow">
 <h2 class="text-cnn_black_light dark:text-white mb-2 inline leading-normal text-xl group-hover:text-cnn_red">Bupati Nonaktif Sidoarjo Muhdlor Ali Divonis 4,5 Tahun Penjara</h2>
 <span class="block mt-1">
 <span class="text-xs text-cnn_red">Nasional</span>
 <span class="text-xs text-cnn_black_light3"> • 6 menit yang lalu<!--2024-12-23 21:17:01--> </span>
 </span>
 

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
raw_link = soup.find_all("a", class_ = "flex group items-center gap-4")

raw_link
```

</details>

---

### **1c**
---

- Setelah sudah mendapatkan komponen - komponen HTML dari website, sekarang kita akan ambil URL nya saja
- URL link ada di tag `<a href="link"`>
- Oleh karena itu, kita akan extract yang ada di `href` saja
- Untuk mendapatkan value nya, kita bisa menggunakan function `get`

In [43]:
for link in raw_link:
    get_link = link.get("href")

    print(get_link)

https://www.cnnindonesia.com/nasional/20241223210200-12-1180430/bupati-nonaktif-sidoarjo-muhdlor-ali-divonis-45-tahun-penjara
https://www.cnnindonesia.com/olahraga/20241223190216-142-1180400/martinez-sebut-ruang-ganti-mu-penuh-amarah-usai-dihajar-bournemouth
https://www.cnnindonesia.com/internasional/20241220135940-115-1179467/foto-anggota-dpr-taiwan-baku-hantam-di-gedung-parlemen
https://www.cnnindonesia.com/ekonomi/20241223205339-92-1180428/12-perusahaan-ternama-as-tumbang-sepanjang-2024
https://www.cnnindonesia.com/hiburan/20241223120253-220-1180224/mengenal-7-bintang-baru-dan-karakter-superman-era-james-gunn
https://www.cnnindonesia.com/otomotif/20241223205717-579-1180429/resmi-honda-dan-nissan-merger
https://www.cnnindonesia.com/nasional/20241223204852-20-1180426/warga-terdampak-banjir-makassar-2551-jiwa-diprediksi-terus-bertambah
https://www.cnnindonesia.com/ekonomi/20241223203100-85-1180422/wamen-esdm-sebut-stok-bbm-di-medan-tambah-5-persen-saat-nataru
https://www.cnnindonesia.c

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
for link in raw_link:
    get_link = link.get("href")

    print(get_link)
```

</details>

---

### **1d**
---

- Setelah berhasil mendapatkan URL dari masing - masing berita, kita akan ambil URL berita dari page 1 sampai 15
- URL untuk next page nya berubah menjadi `https://www.cnnindonesia.com/indeks/2/{page_num}`
- Oleh karena itu kita perlu menyesuaikan untuk membuat koneksi ke Portal Berita
- Tiap proses scraping, kita berikan delay `0.25`
- Simpan URL ke dalam list kosongan dan simpan ke `cnn_url_news.csv`

In [51]:
links = []

for page in tqdm(range(1, 6)):
    resp = requests.get(f"https://www.cnnindonesia.com/indeks/2?page={page}")

    soup = BeautifulSoup(resp.text, "html.parser")

    raw_link = soup.find_all("a", class_ = "flex group items-center gap-4")

    # print(raw_link)
    for link in raw_link:
        get_link = link.get("href")
        links.append(get_link)

    time.sleep(0.25)

100%|██████████| 5/5 [00:11<00:00,  2.24s/it]


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
links = []

for page in tqdm(range(1, 16)):
    resp = requests.get(f"https://www.cnnindonesia.com/indeks/2/{page}")

    soup = BeautifulSoup(resp.text, "html.parser")

    raw_link = soup.find_all("a", class_ = "flex group items-center gap-4")

    for link in raw_link:
        get_link = link.get("href")
        links.append(get_link)

    time.sleep(0.25)

```

</details>

---

In [52]:
links

['https://www.cnnindonesia.com/nasional/20241223210200-12-1180430/bupati-nonaktif-sidoarjo-muhdlor-ali-divonis-45-tahun-penjara',
 'https://www.cnnindonesia.com/olahraga/20241223190216-142-1180400/martinez-sebut-ruang-ganti-mu-penuh-amarah-usai-dihajar-bournemouth',
 'https://www.cnnindonesia.com/internasional/20241220135940-115-1179467/foto-anggota-dpr-taiwan-baku-hantam-di-gedung-parlemen',
 'https://www.cnnindonesia.com/ekonomi/20241223205339-92-1180428/12-perusahaan-ternama-as-tumbang-sepanjang-2024',
 'https://www.cnnindonesia.com/hiburan/20241223120253-220-1180224/mengenal-7-bintang-baru-dan-karakter-superman-era-james-gunn',
 'https://www.cnnindonesia.com/otomotif/20241223205717-579-1180429/resmi-honda-dan-nissan-merger',
 'https://www.cnnindonesia.com/nasional/20241223204852-20-1180426/warga-terdampak-banjir-makassar-2551-jiwa-diprediksi-terus-bertambah',
 'https://www.cnnindonesia.com/ekonomi/20241223203100-85-1180422/wamen-esdm-sebut-stok-bbm-di-medan-tambah-5-persen-saat-nat

In [53]:
news_cnn = pd.DataFrame(links, columns = ["url"])
news_cnn


Unnamed: 0,url
0,https://www.cnnindonesia.com/nasional/20241223...
1,https://www.cnnindonesia.com/olahraga/20241223...
2,https://www.cnnindonesia.com/internasional/202...
3,https://www.cnnindonesia.com/ekonomi/202412232...
4,https://www.cnnindonesia.com/hiburan/202412231...
5,https://www.cnnindonesia.com/otomotif/20241223...
6,https://www.cnnindonesia.com/nasional/20241223...
7,https://www.cnnindonesia.com/ekonomi/202412232...
8,https://www.cnnindonesia.com/ekonomi/202412231...
9,https://www.cnnindonesia.com/ekonomi/202412232...


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
news_cnn = pd.DataFrame(links, columns = ["url"])

news_cnn
```

</details>

---

In [54]:
news_cnn.to_csv("cnn_url_news.csv", index = False)

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
imdb_data = pd.read_csv("IMDB_Movies.csv")

imdb_data.head()
```

</details>

---

## **2**
---

### **2a**
---

- Sekarang kita akan gunakan `cnn_url_news.csv` data untuk melakukan scaraping data
- Langkah awal kita akan read data tersebut

In [56]:
# your answer here
links = pd.read_csv("cnn_url_news.csv")

links

Unnamed: 0,url
0,https://www.cnnindonesia.com/nasional/20241223...
1,https://www.cnnindonesia.com/olahraga/20241223...
2,https://www.cnnindonesia.com/internasional/202...
3,https://www.cnnindonesia.com/ekonomi/202412232...
4,https://www.cnnindonesia.com/hiburan/202412231...
5,https://www.cnnindonesia.com/otomotif/20241223...
6,https://www.cnnindonesia.com/nasional/20241223...
7,https://www.cnnindonesia.com/ekonomi/202412232...
8,https://www.cnnindonesia.com/ekonomi/202412231...
9,https://www.cnnindonesia.com/ekonomi/202412232...


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
links = pd.read_csv("cnn_url_news.csv")

links.head()
```

</details>

---

### **2b**
---

In [58]:
get_link

'https://www.cnnindonesia.com/internasional/20241223170944-110-1180358/video-momen-erdogan-sempat-tinggalkan-ktt-d-8-saat-prabowo-pidato'

- Kita akan coba test dengan menggunakan satu link terlebih dahulu untuk melihat struktur dari HTML Portal Berita yang ingin kita scrape
- Kita akan menggunakan data yang ada di index `0`

In [59]:
# your answer here
resp = requests.get(get_link)

resp.status_code

200

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
get_link = links["url"].iloc[0]

get_link
```

</details>

---

In [None]:
resp = requests.get(get_link)

resp.status_code

Buatlah objek BeautifulSoup

In [60]:
# your answer here
soup = BeautifulSoup(resp.text, "html.parser")

In [61]:
soup

<!DOCTYPE html>

<html class="scroll-smooth scroll-pt-[88px]" id="anchor" lang="id-ID">
<head>
<title>VIDEO: Momen Erdogan Sempat Tinggalkan KTT D-8 saat Prabowo Pidato</title>
<link href="https://cdn.cnnindonesia.com" rel="dns-prefetch"/>
<link href="https://cdn.detik.net.id" rel="dns-prefetch"/>
<link href="https://securepubads.g.doubleclick.net" rel="dns-prefetch"/>
<link href="https://cdnstatic.detik.com" rel="dns-prefetch"/>
<link href="https://akcdn.detik.net.id" rel="dns-prefetch"/>
<link href="https://www.gstatic.com" rel="dns-prefetch"/>
<link href="https://www.google-analytics.com" rel="dns-prefetch"/>
<link href="https://partner.googleadservices.com" rel="dns-prefetch"/>
<link href="https://connect.detik.com" rel="dns-prefetch"/>
<link href="https://www.googletagmanager.com" rel="dns-prefetch"/>
<link href="https://pubads.g.doubleclick.net" rel="dns-prefetch"/>
<link href="https://analytic.detik.com" rel="dns-prefetch"/>
<link href="https://newcomment.detik.com" rel="dns-pre

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
soup = BeautifulSoup(resp.text, "html.parser")
```

</details>

---

### **2c**
---

- Komponen yang perlu kita extract adalah:
    - `News Title`
    - `Author`
    - `Article Content` (Isi berita)
    - `News Created`

- Hasilnya, kita masukkan ke dalam dictionary

- Setelah itu, kita bisa tambahkan informasi baru yaitu kapan ketika kita melakukan proses Data Scraping dengan nama `scraped_at`

In [64]:
# code for generate current timestamp

# Get the current timestamp
current_timestamp = pd.Timestamp.now()

# # Convert the Timestamp object to a formatted string
formatted_timestamp = current_timestamp.strftime('%Y-%m-%d %H:%M:%S')

In [65]:
formatted_timestamp

'2024-12-23 21:36:29'

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
get_title = soup.find("h1").text.strip()

get_article = soup.find_all("p")

final_article = ""

for article in get_article:
    final_article += article.text + "\n"


get_author = soup.find("span", class_ = "text-cnn_red").text

news_created = soup.find("div", class_ = "text-cnn_grey text-sm mb-4").text.strip()

news_data = {
    "news_title": get_title,
    "author": get_author,
    "article": final_article.strip(),
    "news_created": news_created,
    "scrapped_at": formatted_timestamp
}
```

</details>

---

In [66]:
get_title = soup.find("h1").text.strip()

get_article = soup.find_all("p")

final_article = ""

for article in get_article:
    final_article += article.text + "\n"


get_author = soup.find("span", class_ = "text-cnn_red").text

news_created = soup.find("div", class_ = "text-cnn_grey text-sm mb-4").text.strip()

news_data = {
    "news_title": get_title,
    "author": get_author,
    "article": final_article.strip(),
    "news_created": news_created,
    "scrapped_at": formatted_timestamp
}

In [67]:
news_data

{'news_title': 'VIDEO: Momen Erdogan Sempat Tinggalkan KTT D-8 saat Prabowo Pidato',
 'author': 'CNN Indonesia',
 'article': 'Rekaman video menunjukkan Presiden Turki Recep Tayyip Erdogan sempat tinggalkan ruangan saat Presiden RI Prabowo Subianto menyampaikan pidato di KTT Developing Eight D-8 di Kairo, Mesir pada Kamis (19/12).\nSejumlah delegasi pun terlihat berjalan hendak meninggalkan ruangan saat Prabowo berpidato.\nSalah satu yang jelas terlihat dalam video itu adalah Erdogan yang berjalan di belakang kursi Prabowo, bahkan sempat menyenggol kursi sang Presiden RI ketika sedang berpidato.\nMomen itu terekam ketika Prabowo berbicara mengecam pelanggaran Israel terhadap hukum internasional.\nPihak Duta Besar Turki untuk Indonesia kemudian mengklarifikasi dan membenarkan pernyataan Kemlu RI yang membantah kabar bahwa tindakan Erdogan merupakan walk out. Erdogan disebut keluar ruangan terkait pertemuan bilateral harus dihadiri Presiden Turki yang kebetulan paralel atau bersamaan deng

## **3**
---

### **3a**
---

- Setelah berhasil mengetest URL di atas, sekarang kita akan mengambil seluruh detail informasi berita dengan menggunakan seluruh URL yang sudah didapatkan sebelumnya
- Kita tambahkan proses `try-except` dan akan raise Error apabila terjadi kesalahan pada kode yang kita miliki
- Setelah itu tambahkan delay atau pause sekitar `0.5` setiap berhasil scrape data

In [69]:
links.head()

Unnamed: 0,url
0,https://www.cnnindonesia.com/nasional/20241223...
1,https://www.cnnindonesia.com/olahraga/20241223...
2,https://www.cnnindonesia.com/internasional/202...
3,https://www.cnnindonesia.com/ekonomi/202412232...
4,https://www.cnnindonesia.com/hiburan/202412231...


In [70]:
full_data = []

for idx in tqdm(range(len(links))):
    try:
        print(f"Scraping news {idx} of {len(links)}")
        get_link = links["url"].loc[idx]

        resp = requests.get(get_link)

        soup = BeautifulSoup(resp.text, "html.parser")

        get_title = soup.find("h1").text.strip()

        get_article = soup.find_all("p")

        final_article = ""

        for article in get_article:
            if "ADVERTISEMENT" not in article.text and "SCROLL TO CONTINUE WITH CONTENT" not in article.text:
                final_article += article.text + "\n"

        get_author = soup.find("span", class_ = "text-cnn_red").text

        news_created = soup.find("div", class_ = "text-cnn_grey text-sm mb-4").text.strip()

        news_data = {
            "news_title": get_title,
            "author": get_author,
            "article": final_article.strip(),
            "news_created": news_created,
            "scrapped_at": formatted_timestamp
        }

        full_data.append(news_data)

        time.sleep(0.5)

    except Exception as e:
        raise Exception("There's some error", e)

  0%|          | 0/50 [00:00<?, ?it/s]

Scraping news 0 of 50


  2%|▏         | 1/50 [00:01<01:08,  1.39s/it]

Scraping news 1 of 50


  4%|▍         | 2/50 [00:03<01:32,  1.92s/it]

Scraping news 2 of 50


  4%|▍         | 2/50 [00:04<01:53,  2.36s/it]


Exception: ("There's some error", AttributeError("'NoneType' object has no attribute 'text'"))

In [71]:
links.iloc[2]

url    https://www.cnnindonesia.com/internasional/202...
Name: 2, dtype: object

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
full_data = []

for idx in tqdm(range(len(links))):
    try:
        get_link = links["url"].loc[idx]

        resp = requests.get(get_link)

        soup = BeautifulSoup(resp.text, "html.parser")

        get_title = soup.find("h1").text.strip()

        get_article = soup.find_all("p")

        final_article = ""

        for article in get_article:
            if "ADVERTISEMENT" not in article.text and "SCROLL TO CONTINUE WITH CONTENT" not in article.text:
                final_article += article.text + "\n"

        get_author = soup.find("span", class_ = "text-cnn_red").text

        news_created = soup.find("div", class_ = "text-cnn_grey text-sm mb-4").text.strip()

        news_data = {
            "news_title": get_title,
            "author": get_author,
            "article": final_article.strip(),
            "news_created": news_created,
            "scrapped_at": formatted_timestamp
        }

        full_data.append(news_data)

        time.sleep(0.5)

    except:
        raise Exception("There's some error")
```

</details>

---

### **3b**
---

- Ternyata ada error dari code yang kita miliki
- Setelah ditelusuri, ternyata yang menyebabkan error adalah code scraper yang kita buat tidak bisa mengambil konten yang isi berita nya hanya ada Gambar saja
- Oleh karena itu, kita perlu update code scraper kita untuk filter kata `FOTO:` dan `SMASHOT:` yang dimana kedua keyword tersebut mengandung Gambar saja

In [None]:
links["url"].iloc[5]

In [72]:
full_data = []

for idx in tqdm(range(len(links))):
    try:
        print(f"Scraping news {idx} of {len(links)}")
        get_link = links["url"].loc[idx]

        resp = requests.get(get_link)

        soup = BeautifulSoup(resp.text, "html.parser")

        get_title = soup.find("h1").text.strip()

        if 'FOTO:' in get_title or "SMASHOT:" in get_title:
            continue

        get_article = soup.find_all("p")

        final_article = ""

        for article in get_article:
            if "ADVERTISEMENT" not in article.text and "SCROLL TO CONTINUE WITH CONTENT" not in article.text:
                final_article += article.text + "\n"

        get_author = soup.find("span", class_ = "text-cnn_red").text
        
        news_created = soup.find("div", class_ = "text-cnn_grey text-sm mb-4").text.strip()

        news_data = {
            "news_title": get_title,
            "author": get_author,
            "article": final_article.strip(),
            "news_created": news_created,
            "scrapped_at": formatted_timestamp
        }

        full_data.append(news_data)

        time.sleep(0.5)

    except Exception as e:  
        raise Exception("There's some error",e)

  0%|          | 0/50 [00:00<?, ?it/s]

Scraping news 0 of 50


  2%|▏         | 1/50 [00:01<01:03,  1.29s/it]

Scraping news 1 of 50


  4%|▍         | 2/50 [00:04<02:00,  2.52s/it]

Scraping news 2 of 50


  6%|▌         | 3/50 [00:07<02:01,  2.59s/it]

Scraping news 3 of 50


  8%|▊         | 4/50 [00:09<01:54,  2.48s/it]

Scraping news 4 of 50


  8%|▊         | 4/50 [00:12<02:24,  3.14s/it]


Exception: ("There's some error", AttributeError("'NoneType' object has no attribute 'text'"))

In [74]:
import os

full_data = []

for idx in tqdm(range(len(links))):
    try:
        print(f"Scraping news {idx} of {len(links)}")
        get_link = links["url"].loc[idx]

        resp = requests.get(get_link)

        soup = BeautifulSoup(resp.text, "html.parser")

        get_title = soup.find("h1").text.strip()

        if 'FOTO:' in get_title or "SMASHOT:" in get_title:
            continue

        get_article = soup.find_all("p")

        final_article = ""

        for article in get_article:
            if "ADVERTISEMENT" not in article.text and "SCROLL TO CONTINUE WITH CONTENT" not in article.text:
                final_article += article.text + "\n"

        get_author = soup.find("span", class_ = "text-cnn_red").text
        
        # there is some html tag for news dates
        if soup.find("div", class_ = "text-cnn_grey text-sm mb-4"):
            news_created = soup.find("div", class_ = "text-cnn_grey text-sm mb-4").text.strip()
        else:
            news_created = soup.find("div", class_ = "text-cnn_grey text-sm mb-6").text.strip()

        # get image
        get_image = soup.find("img", class_ = "w-full")
        if get_image:
            # download the image
            image_url = get_image.get("src")
            image_name = image_url.split("/")[-1]
            image_name = image_name.split("?")[0]
            print(image_name)

            with open(f"{image_name}", "wb") as f:
                f.write(requests.get(image_url).content)
        else:
            image_url=None

        news_data = {
            "news_title": get_title,
            "author": get_author,
            "article": final_article.strip(),
            "news_created": news_created,
            "scrapped_at": formatted_timestamp,
            "image": image_url
        }

        full_data.append(news_data)

        time.sleep(0.5)

    except Exception as e:  
        raise Exception("There's some error",e)

  0%|          | 0/50 [00:00<?, ?it/s]

Scraping news 0 of 50
eks-bupati-sidoarjo-gus-muhdlor-terisak-bacakan-pledoi-kasus-korupsi_169.jpeg


  2%|▏         | 1/50 [00:06<05:30,  6.75s/it]

Scraping news 1 of 50
mu-vs-bournemouth-6_169.jpeg


  4%|▍         | 2/50 [00:12<05:05,  6.36s/it]

Scraping news 2 of 50


  6%|▌         | 3/50 [00:14<03:21,  4.30s/it]

Scraping news 3 of 50
tupperware-4_169.jpeg


  8%|▊         | 4/50 [00:19<03:27,  4.51s/it]

Scraping news 4 of 50
film-superman-2025-1_169.png


 10%|█         | 5/50 [00:22<03:05,  4.12s/it]

Scraping news 5 of 50
ilustrasi-honda-dan-nissan-merger-5_169.jpeg


 12%|█▏        | 6/50 [00:26<02:59,  4.08s/it]

Scraping news 6 of 50
banjir-rendam-permukiman-warga-di-makassar-1_169.jpeg


 14%|█▍        | 7/50 [00:32<03:10,  4.43s/it]

Scraping news 7 of 50
pertamina-turunkan-harga-pertamax-17_169.jpeg


 16%|█▌        | 8/50 [00:36<03:04,  4.39s/it]

Scraping news 8 of 50
zulkifli-hasan_169.jpeg


 18%|█▊        | 9/50 [00:42<03:24,  5.00s/it]

Scraping news 9 of 50
bnr-pertamina-2_169.jpeg


 20%|██        | 10/50 [00:47<03:16,  4.91s/it]

Scraping news 10 of 50
ilustrasi-batik_169.jpeg


 22%|██▏       | 11/50 [00:52<03:12,  4.95s/it]

Scraping news 11 of 50
tol-cimanggis-cibitung-seksi-2b-yang-akan-segera-bertarif-pada-2-agustus-2024-pukul-0000-wib-1_169.jpeg


 24%|██▍       | 12/50 [00:57<03:12,  5.05s/it]

Scraping news 12 of 50
ekanit-panya_169.jpeg


 26%|██▌       | 13/50 [01:02<03:00,  4.87s/it]

Scraping news 13 of 50
10bbab18-65c2-411d-b382-b1714c92fea5_169.jpg


 28%|██▊       | 14/50 [01:08<03:10,  5.30s/it]

Scraping news 14 of 50
tabrak-lari-berujung-kecelakaan-beruntun-di-surabaya-1-tewas-1_169.jpeg


 30%|███       | 15/50 [01:14<03:08,  5.40s/it]

Scraping news 15 of 50
thumbnail-video-2_169.jpeg


 32%|███▏      | 16/50 [01:17<02:44,  4.83s/it]

Scraping news 16 of 50
empat-orang-dilaporkan-meninggal-dunia-dalam-insiden-kecelakaan-bus-rombongan-siswa-smp-dengan-truk-bermuatan-di-km-77200-a-ar_169.jpeg


 34%|███▍      | 17/50 [01:21<02:26,  4.45s/it]

Scraping news 17 of 50
pembayaran-digital-qris-5_169.jpeg


 36%|███▌      | 18/50 [01:24<02:09,  4.04s/it]

Scraping news 18 of 50
ilustrasi-bendera-arab-saudi_169.jpeg


 38%|███▊      | 19/50 [01:29<02:15,  4.36s/it]

Scraping news 19 of 50
manchester-city-vs-manchester-united-7_169.jpeg


 40%|████      | 20/50 [01:33<02:05,  4.19s/it]

Scraping news 20 of 50


 42%|████▏     | 21/50 [01:34<01:35,  3.30s/it]

Scraping news 21 of 50
2cc2065f-d66e-49f5-8cdc-6474ad49537d_169.jpg


 44%|████▍     | 22/50 [01:38<01:39,  3.57s/it]

Scraping news 22 of 50
timnas-indonesia-melawan-arab-saudi-14_169.jpeg


 46%|████▌     | 23/50 [01:42<01:35,  3.53s/it]

Scraping news 23 of 50
suasana-pameran-automotif-gaikindo-indonesia-internasional-auto-show-giias-2024-1_169.jpeg


 48%|████▊     | 24/50 [01:46<01:36,  3.72s/it]

Scraping news 24 of 50
bri-10_169.jpeg


 50%|█████     | 25/50 [01:49<01:27,  3.51s/it]

Scraping news 25 of 50
sekjen-partai-gerindra-ahmad-muzani_169.jpeg


 52%|█████▏    | 26/50 [01:52<01:22,  3.43s/it]

Scraping news 26 of 50


 54%|█████▍    | 27/50 [01:55<01:13,  3.21s/it]

Scraping news 27 of 50
moses-itauma_169.jpeg


 56%|█████▌    | 28/50 [01:57<01:07,  3.05s/it]

Scraping news 28 of 50
218f9cf5-0dd8-4896-9679-2f6df74ac36a_169.jpg


 58%|█████▊    | 29/50 [02:00<01:00,  2.90s/it]

Scraping news 29 of 50
ilustrasi-logo-cnn-indonesia_169.jpeg


 60%|██████    | 30/50 [02:04<01:03,  3.17s/it]

Scraping news 30 of 50


 62%|██████▏   | 31/50 [02:05<00:52,  2.74s/it]

Scraping news 31 of 50
moses-itauma_169.jpeg


 64%|██████▍   | 32/50 [02:09<00:53,  2.99s/it]

Scraping news 32 of 50
218f9cf5-0dd8-4896-9679-2f6df74ac36a_169.jpg


 66%|██████▌   | 33/50 [02:13<00:57,  3.41s/it]

Scraping news 33 of 50
ilustrasi-logo-cnn-indonesia_169.jpeg


 68%|██████▊   | 34/50 [02:17<00:56,  3.54s/it]

Scraping news 34 of 50
britney-spears_169.jpeg


 70%|███████   | 35/50 [02:20<00:50,  3.38s/it]

Scraping news 35 of 50


 72%|███████▏  | 36/50 [02:27<01:00,  4.30s/it]

Scraping news 36 of 50
demo-tangkap-harun-masiku-di-kpk_169.jpeg


 74%|███████▍  | 37/50 [02:30<00:53,  4.09s/it]

Scraping news 37 of 50
bkpm-1_169.jpeg


 76%|███████▌  | 38/50 [02:35<00:51,  4.30s/it]

Scraping news 38 of 50


 78%|███████▊  | 39/50 [02:37<00:40,  3.72s/it]

Scraping news 39 of 50
44477bd9-caa2-4ac8-af82-f279121210f9_169.jpeg


 80%|████████  | 40/50 [02:43<00:41,  4.13s/it]

Scraping news 40 of 50
sidang-putusan-korupsi-tata-niaga-timah-7_169.jpeg


 82%|████████▏ | 41/50 [02:51<00:48,  5.36s/it]

Scraping news 41 of 50
803b1081-121c-4035-b949-47fd3f52f655_169.jpeg


 84%|████████▍ | 42/50 [02:55<00:39,  4.95s/it]

Scraping news 42 of 50
84479c03-6b7a-4e4a-828a-870b0875aeab_169.jpg


 86%|████████▌ | 43/50 [02:59<00:32,  4.69s/it]

Scraping news 43 of 50
taman-literasi-martha-christina-tiahahu-2_169.jpeg


 88%|████████▊ | 44/50 [03:03<00:27,  4.55s/it]

Scraping news 44 of 50
pertamina-1_169.jpeg


 90%|█████████ | 45/50 [03:07<00:21,  4.37s/it]

Scraping news 45 of 50
singapura-vs-thailand_169.jpeg


 92%|█████████▏| 46/50 [03:11<00:17,  4.35s/it]

Scraping news 46 of 50
e54a8e3f-7fa5-485c-92d4-4159e8d2b619_169.jpeg


 94%|█████████▍| 47/50 [03:17<00:13,  4.66s/it]

Scraping news 47 of 50
le-minerale_169.png


 96%|█████████▌| 48/50 [03:21<00:09,  4.67s/it]

Scraping news 48 of 50
fea34e6c-1626-4c2a-9518-05ab542fd9cf_169.jpg


 98%|█████████▊| 49/50 [03:25<00:04,  4.45s/it]

Scraping news 49 of 50
thumbnail-video-2_169.jpeg


100%|██████████| 50/50 [03:28<00:00,  4.17s/it]


In [None]:
"https://akcdn.detik.net.id/visual/2024/12/20/film-superman-2025-1_169.png?w=650&amp;q=80"

In [75]:
full_data

[{'news_title': 'Bupati Nonaktif Sidoarjo Muhdlor Ali Divonis 4,5 Tahun Penjara',
  'author': 'CNN Indonesia',
  'article': 'Bupati Sidoarjo\xa0nonaktif Ahmad Muhdlor Ali alias Gus Muhdlor\xa0divonis 4,5 tahun penjara dalam kasus korupsi pemotongan dana insentif pegawai atau ASN Badan Pelayanan Pajak Daerah (BPPD) Kabupaten Sidoarjo.\nAmar putusan itu dibacakan Ketua Majelis Hakim Ni Putu Sri Indayani dalam persidangan di Pengadilan Tipikor Surabaya, Senin (23/12).\n"Menjatuhkan pidana terhadap terdakwa dengan pidana selama 4 tahun dan 6 bulan penjara," kata Ketua Majelis Hakim Ni Putu Sri Indayani.\nSelain itu, Muhdlor juga dikenakan pidana denda Rp300 juta subsider 3 bulan kurungan dan pidana tambahan uang pengganti Rp1,4 miliar subsider 1,5 tahun penjara.\n\nMuhdlor, kata hakim, terbukti melanggar Pasal 12 huruf F jo pasal 18 UU Tipikor jo Pasal 55 ayat 1 ke-1 jo Pasal 64 ayat 1 KUHP, sesuai dengan dakwaan alternatif pertama.\nHakim menyatakan sejumlah pertimbangan yang membuat huku

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
full_data = []

for idx in tqdm(range(len(links))):
    try:
        get_link = links["url"].loc[idx]

        resp = requests.get(get_link)

        soup = BeautifulSoup(resp.text, "html.parser")

        get_title = soup.find("h1").text.strip()

        if 'FOTO:' in get_title or "SMASHOT:" in get_title:
            continue

        get_article = soup.find_all("p")

        final_article = ""

        for article in get_article:
            if "ADVERTISEMENT" not in article.text and "SCROLL TO CONTINUE WITH CONTENT" not in article.text:
                final_article += article.text + "\n"

        get_author = soup.find("span", class_ = "text-cnn_red").text

        news_created = soup.find("div", class_ = "text-cnn_grey text-sm mb-4").text.strip()

        news_data = {
            "news_title": get_title,
            "author": get_author,
            "article": final_article.strip(),
            "news_created": news_created,
            "scrapped_at": formatted_timestamp
        }

        full_data.append(news_data)

        time.sleep(0.5)

    except:
        raise Exception("There's some error")
```

</details>

---

In [50]:
full_data

[{'news_title': 'Manajer Timnas Respons #STYOut dan #ErickOut: Jangan Adu Domba',
  'author': 'CNN Indonesia',
  'article': 'Manajer Timnas Indonesia sekaligus Ketua Badan Tim Nasional (BTN) PSSI\xa0Sumardji menanggapi tanda pagar atau tagar \'#STYOut\' dan \'#ErickOut\' yang ramai di media sosial.\nTanda pagar \'#STYOut\' mulai ramai beredar begitu Timnas Indonesia gagal lolos ke babak semifinal Piala AFF 2024. Tak berselang lama tanda pagar \'#Erick Out\' juga menggema.\nBagi Sumardji, ini kontraproduktif. Menurutnya tidak perlu ada pihak yang dikambinghitamkan atas situasi saat ini. Ia berharap publik menahan diri untuk saling menyerang.\n"Ini harus diluruskan. Satu, kita harus membuat narasi yang bagus. Kritik tentu boleh, tapi tolong jangan benturkan pak Erick Thohir dengan coach Shin Tae Yong."\n"Kalau menurut saya keduanya tidak boleh diadu domba. Menurut saya STY harus tetap pegang Timnas dan pak Erick tetap di PSSI," kata Sumardji kepada CNNIndonesia.com, Senin (23/12).\nMenur

### **3c**
---

- Setelah berhasil melakukan Data Scraping, kita simpan hasilnya ke dalam `cnn_news_raw.csv`

In [76]:
cnn_news = pd.DataFrame(full_data)

cnn_news.to_csv("cnn_news_raw.csv", index = False)

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
cnn_news = pd.DataFrame(full_data)

cnn_news.to_csv("cnn_news_raw.csv", index = False)
```

</details>

---

### **4**
---

- Sekarang kita akan masuk ke dalam proses Data Wrangling
- Read data `cnn_news_raw.csv`

In [65]:
cnn_news = pd.read_csv("cnn_news_raw.csv")

# show full text
pd.set_option("display.max_colwidth", None)
cnn_news.image

0    https://akcdn.detik.net.id/visual/2024/03/14/soccer_169.jpeg?w=650&q=90
Name: image, dtype: object

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
cnn_news = pd.read_csv("cnn_news_raw.csv")

cnn_news.head()
```

</details>

---

Cek data shape dan tipe data dari data yang setelah di scrape

In [77]:
cnn_news.shape

(47, 6)

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
cnn_news.shape
```

</details>

---

In [78]:
cnn_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   news_title    47 non-null     object
 1   author        47 non-null     object
 2   article       47 non-null     object
 3   news_created  47 non-null     object
 4   scrapped_at   47 non-null     object
 5   image         44 non-null     object
dtypes: object(6)
memory usage: 2.3+ KB


<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
cnn_news.info()
```

</details>

---

## **5**
---

- Setelah dilihat, ternyata pada bagian `article` terdapat value `\n`
- Oleh karena itu, kita bisa ganti value nya dengan whitespace `" "`

In [79]:
# value before
cnn_news["article"].iloc[6]

'Wakil Menteri Energi Sumber Daya Mineral (ESDM)\xa0Yuliot\xa0Tanjung mengatakan stok\xa0BBM\xa0di wilayah Kota Medan, Sumatera\xa0Utara ditambah\xa0untuk mengantisipasi lonjakan permintaan\xa0selama perayaan Natal 2024 dan Tahun Baru\xa02025 (Nataru).\n"Kita menambah stok BBM di seluruh stasiun pengisian, karena adanya permintaan penambahan rata-rata sebanyak 5 persen dari suplai normal. Stok sudah ditambah dari suplai normal, sehingga tidak akan ada kelangkaan," ujar Yuliot saat kunjungan ke Medan, Senin (23/12/2024).\nKarenanya, meski permintaan meningkat,\xa0Yuliot memastikan ketersediaan BBM dan LPG selama periode Nataru ini dalam keadaan aman.\n"Kami dari posko nasional Kementerian ESDM melakukan pengecekan lapangan terhadap ketersediaan BBM dan LPG. Alhamdulillah ini aman untuk ketersediaan dalam rangka Natal 2024 dan menyambut Tahun Baru 2025. Ketersediaan distribusi dilakukan secara baik sehingga ketersediaan BBM dan LPG, relatif aman," ujar Yuliot.\n\nYuliot memastikan kebutu

In [81]:
cnn_news["article"] = cnn_news["article"].str.replace("\xa0", " ")

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
cnn_news["article"] = cnn_news["article"].str.replace("\n", " ")
```

</details>

---

In [82]:
# value after
cnn_news["article"].iloc[6]

'Wakil Menteri Energi Sumber Daya Mineral (ESDM) Yuliot Tanjung mengatakan stok BBM di wilayah Kota Medan, Sumatera Utara ditambah untuk mengantisipasi lonjakan permintaan selama perayaan Natal 2024 dan Tahun Baru 2025 (Nataru). "Kita menambah stok BBM di seluruh stasiun pengisian, karena adanya permintaan penambahan rata-rata sebanyak 5 persen dari suplai normal. Stok sudah ditambah dari suplai normal, sehingga tidak akan ada kelangkaan," ujar Yuliot saat kunjungan ke Medan, Senin (23/12/2024). Karenanya, meski permintaan meningkat, Yuliot memastikan ketersediaan BBM dan LPG selama periode Nataru ini dalam keadaan aman. "Kami dari posko nasional Kementerian ESDM melakukan pengecekan lapangan terhadap ketersediaan BBM dan LPG. Alhamdulillah ini aman untuk ketersediaan dalam rangka Natal 2024 dan menyambut Tahun Baru 2025. Ketersediaan distribusi dilakukan secara baik sehingga ketersediaan BBM dan LPG, relatif aman," ujar Yuliot.  Yuliot memastikan kebutuhan BBM dan LPG secara keseluruh

## **6**
---

- Simpanlah output yang sudah dilakukan proses Data Wrangling ke dalam table `cnn_news`
- Schema table
    ```sql
    CREATE TABLE public."cnn_news" (
	news_title varchar NULL,
	author varchar NULL,
	article varchar NULL,
	news_created varchar NULL,
	scrapped_at timestamp NULL
    );
    ```

In [83]:
from sqlalchemy import create_engine

In [84]:
# create connection to database
conn = create_engine("postgresql://postgres:aku@localhost/test")

cnn_news.to_sql(name = "cnn_news", con = conn,
                if_exists = "replace", index = False)

47

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
conn = create_engine("postgresql://postgres:cobapassword@localhost/data-wrangling")

cnn_news.to_sql(name = "cnn_news", con = conn,
                if_exists = "append", index = False)
```

</details>

---

- Validasi apakah datanya sudah tersimpan di database
- Lakukan query untuk mengambil datanya di database

In [None]:
pd.read_sql("SELECT * FROM cnn_news LIMIT 5", con = conn)

<details>
    <summary><b>Klik untuk melihat kunci jawaban</b></summary>

```python
pd.read_sql("SELECT * FROM cnn_news LIMIT 5", con = conn)
```

</details>

---