**DEMO - NEWS SCRAPING**

**Install Libraries**

In [1]:
pip install newspaper3k lxml[html_clean]




**Import Libraries / Packages**

In [2]:
from newspaper import Article
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

**Define Functions (if any)**

In [3]:
# Function to scrape news articles from Detik
def scrape_news_tag(tag_url):

    response = requests.get(tag_url)
    page = 1
    news_data = []

    while True:
        url = f"{tag_url}{page}"
        response = requests.get(url)

        if page > 1 or response.status_code != 200:
            print("Failed to access the main page or no more pages.")
            break

        soup = BeautifulSoup(response.content, "html.parser")
        articles = soup.find_all("article")

        if not articles:
            print("No more articles found, stopping crawl.")
            break

        for article in articles:
            link_tag = article.find("a", href=True)
            if link_tag:
                article_url = link_tag["href"]
                try:
                    article_response = requests.get(article_url)
                    if article_response.status_code == 200:
                        article_soup = BeautifulSoup(
                            article_response.content, "html.parser"
                        )
                        title = (
                            article_soup.find("h1").text
                            if article_soup.find("h1")
                            else "No Title"
                        )
                        content = " ".join([p.text for p in article_soup.find_all("p")])

                        # Extract and format published date from detail__date class
                        date_div = article_soup.find("div", class_="detail__date")
                        if date_div:
                            raw_date = date_div.text.strip()
                            # Handle date format variations by extracting date using regex
                            date_match = re.search(
                                r"\d{2} \w+ \d{4} \d{2}:\d{2}", raw_date
                            )
                            if date_match:
                                try:
                                    published_date = pd.to_datetime(
                                        date_match.group(),
                                        format="%d %b %Y %H:%M",
                                        errors="coerce",
                                    ).strftime("%Y-%m-%d %H:%M:%S")
                                except Exception as e:
                                    published_date = "Unknown"
                            else:
                                published_date = "Unknown"
                        else:
                            published_date = "Unknown"

                        author = (
                            article_soup.find(class_="detail__author").text.strip()
                            if article_soup.find(class_="detail__author")
                            else "Unknown"
                        )
                        news_data.append(
                            {
                                "Title": title,
                                "Content": content,
                                "Published Date": published_date,
                                "Author": author,
                            }
                        )
                    else:
                        print(f"Failed to access article: {article_url}")
                except Exception as e:
                    print(f"Error scraping article: {article_url}, Error: {e}")

        print(f"Page {page} processed.")
        page += 1

    # Convert to DataFrame and sort by published date descending
    df = pd.DataFrame(news_data)
    df["Published Date"] = pd.to_datetime(df["Published Date"], errors="coerce")

    return df

**Sample - Scrape from a pre-defined article**

In [4]:
url = "https://finance.detik.com/berita-ekonomi-bisnis/d-7774570/pengusaha-lirik-peluang-di-balik-ancaman-perang-dagang-trump"
article = Article(url)
article.download()
article.parse()
print("Title: ", article.title)
print("Content: ", article.text)

Title:  Pengusaha Lirik Peluang di Balik Ancaman Perang Dagang Trump
Content:  Presiden Amerika Serikat (AS) Donald Trump memulai perang dagang dengan menaikkan tarif impor di sejumlah negara, termasuk China. Soal ini, pihak pengusaha mengaku harus pintar-pintar dalam melihat peluang yang timbul dari perang dagang ini.

Ketua Dewan Pertimbangan Kamar Dagang dan Industri (Kadin) Indonesia, Arsjad Rasjid, mengatakan ada dua sisi yang bisa dilihat dari hingar-bingar perang dagang Trump.

"Yang dilakukan adalah lebih baik untuk Amerika sendiri. Untuk kita, kita lihat peluangnya. Misalnya, kalau mereka tidak mau beli produk China, kalau bisa dari Indonesia, bagaimana?" Ucap Arsjad saat ditemui di konferensi pers Indonesia Economic Summit, Jakarta, Rabu (12/2/2025).

ADVERTISEMENT SCROLL TO CONTINUE WITH CONTENT

Menurut Arsjad, dengan begitu, ada potensi ke depan bahwa pengusaha China akan lebih banyak investasi di Indonesia. Hal ini juga ditujukan supaya usaha tetap berjalan.

"Karena kala

**Sample - Scrape news articles related to 'perang dagang' from Detik.**

In [5]:
"""Scrape news articles related to 'perang dagang' from Detik."""

tag_url = "https://www.detik.com/tag/perang-dagang/?sortby=time&page="

news_df = scrape_news_tag(tag_url)
news_df.sort_values(by="Published Date", ascending=False, inplace=True)
print(news_df.head())

Page 1 processed.
Failed to access the main page or no more pages.
                                               Title  \
0  \r\n        Pengusaha Lirik Peluang di Balik A...   
1  \r\n        IHSG Rontok hingga ke Level 6.500-...   
2  \r\n        Perang Dagang AS Vs China Bisa Unt...   
3  \r\n        Bisnis di Perbatasan AS Bisa Kocar...   
4  \r\n        Trump Bakal Umumkan Tarif Impor Ba...   

                                             Content      Published Date  \
0  Presiden Amerika Serikat (AS) Donald Trump mem... 2025-02-12 12:23:00   
1  Indeks Harga Saham Gabungan (IHSG) kembali ter... 2025-02-10 14:20:00   
2  Presiden Amerika Serikat (AS) Donald Trump mem... 2025-02-10 13:40:00   
3  Ketidakpastian seputar usulan tarif dan kebija... 2025-02-08 20:45:00   
4  Presiden Amerika Serikat (AS) Donald Trump ber... 2025-02-08 19:15:00   

                                Author  
0     Amanda Christabel - detikFinance  
1     Amanda Christabel - detikFinance  
2       Retno Ay

**Save to Excel (if necessary)**

In [6]:
if news_df is not None:
    news_df.to_excel("perang_dagang_news_detik.xlsx", index=False)
    print("News articles saved to perang_dagang_news_detik.xlsx")

News articles saved to perang_dagang_news_detik.xlsx
