   
# 01 DATA TYPE

In this session, we’ll explore one of the most fundamental concepts in Python programming: Data Types. Understanding data types is essential because everything in Python is an object with a specific type.

📘 **What You’ll Learn: **

*What are data types in Python?*
    The most common built-in data types

    int – Integer numbers

    float – Decimal numbers

    str – Text data (strings)

    bool – Boolean values (True / False)

    list, tuple, dict, set – Collection types

*How to check a variable’s data type:*

    Type conversion (casting)

In [None]:
#Example of Primitive Data Types - Introduction


# Example: Declare different types of variables
a = 10          # Integer
b = 3.14        # Float
c = "Hello"     # String
d = True        # Boolean

# Check their types
print(type(a))
print(type(b))
print(type(c))
print(type(d))

print ("\nData type can operate differently")
print(type(3 + 5.0))
print(type("3" + "5"))
print(type(10 > 3))

print ("\nisinstance to call Boolean")
print(isinstance(3,float))

In [None]:
# TASK 01
# Goal        : Practice basic operations
# Description :
# - Practice using Gemini in Google Colab
# - You have list that has item [1, '2', "Finance", "4", "News", 10]
#.  What is the logic to obtain result : Print out total value of numbers - both integer and string - in the list
# - Duration : 5mins

data_list = [1, '2', "Finance", "4", "News", 10]
total_value = 0

for item in data_list:
  if isinstance(item, (int, float)):
    total_value += item
  elif isinstance(item, str) and item.isdigit():
    total_value += int(item)

print(total_value)


**Introducing another data type, popular for data analysis : Data Frame**



*   Provided by the Pandas library in Python, not a built-in native Python type.
*   A class used to represent tabular data (think of it like an excel sheet)





In [None]:
# Create Blank dataframe from pandas (1)
import pandas as pd

df = pd.DataFrame()

# Assign variables to dataframe
df = pd.DataFrame(columns=["newsportal", "title", "date"])

newsportal = "Detik"
title = "BI Ungkap Strategi Menjaga Inflasi"
date = "2025-05-31"


df.loc[len(df)] = [newsportal, title, date]

# View the DataFrame
df

In [None]:
# Create dataframe from pandas
import pandas as pd

# Create a DataFrame with news portal data
df = pd.DataFrame({
    'newsportal': ['Detik', 'Kompas'],
    'title': [
        'BI Ungkap Strategi Menjaga Inflasi',
        'Kerangka Ekonomi Makro 2026: Pertumbuhan Ekonomi 5,2 hingga 5,8 Persen'
    ],
    'date': ['2025-05-22', '2025-06-01']
})

# Print the type of the DataFrame
print(type(df))

In [None]:
# Show dataframe
df.head()

In [None]:
# Save to CSV in the Colab working directory
df.to_csv("news_articles.csv", index=False)

# Save to Excel (requires openpyxl for .xlsx)
df.to_excel("news_articles.xlsx", index=False)

In [None]:
# TASK 02
# Goal        : Practice loading csv / excel to Google Colab
# Description :
# - Create an excel sheet consists of : header row, 3 columns, 3 rows
# - Load to Colab Directory (on the left side, folder icon)
# - Practice Gemini to get syntax of loading to here
# - Duration : 10mins

# [Your script here]


# 02 String Operation

📌 If you can’t handle strings, you can’t clean scraped data.

String operations are your basic toolbox for transforming raw scraped content into clean, usable data — whether you’re building a dataset, summarizing news, or tracking trends.

In [None]:
# TASK 03
# Goal        : Execute String Operations as foundations to text (scraping) analysis
# Description :
# Practice using Gemini to get the syntax for below use cases:
# Clean the headline - Remove leading/trailing whitespace, then Convert to lowercase
# Expected print result : "bi ungkap strategi menjaga inflasi"

# Extract the author’s name and date from author_info
# Expected print result "Retno Ayuningrum"

# Remove the link from the article snippet
# Expected print result "22 Mei 2025"

# Keep only the sentence, not the URL
# Expected print result : " Bank Indonesia (BI) menyampaikan bahwa meskipun inflasi global diperkirakan meningkat pada 2025, inflasi di Indonesia tetap stabil dan berada dalam sasaran yang ditetapkan, berkat sinergi antara BI dan pemerintah melalui Tim Pengendalian Inflasi Pusat dan Daerah (TPIP-TPID) dalam menjaga stabilitas harga pangan."


headline = "   BI Ungkap Strategi Menjaga Inflasi   "
author_info = "By Retno Ayuningrum - 22 Mei 2025"
article_snippet = "Bank Indonesia (BI) menyampaikan bahwa meskipun inflasi global \
diperkirakan meningkat pada 2025, inflasi di Indonesia tetap stabil dan berada dalam sasaran yang ditetapkan, \
berkat sinergi antara BI dan pemerintah melalui Tim Pengendalian Inflasi Pusat dan Daerah (TPIP-TPID) \
dalam menjaga stabilitas harga pangan. \
 Read more at https://finance.detik.com/berita-ekonomi-bisnis/d-7926539/bi-ungkap-strategi-menjaga-inflasi"

# [Your script here]

# Clean the headline - Remove leading/trailing whitespace, then Convert to lowercase
cleaned_headline = headline.strip().lower()
print(cleaned_headline)

# Extract the author’s name and date from author_info
# Split the string by ' - ' and take the second element, then remove 'By '
author_name = author_info.split(' - ')[0].replace('By ', '')
print(author_name)

# Remove the link from the article snippet
# Find the index of 'Read more at' and slice the string
date_info = author_info.split(' - ')[1]
print(date_info)


# Keep only the sentence, not the URL
# Find the index of 'Read more at' and slice the string before that
url_start_index = article_snippet.find(" Read more at")
if url_start_index != -1:
  article_sentence = article_snippet[:url_start_index].strip()
else:
  article_sentence = article_snippet.strip()

print(article_sentence)


In [None]:
# Below is sample of news scraping block of syntax of single news portal using BeautifulSoup

import requests
from bs4 import BeautifulSoup

# URL of the Detik Finance article
url = "https://finance.detik.com/berita-ekonomi-bisnis/d-7926539/bi-ungkap-strategi-menjaga-inflasi"

# Send a GET request to the URL
response = requests.get(url)
response.raise_for_status()  # Raise an exception for HTTP errors

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# Extract the title
title_tag = soup.find("h1")
title = title_tag.get_text(strip=True) if title_tag else "Title not found"

# Extract the author and date
date_tag = soup.find("div", class_="detail__date")

# Extract the author from the meta tag
author_meta = soup.find("meta", attrs={"name": "author"})
author = author_meta["content"] if author_meta else "Author not found"
date = date_tag.get_text(strip=True) if date_tag else "Date not found"

# Extract the article content
content_div = soup.find("div", class_="detail__body-text")
if content_div:
    paragraphs = content_div.find_all("p")
    content = "\n".join(p.get_text(strip=True) for p in paragraphs)
else:
    content = "Content not found"

# Display the extracted information
print("Title:", title)
print("Author:", author)
print("Date:", date)
print("Content:\n", content)


In [None]:
# TASK 04
# Goal        : Execute String Operations (real example)
# Description :
# Practice using Gemini to get the syntax for below use cases:
# Extract the date from date variable "Kamis, 22 Mei  2025 12:43 WIB."
# Duration : 10mins

date="Kamis, 22 Mei  2025 12:43 WIB."

# [Your script here]


# Split the string by comma and take the second part
date_part = date.split(',')[1].strip()

# Split the date part by space and take the first three parts (day, month, year)
date_components = date_part.split()[:3]

# Join the components back with a space
extracted_date = ' '.join(date_components)

print(extracted_date)

In [None]:
# TASK 05
# Goal        : Create dataframe that contains Title, Author, Date, and URL
# Description :
# Create a variable detik_data which is the dictionary
# Create dataframe called df_news that contains detiknews component of title, author, date, url (4 columns)
# Duration : 10mins

import pandas as pd

# [Your script here]

detik_data = {
    'Title': [title],
    'Author': [author],
    'Date': [date],
    'URL': [url]
}

# Create the DataFrame
df_news = pd.DataFrame(detik_data)
df_news


# 03 Scrape a single news page

Beautiful Soup scrapes and analyzes web page content like article titles, links, dates, and more — by treating the HTML as a tree of elements that you can search.

In [None]:
from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1>Hello, world!</h1>
    <p class="intro">Welcome to scraping</p>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

# Extract elements
print(soup.h1.text)                # Hello, world!
print(soup.find("p").text)         # Welcome to scraping


In [None]:
# TASK 06
# Goal        : Newscraping practice using BeautifulSoup
# Description :
# Using scraping syntax provided previously, newscrape an article from Kompas
# [Attention] Use URL = https://money.kompas.com/read/2025/05/20/145400226/kerangka-ekonomi-makro-2026--pertumbuhan-ekonomi-5-2-hingga-5-8-persen
# Why Date and Content scraping not found? [Hint] Inspect and find that content is under <script>
# Duration : 15mins

import requests
from bs4 import BeautifulSoup

import re

# URL of the Detik Finance article
url = "https://money.kompas.com/read/2025/05/20/145400226/kerangka-ekonomi-makro-2026--pertumbuhan-ekonomi-5-2-hingga-5-8-persen"

# [Your script here]

import requests
from bs4 import BeautifulSoup

# Send a GET request to the URL
response = requests.get(url)
response.raise_for_status()  # Raise an exception for HTTP errors

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# Extract the title
title_tag = soup.find("h1")
title = title_tag.get_text(strip=True) if title_tag else "Title not found"

# Extract the author and date
date_tag = soup.find("div", class_="read__time")

# Extract the author from the meta tag
author_meta = soup.find("meta", attrs={"name": "author"})
author = author_meta["content"] if author_meta else "Author not found"
date = date_tag.get_text(strip=True) if date_tag else "Date not found"

# Find all script tags
script_tags = soup.find_all("script")

# Iterate through script tags and search for the variable
for script in script_tags:
    if script.string and "var keywordBrandSafety" in script.string:
        # Use regex to find the value of keywordBrandSafety
        match = re.search(r'var keywordBrandSafety\s*=\s*"(.*?)";', script.string)
        if match:
            # Extract the captured group (the content within the quotes)
            content = match.group(1)
            break  # Stop searching once found

# Display the extracted information
print("Title:", title)
print("Author:", author)
print("Date:", date)
print("Content:\n", content)

In [None]:
# TASK 07
# Goal        : Append to dataframe (do not create new dataframe)
# Description :
# Append kompas component from result above (Title, Author, Date, and URL) to df_news (previously created).
# Expected result : df_news now has 4 columns and 2 rows
# Duration : 10mins

# [Your script here]

# Create a dictionary for the Kompas article data
kompas_data = {
    'Title': [title],
    'Author': [author],
    'Date': [date],
    'URL': [url]
}

# Create a new DataFrame from the Kompas data
df_kompas = pd.DataFrame(kompas_data)

# Append the Kompas DataFrame to the existing df_news DataFrame
# Use ignore_index=True to renumber the index after appending
df_news = pd.concat([df_news, df_kompas], ignore_index=True)

# Display the updated DataFrame
df_news



# 04 Scrape a news portal with a keyword

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Step 1: Define search URL
search_url = "https://www.detik.com/search/searchall?query=inflasi&siteid=29&source_kanal=true"

# Step 2: Fetch search result page
response = requests.get(search_url)
soup = BeautifulSoup(response.text, "html.parser")

# Step 3: Extract top 5 article URLs
article_links = []
for tag in soup.select("article a[href]"):
    url = tag["href"]
    if url.startswith("https://") and url not in article_links:
        article_links.append(url)
    if len(article_links) == 5:
        break

print(f"\nFound {len(article_links)} article links.\n")

# Step 4: Prepare a list to collect all articles
rows = []

# Step 5: Scrape each article
for link in article_links:
    try:
        article_resp = requests.get(link)
        article_soup = BeautifulSoup(article_resp.text, "html.parser")

        title = article_soup.find("h1").get_text(strip=True) if article_soup.find("h1") else "No Title"
        author_meta = article_soup.find("meta", attrs={"name": "author"})
        author = author_meta["content"] if author_meta else "No Author"
        date_tag = article_soup.find("div", class_="detail__date")
        date = date_tag.get_text(strip=True) if date_tag else "No Date"

        # Print info
        print("Title :", title)
        print("Author:", author)
        print("Date  :", date)
        print("URL   :", link)
        print("-" * 60)

        # Append to list
        rows.append({
            "title": title,
            "author": author,
            "date": date,
            "url": link
        })

        time.sleep(1)
    except Exception as e:
        print(f"Error scraping {link}: {e}")

# Step 6: Convert to DataFrame
df_multinews= pd.DataFrame(rows)
df_multinews


In [None]:
# TASK 08
# Goal        : Newscraping from a single news portal using multi keyword
# Description :
# Create a list of keyword keyword_list=['inflasi','perang dagang'] from finance detik news
# Create function to contain newscraping script but with top 3 per keyword
# Create loop to read every item in the list and pass it to the function
# Duration : 10mins

keyword_list=['inflasi','perang dagang']

def extract_multikeyword(keyword):

  # Step 1: Define search URL
  search_url = f"https://www.detik.com/search/searchall?query={keyword}&siteid=29&source_kanal=true"

  # Step 2: Fetch search result page
  response = requests.get(search_url)
  soup = BeautifulSoup(response.text, "html.parser")

  # Step 3: Extract top 5 article URLs
  article_links = []
  for tag in soup.select("article a[href]"):
      url = tag["href"]
      if url.startswith("https://") and url not in article_links:
          article_links.append(url)
      if len(article_links) == 3:
          break

  print(f"\nFound {len(article_links)} article links.\n")

  # Step 4: Prepare a list to collect all articles
  rows = []

  # Step 5: Scrape each article
  for link in article_links:
      try:
          article_resp = requests.get(link)
          article_soup = BeautifulSoup(article_resp.text, "html.parser")

          title = article_soup.find("h1").get_text(strip=True) if article_soup.find("h1") else "No Title"
          author_meta = article_soup.find("meta", attrs={"name": "author"})
          author = author_meta["content"] if author_meta else "No Author"
          date_tag = article_soup.find("div", class_="detail__date")
          date = date_tag.get_text(strip=True) if date_tag else "No Date"

          # Print info
          print("Title :", title)
          print("Author:", author)
          print("Date  :", date)
          print("URL   :", link)
          print("-" * 60)

          # Append to list
          rows.append({
              "title": title,
              "author": author,
              "date": date,
              "url": link
          })

          time.sleep(1)
      except Exception as e:
          print(f"Error scraping {link}: {e}")

  return rows

rows_all=[]
for keyword in keyword_list:
  result=extract_multikeyword(keyword)
  rows_all.extend (result)

df_multinewskeyword= pd.DataFrame(rows_all)
df_multinewskeyword

In [None]:
# TASK 09
# Goal        : Sort df_multinewskeyword by date
# Description :
# [Hint] Date is different format, should be splitted.
# [Prompt Example] How to split the df_multinewskeyword[date] to extract
#                  from Rabu, 14 Mei 2025 11:12 WIB to 14 Mei 2025 only, then convert to date type.
#                  The text string is month Mei, Apr, Nov, Okt -- need mapping
# Duration : 15mins

# [Your script here]

import pandas as pd
from datetime import datetime


# Assuming df_multinewskeyword DataFrame already exists with the 'date' column

# Manual mapping for 3-character Indonesian month abbreviations to English full names
month_abbr_to_english = {
    'Jan': 'January',
    'Feb': 'February',
    'Mar': 'March',
    'Apr': 'April',
    'Mei': 'May',
    'Jun': 'June',
    'Jul': 'July',
    'Agu': 'August',
    'Sep': 'September',
    'Okt': 'October',
    'Nov': 'November',
    'Des': 'December'
}


# Function to extract, map month, and convert to date
def extract_map_and_convert_date(date_str):
    if date_str == "No Date":
        return pd.NaT # Return Not a Time for missing dates

    try:
        # Split by comma and take the second part (which contains date and time)
        date_time_part = date_str.split(',')[1].strip()

        # Split the date_time_part by space
        parts = date_time_part.split()

        if len(parts) < 3: # Ensure we have at least day, month, year
             return pd.NaT

        # Extract day, month abbreviation, and year
        day = parts[0]
        month_abbr_id = parts[1]
        year = parts[2]

        # Map Indonesian month abbreviation to English full name
        month_en = month_abbr_to_english.get(month_abbr_id, month_abbr_id) # Use original if not found

        # Construct a string in a format that pd.to_datetime can easily understand
        # using the full English month name
        cleaned_date_str = f"{day} {month_en} {year}"

        # Convert to datetime using pd.to_datetime
        # Use the format string that matches "14 May 2025"
        return pd.to_datetime(cleaned_date_str, format='%d %B %Y', errors='coerce')

    except Exception as e:
        print(f"Error processing date '{date_str}': {e}")
        return pd.NaT # Return Not a Time if conversion fails


# Apply the function to the 'date' column and store in a new column
df_multinewskeyword['date_converted_mapped'] = df_multinewskeyword['date'].apply(extract_map_and_convert_date)

# Display the DataFrame with the new converted date column
print(df_multinewskeyword[['title', 'date', 'date_converted_mapped']])

# Now you can sort by the 'date_converted_mapped' column
df_multinewskeyword_sorted = df_multinewskeyword.sort_values(by='date_converted_mapped', ascending=True)

df_multinewskeyword_sorted


In [None]:
# TASK 10
# Goal        : Save final result to local desktop
# Description : Save df_multinewskeyword_sorted to ""detik_finance_news_sorted.xlsx"" and download to local desktop


# [Your script here]

# Define the filename for the Excel file
excel_filename = "detik_finance_news_sorted.xlsx"

# Save the DataFrame to an Excel file in the Colab working directory
# index=False prevents writing the DataFrame index as a column in the Excel file
df_multinewskeyword_sorted.to_excel(excel_filename, index=False)
