<a href="https://colab.research.google.com/github/ms624atyale/NLP_2025/blob/main/12_Crawling_Saving.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color = 'red'> üêπ üëÄ üêæ **Crawling or Text Mining or Scraping**

In [2]:
pip install requests



In [9]:
import requests

def get_wikipedia_page(title):
    URL = "https://en.wikipedia.org/w/api.php"

    PARAMS = {
        "action": "query",
        "format": "json",
        "prop": "extracts",
        "titles": title,
        "explaintext": 1
    }

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/123.0 Safari/537.36"
    }

    response = requests.get(URL, params=PARAMS, headers=headers)

    if response.status_code != 200:
        print("HTTP error:", response.status_code)
        return None

    try:
        data = response.json()
    except:
        print("JSON decode error")
        print("Raw response:", response.text[:500])
        return None

    pages = data.get("query", {}).get("pages", {})
    page = next(iter(pages.values()))
    return page.get("extract", "")

In [8]:
text = get_wikipedia_page("KPop Demon Hunters")
print(text[:500])

KPop Demon Hunters is a 2025 American animated musical urban fantasy film directed by Maggie Kang and Chris Appelhans from a screenplay they co-wrote with Danya Jimenez and Hannah McMechan, based on a story conceived by Kang. Produced by Sony Pictures Animation for Netflix, the film stars the voices of Arden Cho, Ahn Hyo-seop, May Hong, Ji-young Yoo, Yunjin Kim, Daniel Dae Kim, Ken Jeong, and Lee Byung-hun. The film follows a K-pop girl group, Huntr/x, who lead double lives as demon hunters; the


In [15]:
titles = [
    "K-pop",
    "Korean Wave",
    "KPop Demon Hunters",
    "Hybe",
    "BTS",
    "2024 Nobel Prize in Literature",
    "Han Kang",
    "Bong Joon Ho",
    "Pachinko",
    "Minjung Son"
]

corpus = {}

for t in titles:
    txt = get_wikipedia_page(t)
    if txt:
        corpus[t] = txt
    else:
        print("Failed:", t)

# Show first 200 chars for each
for title, text in corpus.items():
    print("\n====", title, "====")
    print(text[:200])

Failed: Minjung Son

==== K-pop ====
K-pop (Korean: ÏºÄÏù¥Ìåù; RR: Keipap; an abbreviation of "Korean popular music") is a form of popular music originating in South Korea. The music genre that the term is used to refer to colloquially emerged

==== Korean Wave ====
The Korean Wave, or hallyu (Korean: ÌïúÎ•ò; ), is the dramatic rise in global interest in South Korean popular culture since the 1990s‚Äîled by K-pop, K-dramas, and films, with keystone successes including 

==== KPop Demon Hunters ====
KPop Demon Hunters is a 2025 American animated musical urban fantasy film directed by Maggie Kang and Chris Appelhans from a screenplay they co-wrote with Danya Jimenez and Hannah McMechan, based on a

==== Hybe ====
Hybe Co., Ltd. (Korean: ÌïòÏù¥Î∏å; haibeu), commonly known as simply Hybe, is a South Korean multinational entertainment company established in 2005 by Bang Si-hyuk as Big Hit Entertainment Co., Ltd.
The co

==== BTS ====
BTS (Korean: Î∞©ÌÉÑÏÜåÎÖÑÎã®; RR: Bangtan Sonyeondan; lit

‚úÖ Script A ‚Äî create one TXT file per title

- All scripts use Wikipedia API

In [16]:
import os

os.makedirs("wiki_txts", exist_ok=True)

for title in titles:
    txt = get_wikipedia_page(title)
    if not txt:
        print(f"Skipping: {title}")
        continue

    fname = title.replace(" ", "_").replace("'", "") + ".txt"
    path = os.path.join("wiki_txts", fname)

    with open(path, "w", encoding="utf-8") as f:
        f.write(txt)

    print(f"Saved: {path}")

Saved: wiki_txts/K-pop.txt
Saved: wiki_txts/Korean_Wave.txt
Saved: wiki_txts/KPop_Demon_Hunters.txt
Saved: wiki_txts/Hybe.txt
Saved: wiki_txts/BTS.txt
Saved: wiki_txts/2024_Nobel_Prize_in_Literature.txt
Saved: wiki_txts/Han_Kang.txt
Saved: wiki_txts/Bong_Joon_Ho.txt
Saved: wiki_txts/Pachinko.txt
Skipping: Minjung Son


‚úÖ Script B ‚Äî Create one TXT file with records separated by @

In [18]:
output = []

for title in titles:
    txt = get_wikipedia_page(title)
    if not txt:
        txt = ""   # store empty if missing
    block = f"@@@@@\nTITLE: {title}\n{txt}\n"
    output.append(block)

final = "\n".join(output)

with open("wiki_corpus_delimited.txt", "w", encoding="utf-8") as f:
    f.write(final)

print("Saved: wiki_corpus_delimited.txt")

Saved: wiki_corpus_delimited.txt


‚úÖ Script C ‚Äî Create one CSV with two columns (title + text)

#üêπ üêæ üìå **Use this!!!**üìå

In [17]:
import csv

rows = []

for title in titles:
    txt = get_wikipedia_page(title)
    rows.append([title, txt])

with open("wiki_corpus.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["title", "text"])
    writer.writerows(rows)

print("Saved: wiki_corpus.csv")

Saved: wiki_corpus.csv
