<a href="https://colab.research.google.com/github/mufi2/LLM-Engineer/blob/main/OpenAI_webscrapper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import userdata

api_key = userdata.get("OPENAI_API_KEY")

assert api_key is not None, "API key not found"
print("API key loaded:", api_key[:6], "...")


API key loaded: sk-pro ...


In [None]:
!pip -q install trafilatura requests


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/132.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.6/132.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m837.9/837.9 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m274.7/274.7 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import requests
import trafilatura

def get_site_text(url: str, *, timeout: int = 20, max_chars: int = 1500000) -> str:
    """
    Fetch a webpage and extract its main readable text.
    Best for article-like pages. Returns plain text.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36"
    }

    r = requests.get(url, headers=headers, timeout=timeout)
    r.raise_for_status()

    text = trafilatura.extract(r.text)
    if not text:
        return "Could not extract text (page may be JS-rendered or blocked)."

    text = text.strip()
    if len(text) > max_chars:
        text = text[:max_chars].rstrip() + "\n\n[...truncated...]"
    return text


In [None]:
print(get_site_text("https://en.wikipedia.org/wiki/Bangladesh"))

Bangladesh
People's Republic of Bangladesh | |
|---|---|
| Anthem: আমার সোনার বাংলা (Bengali) Amar Sonar Bangla "My Golden Bengal" | |
| Government Seal | |
| Capital and largest city | Dhaka 23°45′50″N 90°23′20″E / 23.76389°N 90.38889°E |
| Official language and national language | Bengali[1][2] |
| Recognised foreign language | English[3][4] |
| Ethnic groups (2022)[5] | 99% Bengali |
| Religion (2022)[6] | |
| Demonym | Bangladeshi |
| Government | Unitary parliamentary republic under an interim government |
| Mohammed Shahabuddin | |
| Muhammad Yunus | |
| Zubayer Rahman Chowdhury | |
| Legislature | Jatiya Sangsad (currently suspended) |
| Independence from Pakistan | |
| 15 August 1947 | |
| 14 October 1955 | |
| 26 March 1971 | |
| 10 April 1971 | |
• Victory in the Liberation War | 16 December 1971 |
| 16 December 1972 | |
| Area | |
• Total | 148,460[7] km2 (57,320 sq mi) (92nd) |
• Water (%) | 6.4 |
• Land area | 130,170 km2[8] |
• Water area | 18,290 km2[8] |
| Population | 

In [None]:
import os
from dotenv import load_dotenv

def load_and_check_api_key():
    """
    Loads OPENAI_API_KEY from:
    - .env / environment variables (local)
    - Colab Secrets (if running in Colab)

    Prints clear diagnostic messages like your example.
    """
    load_dotenv(override=True)

    # Try Colab first
    api_key = None
    try:
        from google.colab import userdata
        api_key = userdata.get("OPENAI_API_KEY")
    except ImportError:
        pass

    # Fallback to environment (.env / shell)
    if not api_key:
        api_key = os.getenv("OPENAI_API_KEY")

    # ---- Checks (same spirit as your code) ----
    if not api_key:
        print("❌ No API key was found. Please add OPENAI_API_KEY.")
        return None

    if not api_key.startswith("sk-"):
        print("⚠️ API key was found, but it doesn't look like a valid OpenAI key (should start with 'sk-').")
        return None

    if api_key.strip() != api_key:
        print("⚠️ API key has leading or trailing spaces. Please remove them.")
        return None

    print("✅ API key found and looks good so far!")
    return api_key


In [None]:
api_key = load_and_check_api_key()


✅ API key found and looks good so far!


In [None]:
from openai import OpenAI
openai = OpenAI(api_key=api_key)

def summarize_scrap(url : str,
                    system_prompt : str,
                    user_prompt : str,
                    model : str = "gpt-5"):
  text = get_site_text(url)
  final_user = f"""{user_prompt}
  --- SCRAPED TEXT START ---
  {text}
  --- SCRAPED TEXT END --- """
  response = openai.chat.completions.create(
    model=model,
    messages=[{"role":"system","content":system_prompt},
              {"role":"user","content":final_user}]
  )
  return response.choices[0].message.content


In [None]:
system_prompt = "You are a helpful assistant. Summarize only from the provided text."
user_prompt = "Summarize the website content in bullet points."

url = "https://edwarddonner.com"

result = summarize_scrap(
    url=url,
    system_prompt=system_prompt,
    user_prompt=user_prompt
)

print(result)


- Ed is a coder who experiments with LLMs; he also enjoys amateur electronic music production and browsing Hacker News.
- Co-founder and CTO of Nebula.io, applying AI to help people discover their potential and purpose.
- Previously founder and CEO of AI startup untapt, acquired in 2021.
- Created Udemy courses on LLMs after friends’ encouragement; they’re best-selling, top-rated, with 400,000 learners across 190 countries. Full curriculum available on his site.
- Promises infrequent, value-focused emails.
- Contact: ed [at] edwarddonner [dot] com.


In [None]:
from IPython.display import Markdown, display

def summarize_url_and_show_markdown(
    url: str,
    system_prompt: str,
    user_prompt: str,
):
    """
    Calls summarize_url_with_prompts()
    and displays the result as rendered Markdown.
    """

    result = summarize_scrap(
        url=url,
        system_prompt=system_prompt,
        user_prompt=user_prompt
    )

    display(Markdown(result))


In [None]:
summarize_url_and_show_markdown(url = url,
                                system_prompt = system_prompt,
                                user_prompt = user_prompt)

- Overview: Bangladesh is a South Asian country of about 171.4 million people (2023) in 148,460 km², bordered by India and Myanmar with a Bay of Bengal coastline; Dhaka is the capital and Chittagong the main port; Bengali is the official language (English recognised), 99% of people are Bengali and about 91% are Muslim.
- Historical arc: From ancient Hindu–Buddhist polities to the Bengal Sultanate and Mughal prosperity, British rule followed the 1757 Battle of Plassey; East Bengal became East Pakistan in 1947, and after a 1971 war marked by genocide and Indian support, Bangladesh became independent; politics since saw Mujib’s rule and assassination, Zia and Ershad eras, alternating BNP–Awami League rivalry, and in August 2024 Sheikh Hasina was ousted, with a Muhammad Yunus–led interim government installed.
- Government and politics: A unitary Westminster-style parliamentary republic with a powerful prime minister, ceremonial president, a 350-seat unicameral Jatiya Sangsad (50 reserved for women, anti-defection Article 70), and a Supreme Court; parliament is currently suspended, the judiciary faces large backlogs, and overall democratic performance is rated low by International IDEA.
- Economy: Lower-middle-income mixed economy (nominal GDP 2025 est. $475b; PPP $1.78t; per capita nominal $2,730; PPP $10,260), with services 51.5%, industry 34.6%, agriculture 11%; garments account for 84% of exports (second-largest globally); remittances were ~$27b in 2024; challenges include inflation, corruption, power constraints and slow reforms; HDI 0.685 (2023), Gini 33.4 (2025).
- Energy and infrastructure: Achieved 100% electrification by 2022, generation capacity rose to 25.5 GW (plan 50 GW by 2041); world’s largest off-grid solar program; Rooppur nuclear plant’s first unit is expected in 2025; gas shortages drive LNG imports; significant roles for private power firms and US-made turbines.
- Foreign relations: A middle power active in UN, Commonwealth, SAARC (pioneer), OIC, D-8 and hosts BIMSTEC HQ; seeks ASEAN membership; relations are strained with Myanmar over >700,000 Rohingya refugees; key ties include India (water/border issues), China (largest trading partner and arms supplier), and Japan (largest aid provider); 59% of remittances come from the Middle East; leads climate-vulnerable diplomacy.
- Military: About 230,000 active personnel, among South Asia’s larger forces; largest contributor to UN peacekeeping; defence budget ~1.3% of GDP; Navy includes frigates, corvettes and submarines; equipment largely from China, with growing cooperation with India; ratified the UN nuclear ban treaty (2019).
- Geography, climate, environment: Dominated by the Ganges–Brahmaputra–Meghna delta and low elevations, with the Sundarbans mangroves and haor wetlands; tropical monsoon climate with frequent floods and cyclones; highly climate-vulnerable (sea-level rise could inundate ~20% of land by 2050); Bangladesh Delta Plan 2100 underway; forest cover ~14%; biodiversity is rich but threatened by pollution and habitat loss.
- Demographics and social indicators: One of the world’s most densely populated countries; TFR 1.9 (below replacement), ~40% urban, median age ~28; hosts one of the largest refugee populations (Rohingya); literacy 76%; education is free/compulsory but public spending is low (1.8% of GDP); health spending ~2.36% of GDP with high out-of-pocket costs, life expectancy 74, significant malnutrition, severe air pollution, and arsenic in drinking water.
- Society, culture, media and sport: Vibrant civil society (e.g., BRAC) amid shrinking civic space; human-rights concerns (e.g., US sanctions on RAB), Digital Security Act replaced by Cyber Security Act (2023); Freedom House rates it “partly free,” press freedom ranked 149/180 (2025); rich cultural heritage (Pahela Baishakh, Eid, Durga Puja; literature from Tagore to Nazrul; Jamdani UNESCO); cuisine centers on rice and fish (hilsa); kabaddi is the national sport, cricket is most popular with notable men’s and women’s achievements; three UNESCO World Heritage Sites support modest tourism.

In [None]:
summarize_url_and_show_markdown(url = "https://edwarddonner.com",
                                system_prompt = system_prompt,
                                user_prompt = user_prompt)

- Ed is a coder who experiments with LLMs; hobbies include amateur electronic music production and browsing Hacker News.
- Co-founder and CTO of Nebula.io, applying AI to help people discover their potential and purpose.
- Previously founder and CEO of AI startup untapt, acquired in 2021.
- Created Udemy courses on LLMs after friends’ encouragement; they are best-selling, top-rated, with 400,000 learners across 190 countries.
- Expresses gratitude to learners visiting from his courses; mentions a full curriculum is available.
- Promises infrequent, value-added contact.
- Contact: ed [at] edwarddonner [dot] com.