<a href="https://colab.research.google.com/github/official-okello/DS_bootcamp_with_gomycode/blob/master/Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
# Getting and parsing html content from a Wikipedia page
import requests as re
from bs4 import BeautifulSoup

def get_html_content(url):
    html_response = re.get(url)
    if html_response.status_code == 200:
        return BeautifulSoup(html_response.text, 'html.parser')
    else:
        return None

# Extracting article titles
def extract_article_title(soup):
    return soup.find('h1').text

# Extracting article text for each paragraph with their respective. Mapping headings to their respective paragraphs in the dictionary.
def extract_article_text(soup):
    article_text = {}
    for paragraph in soup.find_all('p'):
        heading = paragraph.find_previous('h2')
        if heading:
            article_text[heading.text.strip()] = paragraph.text.strip()
    return article_text

# Collecting every link that redirects to another Wikipedia page
def collect_internal_links(soup):
    internal_links = []
    for link in soup.find_all('a', href=True):
        if link['href'].startswith('/wiki/') and ':' not in link['href']:
            internal_links.append("https://en.wikipedia.org" + link['href'])
    return internal_links

# Wrapping all the previous functions into a single function that takes as parameters a Wikipedia link
def process_wikipedia_page(url):
    soup = get_html_content(url)
    if soup:
        article_title = extract_article_title(soup)
        article_text = extract_article_text(soup)
        internal_links = collect_internal_links(soup)

        return {
            "title": article_title,
            "content": article_text,
            "internal_links": internal_links
        }
    else:
        return None

result = process_wikipedia_page('https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart')

# Displaying Results
if result:
    print(f"Article Title:", result["title"])
    print("\n--- Article Content ---")
    for heading, paragraph in result["content"].items():
        print(f"{heading}\n{paragraph}\n")

    print("\n--- Internal Links ---")
    for link in result["internal_links"][:10]:  # Show only first 10 links
        print(link)
else:
    print("Failed to fetch Wikipedia page.")

Article Title: Wolfgang Amadeus Mozart

--- Article Content ---
Contents
While visiting Vienna in 1781, Mozart was dismissed from his Salzburg position. He stayed in Vienna, where he achieved fame but little financial security. During Mozart's early years in Vienna, he produced several notable works, such as the opera Die Entführung aus dem Serail, the Great Mass in C minor, the "Haydn" Quartets and a number of symphonies. Throughout his Vienna years, Mozart composed over a dozen piano concertos, many considered some of his greatest achievements. In the final years of his life, Mozart wrote many of his best-known works, including his last three symphonies, culminating in the Jupiter Symphony, the serenade Eine kleine Nachtmusik, his Clarinet Concerto, the operas The Marriage of Figaro, Don Giovanni, Così fan tutte and The Magic Flute and his Requiem. The Requiem was largely unfinished at the time of his death at age 35, the circumstances of which are uncertain and much mythologised.

L