# Web Scraping Wikipedia - France Page
## A Step-by-Step Tutorial using BeautifulSoup

This notebook shows how to collect information from the France Wikipedia page, clean it, and store the results.

## Step 1: Import Required Libraries

We need `requests` to download the page, `BeautifulSoup` to parse it, and `pandas` to present the results cleanly.

In [1]:
# Import required libraries for web scraping and data handling
import requests  # Library to send HTTP requests and fetch web page content
from bs4 import BeautifulSoup  # Library for parsing HTML and XML documents
import pandas as pd  # Library for data manipulation and creating structured tables (DataFrames)

## Step 2: Fetch the Wikipedia Page

Wikipedia blocks some scripts that look like bots, so we send a friendly browser-style header and check the status code before continuing.

In [2]:
# Define the URL of the Wikipedia page we want to scrape
url = "https://en.wikipedia.org/wiki/France"

# Set headers to mimic a real browser, helping to avoid being blocked by the server
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/123.0 Safari/537.36"
}

# Send a GET request to the URL with the headers to retrieve the page content
response = requests.get(url, headers=headers)

# Check the response status code to ensure the request was successful (200 means OK)
if response.status_code == 200:
    print("Page fetched successfully!")
    print(f"Status Code: {response.status_code}")
    print(f"Content Length: {len(response.content)} bytes")
else:
    print(f"Failed to fetch page. Status Code: {response.status_code}")

Page fetched successfully!
Status Code: 200
Content Length: 1172254 bytes


## Step 3: Parse HTML with BeautifulSoup

Once the page downloads, we hand the raw HTML to BeautifulSoup so we can search it.

In [3]:
# Parse the raw HTML content from the response using BeautifulSoup with the 'lxml' parser for efficient HTML handling
soup = BeautifulSoup(response.content, 'lxml')

# Confirm parsing was successful and show the type of the soup object
print("HTML parsed successfully!")
print(f"Document type: {type(soup)}")

HTML parsed successfully!
Document type: <class 'bs4.BeautifulSoup'>


## Step 4: Extract the Page Title

We read both the browser tab title (`<title>`) and the main heading (`<h1>`).

In [4]:
# Method 1: Extract the page title from the <title> tag in the HTML head
page_title = soup.title.string
print("Page Title (from <title> tag):")
print(page_title)
print()

# Method 2: Find the main heading (h1) which is typically the article title
main_heading = soup.find('h1', class_='firstHeading')
if main_heading:
    article_title = main_heading.text
    print("Article Title (from <h1> tag):")
    print(article_title)
else:
    # Fallback for newer Wikipedia layouts: look for the title in a span with specific class
    main_heading = soup.find('span', class_='mw-page-title-main')
    if main_heading:
        article_title = main_heading.text
        print("Article Title (from span):")
        print(article_title)

Page Title (from <title> tag):
France - Wikipedia

Article Title (from <h1> tag):
France


## Step 5: Extract the First 5 Headings

Wikipedia uses `<h2>` tags for section headings. We read the first few to understand the article layout.

In [5]:
# Find all <h2> elements in the HTML, which represent section headings
all_headings = soup.find_all('h2')

# Loop through the first 5 headings, extract their text, and print them
for i, heading in enumerate(all_headings[:5], 1):
    # Get the text content of the heading, stripping extra whitespace
    heading_text = heading.get_text(strip=True)
    print(f"{i}. {heading_text}")

print(f"\nTotal headings found: {len(all_headings)}")

1. Contents
2. Etymology
3. History
4. Geography
5. Politics

Total headings found: 13


## Step 6: Extract a Specific Paragraph

We gather every main paragraph, clean out empty ones, and then display whichever paragraph number we want.

In [6]:
# Locate the main content container and find all <p> tags within it, or fallback to all <p> tags if container not found
content_div = soup.find('div', class_='mw-page-container')
all_paragraphs = content_div.find_all('p') if content_div else soup.find_all('p')

# Filter the list to include only paragraphs with non-empty text content
valid_paragraphs = [p for p in all_paragraphs if p.get_text(strip=True)]


print(f"Total valid paragraphs found: {len(valid_paragraphs)}")
print()

Total valid paragraphs found: 143



In [7]:
# Define a function to display a specific paragraph by its number
def display_paragraph(paragraph_number):
    """
    Display the text of a specified paragraph from the list of valid paragraphs.
    
    Args:
        paragraph_number (int): The 1-based index of the paragraph to display.
    """
    # Validate the paragraph number is within range
    if 1 <= paragraph_number <= len(valid_paragraphs):
        # Retrieve the paragraph (adjusting for 0-based indexing)
        paragraph = valid_paragraphs[paragraph_number - 1]
        # Extract and print the cleaned text
        paragraph_text = paragraph.get_text(strip=True)
        print(f"Paragraph #{paragraph_number}:")
        print(paragraph_text)
    else:
        print(f"Invalid paragraph number. Please choose between 1 and {len(valid_paragraphs)}")

# Example usage: Display the first paragraph
print("Example 1: Displaying Paragraph")
display_paragraph(1)

Example 1: Displaying Paragraph
Paragraph #1:
France,[h]officially theFrench Republic,[i]is a country primarily located inWestern Europe.Its overseas regions and territoriesincludeFrench Guianain South America,Saint Pierre and Miquelonin the North Atlantic, theFrench West Indies, andmany islandsinOceaniaand theIndian Ocean, giving itthe largest discontiguous exclusive economic zone in the world.Metropolitan Franceshares borders withBelgiumandLuxembourgto the north;Germanyto the northeast;Switzerlandto the east;ItalyandMonacoto the southeast;AndorraandSpainto the south; and a maritime border with theUnited Kingdomto the northwest. Its metropolitan area extends from theRhineto the Atlantic Ocean and from theMediterranean Seato theEnglish Channeland theNorth Sea. Its 18integral regions—five of which are overseas—span a combined area of 632,702 km2(244,288 sq mi) and havean estimated total populationof over 68.6 million as of January 2025[update]. France is asemi-presidential republic. Its

In [8]:
# Display the third paragraph as another example
print("Example 2: Displaying Paragraph 3")
display_paragraph(3)

Example 2: Displaying Paragraph 3
Paragraph #3:
TheFrench Revolutionof 1789 overthrew theAncien Régimeand produced theDeclaration of the Rights of Man, which expresses the nation's ideals to this day. France reached its political and military zenith in the early 19th century underNapoleon Bonaparte, subjugating part of continental Europe and establishing theFirst French Empire. Its collapse initiated a period of relative decline in which France endured theBourbon Restorationuntil the founding of theFrench Second Republicwhich was succeeded by theSecond French EmpireuponNapoleon III's takeover. His empire collapsed during theFranco-Prussian Warin 1870. This led to the establishment of theFrench Third Republic, with a period of economic prosperity and cultural and scientific flourishing known as theBelle Époque. France was one of themajor participantsofWorld War I, from whichit emerged victoriousat great human and economic cost. It was among theAllies of World War II, but it surrendered 

In [9]:
# Display the fifth paragraph as a third example
print("Example 3: Displaying Paragraph 5")
display_paragraph(5)

Example 3: Displaying Paragraph 5
Paragraph #5:
Originally applied to the wholeFrankish Empire, the nameFrancecomes from theLatinFrancia, or 'realm of theFranks'.[11]Thename of the Franksis related to the English wordfrank('free'): the latter stems from theOld Frenchfranc('free, noble, sincere'), and ultimately from theMedieval Latinwordfrancus('free, exempt from service; a freeman, a Frank'), a generalisation of the tribal name that emerged as aLate Latinborrowing of the reconstructedFrankishendonym*Frank.[12][13]It has been suggested that the meaning 'free' was adopted because after the conquest ofGaul, only Franks were free of taxation,[14]or more generally because they had the status of freemen in contrast to servants or slaves.[13]The etymology of*Frankis uncertain. It is traditionally derived from theProto-Germanicword*frankōn, which translates as 'javelin' or 'lance' (the throwing axe of the Franks was known as thefrancisca),[15]although these weapons may have been named because

## Step 7: Extract the Infobox

The infobox on the right side stores key facts about France. We turn it into a list of label/value pairs.

In [10]:
# Locate the infobox table using its class name
infobox = soup.find('table', class_='infobox')

if infobox:
    print("✓ Infobox table found!")
    print("\nExtracting France Information...\n")
    
    # Initialize a list to store pairs of labels and values
    info_data = []
    
    # Iterate over each row in the infobox table
    rows = infobox.find_all('tr')
    
    for row in rows:
        # Find the header (th) and data (td) cells in the row
        header = row.find('th')
        data = row.find('td')
        
        if header and data:
            # Extract and clean the text from header and data
            label = header.get_text(strip=True)
            value = data.get_text(strip=True)
            
            # Remove citation references like [1], [2] using regex
            import re
            value = re.sub(r'\[\d+\]', '', value)
            
            # Append the cleaned label-value pair to the list
            info_data.append([label, value])
    
    # Print the extracted information in a formatted way
    print("France - Key Information:")
    print("=" * 80)
    for label, value in info_data:
        print(f"{label:30} : {value}")
        
else:
    print("✗ Infobox table not found")

✓ Infobox table found!

Extracting France Information...

France - Key Information:
Capitaland largest city        : Paris48°51′N2°21′E﻿ / ﻿48.850°N 2.350°E﻿ /48.850; 2.350
Official languageand national language : French[a]
Nationality(2021)[1]           : 92.2%French7.8%other
Religion(2021)[2]              : 50%Christianity33%irreligion4%Islam4%other religions
Demonym                        : French
Government                     : Unitarysemi-presidential republic
•President                     : Emmanuel Macron
•Prime Minister                : Sébastien Lecornu
•President of the Senate       : Gérard Larcher
•President of the National Assembly : Yaël Braun-Pivet
Legislature                    : Parliament
•Upper house                   : Senate
•Lower house                   : National Assembly
•Kingdom of the West Franks—Treaty of Verdun : 10 August 843
•French Republic—French First Republic : 22 September 1792
•Current constitution—French Fifth Republic : 4 October 1958
• Total   

## Step 8: Create a DataFrame

Turning the list into a DataFrame lets us inspect and export it easily.

In [11]:
# Convert the list of info_data into a pandas DataFrame for structured data handling
if info_data:
    df = pd.DataFrame(info_data, columns=['Property', 'Value'])
    print("France Information Table (as DataFrame):")
    display(df)
    print(f"\nTotal rows extracted: {len(df)}")
else:
    print("No data to create DataFrame")

France Information Table (as DataFrame):


Unnamed: 0,Property,Value
0,Capitaland largest city,Paris48°51′N2°21′E﻿ / ﻿48.850°N 2.350°E﻿ /48.8...
1,Official languageand national language,French[a]
2,Nationality(2021)[1],92.2%French7.8%other
3,Religion(2021)[2],50%Christianity33%irreligion4%Islam4%other rel...
4,Demonym,French
5,Government,Unitarysemi-presidential republic
6,•President,Emmanuel Macron
7,•Prime Minister,Sébastien Lecornu
8,•President of the Senate,Gérard Larcher
9,•President of the National Assembly,Yaël Braun-Pivet



Total rows extracted: 38


## Step 9 Pull Specific Facts

We search the table for keywords like “Capital” or “Population” and return the matching value.

In [12]:
# Define a function to search for and retrieve specific information by keyword
def get_info_value(search_term):
    """
    Search the info_data list for a label containing the search term and return its value.
    
    Args:
        search_term (str): The keyword to search for in labels (case-insensitive).
    
    Returns:
        str: The corresponding value if found, otherwise "Not found".
    """
    # Loop through the info_data to find a matching label
    for label, value in info_data:
        # Check if the search term is in the label, ignoring case
        if search_term.lower() in label.lower():
            return value
    return "Not found"

# Use the function to extract and print specific facts about France
print("Specific Information about France:")
print("=" * 80)
print(f"Capital         : {get_info_value('Capital')}")
print(f"Population      : {get_info_value('January 2025 estimate')}")
print(f"Area            : {get_info_value('Total')}")
print(f"Government      : {get_info_value('Government')}")
print(f"Currency        : {get_info_value('Currency')}")
print(f"Language        : {get_info_value('language')}")
print(f"Religion        : {get_info_value('Religion')}")

Specific Information about France:
Capital         : Paris48°51′N2°21′E﻿ / ﻿48.850°N 2.350°E﻿ /48.850; 2.350
Population      : 68,605,616(21st)
Area            : 632,702.3 km2(244,287.7 sq mi)(includingmetropolitan Franceandoverseas Franceand excludingTerre Adelie)(42nd)
Government      : Unitarysemi-presidential republic
Currency        : Euro(€) (EUR)[c]CFP franc(XPF)[d]
Language        : French[a]
Religion        : 50%Christianity33%irreligion4%Islam4%other religions


## Step 10: Save to CSV

Saving the cleaned table means we can reuse it later without re-scraping.

In [13]:
# Save the DataFrame to a CSV file for future use, without including the index column
if info_data:
    csv_filename = 'france_wikipedia_data.csv'
    df.to_csv(csv_filename, index=False, encoding='utf-8')
    print(f"✓ Data saved to '{csv_filename}'")
    print(f"Location: {csv_filename}")
else:
    print("No data to save")

✓ Data saved to 'france_wikipedia_data.csv'
Location: france_wikipedia_data.csv


## Summary

In this walkthrough you:

1. Imported the libraries and fetched a live page with respectful headers.
2. Parsed the HTML with BeautifulSoup.
3. Read the page title and key headings.
4. Gathered paragraphs and built a helper to display them on demand.
5. Extracted the infobox into a tidy DataFrame.
6. Looked up specific facts and saved everything to CSV.

Keep experimenting with other Wikipedia pages—most follow the same layout, so your code will transfer with only minor tweaks.