### Step 1: Install Required Libraries
- First, ensure you have the necessary libraries installed. 
- You need `requests` to fetch the webpage content and `beautifulsoup4` to parse it.

In [None]:
#(uncomment this line of code if you don't have the libraries already installed)
# !pip install requests beautifulsoup4 

### Step 2: Import Libraries
Start by importing the necessary libraries in your Python script.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Step 3: Fetch the Webpage
Use the `requests` library to get the HTML content of the Wikipedia page.

In [2]:
# Define the URL of the Wikipedia page containing FIFA World Cup information
url = "https://en.wikipedia.org/wiki/FIFA_World_Cup"


# Send a GET request to the specified URL and store the response
response = requests.get(url)

# Check if the request was successful (status code 200 indicates success)
if response.status_code == 200:
    # If successful, print a success message
    print("Successfully fetched the webpage!")
else:
    # If not successful, print a failure message
    print("Failed to fetch the webpage.")

Successfully fetched the webpage!


### Step 4: Parse the HTML Content
Use Beautiful Soup to parse the HTML content.

In [3]:
# Parse the content of the fetched webpage using BeautifulSoup
# The 'html.parser' is a built-in parser in Python for parsing HTML and XHTML documents
soup = BeautifulSoup(response.content, 'html.parser')

##### `response.content:`

- `response.content` is an attribute of the response object obtained from making a request using the requests library in Python.
- It contains the raw content of the response, which is essentially the HTML content of the webpage in this case.
- The content is in bytes format, which means it is not yet decoded into a string or other more human-readable format.

##### Parsing:

- Parsing refers to the process of analyzing a string of symbols, either in natural language or computer languages, to determine its grammatical structure with respect to a given set of rules.
- In the context of web scraping, parsing HTML involves converting the HTML content (which is a plain text with tags and data) into a structured format that is easier to work with, such as a parse tree.
- This structured format allows for more efficient data extraction and manipulation.
- Libraries like BeautifulSoup are used for parsing HTML documents, creating a tree structure from the raw HTML content, allowing for easy navigation and data extraction.

### Step 5: Locate the Data Table
- Inspect the Wikipedia page to find the table containing attendance data. 
- We see that it's the 4th table in the list

In [None]:
soup.find_all('table')

### Step 6: Extract Data from Table
- There are many tables in the webpage, so we filter to just the 4th table (index 3)
- Then iterate through each row in the table to extract the attendance data. Store the data in a list of dictionaries for easy manipulation later.

In [4]:
# Find the section of the webpage that contains the World Cup attendance data
attendance_section =  soup.find_all('table')[3]

In [None]:
attendance_section

In [5]:
# Initialize an empty list to store the World Cup attendance data
world_cups = []

# Extract all the rows (tr elements) from the attendance table
rows = attendance_section.find_all('tr')

# Iterate over each row in the table, starting from the second row (skipping the header row)
for row in rows[1:]:
    # Find all the cells (td and th elements) in the current row
    cols = row.find_all(['td', 'th'])
    
    # Check if the current row has at least 8 columns (to ensure we capture all the data)
    if len(cols) >= 8:
        # Extract the data from the cells and store them in variables
        year = cols[0].get_text(strip=True)  
        hosts = cols[1].get_text(strip=True)
        venues_cities = cols[2].get_text(strip=True)
        total_attendance = cols[3].get_text(strip=True)
        matches = cols[4].get_text(strip=True)
        avg_attendance = cols[5].get_text(strip=True)
        highest_attendance_number = cols[6].get_text(strip=True)
        highest_attendance_venue = cols[7].get_text(strip=True)
        # The strip=True parameter ensures that any leading or trailing whitespace is removed from the extracted text
        
        # Create a dictionary with the extracted data and append it to the world_cups list
        world_cups.append({
            "Year": year,
            "Hosts": hosts,
            "Venues/Cities": venues_cities,
            "Total Attendance": total_attendance,
            "Matches": matches,
            "Average Attendance": avg_attendance,
            "Highest Attendance Number": highest_attendance_number,
            "Highest Attendance Venue": highest_attendance_venue
        })

##### In HTML, `<tr>` and `<td>` are tags used to define table rows and table data cells, respectively:

- **`<tr>` (Table Row)**: This tag defines a row in a table.
- A `<tr>` element contains one or more `<td>` elements, which represent the data cells in that row.

- **`<td>` (Table Data)**: This tag defines a cell in a table row. 
- It is used to contain the data for that particular cell. 
- A `<td>` element is always nested inside a `<tr>` element.


- **`table.find_all('tr')` finds all the rows in the table.**
- **`row.find_all('td')` finds all the data cells in the current row.**

##### Regarding the `<th>` (Table Header) element:

- The `<th>` element is used to define a header cell in an HTML table. 
- It is similar to the `<td>` (Table Data) element, but it is typically used to represent the column headers or row headers in a table.
- You can use the `find_all('th')` method to find all the header cells in the table, just like you used `find_all('tr')` to find all the rows and row.`find_all('td')` to find all the data cells in a row.

### Step 7: Clean and Convert Data
Convert the extracted data into a pandas DataFrame and clean it up.

In [6]:
# Convert list of dictionaries to DataFrame
df_world_cups = pd.DataFrame(world_cups)

# Display the DataFrame
df_world_cups

Unnamed: 0,Year,Hosts,Venues/Cities,Total Attendance,Matches,Average Attendance,Highest Attendance Number,Highest Attendance Venue
0,1,1930,Uruguay,Uruguay,4–2,Argentina,United States,–[n 1]
1,2,1934,Italy,Italy,2–1(a.e.t.),Czechoslovakia,Germany,3–2
2,3,1938,France,Italy,4–2,Hungary,Brazil,4–2
3,4,1950,Brazil,Uruguay,2–1[n 2],Brazil,Sweden,3–1[n 2]
4,5,1954,Switzerland,West Germany,3–2,Hungary,Austria,3–1
5,6,1958,Sweden,Brazil,5–2,Sweden,France,6–3
6,7,1962,Chile,Brazil,3–1,Czechoslovakia,Chile,1–0
7,8,1966,England,England,4–2(a.e.t.),West Germany,Portugal,2–1
8,9,1970,Mexico,Brazil,4–1,Italy,West Germany,1–0
9,10,1974,West Germany,West Germany,2–1,Netherlands,Poland,1–0


In [None]:
df_world_cups.info()

In [None]:
# Directly access the problematic cells in the 'Year' and 'Highest Attendance Number' columns and clean them
df_world_cups.loc[df_world_cups['Hosts'].str.contains('[n 1]'), 'Hosts'] = df_world_cups['Hosts'].str.replace('[n 1]', '')

df_world_cups.loc[df_world_cups['Highest Attendance Number'].str.contains('[96]'), 'Highest Attendance Number'] = df_world_cups['Highest Attendance Number'].str.replace('[96]', '')

In [None]:
# Remove commas from specified columns so we can convert to numeric later
specified_columns = ['Total Attendance', 'Average Attendance', 'Highest Attendance Number']
for col in specified_columns:
    df_world_cups[col] = df_world_cups[col].str.replace(',', '')

In [None]:
# Replace empty cells with Nan
import numpy as np

df_world_cups.replace('', np.nan, inplace=True)

In [None]:
# Convert numerical columns to numeric dtype
numeric_columns = ['Year', 'Total Attendance', 'Matches', 'Average Attendance', 'Highest Attendance Number']
df_world_cups[numeric_columns] = df_world_cups[numeric_columns].astype(pd.Int64Dtype())

In [None]:
# Separate 'Venues/Cities' into two columns
df_world_cups[['Venues', 'Cities']] = df_world_cups['Venues/Cities'].str.split('/', expand=True)

# Drop the original 'Venues/Cities' column
df_world_cups.drop(columns=['Venues/Cities'], inplace=True)

In [None]:
# Rearrange columns
df_world_cups = df_world_cups[['Year', 'Hosts', 'Venues', 'Cities', 'Total Attendance', 
                               'Matches', 'Average Attendance', 'Highest Attendance Number', 
                               'Highest Attendance Venue']]

In [None]:
# Display the cleaned DataFrame
df_world_cups

In [None]:
df_world_cups.info()

### Step 8: Save the Data (Optional)
Finally, save the cleaned data to a CSV file.

In [None]:
df_world_cups.to_csv('fifa_worldcup_attendance.csv', index=False)
print("Data saved to 'fifa_worldcup_attendance.csv'.")