### In-Class Assignment: Web Scraping and Data Extraction from a New Webpage
Use the requests library to fetch a new webpage.
Parse the HTML content using BeautifulSoup.
Extract various elements such as figures, tables, and text.
Work collaboratively in groups to practice web scraping and present their findings.
- Task 1: Select a Webpage of interest (e.g., a news article, an educational resource, or a data-driven website). Ensure that the selected webpage contains a variety of elements, such as tables, figures, and text content.
- Task 2: Fetch and Parse the Webpage

In [4]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_manufacturers_by_motor_vehicle_production'
response = requests.get(url)

In [5]:

# Check if the request was successful
if response.status_code == 200:
    print("Successfully fetched the webpage!")
else:
    print("Failed to fetch the webpage.")

Successfully fetched the webpage!


In [6]:

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

### Task 3: Extract Elements

In [7]:
#Find all images and extract their src attributes.
images = soup.find_all('img')
image_urls = [img['src'] for img in images if 'src' in img.attrs]
print("Image URLs:", image_urls)

Image URLs: ['/static/images/icons/wikipedia.png', '/static/images/mobile/copyright/wikipedia-wordmark-en.svg', '/static/images/mobile/copyright/wikipedia-tagline-en.svg', '//upload.wikimedia.org/wikipedia/commons/thumb/5/53/Ambox_current_red_Americas.svg/42px-Ambox_current_red_Americas.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/2/22/Motor_vehicles_produced_by_country_2013.png/220px-Motor_vehicles_produced_by_country_2013.png', '//upload.wikimedia.org/wikipedia/en/thumb/1/1d/Information_icon4.svg/40px-Information_icon4.svg.png', '//upload.wikimedia.org/wikipedia/en/thumb/9/96/Symbol_category_class.svg/16px-Symbol_category_class.svg.png', '//upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/12px-Commons-logo.svg.png', '//upload.wikimedia.org/wikipedia/en/thumb/d/db/Symbol_list_class.svg/16px-Symbol_list_class.svg.png', '//upload.wikimedia.org/wikipedia/en/thumb/e/e2/Symbol_portal_class.svg/16px-Symbol_portal_class.svg.png', 'https://login.wikimedia.org/wiki/Sp

In [22]:
from IPython.display import Image, display
image_urls = ['/static/images/icons/wikipedia.png', '/static/images/mobile/copyright/wikipedia-wordmark-en.svg', '/static/images/mobile/copyright/wikipedia-tagline-en.svg', '//upload.wikimedia.org/wikipedia/commons/thumb/5/53/Ambox_current_red_Americas.svg/42px-Ambox_current_red_Americas.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/2/22/Motor_vehicles_produced_by_country_2013.png/220px-Motor_vehicles_produced_by_country_2013.png', '//upload.wikimedia.org/wikipedia/en/thumb/1/1d/Information_icon4.svg/40px-Information_icon4.svg.png', '//upload.wikimedia.org/wikipedia/en/thumb/9/96/Symbol_category_class.svg/16px-Symbol_category_class.svg.png', '//upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/12px-Commons-logo.svg.png', '//upload.wikimedia.org/wikipedia/en/thumb/d/db/Symbol_list_class.svg/16px-Symbol_list_class.svg.png', '//upload.wikimedia.org/wikipedia/en/thumb/e/e2/Symbol_portal_class.svg/16px-Symbol_portal_class.svg.png', 'https://login.wikimedia.org/wiki/Special:CentralAutoLogin/start?type=1x1', '/static/images/footer/wikimedia-button.svg', '/static/images/footer/poweredby_mediawiki.svg']  # Replace with your actual list
image_index = 4 

if image_urls and 4<= image_index < len(image_urls):
    image_url = image_urls[image_index]
    
    # Display the image
    display(Image(url=image_url))
else:
    print("Invalid index or no images found.")


In [38]:
# Locate and extract all tables on the webpage, converting them into Pandas DataFrames.
import pandas as pd

tables = soup.find_all('table')
df = [pd.read_html(str(table))[0] for table in tables]
table_index=2
print(f"Table {table_index + 1}:\n", df[table_index])

Table 3:
    Rank             Group        Country  Sold vehicles (2023)[18]
0     1            Toyota          Japan                  10307395
1     2  Volkswagen Group        Germany                   9239575
2     3       Hyundai/Kia    South Korea                   7302451
3     4        Stellantis    Netherlands                   6392600
4     5    General Motors  United States                   6188476
5     6              Ford  United States                   4413545
6     7             Honda          Japan                   4188039
7     8            Nissan          Japan                   3374271
8     9               BMW        Germany                   2555341
9    10           Changan          China                   2553052


In [47]:
#Extract the main text content, such as paragraphs or headings.
paragraphs = soup.find_all('p')
text_content = ' '.join([para.get_text() for para in paragraphs])
print(text_content[:908])  # Print the first 500 characters


 This is a list of manufacturers by motor vehicle production, by year, based on Organisation Internationale des Constructeurs d'Automobiles (OICA).
 Figures include passenger cars, light commercial vehicles, minibuses, trucks, buses and coaches. OICA defines these entries as follows:[1]
 Motor vehicle production by manufacturer (top five groups)
 The summary chart includes the five largest worldwide automotive manufacturing groups as of 2017 by number of vehicles produced. Those same groups held the top 5 positions 2007 to 2019; Hyundai / Kia had a lower rank until it took the fifth spot in 2007 from the at that time split German-American auto manufacturer DaimlerChrysler, while Ford became surpassed by Honda in 2020, and even Nissan in 2021, before surpassing them again in 2022. Figures were compiled by the International Organization of Motor Vehicle Manufacturers (OICA) before the year 2018:



### Task 4: Analyze and Discuss Findings
Each group will analyze the extracted data and discuss the following:
- What figures (images) were extracted and what do they represent?
- What information is contained in the tables, and how does it contribute to the overall content of the webpage?
- What is the main focus of the text content extracted? How does it relate to the images and tables?
- Discuss the challenges faced during extraction, such as dealing with complex HTML structures or incomplete data.

1.The main map image shows a heat map of vehicles produced by country where the darker the country the more vehicles produced.
2.The information contained is the number of vehicles sold by each manufacturing group in 2023. It contributes by showing the 2023 numbers and being able to compare them to other years.
3.The main focus of the extracted text is to explain the tables a little bit and to give facts on which companies produced the most vehicles during which years.
4.We had a hard time when it came to getting one image out of all the images

### Task 5: Present Findings
Shares your analysis of the extracted elements.
Discusses any patterns, relationships, or insights gained from the data.

Each group should submit their Jupyter notebook (or Python script) with the code, analysis, and any additional notes or reflections on the exercise.

We found that Toyota from Japan and most of Asia produce a lot of vehicles along with a few German companies and some American companies as well. We also found South America and Africa produce very little vehicles when compared to other continents.