# Web Scraping Exercise

Web Scraping allows you to gather large volumes of data from diverse and real-time online sources. This data can be crucial for enriching your datasets, filling in gaps, and providing current information that enhances the quality and relevance of your analysis. Web scraping enables you to collect data that might not be readily available through traditional APIs or databases, offering a competitive edge by incorporating unique and comprehensive insights. Moreover, it automates the data collection process, saving time and resources while ensuring a scalable approach to continuously updating and maintaining your datasets.

Ethical web scraping involves respecting website terms of service, avoiding overloading servers, and ensuring that the collected data is used responsibly and in compliance with privacy laws and regulations.

Use Python, ```requests```, ```BeautifulSoup``` and/or ```pandas``` to scrape web data:

## Import Libraries

In [1]:
# TODO
import requests
from bs4 import BeautifulSoup
import pandas as pd


## Define the Target URL

In [2]:
#url = # TODO
url = 'https://www.billboard.com/charts/hot-100/'


## Send a Request to the Website

Do not forget to check the response status code

In [3]:
# TODO
response = requests.get(url)

if response.status_code == 200:
    print("Anfrage erfolgreich!")
else:
    print(f"Anfrage fehlgeschlagen mit Statuscode {response.status_code}")

Anfrage erfolgreich!


## Parse the HTML Content

Use a library to access the HTMl content

In [4]:
# TODO
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

## Identify the Data to be Scraped

Write a couple of sentence on the data you want to scrape

TODO: 
Daten von der Billboard Hot 100 Website werden gescrapt. Ziel ist es, Songtitel, Interpreten und Platzierungen der wöchentlichen Charts zu extrahieren. Diese Informationen sollen mit Spotify-Merkmalen verglichen werden, um Merkmale erfolgreicher Songs zu analysieren.

## Extract Data

Find specific elements and extract text or attributes from elements (handle pagination if necessary)

In [7]:
# TODO
import requests
from bs4 import BeautifulSoup

url = "https://www.billboard.com/charts/hot-100"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

songs = []

for entry in soup.select("li.o-chart-results-list__item"):
    title_tag = entry.find("h3")
    artist_tag = entry.find("span")

    if title_tag and artist_tag:
        title = title_tag.get_text(strip=True)
        artist = artist_tag.get_text(strip=True)

        songs.append({
            'Title': title,
            'Artist': artist
        })


for song in songs[:10]:
    print(song)


{'Title': 'Ordinary', 'Artist': 'Alex Warren'}
{'Title': 'What I Want', 'Artist': 'Morgan Wallen Featuring Tate McRae'}
{'Title': 'Just In Case', 'Artist': 'Morgan Wallen'}
{'Title': 'Luther', 'Artist': 'Kendrick Lamar & SZA'}
{'Title': "I'm The Problem", 'Artist': 'Morgan Wallen'}
{'Title': 'A Bar Song (Tipsy)', 'Artist': 'Shaboozey'}
{'Title': 'Die With A Smile', 'Artist': 'Lady Gaga & Bruno Mars'}
{'Title': 'Lose Control', 'Artist': 'Teddy Swims'}
{'Title': 'Beautiful Things', 'Artist': 'Benson Boone'}
{'Title': 'Nokia', 'Artist': 'Drake'}


## Store Data in a Structured Format

Give a brief overview of the data collected (e.g. count, fields, ...)

In [8]:
# TODO
import pandas as pd
df = pd.DataFrame(songs)
print(df.head())
print(f"\nEs wurden {len(df)} Songs gesammelt mit den Feldern: {list(df.columns)}")


             Title                              Artist
0         Ordinary                         Alex Warren
1      What I Want  Morgan Wallen Featuring Tate McRae
2     Just In Case                       Morgan Wallen
3           Luther                Kendrick Lamar & SZA
4  I'm The Problem                       Morgan Wallen

Es wurden 100 Songs gesammelt mit den Feldern: ['Title', 'Artist']


## Save the Data

In [9]:
# TODO
df.to_csv('songArtist.csv', index=False)
print("Daten wurden in 'songArtist.csv' gespeichert.")

Daten wurden in 'songArtist.csv' gespeichert.
